History log of /linux-master/drivers/net/ethernet/microsoft/mana/gdma_main.c
Revision Date Author Comments
# 8afefc36 28-Jan-2024 Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>

net: mana: Assigning IRQ affinity on HT cores

Existing MANA design assigns IRQ to every CPU, including sibling
hyper-threads. This may cause multiple IRQs to be active simultaneously
in the same core and may reduce the network performance.

Improve the performance by assigning IRQ to non sibling CPUs in local
NUMA node. The performance improvement we are getting using ntttcp with
following patch is around 15 percent against existing design and
approximately 11 percent, when trying to assign one IRQ in each core
across NUMA nodes, if enough cores are present.
The change will improve the performance for the system
with high number of CPU, where number of CPUs in a node is more than
64 CPUs. Nodes with 64 CPUs or less than 64 CPUs will not be affected
by this change.

The performance study was done using ntttcp tool in Azure.
The node had 2 nodes with 32 cores each, total 128 vCPU and number of channels
were 32 for 32 RX rings.

The below table shows a comparison between existing design and new
design:

IRQ node-num core-num CPU performance(%)
1 0 | 0 0 | 0 0 | 0-1 0
2 0 | 0 0 | 1 1 | 2-3 3
3 0 | 0 1 | 2 2 | 4-5 10
4 0 | 0 1 | 3 3 | 6-7 15
5 0 | 0 2 | 4 4 | 8-9 15
...
...
25 0 | 0 12| 24 24| 48-49 12
...
32 0 | 0 15| 31 31| 62-63 12
33 0 | 0 16| 0 32| 0-1 10
...
64 0 | 0 31| 31 63| 62-63 0

Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 91bfe210 28-Jan-2024 Yury Norov <yury.norov@gmail.com>

net: mana: add a function to spread IRQs per CPUs

Souradeep investigated that the driver performs faster if IRQs are
spread on CPUs with the following heuristics:

1. No more than one IRQ per CPU, if possible;
2. NUMA locality is the second priority;
3. Sibling dislocality is the last priority.

Let's consider this topology:

Node 0 1
Core 0 1 2 3
CPU 0 1 2 3 4 5 6 7

The most performant IRQ distribution based on the above topology
and heuristics may look like this:

IRQ Nodes Cores CPUs
0 1 0 0-1
1 1 1 2-3
2 1 0 0-1
3 1 1 2-3
4 2 2 4-5
5 2 3 6-7
6 2 2 4-5
7 2 3 6-7

The irq_setup() routine introduced in this patch leverages the
for_each_numa_hop_mask() iterator and assigns IRQs to sibling groups
as described above.

According to [1], for NUMA-aware but sibling-ignorant IRQ distribution
based on cpumask_local_spread() performance test results look like this:

./ntttcp -r -m 16
NTTTCP for Linux 1.4.0
---------------------------------------------------------
08:05:20 INFO: 17 threads created
08:05:28 INFO: Network activity progressing...
08:06:28 INFO: Test run completed.
08:06:28 INFO: Test cycle finished.
08:06:28 INFO: ##### Totals: #####
08:06:28 INFO: test duration :60.00 seconds
08:06:28 INFO: total bytes :630292053310
08:06:28 INFO: throughput :84.04Gbps
08:06:28 INFO: retrans segs :4
08:06:28 INFO: cpu cores :192
08:06:28 INFO: cpu speed :3799.725MHz
08:06:28 INFO: user :0.05%
08:06:28 INFO: system :1.60%
08:06:28 INFO: idle :96.41%
08:06:28 INFO: iowait :0.00%
08:06:28 INFO: softirq :1.94%
08:06:28 INFO: cycles/byte :2.50
08:06:28 INFO: cpu busy (all) :534.41%

For NUMA- and sibling-aware IRQ distribution, the same test works
15% faster:

./ntttcp -r -m 16
NTTTCP for Linux 1.4.0
---------------------------------------------------------
08:08:51 INFO: 17 threads created
08:08:56 INFO: Network activity progressing...
08:09:56 INFO: Test run completed.
08:09:56 INFO: Test cycle finished.
08:09:56 INFO: ##### Totals: #####
08:09:56 INFO: test duration :60.00 seconds
08:09:56 INFO: total bytes :741966608384
08:09:56 INFO: throughput :98.93Gbps
08:09:56 INFO: retrans segs :6
08:09:56 INFO: cpu cores :192
08:09:56 INFO: cpu speed :3799.791MHz
08:09:56 INFO: user :0.06%
08:09:56 INFO: system :1.81%
08:09:56 INFO: idle :96.18%
08:09:56 INFO: iowait :0.00%
08:09:56 INFO: softirq :1.95%
08:09:56 INFO: cycles/byte :2.25
08:09:56 INFO: cpu busy (all) :569.22%

[1] https://lore.kernel.org/all/20231211063726.GA4977@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Co-developed-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# a7f0636d 15-Dec-2023 Long Li <longli@microsoft.com>

RDMA/mana_ib: register RDMA device with GDMA

Software client needs to register with the RDMA management interface on
the SoC to access more features, including querying device capabilities
and RC queue pair.

Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1702692255-23640-2-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>


# 02fed6d9 13-Dec-2023 Konstantin Taranov <kotaranov@microsoft.com>

net: mana: add msix index sharing between EQs

This patch allows to assign and poll more than one EQ on the same
msix index.
It is achieved by introducing a list of attached EQs in each IRQ context.
It also removes the existing msix_index map that tried to ensure that there
is only one EQ at each msix_index.
This patch exports symbols for creating EQs from other MANA kernel modules.

Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 62c1bff5 02-Aug-2023 Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>

net: mana: Configure hwc timeout from hardware

At present hwc timeout value is a fixed value. This patch sets the hwc
timeout from the hardware. It now uses a new hardware capability
GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG to query and set the value
in hwc_timeout.

Signed-off-by: Souradeep Chakrabarti <schakrabarti@linux.microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# f5e39b57 17-Jul-2023 Long Li <longli@microsoft.com>

net: mana: Use the correct WQE count for ringing RQ doorbell

The hardware specification specifies that WQE_COUNT should set to 0 for
the Receive Queue. Although currently the hardware doesn't enforce the
check, in the future releases it may check on this value.

Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1689622539-5334-3-git-send-email-longli@linuxonhyperv.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 2d59af83 23-Mar-2023 Cai Huoqing <cai.huoqing@linux.dev>

net: mana: Remove redundant pci_clear_master

Remove pci_clear_master to simplify the code,
the bus-mastering is also cleared in do_pci_disable_device,
like this:
./drivers/pci/pci.c:2197
static void do_pci_disable_device(struct pci_dev *dev)
{
u16 pci_command;

pci_read_config_word(dev, PCI_COMMAND, &pci_command);
if (pci_command & PCI_COMMAND_MASTER) {
pci_command &= ~PCI_COMMAND_MASTER;
pci_write_config_word(dev, PCI_COMMAND, pci_command);
}

pcibios_disable_device(dev);
}.
And dev->is_busmaster is set to 0 in pci_disable_device.

Signed-off-by: Cai Huoqing <cai.huoqing@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 18a04837 06-Feb-2023 Haiyang Zhang <haiyangz@microsoft.com>

net: mana: Fix accessing freed irq affinity_hint

After calling irq_set_affinity_and_hint(), the cpumask pointer is
saved in desc->affinity_hint, and will be used later when reading
/proc/irq/<num>/affinity_hint. So the cpumask variable needs to be
persistent. Otherwise, we are accessing freed memory when reading
the affinity_hint file.

Also, need to clear affinity_hint before free_irq(), otherwise there
is a one-time warning and stack trace during module unloading:

[ 243.948687] WARNING: CPU: 10 PID: 1589 at kernel/irq/manage.c:1913 free_irq+0x318/0x360
...
[ 243.948753] Call Trace:
[ 243.948754] <TASK>
[ 243.948760] mana_gd_remove_irqs+0x78/0xc0 [mana]
[ 243.948767] mana_gd_remove+0x3e/0x80 [mana]
[ 243.948773] pci_device_remove+0x3d/0xb0
[ 243.948778] device_remove+0x46/0x70
[ 243.948782] device_release_driver_internal+0x1fe/0x280
[ 243.948785] driver_detach+0x4e/0xa0
[ 243.948787] bus_remove_driver+0x70/0xf0
[ 243.948789] driver_unregister+0x35/0x60
[ 243.948792] pci_unregister_driver+0x44/0x90
[ 243.948794] mana_driver_exit+0x14/0x3fe [mana]
[ 243.948800] __do_sys_delete_module.constprop.0+0x185/0x2f0

To fix the bug, use the persistent mask, cpumask_of(cpu#), and set
affinity_hint to NULL before freeing the IRQ, as required by free_irq().

Cc: stable@vger.kernel.org
Fixes: 71fa6887eeca ("net: mana: Assign interrupts to CPUs based on NUMA nodes")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/1675718929-19565-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 20e3028c 19-Jan-2023 Haiyang Zhang <haiyangz@microsoft.com>

net: mana: Fix IRQ name - add PCI and queue number

The PCI and queue number info is missing in IRQ names.

Add PCI and queue number to IRQ names, to allow CPU affinity
tuning scripts to work.

Cc: stable@vger.kernel.org
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/r/1674161950-19708-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 3574cfdc 11-Nov-2022 Leon Romanovsky <leon@kernel.org>

RDMA/mana: Remove redefinition of basic u64 type

gdma_obj_handle_t is no more than redefinition of basic
u64 type. Remove such obfuscation.

Link: https://lore.kernel.org/r/3c1e821279e6a165d058655d2343722d6650e776.1668160486.git.leonro@nvidia.com
Acked-by: Long Li <longli@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>


# 28c66cfa 03-Nov-2022 Ajay Sharma <sharmaajay@microsoft.com>

net: mana: Define data structures for protection domain and memory registration

The MANA hardware support protection domain and memory registration for use
in RDMA environment. Add those definitions and expose them for use by the
RDMA driver.

Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-12-git-send-email-longli@linuxonhyperv.com
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>


# fd325cd6 03-Nov-2022 Long Li <longli@microsoft.com>

net: mana: Move header files to a common location

In preparation to add MANA RDMA driver, move all the required header files
to a common location for use by both Ethernet and RDMA drivers.

Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-8-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>


# 4c0ff7a1 03-Nov-2022 Long Li <longli@microsoft.com>

net: mana: Export Work Queue functions for use by RDMA driver

RDMA device may need to create Ethernet device queues for use by Queue
Pair type RAW. This allows a user-mode context accesses Ethernet hardware
queues. Export the supporting functions for use by the RDMA driver.

Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-6-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>


# 6fe25416 03-Nov-2022 Ajay Sharma <sharmaajay@microsoft.com>

net: mana: Set the DMA device max segment size

MANA hardware doesn't have any restrictions on the DMA segment size, set it
to the max allowed value.

Signed-off-by: Ajay Sharma <sharmaajay@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-5-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>


# f3dc0962 03-Nov-2022 Long Li <longli@microsoft.com>

net: mana: Record the physical address for doorbell page region

For supporting RDMA device with multiple user contexts with their
individual doorbell pages, record the start address of doorbell page
region for use by the RDMA driver to allocate user context doorbell IDs.

Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://lore.kernel.org/r/1667502990-2559-3-git-send-email-longli@linuxonhyperv.com
Acked-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>


# 71fa6887 01-Nov-2022 Saurabh Sengar <ssengar@linux.microsoft.com>

net: mana: Assign interrupts to CPUs based on NUMA nodes

In large VMs with multiple NUMA nodes, network performance is usually
best if network interrupts are all assigned to the same virtual NUMA
node. This patch assigns online CPU according to a numa aware policy,
local cpus are returned first, followed by non-local ones, then it wraps
around.

Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://lore.kernel.org/r/1667282761-11547-1-git-send-email-ssengar@linux.microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 6fd2c68d 11-Sep-2022 Haiyang Zhang <haiyangz@microsoft.com>

net: mana: Add rmb after checking owner bits

Per GDMA spec, rmb is necessary after checking owner_bits, before
reading EQ or CQ entries.

Add rmb in these two places to comply with the specs.

Cc: stable@vger.kernel.org
Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Reported-by: Sinan Kaya <Sinan.Kaya@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Link: https://lore.kernel.org/r/1662928805-15861-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 8409fe92 27-Aug-2022 Vitaly Kuznetsov <vkuznets@redhat.com>

PCI: Move PCI_VENDOR_ID_MICROSOFT/PCI_DEVICE_ID_HYPERV_VIDEO definitions to pci_ids.h

There are already three places in kernel which define
PCI_VENDOR_ID_MICROSOFT and two for PCI_DEVICE_ID_HYPERV_VIDEO and
there's a need to use these from core VMBus code. Move the defines where
they belong.

No functional change.

Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com> # pci_ids.h
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20220827130345.1320254-2-vkuznets@redhat.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>


# 1566e7d6 14-Jun-2022 Dexuan Cui <decui@microsoft.com>

net: mana: Add the Linux MANA PF driver

This minimal PF driver runs on bare metal.
Currently Ethernet TX/RX works. SR-IOV management is not supported yet.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Co-developed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>


# 10cdc794 24-Jan-2022 Gustavo A. R. Silva <gustavoars@kernel.org>

net: mana: Use struct_size() helper in mana_gd_create_dma_region()

Make use of the struct_size() helper instead of an open-coded version,
in order to avoid any potential type mistakes or integer overflows that,
in the worst scenario, could lead to heap overflows.

Also, address the following sparse warnings:
drivers/net/ethernet/microsoft/mana/gdma_main.c:677:24: warning: using sizeof on a flexible structure

Link: https://github.com/KSPP/linux/issues/174
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8f1bc38b 08-Nov-2021 Colin Ian King <colin.i.king@googlemail.com>

net: mana: Fix spelling mistake "calledd" -> "called"

There is a spelling mistake in a dev_info message. Fix it.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Link: https://lore.kernel.org/r/20211108201817.43121-1-colin.i.king@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>


# 635096a8 29-Oct-2021 Dexuan Cui <decui@microsoft.com>

net: mana: Support hibernation and kexec

Implement the suspend/resume/shutdown callbacks for hibernation/kexec.

Add mana_gd_setup() and mana_gd_cleanup() for some common code, and
use them in the mand_gd_* callbacks.

Reuse mana_probe/remove() for the hibernation path.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 62ea8b77 29-Oct-2021 Dexuan Cui <decui@microsoft.com>

net: mana: Improve the HWC error handling

Currently when the HWC creation fails, the error handling is flawed,
e.g. if mana_hwc_create_channel() -> mana_hwc_establish_channel() fails,
the resources acquired in mana_hwc_init_queues() is not released.

Enhance mana_hwc_destroy_channel() to do the proper cleanup work and
call it accordingly.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 3c37f357 29-Oct-2021 Dexuan Cui <decui@microsoft.com>

net: mana: Report OS info to the PF driver

The PF driver might use the OS info for statistical purposes.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# c1a3e9f9 24-Aug-2021 Haiyang Zhang <haiyangz@microsoft.com>

net: mana: Add WARN_ON_ONCE in case of CQE read overflow

This is not an expected case normally.
Add WARN_ON_ONCE in case of CQE read overflow, instead of failing
silently.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 1e2d0824 24-Aug-2021 Haiyang Zhang <haiyangz@microsoft.com>

net: mana: Add support for EQ sharing

The existing code uses (1 + #vPorts * #Queues) MSIXs, which may exceed
the device limit.

Support EQ sharing, so that multiple vPorts (NICs) can share the same
set of MSIXs.

And, report the EQ-sharing capability bit to the host, which means the
host can potentially offer more vPorts and queues to the VM.

Also update the resource limit checking and error handling for better
robustness.

Now, we support up to 256 virtual ports per VF (it was 16/VF), and
support up to 64 queues per vPort (it was 16).

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# e1b5683f 24-Aug-2021 Haiyang Zhang <haiyangz@microsoft.com>

net: mana: Move NAPI from EQ to CQ

The existing code has NAPI threads polling on EQ directly. To prepare
for EQ sharing among vPorts, move NAPI from EQ to CQ so that one EQ
can serve multiple CQs from different vPorts.

The "arm bit" is only set when CQ processing is completed to reduce
the number of EQ entries, which in turn reduce the number of interrupts
on EQ.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>


# ca9c54d2 16-Apr-2021 Dexuan Cui <decui@microsoft.com>

net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)

Add a VF driver for Microsoft Azure Network Adapter (MANA) that will be
available in the future.

Co-developed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Co-developed-by: Shachar Raindel <shacharr@microsoft.com>
Signed-off-by: Shachar Raindel <shacharr@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>