#
a251c17a |
|
05-Oct-2022 |
Jason A. Donenfeld <Jason@zx2c4.com> |
treewide: use get_random_u32() when possible The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
#
91a3f14e |
|
18-Aug-2022 |
Mark Zhang <markzhang@nvidia.com> |
IB/cm: Remove the service_mask parameter from ib_cm_listen() Remove the service_mask parameter of ib_cm_listen(), as all callers use 0. Link: https://lore.kernel.org/r/20220819090859.957943-2-markzhang@nvidia.com Signed-off-by: Mark Zhang <markzhang@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
#
10f7b9bc |
|
19-Oct-2021 |
Jakub Kicinski <kuba@kernel.org> |
RDMA/ipoib: Use dev_addr_mod() Commit 406f42fa0d3c ("net-next: When a bond have a massive amount of VLANs...") introduced a rbtree for faster Ethernet address look up. To maintain netdev->dev_addr in this tree we need to make all the writes to it got through appropriate helpers. Link: https://lore.kernel.org/r/20211019182604.1441387-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
#
1f8f60f3 |
|
26-May-2021 |
YueHaibing <yuehaibing@huawei.com> |
IB/ipoib: Use DEVICE_ATTR_*() macros Use DEVICE_ATTR_*() helper instead of plain DEVICE_ATTR, which makes the code a bit shorter and easier to read. Link: https://lore.kernel.org/r/20210526132753.3092-1-yuehaibing@huawei.com Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
#
338a010c |
|
13-Apr-2021 |
Manjunath Patil <manjunath.b.patil@oracle.com> |
IB/ipoib: Improve latency in ipoib/cm connection formation Currently IPoIB connected mode queries the device to get the pkey table entry during connection formation. This will increase the time taken to form the connection, especially when limited pkeys are in use. This gets worse when multiple connection attempts are done in parallel. Since ipoib interfaces are locked to a single pkey, use the pkey index that was determined at link up time instead of searching for anything. This improved the latency from 500ms to 1ms on an internal setup. Link: https://lore.kernel.org/r/1618338965-16717-1-git-send-email-manjunath.b.patil@oracle.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
#
1c7fd726 |
|
07-Oct-2020 |
Joe Perches <joe@perches.com> |
RDMA: Convert sysfs device * show functions to use sysfs_emit() Done with cocci script: @@ identifier d_show; identifier dev, attr, buf; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... return - sprintf(buf, + sysfs_emit(buf, ...); ...> } @@ identifier d_show; identifier dev, attr, buf; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... return - snprintf(buf, PAGE_SIZE, + sysfs_emit(buf, ...); ...> } @@ identifier d_show; identifier dev, attr, buf; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... return - scnprintf(buf, PAGE_SIZE, + sysfs_emit(buf, ...); ...> } @@ identifier d_show; identifier dev, attr, buf; expression chr; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... return - strcpy(buf, chr); + sysfs_emit(buf, chr); ...> } @@ identifier d_show; identifier dev, attr, buf; identifier len; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... len = - sprintf(buf, + sysfs_emit(buf, ...); ...> return len; } @@ identifier d_show; identifier dev, attr, buf; identifier len; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... len = - snprintf(buf, PAGE_SIZE, + sysfs_emit(buf, ...); ...> return len; } @@ identifier d_show; identifier dev, attr, buf; identifier len; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... len = - scnprintf(buf, PAGE_SIZE, + sysfs_emit(buf, ...); ...> return len; } @@ identifier d_show; identifier dev, attr, buf; identifier len; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { <... - len += scnprintf(buf + len, PAGE_SIZE - len, + len += sysfs_emit_at(buf, len, ...); ...> return len; } @@ identifier d_show; identifier dev, attr, buf; expression chr; @@ ssize_t d_show(struct device *dev, struct device_attribute *attr, char *buf) { ... - strcpy(buf, chr); - return strlen(buf); + return sysfs_emit(buf, chr); } Link: https://lore.kernel.org/r/7f406fa8e3aa2552c022bec680f621e38d1fe414.1602122879.git.joe@perches.com Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
#
119181d1 |
|
07-Sep-2020 |
Leon Romanovsky <leon@kernel.org> |
RDMA: Restore ability to fail on SRQ destroy In similar way to other IB objects, restore the ability to return error on SRQ destroy. Strictly speaking, this change is not necessary, and provided here to ensure a symmetrical interface like other destroy functions. Fixes: 68e326dea1db ("RDMA: Handle SRQ allocations by IB/core") Link: https://lore.kernel.org/r/20200907120921.476363-5-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
#
df561f66 |
|
23-Aug-2020 |
Gustavo A. R. Silva <gustavoars@kernel.org> |
treewide: Use fallthrough pseudo-keyword Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
|
#
1acba6a8 |
|
27-May-2020 |
Valentine Fatiev <valentinef@mellanox.com> |
IB/ipoib: Fix double free of skb in case of multicast traffic in CM mode When connected mode is set, and we have connected and datagram traffic in parallel, ipoib might crash with double free of datagram skb. The current mechanism assumes that the order in the completion queue is the same as the order of sent packets for all QPs. Order is kept only for specific QP, in case of mixed UD and CM traffic we have few QPs (one UD and few CM's) in parallel. The problem: ---------------------------------------------------------- Transmit queue: ----------------- UD skb pointer kept in queue itself, CM skb kept in spearate queue and uses transmit queue as a placeholder to count the number of total transmitted packets. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 .........127 ------------------------------------------------------------ NL ud1 UD2 CM1 ud3 cm2 cm3 ud4 cm4 ud5 NL NL NL ........... ------------------------------------------------------------ ^ ^ tail head Completion queue (problematic scenario) - the order not the same as in the transmit queue: 1 2 3 4 5 6 7 8 9 ------------------------------------ ud1 CM1 UD2 ud3 cm2 cm3 ud4 cm4 ud5 ------------------------------------ 1. CM1 'wc' processing - skb freed in cm separate ring. - tx_tail of transmit queue increased although UD2 is not freed. Now driver assumes UD2 index is already freed and it could be used for new transmitted skb. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 .........127 ------------------------------------------------------------ NL NL UD2 CM1 ud3 cm2 cm3 ud4 cm4 ud5 NL NL NL ........... ------------------------------------------------------------ ^ ^ ^ (Bad)tail head (Bad - Could be used for new SKB) In this case (due to heavy load) UD2 skb pointer could be replaced by new transmitted packet UD_NEW, as the driver assumes its free. At this point we will have to process two 'wc' with same index but we have only one pointer to free. During second attempt to free the same skb we will have NULL pointer exception. 2. UD2 'wc' processing - skb freed according the index we got from 'wc', but it was already overwritten by mistake. So actually the skb that was released is the skb of the new transmitted packet and not the original one. 3. UD_NEW 'wc' processing - attempt to free already freed skb. NUll pointer exception. The fix: ----------------------------------------------------------------------- The fix is to stop using the UD ring as a placeholder for CM packets, the cyclic ring variables tx_head and tx_tail will manage the UD tx_ring, a new cyclic variables global_tx_head and global_tx_tail are introduced for managing and counting the overall outstanding sent packets, then the send queue will be stopped and waken based on these variables only. Note that no locking is needed since global_tx_head is updated in the xmit flow and global_tx_tail is updated in the NAPI flow only. A previous attempt tried to use one variable to count the outstanding sent packets, but it did not work since xmit and NAPI flows can run at the same time and the counter will be updated wrongly. Thus, we use the same simple cyclic head and tail scheme that we have today for the UD tx_ring. Fixes: 2c104ea68350 ("IB/ipoib: Get rid of the tx_outstanding variable in all modes") Link: https://lore.kernel.org/r/20200527134705.480068-1-leon@kernel.org Signed-off-by: Valentine Fatiev <valentinef@mellanox.com> Signed-off-by: Alaa Hleihel <alaa@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
5d7d78ea |
|
27-Jun-2019 |
Fuqian Huang <huangfq.daxian@gmail.com> |
IB/ipoib: Remove memset after vzalloc in ipoib_cm.c vzalloc has already zeroed the memory. So a memset is unneeded. Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
6ab4aba0 |
|
24-Jan-2019 |
Feras Daoud <ferasda@mellanox.com> |
IB/ipoib: Fix for use-after-free in ipoib_cm_tx_start The following BUG was reported by kasan: BUG: KASAN: use-after-free in ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib] Read of size 80 at addr ffff88034c30bcd0 by task kworker/u16:1/24020 Workqueue: ipoib_wq ipoib_cm_tx_start [ib_ipoib] Call Trace: dump_stack+0x9a/0xeb print_address_description+0xe3/0x2e0 kasan_report+0x18a/0x2e0 ? ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib] memcpy+0x1f/0x50 ipoib_cm_tx_start+0x430/0x1390 [ib_ipoib] ? kvm_clock_read+0x1f/0x30 ? ipoib_cm_skb_reap+0x610/0x610 [ib_ipoib] ? __lock_is_held+0xc2/0x170 ? process_one_work+0x880/0x1960 ? process_one_work+0x912/0x1960 process_one_work+0x912/0x1960 ? wq_pool_ids_show+0x310/0x310 ? lock_acquire+0x145/0x440 worker_thread+0x87/0xbb0 ? process_one_work+0x1960/0x1960 kthread+0x314/0x3d0 ? kthread_create_worker_on_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x50 Allocated by task 0: kasan_kmalloc+0xa0/0xd0 kmem_cache_alloc_trace+0x168/0x3e0 path_rec_create+0xa2/0x1f0 [ib_ipoib] ipoib_start_xmit+0xa98/0x19e0 [ib_ipoib] dev_hard_start_xmit+0x159/0x8d0 sch_direct_xmit+0x226/0xb40 __dev_queue_xmit+0x1d63/0x2950 neigh_update+0x889/0x1770 arp_process+0xc47/0x21f0 arp_rcv+0x462/0x760 __netif_receive_skb_core+0x1546/0x2da0 netif_receive_skb_internal+0xf2/0x590 napi_gro_receive+0x28e/0x390 ipoib_ib_handle_rx_wc_rss+0x873/0x1b60 [ib_ipoib] ipoib_rx_poll_rss+0x17d/0x320 [ib_ipoib] net_rx_action+0x427/0xe30 __do_softirq+0x28e/0xc42 Freed by task 26680: __kasan_slab_free+0x11d/0x160 kfree+0xf5/0x360 ipoib_flush_paths+0x532/0x9d0 [ib_ipoib] ipoib_set_mode_rss+0x1ad/0x560 [ib_ipoib] set_mode+0xc8/0x150 [ib_ipoib] kernfs_fop_write+0x279/0x440 __vfs_write+0xd8/0x5c0 vfs_write+0x15e/0x470 ksys_write+0xb8/0x180 do_syscall_64+0x9b/0x420 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff88034c30bcc8 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 8 bytes inside of 512-byte region [ffff88034c30bcc8, ffff88034c30bec8) The buggy address belongs to the page: The following race between change mode and xmit flow is the reason for this use-after-free: Change mode Send packet 1 to GID XX Send packet 2 to GID XX | | | start | | | | | | | | | Create new path for GID XX | | and update neigh path | | | | | | | | | | flush_paths | | | | queue_work(cm.start_task) | | Path for GID XX not found | create new path | | start_task runs with old released path There is no locking to protect the lifetime of the path through the ipoib_cm_tx struct, so delete it entirely and always use the newly looked up path under the priv->lock. Fixes: 546481c2816e ("IB/ipoib: Fix memory corruption in ipoib cm mode connect flow") Signed-off-by: Feras Daoud <ferasda@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
4d6e4d12 |
|
11-Oct-2018 |
Denis Drozdov <denisd@mellanox.com> |
IB/ipoib: Clear IPCB before icmp_send IPCB should be cleared before icmp_send, since it may contain data from previous layers and the data could be misinterpreted as ip header options, which later caused the ihl to be set to an invalid value and resulted in the following stack corruption: [ 1083.031512] ib0: packet len 57824 (> 2048) too long to send, dropping [ 1083.031843] ib0: packet len 37904 (> 2048) too long to send, dropping [ 1083.032004] ib0: packet len 4040 (> 2048) too long to send, dropping [ 1083.032253] ib0: packet len 63800 (> 2048) too long to send, dropping [ 1083.032481] ib0: packet len 23960 (> 2048) too long to send, dropping [ 1083.033149] ib0: packet len 63800 (> 2048) too long to send, dropping [ 1083.033439] ib0: packet len 63800 (> 2048) too long to send, dropping [ 1083.033700] ib0: packet len 63800 (> 2048) too long to send, dropping [ 1083.034124] ib0: packet len 63800 (> 2048) too long to send, dropping [ 1083.034387] ================================================================== [ 1083.034602] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0xf08/0x1310 [ 1083.034798] Write of size 4 at addr ffff880353457c5f by task kworker/u16:0/7 [ 1083.034990] [ 1083.035104] CPU: 7 PID: 7 Comm: kworker/u16:0 Tainted: G O 4.19.0-rc5+ #1 [ 1083.035316] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu2 04/01/2014 [ 1083.035573] Workqueue: ipoib_wq ipoib_cm_skb_reap [ib_ipoib] [ 1083.035750] Call Trace: [ 1083.035888] dump_stack+0x9a/0xeb [ 1083.036031] print_address_description+0xe3/0x2e0 [ 1083.036213] kasan_report+0x18a/0x2e0 [ 1083.036356] ? __ip_options_echo+0xf08/0x1310 [ 1083.036522] __ip_options_echo+0xf08/0x1310 [ 1083.036688] icmp_send+0x7b9/0x1cd0 [ 1083.036843] ? icmp_route_lookup.constprop.9+0x1070/0x1070 [ 1083.037018] ? netif_schedule_queue+0x5/0x200 [ 1083.037180] ? debug_show_all_locks+0x310/0x310 [ 1083.037341] ? rcu_dynticks_curr_cpu_in_eqs+0x85/0x120 [ 1083.037519] ? debug_locks_off+0x11/0x80 [ 1083.037673] ? debug_check_no_obj_freed+0x207/0x4c6 [ 1083.037841] ? check_flags.part.27+0x450/0x450 [ 1083.037995] ? debug_check_no_obj_freed+0xc3/0x4c6 [ 1083.038169] ? debug_locks_off+0x11/0x80 [ 1083.038318] ? skb_dequeue+0x10e/0x1a0 [ 1083.038476] ? ipoib_cm_skb_reap+0x2b5/0x650 [ib_ipoib] [ 1083.038642] ? netif_schedule_queue+0xa8/0x200 [ 1083.038820] ? ipoib_cm_skb_reap+0x544/0x650 [ib_ipoib] [ 1083.038996] ipoib_cm_skb_reap+0x544/0x650 [ib_ipoib] [ 1083.039174] process_one_work+0x912/0x1830 [ 1083.039336] ? wq_pool_ids_show+0x310/0x310 [ 1083.039491] ? lock_acquire+0x145/0x3a0 [ 1083.042312] worker_thread+0x87/0xbb0 [ 1083.045099] ? process_one_work+0x1830/0x1830 [ 1083.047865] kthread+0x322/0x3e0 [ 1083.050624] ? kthread_create_worker_on_cpu+0xc0/0xc0 [ 1083.053354] ret_from_fork+0x3a/0x50 For instance __ip_options_echo is failing to proceed with invalid srr and optlen passed from another layer via IPCB [ 762.139568] IPv4: __ip_options_echo rr=0 ts=0 srr=43 cipso=0 [ 762.139720] IPv4: ip_options_build: IPCB 00000000f3cd969e opt 000000002ccb3533 [ 762.139838] IPv4: __ip_options_echo in srr: optlen 197 soffset 84 [ 762.139852] IPv4: ip_options_build srr=0 is_frag=0 rr_needaddr=0 ts_needaddr=0 ts_needtime=0 rr=0 ts=0 [ 762.140269] ================================================================== [ 762.140713] IPv4: __ip_options_echo rr=0 ts=0 srr=0 cipso=0 [ 762.141078] BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x12ec/0x1680 [ 762.141087] Write of size 4 at addr ffff880353457c7f by task kworker/u16:0/7 Signed-off-by: Denis Drozdov <denisd@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Reviewed-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
816e846c |
|
24-Aug-2018 |
Aaron Knister <aaron.s.knister@nasa.gov> |
IB/ipoib: Avoid a race condition between start_xmit and cm_rep_handler Inside of start_xmit() the call to check if the connection is up and the queueing of the packets for later transmission is not atomic which leaves a window where cm_rep_handler can run, set the connection up, dequeue pending packets and leave the subsequently queued packets by start_xmit() sitting on neigh->queue until they're dropped when the connection is torn down. This only applies to connected mode. These dropped packets can really upset TCP, for example, and cause multi-minute delays in transmission for open connections. Here's the code in start_xmit where we check to see if the connection is up: if (ipoib_cm_get(neigh)) { if (ipoib_cm_up(neigh)) { ipoib_cm_send(dev, skb, ipoib_cm_get(neigh)); goto unref; } } The race occurs if cm_rep_handler execution occurs after the above connection check (specifically if it gets to the point where it acquires priv->lock to dequeue pending skb's) but before the below code snippet in start_xmit where packets are queued. if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE) { push_pseudo_header(skb, phdr->hwaddr); spin_lock_irqsave(&priv->lock, flags); __skb_queue_tail(&neigh->queue, skb); spin_unlock_irqrestore(&priv->lock, flags); } else { ++dev->stats.tx_dropped; dev_kfree_skb_any(skb); } The patch acquires the netif tx lock in cm_rep_handler for the section where it sets the connection up and dequeues and retransmits deferred skb's. Fixes: 839fcaba355a ("IPoIB: Connected mode experimental support") Cc: stable@vger.kernel.org Signed-off-by: Aaron Knister <aaron.s.knister@nasa.gov> Tested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
ee190ab7 |
|
29-Jul-2018 |
Jason Gunthorpe <jgg@ziepe.ca> |
IB/ipoib: Get rid of the sysfs_mutex This mutex was introduced to deal with the deadlock formed by calling unregister_netdev from within the sysfs callback of a netdev. Now that we have priv_destructor and needs_free_netdev we can switch to the more targeted solution of running the unregister from a work queue. This avoids the deadlock and gets rid of the mutex. The next patch in the series needs this mutex eliminated to create atomicity of unregisteration. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
#
577e07ff |
|
29-Jul-2018 |
Jason Gunthorpe <jgg@ziepe.ca> |
IB/ipoib: Get rid of IPOIB_FLAG_GOING_DOWN This essentially duplicates the netdev's reg_state, so just use that directly. The reg_state is updated under the rntl_lock, and all places using GOING_DOWN already acquire the rtnl_lock so checking is safe. Since the only place we use GOING_DOWN is for the parent device this does not fix any bugs, but it is a step to tidy up the unregister flow so that after later patches the flow is uniform and sane. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
#
e7ff98ae |
|
29-Jul-2018 |
Parav Pandit <parav@mellanox.com> |
RDMA/cma: Constify path record, ib_cm_event, listen_id pointers Constify several pointers such as path_rec, ib_cm_event and listen_id pointers in several functions. Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
e586e1e1 |
|
30-Jul-2018 |
Kamal Heib <kamalheib1@gmail.com> |
RDMA/ipoib: Fix check for return code from ib_create_srq Make sure to check for "-EOPNOTSUPP" instead of "-ENOSYS" which is the return code from ib_create_srq() in case that it not supported. Signed-off-by: Kamal Heib <kamalheib1@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
4b4671a0 |
|
18-Jul-2018 |
Bart Van Assche <bvanassche@acm.org> |
IB/IPoIB: Simplify ib_post_(send|recv|srq_recv)() calls Instead of declaring and passing a dummy 'bad_wr' pointer, pass NULL as third argument to ib_post_(send|recv|srq_recv)(). Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
b1b63970 |
|
04-Jul-2018 |
Kamal Heib <kamalheib1@gmail.com> |
RDMA/ipoib: Fix use of sizeof() Make sure to use sizeof(...) instead of sizeof ... which is more preferred. Signed-off-by: Kamal Heib <kamalheib1@gmail.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
0578cdad |
|
04-Jul-2018 |
Kamal Heib <kamalheib1@gmail.com> |
RDMA/ipoib: Prefer unsigned int to bare use of unsigned This commit replaces all the unsigned definitions in favour of 'unsigned int' which is preferred. Signed-off-by: Kamal Heib <kamalheib1@gmail.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
299c36b1 |
|
04-Jul-2018 |
Kamal Heib <kamalheib1@gmail.com> |
RDMA/ipoib: Use min_t() macro instead of min() Use min_t() macro to avoid the casting when using min() macro, also fix the type of "length" and "wc->byte_len" to be "unsigned int" and "u32" which is the right type for each one of them. Signed-off-by: Kamal Heib <kamalheib1@gmail.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
33023fb8 |
|
18-Jun-2018 |
Steve Wise <larrystevenwise@gmail.com> |
IB/core: add max_send_sge and max_recv_sge attributes This patch replaces the ib_device_attr.max_sge with max_send_sge and max_recv_sge. It allows ulps to take advantage of devices that have very different send and recv sge depths. For example cxgb4 has a max_recv_sge of 4, yet a max_send_sge of 16. Splitting out these attributes allows much more efficient use of the SQ for cxgb4 with ulps that use the RDMA_RW API. Consider a large RDMA WRITE that has 16 scattergather entries. With max_sge of 4, the ulp would send 4 WRITE WRs, but with max_sge of 16, it can be done with 1 WRITE WR. Acked-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Selvin Xavier <selvin.xavier@broadcom.com> Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
fad953ce |
|
12-Jun-2018 |
Kees Cook <keescook@chromium.org> |
treewide: Use array_size() in vzalloc() The vzalloc() function has no 2-factor argument form, so multiplication factors need to be wrapped in array_size(). This patch replaces cases of: vzalloc(a * b) with: vzalloc(array_size(a, b)) as well as handling cases of: vzalloc(a * b * c) with: vzalloc(array3_size(a, b, c)) This does, however, attempt to ignore constant size factors like: vzalloc(4 * 1024) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( vzalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | vzalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( vzalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(char) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | vzalloc( - sizeof(u8) * COUNT + COUNT , ...) | vzalloc( - sizeof(__u8) * COUNT + COUNT , ...) | vzalloc( - sizeof(char) * COUNT + COUNT , ...) | vzalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( vzalloc( - sizeof(TYPE) * (COUNT_ID) + array_size(COUNT_ID, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT_ID + array_size(COUNT_ID, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * (COUNT_CONST) + array_size(COUNT_CONST, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT_CONST + array_size(COUNT_CONST, sizeof(TYPE)) , ...) | vzalloc( - sizeof(THING) * (COUNT_ID) + array_size(COUNT_ID, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT_ID + array_size(COUNT_ID, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * (COUNT_CONST) + array_size(COUNT_CONST, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT_CONST + array_size(COUNT_CONST, sizeof(THING)) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ vzalloc( - SIZE * COUNT + array_size(COUNT, SIZE) , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( vzalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | vzalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | vzalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( vzalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | vzalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | vzalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | vzalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( vzalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | vzalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( vzalloc(C1 * C2 * C3, ...) | vzalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants. @@ expression E1, E2; constant C1, C2; @@ ( vzalloc(C1 * C2, ...) | vzalloc( - E1 * E2 + array_size(E1, E2) , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
|
#
f15ca723 |
|
25-Jan-2018 |
Nicolas Dichtel <nicolas.dichtel@6wind.com> |
net: don't call update_pmtu unconditionally Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to: "BUG: unable to handle kernel NULL pointer dereference at (null)" Let's add a helper to check if update_pmtu is available before calling it. Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path") Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path") CC: Roman Kapl <code@rkapl.cz> CC: Xin Long <lucien.xin@gmail.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
809cb695 |
|
03-Jan-2018 |
Alex Estrin <alex.estrin@intel.com> |
IB/ipoib: Fix for notify send CQ failure messages If IB_CQ_REPORT_MISSED_EVENTS flag is passed in ib_req_notify_cq() it may return positive value indicating non-empty CQ. If return code not verified the log might be flooded with false warning messages "request notify on send CQ failed". Fixes: 8966e28d2e40 ("IB/ipoib: Use NAPI in UD/TX flows") Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Alex Estrin <alex.estrin@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
9d98e19b |
|
12-Dec-2017 |
Yuval Shaia <yuval.shaia@oracle.com> |
IB/ipoib: Restore MM behavior in case of tx_ring allocation failure memalloc_noio_save modifies the behavior of MM, we must restore it after we are done. Fixes: d83187dda9b9 ("IB/IPoIB: Convert IPoIB to memalloc_noio_* calls") Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
c55359a2 |
|
29-Nov-2017 |
Yuval Shaia <yuval.shaia@oracle.com> |
IB/ipoib: Replace printk with pr_warn pr_* is the preferred way to print messages, replace all printk(KERN_WARN, ...) with pr_warn. Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
|
#
8966e28d |
|
18-Oct-2017 |
Erez Shitrit <erezsh@mellanox.com> |
IB/ipoib: Use NAPI in UD/TX flows Instead of explicit call to poll_cq of the tx ring, use the NAPI mechanism to handle the completions of each packet that has been sent to the HW. The next major changes were taken: * The driver init completion function in the creation of the send CQ, that function triggers the napi scheduling. * The driver uses CQ for RX for both modes UD and CM, and CQ for TX for CM and UD. Cc: Kamal Heib <kamalh@mellanox.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Reviewed-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
2c104ea6 |
|
18-Oct-2017 |
Erez Shitrit <erezsh@mellanox.com> |
IB/ipoib: Get rid of the tx_outstanding variable in all modes The first step toward using NAPI in the UD/TX flow is to separate between two flows, the NAPI and the xmit, meaning no use of shared variables between both flows. This patch takes out the tx_outstanding variable that was used in both flows and instead the driver uses the 2 cyclic ring variables: tx_head and tx_tail, tx_head used in the xmit flow and tx_tail in the NAPI flow. Cc: Kamal Heib <kamalh@mellanox.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Reviewed-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
b04dc199 |
|
27-Sep-2017 |
Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> |
IB/{ipoib, iser}: Consistent print format of vendor error Vendor error print should be consistent across protocols to avoid any confusion. This patch corrects that. Suggested-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Acked-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
af3c79be |
|
07-Sep-2017 |
Santosh Shilimkar <santosh.shilimkar@oracle.com> |
IB/ipoib: Suppress the retry related completion errors IPoIB doesn't support transport/rnr retry schemes as per RFC so those errors are expected. No need to flood the log files with them. Tested-by: Michael Nowak <michael.nowak@oracle.com> Tested-by: Rafael Alejandro Peralez <rafael.peralez@oracle.com> Tested-by: Liwen Huang <liwen.huang@oracle.com> Tested-by: Hong Liu <hong.x.liu@oracle.com> Reviewed-by: Mukesh Kacker <mukesh.kacker@oracle.com> Reported-by: Rajiv Raja <rajiv.raja@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
69956d83 |
|
17-Aug-2017 |
Erez Shitrit <erezsh@mellanox.com> |
IB/ipoib: Sync between remove_one to sysfs calls that use rtnl_lock In order to avoid deadlock between sysfs functions (like create/delete child) and remove_one (both of them are using the sysfs lock and rtnl_lock) the driver will use a state mutex for sync. That will fix traces as the following: schedule+0x3e/0x90 kernfs_drain+0x75/0xf0 ? wait_woken+0x90/0x90 __kernfs_remove+0x12e/0x1c0 kernfs_remove+0x25/0x40 sysfs_remove_dir+0x57/0x90 kobject_del+0x22/0x60 device_del+0x195/0x230 pm_runtime_set_memalloc_noio+0xac/0xf0 netdev_unregister_kobject+0x71/0x80 rollback_registered_many+0x205/0x2f0 rollback_registered+0x31/0x40 unregister_netdevice_queue+0x58/0xb0 unregister_netdev+0x20/0x30 ipoib_remove_one+0xb7/0x240 [ib_ipoib] ib_unregister_device+0xbc/0x1b0 [ib_core] ib_unregister_mad_agent+0x29/0x30 [ib_core] mlx4_ib_remove+0x67/0x280 [mlx4_ib] INFO: task echo:24082 blocked for more than 120 seconds. Tainted: G OE 4.1.12-37.5.1.el6uek.x86_64 #2 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Call Trace: schedule+0x3e/0x90 schedule_preempt_disabled+0xe/0x10 __mutex_lock_slowpath+0x95/0x110 ? _rcu_barrier+0x177/0x220 mutex_lock+0x23/0x40 rtnl_lock+0x15/0x20 netdev_run_todo+0x81/0x1f0 rtnl_unlock+0xe/0x10 ipoib_vlan_delete+0x12f/0x1c0 [ib_ipoib] delete_child+0x69/0x80 [ib_ipoib] dev_attr_store+0x20/0x30 sysfs_kf_write+0x41/0x50 Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Reviewed-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
1b355094 |
|
15-Jul-2017 |
Leon Romanovsky <leon@kernel.org> |
IB/ipoib: Remove double pointer assigning There is no need to assign "p" pointer twice. This patch fixes the following smatch warning: drivers/infiniband/ulp/ipoib/ipoib_cm.c:517 ipoib_cm_rx_handler() warn: missing break? reassigning 'p->id' Fixes: 839fcaba355a ("IPoIB: Connected mode experimental support") Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
|
#
d83187dd |
|
23-May-2017 |
Leon Romanovsky <leon@kernel.org> |
IB/IPoIB: Convert IPoIB to memalloc_noio_* calls Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O during memory allocation") added the memalloc_noio_(save|restore) functions to enable people to modify the MM behavior by disabling I/O during memory allocation. This was further extended in Fixes: 934f3072c17c ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set"). memalloc_noio_* functions prevent allocation paths recursing back into the filesystem without explicitly changing the flags for every allocation site. However the IPoIB hasn't been keeping up with the changes and missed completely these memalloc_noio_* calls. This led to update of allocation site with special QP creation flag, see commit 09b93088d750 ("IB: Add a QP creation flag to use GFP_NOIO allocations"), while this flag is supported by small number of drivers in IB stack. Let's change it by updating to memalloc_noio_* calls and allow for every driver underneath enjoy NOIO allocations. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
98e77d9f |
|
23-May-2017 |
Leon Romanovsky <leon@kernel.org> |
IB: Convert msleep below 20ms to usleep_range The msleep(1) may do not sleep 1 ms as expected and will sleep longer. The simple conversion from msleep to usleep_range between 1ms and 2ms can solve an issue. The full and comprehensive explanation can be found at [1] and [2]. [1] https://lkml.org/lkml/2007/8/3/250 [2] Documentation/timers/timers-howto.txt Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
c2f8fc4e |
|
27-Apr-2017 |
Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com> |
IB/SA: Rename ib_sa_path_rec to sa_path_rec Rename ib_sa_path_rec to a more generic sa_path_rec. This is part of extending ib_sa to also support OPA path records in addition to the IB defined path records. Reviewed-by: Don Hiatt <don.hiatt@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
c1048aff |
|
10-Apr-2017 |
Erez Shitrit <erezsh@mellanox.com> |
IB/IPoIB: Use defined function for netdev_priv function Make ipoib_priv point to netdev_priv where the code calls netdev_priv. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Reviewed-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
174cd4b1 |
|
02-Feb-2017 |
Ingo Molnar <mingo@kernel.org> |
sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
#
23536dfa |
|
14-Feb-2017 |
Zhu Yanjun <yanjun.zhu@oracle.com> |
IB/ipoib: Remove redudant label There are 2 labels to mark the same statements. Replace the 2 labels with one versatile labe. Cc: Joe Jin <joe.jin@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
c5e8f57b0 |
|
09-Feb-2017 |
Zhu Yanjun <yanjun.zhu@oracle.com> |
IB/ipoib: remove the unnecessary memory free In the function ipoib_cm_nonsrq_init_rx, the memory is not allocated successfully. It is not necessary to free it. CC: Joe Jin <joe.jin@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
27d41d29 |
|
28-Dec-2016 |
Feras Daoud <ferasda@mellanox.com> |
IB/ipoib: Change list_del to list_del_init in the tx object Since ipoib_cm_tx_start function and ipoib_cm_tx_reap function belong to different work queues, they can run in parallel. In this case if ipoib_cm_tx_reap calls list_del and release the lock, ipoib_cm_tx_start may acquire it and call list_del_init on the already deleted object. Changing list_del to list_del_init in ipoib_cm_tx_reap fixes the problem. Fixes: 839fcaba355a ("IPoIB: Connected mode experimental support") Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Reviewed-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
13ee429a |
|
28-Dec-2016 |
Feras Daoud <ferasda@mellanox.com> |
IB/ipoib: Use debug prints instead of warnings in RNR WC status If a receive request has not been posted to the work queue, the incoming message is rejected and the peer will receive a receiver-not-ready (RNR) error. In IPoIB, IB_WC_RNR_RETRY_EXC_ERR error is part of the life cycle therefore ipoib_cm_handle_tx_wc function will print to debug instead of warnings. Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
d32b9a81 |
|
28-Dec-2016 |
Feras Daoud <ferasda@mellanox.com> |
IB/ipoib: Add detailed error message to dev_queue_xmit call Add a detailed return code to dev_queue_xmit function when calling to requeue packet via __skb_dequeue. Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
0a0007f2 |
|
28-Dec-2016 |
Feras Daoud <ferasda@mellanox.com> |
IB/ipoib: Fix deadlock between rmmod and set_mode When calling set_mode from sys/fs, the call flow locks the sys/fs lock first and then tries to lock rtnl_lock (when calling ipoib_set_mod). On the other hand, the rmmod call flow takes the rtnl_lock first (when calling unregister_netdev) and then tries to take the sys/fs lock. Deadlock a->b, b->a. The problem starts when ipoib_set_mod frees it's rtnl_lck and tries to get it after that. set_mod: [<ffffffff8104f2bd>] ? check_preempt_curr+0x6d/0x90 [<ffffffff814fee8e>] __mutex_lock_slowpath+0x13e/0x180 [<ffffffff81448655>] ? __rtnl_unlock+0x15/0x20 [<ffffffff814fed2b>] mutex_lock+0x2b/0x50 [<ffffffff81448675>] rtnl_lock+0x15/0x20 [<ffffffffa02ad807>] ipoib_set_mode+0x97/0x160 [ib_ipoib] [<ffffffffa02b5f5b>] set_mode+0x3b/0x80 [ib_ipoib] [<ffffffff8134b840>] dev_attr_store+0x20/0x30 [<ffffffff811f0fe5>] sysfs_write_file+0xe5/0x170 [<ffffffff8117b068>] vfs_write+0xb8/0x1a0 [<ffffffff8117ba81>] sys_write+0x51/0x90 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b rmmod: [<ffffffff81279ffc>] ? put_dec+0x10c/0x110 [<ffffffff8127a2ee>] ? number+0x2ee/0x320 [<ffffffff814fe6a5>] schedule_timeout+0x215/0x2e0 [<ffffffff8127cc04>] ? vsnprintf+0x484/0x5f0 [<ffffffff8127b550>] ? string+0x40/0x100 [<ffffffff814fe323>] wait_for_common+0x123/0x180 [<ffffffff81060250>] ? default_wake_function+0x0/0x20 [<ffffffff8119661e>] ? ifind_fast+0x5e/0xb0 [<ffffffff814fe43d>] wait_for_completion+0x1d/0x20 [<ffffffff811f2e68>] sysfs_addrm_finish+0x228/0x270 [<ffffffff811f2fb3>] sysfs_remove_dir+0xa3/0xf0 [<ffffffff81273f66>] kobject_del+0x16/0x40 [<ffffffff8134cd14>] device_del+0x184/0x1e0 [<ffffffff8144e59b>] netdev_unregister_kobject+0xab/0xc0 [<ffffffff8143c05e>] rollback_registered+0xae/0x130 [<ffffffff8143c102>] unregister_netdevice+0x22/0x70 [<ffffffff8143c16e>] unregister_netdev+0x1e/0x30 [<ffffffffa02a91b0>] ipoib_remove_one+0xe0/0x120 [ib_ipoib] [<ffffffffa01ed95f>] ib_unregister_device+0x4f/0x100 [ib_core] [<ffffffffa021f5e1>] mlx4_ib_remove+0x41/0x180 [mlx4_ib] [<ffffffffa01ab771>] mlx4_remove_device+0x71/0x90 [mlx4_core] Fixes: 862096a8bbf8 ("IB/ipoib: Add more rtnl_link_ops callbacks") Cc: <stable@vger.kernel.org> # v3.6+ Cc: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
74226649 |
|
03-Nov-2016 |
Leon Romanovsky <leon@kernel.org> |
IB/ipoib: Remove and fix debug prints after allocation failure The prints after [k|v][m|z|c]alloc() functions are not needed, because in case of failure, allocator will print their internal error prints anyway. Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
0b59970e |
|
10-Nov-2016 |
Kamal Heib <kamalh@mellanox.com> |
IB/IPoIB: Remove can't use GFP_NOIO warning Remove the warning print of "can't use of GFP_NOIO" to avoid prints in each QP creation when devices aren't supporting IB_QP_CREATE_USE_GFP_NOIO. This print become more annoying when the IPoIB interface is configured to work in connected mode. Fixes: 09b93088d750 ('IB: Add a QP creation flag to use GFP_NOIO allocations') Signed-off-by: Kamal Heib <kamalh@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
fc791b63 |
|
13-Oct-2016 |
Paolo Abeni <pabeni@redhat.com> |
IB/ipoib: move back IB LL address into the hard header After the commit 9207f9d45b0a ("net: preserve IP control block during GSO segmentation"), the GSO CB and the IPoIB CB conflict. That destroy the IPoIB address information cached there, causing a severe performance regression, as better described here: http://marc.info/?l=linux-kernel&m=146787279825501&w=2 This change moves the data cached by the IPoIB driver from the skb control lock into the IPoIB hard header, as done before the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len and use skb->cb to stash LL addresses"). In order to avoid GRO issue, on packet reception, the IPoIB driver stash into the skb a dummy pseudo header, so that the received packets have actually a hard header matching the declared length. To avoid changing the connected mode maximum mtu, the allocated head buffer size is increased by the pseudo header length. After this commit, IPoIB performances are back to pre-regression value. v2 -> v3: rebased v1 -> v2: avoid changing the max mtu, increasing the head buf size Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
546481c2 |
|
28-Aug-2016 |
Erez Shitrit <erezsh@mellanox.com> |
IB/ipoib: Fix memory corruption in ipoib cm mode connect flow When a new CM connection is being requested, ipoib driver copies data from the path pointer in the CM/tx object, the path object might be invalid at the point and memory corruption will happened later when now the CM driver will try using that data. The next scenario demonstrates it: neigh_add_path --> ipoib_cm_create_tx --> queue_work (pointer to path is in the cm/tx struct) #while the work is still in the queue, #the port goes down and causes the ipoib_flush_paths: ipoib_flush_paths --> path_free --> kfree(path) #at this point the work scheduled starts. ipoib_cm_tx_start --> copy from the (invalid)path pointer: (memcpy(&pathrec, &p->path->pathrec, sizeof pathrec);) -> memory corruption. To fix that the driver now starts the CM/tx connection only if that specific path exists in the general paths database. This check is protected with the relevant locks, and uses the gid from the neigh member in the CM/tx object which is valid according to the ref count that was taken by the CM/tx. Fixes: 839fcaba35 ('IPoIB: Connected mode experimental support') Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
198b12f7 |
|
04-Jun-2016 |
Erez Shitrit <erezsh@mellanox.com> |
IB/IPoIB: Fix race between ipoib_remove_one to sysfs functions In ipoib_remove_one the driver holds the rtnl_lock and tries to do some operation like dev_change_flags or unregister_netdev, while sysfs callback like ipoib_vlan_delete holds sysfs mutex and tries to hold the rtnl_lock via rtnl_trylock() and restart_syscall() if the lock is not free, meanwhile ipoib_remove_one tries to get the sysfs lock in order to free its sysfs directory, and we will get a->b, b->a deadlock. Trace like the following: schedule+0x37/0x80 schedule_preempt_disabled+0xe/0x10 __mutex_lock_slowpath+0xb5/0x120 mutex_lock+0x23/0x40 rtnl_lock+0x15/0x20 netdev_run_todo+0x17c/0x320 rtnl_unlock+0xe/0x10 ipoib_vlan_delete+0x11b/0x1b0 [ib_ipoib] delete_child+0x54/0x80 [ib_ipoib] dev_attr_store+0x18/0x30 sysfs_kf_write+0x37/0x40 mutex_lock+0x16/0x40 SyS_write+0x55/0xc0 entry_SYSCALL_64_fastpath+0x16/0x75 And schedule+0x37/0x80 __kernfs_remove+0x1a8/0x260 ? wake_atomic_t_function+0x60/0x60 kernfs_remove+0x25/0x40 sysfs_remove_dir+0x50/0x80 kobject_del+0x18/0x50 device_del+0x19f/0x260 netdev_unregister_kobject+0x6a/0x80 rollback_registered_many+0x1fd/0x340 rollback_registered+0x3c/0x70 unregister_netdevice_queue+0x55/0xc0 unregister_netdev+0x20/0x30 ipoib_remove_one+0x114/0x1b0 [ib_ipoib] ib_unregister_client+0x4a/0x170 [ib_core] ? find_module_all+0x71/0xa0 ipoib_cleanup_module+0x10/0x94 [ib_ipoib] SyS_delete_module+0x1b5/0x210 entry_SYSCALL_64_fastpath+0x16/0x75 The fix is by checking the flag IPOIB_FLAG_INTF_ON_DESTROY in order to get out from the sysfs function. Fixes: 862096a8bbf8 ("IB/ipoib: Add more rtnl_link_ops callbacks") Fixes: 9baa0b036410 ("IB/ipoib: Add rtnl_link_ops support") Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
860e9538 |
|
03-May-2016 |
Florian Westphal <fw@strlen.de> |
treewide: replace dev->trans_start update with helper Replace all trans_start updates with netif_trans_update helper. change was done via spatch: struct net_device *d; @@ - d->trans_start = jiffies + netif_trans_update(d) Compile tested only. Cc: user-mode-linux-devel@lists.sourceforge.net Cc: linux-xtensa@linux-xtensa.org Cc: linux1394-devel@lists.sourceforge.net Cc: linux-rdma@vger.kernel.org Cc: netdev@vger.kernel.org Cc: MPT-FusionLinux.pdl@broadcom.com Cc: linux-scsi@vger.kernel.org Cc: linux-can@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-omap@vger.kernel.org Cc: linux-hams@vger.kernel.org Cc: linux-usb@vger.kernel.org Cc: linux-wireless@vger.kernel.org Cc: linux-s390@vger.kernel.org Cc: devel@driverdev.osuosl.org Cc: b.a.t.m.a.n@lists.open-mesh.org Cc: linux-bluetooth@vger.kernel.org Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Felipe Balbi <felipe.balbi@linux.intel.com> Acked-by: Mugunthan V N <mugunthanvnm@ti.com> Acked-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
78a50a5e |
|
02-Mar-2016 |
Hans Westgaard Ry <hans.westgaard.ry@oracle.com> |
IB/ipoib: Add handling for sending of skb with many frags IPoIB converts skb-fragments to sge adding 1 extra sge when SG is enabled. Current codepath assumes that the max number of sge a device support is at least MAX_SKB_FRAGS+1, there is no interaction with upper layers to limit number of fragments in an skb if a device suports fewer sges. The assumptions also lead to requesting a fixed number of sge when IPoIB creates queue-pairs with SG enabled. A fallback/slowpath is implemented using skb_linearize to handle cases where the conversion would result in more sges than supported. Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
4a061b28 |
|
18-Dec-2015 |
Or Gerlitz <ogerlitz@mellanox.com> |
IB/ulps: Avoid calling ib_query_device Instead, use the cached copy of the attributes present on the device. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
14d3a3b2 |
|
11-Dec-2015 |
Christoph Hellwig <hch@lst.de> |
IB: add a proper completion queue abstraction This adds an abstraction that allows ULPs to simply pass a completion object and completion callback with each submitted WR and let the RDMA core handle the nitty gritty details of how to handle completion interrupts and poll the CQ. In detail there is a new ib_cqe structure which just contains the completion callback, and which can be used to get at the containing object using container_of. It is pointed to by the WR and WC as an alternative to the wr_id field, similar to how many ULPs already use the field to store a pointer using casts. A driver using the new completion callbacks allocates it's CQs using the new ib_create_cq API, which in addition to the number of CQEs and the completion vectors also takes a mode on how we poll for CQEs. Three modes are available: direct for drivers that never take CQ interrupts and just poll for them, softirq to poll from softirq context using the to be renamed blk-iopoll infrastructure which takes care of rearming and budgeting, or a workqueue for consumer who want to be called from user context. Thanks a lot to Sagi Grimberg who helped reviewing the API, wrote the current version of the workqueue code because my two previous attempts sucked too much and converted the iSER initiator to the new API. Signed-off-by: Christoph Hellwig <hch@lst.de>
|
#
e622f2f4 |
|
08-Oct-2015 |
Christoph Hellwig <hch@lst.de> |
IB: split struct ib_send_wr This patch split up struct ib_send_wr so that all non-trivial verbs use their own structure which embedds struct ib_send_wr. This dramaticly shrinks the size of a WR for most common operations: sizeof(struct ib_send_wr) (old): 96 sizeof(struct ib_send_wr): 48 sizeof(struct ib_rdma_wr): 64 sizeof(struct ib_atomic_wr): 96 sizeof(struct ib_ud_wr): 88 sizeof(struct ib_fast_reg_wr): 88 sizeof(struct ib_bind_mw_wr): 96 sizeof(struct ib_sig_handover_wr): 80 And with Sagi's pending MR rework the fast registration WR will also be down to a reasonable size: sizeof(struct ib_fastreg_wr): 64 Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> [srp, srpt] Reviewed-by: Chuck Lever <chuck.lever@oracle.com> [sunrpc] Tested-by: Haggai Eran <haggaie@mellanox.com> Tested-by: Sagi Grimberg <sagig@mellanox.com> Tested-by: Steve Wise <swise@opengridcomputing.com>
|
#
77b1f996 |
|
30-Jul-2015 |
Jason Gunthorpe <jgg@ziepe.ca> |
IB/ipoib: Remove ib_get_dma_mr calls The pd now has a local_dma_lkey member which completely replaces ib_get_dma_mr, use it instead. Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
73fec7fd |
|
30-Jul-2015 |
Haggai Eran <haggaie@mellanox.com> |
IB/cm: Remove compare_data checks Now that there are no ib_cm clients using the compare_data feature for matching IB CM requests' private data, remove the compare_data parameter of ib_cm_listen and remove the code implementing the feature. Signed-off-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
c4268778 |
|
12-Jul-2015 |
Yuval Shaia <yuval.shaia@oracle.com> |
IB/ipoib: Scatter-Gather support in connected mode By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better performance. This MTU plus overhead puts the memory allocation for IP based packets at 32 4k pages (order 5), which have to be contiguous. When the system memory under pressure, it was observed that allocating 128k contiguous physical memory is difficult and causes serious errors (such as system becomes unusable). This enhancement resolve the issue by removing the physically contiguous memory requirement using Scatter/Gather feature that exists in Linux stack. With this fix Scatter-Gather will be supported also in connected mode. This change reverts some of the change made in commit e112373fd6aa ("IPoIB/cm: Reduce connected mode TX object size"). The ability to use SG in IPoIB CM is possible because the coupling between NETIF_F_SG and NETIF_F_CSUM was removed in commit ec5f06156423 ("net: Kill link between CSUM and SG features.") Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Acked-by: Christian Marie <christian@ponies.io> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
8f71c1a2 |
|
05-May-2015 |
Bart Van Assche <bvanassche@acm.org> |
IPoIB/CM: Fix indentation level See also patch "IPoIB/cm: Add connected mode support for devices without SRQs" (commit ID 68e995a29572). Detected by smatch. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
0b39578b |
|
21-Feb-2015 |
Doug Ledford <dledford@redhat.com> |
IB/ipoib: Use dedicated workqueues per interface During my recent work on the rtnl lock deadlock in the IPoIB driver, I saw that even once I fixed the apparent races for a single device, as soon as that device had any children, new races popped up. It turns out that this is because no matter how well we protect against races on a single device, the fact that all devices use the same workqueue, and flush_workqueue() flushes *everything* from that workqueue means that we would also have to prevent all races between different devices (for instance, ipoib_mcast_restart_task on interface ib0 can race with ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on the rtnl_lock). There are several possible solutions to this problem: Make carrier_on_task and mcast_restart_task try to take the rtnl for some set period of time and if they fail, then bail. This runs the real risk of dropping work on the floor, which can end up being its own separate kind of deadlock. Set some global flag in the driver that says some device is in the middle of going down, letting all tasks know to bail. Again, this can drop work on the floor. Or the method this patch attempts to use, which is when we bring an interface up, create a workqueue specifically for that interface, so that when we take it back down, we are flushing only those tasks associated with our interface. In addition, keep the global workqueue, but now limit it to only flush tasks. In this way, the flush tasks can always flush the device specific work queues without having deadlock issues. Signed-off-by: Doug Ledford <dledford@redhat.com>
|
#
0306eda2 |
|
30-Jan-2015 |
Roland Dreier <roland@purestorage.com> |
Revert "IPoIB: Use dedicated workqueues per interface" This reverts commit 5141861cd5e17eac9676ff49c5abfafbea2b0e98. The series of IPoIB bug fixes that went into 3.19-rc1 introduce regressions, and after trying to sort things out, we decided to revert to 3.18's IPoIB driver and get things right for 3.20. Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
5141861c |
|
10-Dec-2014 |
Doug Ledford <dledford@redhat.com> |
IPoIB: Use dedicated workqueues per interface During my recent work on the rtnl lock deadlock in the IPoIB driver, I saw that even once I fixed the apparent races for a single device, as soon as that device had any children, new races popped up. It turns out that this is because no matter how well we protect against races on a single device, the fact that all devices use the same workqueue, and flush_workqueue() flushes *everything* from that workqueue, we can have one device in the middle of a down and holding the rtnl lock and another totally unrelated device needing to run mcast_restart_task, which wants the rtnl lock and will loop trying to take it unless is sees its own FLAG_ADMIN_UP flag go away. Because the unrelated interface will never see its own ADMIN_UP flag drop, the interface going down will deadlock trying to flush the queue. There are several possible solutions to this problem: Make carrier_on_task and mcast_restart_task try to take the rtnl for some set period of time and if they fail, then bail. This runs the real risk of dropping work on the floor, which can end up being its own separate kind of deadlock. Set some global flag in the driver that says some device is in the middle of going down, letting all tasks know to bail. Again, this can drop work on the floor. I suppose if our own ADMIN_UP flag doesn't go away, then maybe after a few tries on the rtnl lock we can queue our own task back up as a delayed work and return and avoid dropping work on the floor that way. But I'm not 100% convinced that we won't cause other problems. Or the method this patch attempts to use, which is when we bring an interface up, create a workqueue specifically for that interface, so that when we take it back down, we are flushing only those tasks associated with our interface. In addition, keep the global workqueue, but now limit it to only flush tasks. In this way, the flush tasks can always flush the device specific work queues without having deadlock issues. Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
09b93088 |
|
11-May-2014 |
Or Gerlitz <ogerlitz@mellanox.com> |
IB: Add a QP creation flag to use GFP_NOIO allocations This addresses a problem where NFS client writes over IPoIB connected mode may deadlock on memory allocation/writeback. The problem is not directly memory reclamation. There is an indirect dependency between network filesystems writing back pages and ipoib_cm_tx_init() due to how a kworker is used. Page reclaim cannot make forward progress until ipoib_cm_tx_init() succeeds and it is stuck in page reclaim itself waiting for network transmission. Ordinarily this situation may be avoided by having the caller use GFP_NOFS but ipoib_cm_tx_init() does not have that information. To address this, take a general approach and add a new QP creation flag that tells the low-level hardware driver to use GFP_NOIO for the memory allocations related to the new QP. Use the new flag in the ipoib connected mode path, and if the driver doesn't support it, re-issue the QP creation without the flag. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
22252b4e |
|
16-Oct-2013 |
Tal Alon <talal@mellanox.com> |
IPoIB: Change CM skb memory allocation to be non-atomic during init Change CM skb memory allocation to use GFP_KERNEL when possible. During device init there's no need to use GFP_ATOMIC when allocating memory for the CM skbs -- use GFP_KERNEL instead. Signed-off-by: Tal Alon <talal@mellanox.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
49b8e744 |
|
08-Aug-2013 |
Jim Foraker <foraker1@llnl.gov> |
IPoIB: Fix race in deleting ipoib_neigh entries In several places, this snippet is used when removing neigh entries: list_del(&neigh->list); ipoib_neigh_free(neigh); The list_del() removes neigh from the associated struct ipoib_path, while ipoib_neigh_free() removes neigh from the device's neigh entry lookup table. Both of these operations are protected by the priv->lock spinlock. The table however is also protected via RCU, and so naturally the lock is not held when doing reads. This leads to a race condition, in which a thread may successfully look up a neigh entry that has already been deleted from neigh->list. Since the previous deletion will have marked the entry with poison, a second list_del() on the object will cause a panic: #5 [ffff8802338c3c70] general_protection at ffffffff815108c5 [exception RIP: list_del+16] RIP: ffffffff81289020 RSP: ffff8802338c3d20 RFLAGS: 00010082 RAX: dead000000200200 RBX: ffff880433e60c88 RCX: 0000000000009e6c RDX: 0000000000000246 RSI: ffff8806012ca298 RDI: ffff880433e60c88 RBP: ffff8802338c3d30 R8: ffff8806012ca2e8 R9: 00000000ffffffff R10: 0000000000000001 R11: 0000000000000000 R12: ffff8804346b2020 R13: ffff88032a3e7540 R14: ffff8804346b26e0 R15: 0000000000000246 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #6 [ffff8802338c3d38] ipoib_cm_tx_handler at ffffffffa066fe0a [ib_ipoib] #7 [ffff8802338c3d98] cm_process_work at ffffffffa05149a7 [ib_cm] #8 [ffff8802338c3de8] cm_work_handler at ffffffffa05161aa [ib_cm] #9 [ffff8802338c3e38] worker_thread at ffffffff81090e10 #10 [ffff8802338c3ee8] kthread at ffffffff81096c66 #11 [ffff8802338c3f48] kernel_thread at ffffffff8100c0ca We move the list_del() into ipoib_neigh_free(), so that deletion happens only once, after the entry has been successfully removed from the lookup table. This same behavior is already used in ipoib_del_neighs_by_gid() and __ipoib_reap_neigh(). Signed-off-by: Jim Foraker <foraker1@llnl.gov> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Jack Wang <jinpu.wang@profitbricks.com> Reviewed-by: Shlomo Pongratz <shlomop@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
50bea5c0 |
|
07-May-2013 |
Andrew Morton <akpm@linux-foundation.org> |
drivers/infiniband/hw: rename random32() to prandom_u32() Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
#
cc529c0d |
|
03-Mar-2013 |
Akinobu Mita <akinobu.mita@gmail.com> |
RDMA: Rename random32() to prandom_u32() Use more preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Cc: Roland Dreier <roland@kernel.org> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: Steve Wise <swise@chelsio.com> Cc: linux-rdma@vger.kernel.org Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
1ee9e2aa |
|
26-Feb-2013 |
Mike Marciniszyn <mike.marciniszyn@intel.com> |
IPoIB: Fix send lockup due to missed TX completion Commit f0dc117abdfa ("IPoIB: Fix TX queue lockup with mixed UD/CM traffic") attempts to solve an issue where unprocessed UD send completions can deadlock the netdev. The patch doesn't fully resolve the issue because if more than half the tx_outstanding's were UD and all of the destinations are RC reachable, arming the CQ doesn't solve the issue. This patch uses the IB_CQ_REPORT_MISSED_EVENTS on the ib_req_notify_cq(). If the rc is above 0, the UD send cq completion callback is called directly to re-arm the send completion timer. This issue is seen in very large parallel filesystem deployments and the patch has been shown to correct the issue. Cc: <stable@vger.kernel.org> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
7e5a90c2 |
|
04-Feb-2013 |
Shlomo Pongratz <shlomop@mellanox.com> |
IPoIB: Fix crash due to skb double destruct After commit b13912bbb4a2 ("IPoIB: Call skb_dst_drop() once skb is enqueued for sending"), using connected mode and running multithreaded iperf for long time, ie iperf -c <IP> -P 16 -t 3600 results in a crash. After the above-mentioned patch, the driver is calling skb_orphan() and skb_dst_drop() after calling post_send() in ipoib_cm.c::ipoib_cm_send() (also in ipoib_ib.c::ipoib_send()) The problem with this is, as is written in a comment in both routines, "it's entirely possible that the completion handler will run before we execute anything after the post_send()." This leads to running the skb cleanup routines simultaneously in two different contexts. The solution is to always perform the skb_orphan() and skb_dst_drop() before queueing the send work request. If an error occurs, then it will be no different than the regular case where dev_free_skb_any() in the completion path, which is assumed to be after these two routines. Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
b13912bb |
|
19-Dec-2012 |
Roland Dreier <roland@purestorage.com> |
IPoIB: Call skb_dst_drop() once skb is enqueued for sending Currently, IPoIB delays collecting send completions for TX packets in order to batch work more efficiently. It does skb_orphan() right after queuing the packets so that destructors run early, to avoid problems like holding socket send buffers for too long (since we might not collect a send completion until a long time after the packet is actually sent). However, IPoIB clears IFF_XMIT_DST_RELEASE because it actually looks at skb_dst() to update the PMTU when it gets a too-long packet. This means that the packets sitting in the TX ring with uncollected send completions are holding a reference on the dst. We've seen this lead to pathological behavior with respect to route and neighbour GC. The easy fix for this is to call skb_dst_drop() when we call skb_orphan(). Also, give packets sent via connected mode (CM) the same skb_orphan() / skb_dst_drop() treatment that packets sent via datagram mode get. Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
71d9c5f9 |
|
02-Oct-2012 |
Roland Dreier <roland@purestorage.com> |
IPoIB: Fix build with CONFIG_INFINIBAND_IPOIB_CM=n With the new netlink support in commit 862096a8bbf8 ("IB/ipoib: Add more rtnl_link_ops callbacks") we need ipoib_set_mode() to be available even if connected mode isn't built. Move the function from ipoib_cm.c to ipoib_main.c (and make a few CM-related macros available unconditonally). This fixes the build error drivers/built-in.o: In function 'ipoib_changelink': ipoib_netlink.c:(.text+0x6a5fc9): undefined reference to 'ipoib_set_mode' ipoib_netlink.c:(.text+0x6a5fe3): undefined reference to 'ipoib_set_mode' when CONFIG_INFINIBAND_IPOIB_CM isn't set. Reported-by: Randy Dunlap <rdunlap@xenotime.net> Reported-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
862096a8 |
|
26-Sep-2012 |
Or Gerlitz <ogerlitz@mellanox.com> |
IB/ipoib: Add more rtnl_link_ops callbacks Add the rtnl_link_ops changelink and fill_info callbacks, through which the admin can now set/get the driver mode, etc policies. Maintain the proprietary sysfs entries only for legacy childs. For child devices, set dev->iflink to point to the parent device ifindex, such that user space tools can now correctly show the uplink relation as done for vlan, macvlan, etc devices. Pointed out by Patrick McHardy <kaber@trash.net> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fa16ebed |
|
13-Aug-2012 |
Shlomo Pongratz <shlomop@mellanox.com> |
IB/ipoib: Add missing locking when CM object is deleted Commit b63b70d87741 ("IPoIB: Use a private hash table for path lookup in xmit path") introduced a bug where in ipoib_cm_destroy_tx() a CM object is moved between lists without any supported locking. Under a stress test, this eventually leads to list corruption and a crash. Previously when this routine was called, callers were taking the device priv lock. Currently this function is called from the RCU callback associated with neighbour deletion. Fix the race by taking the same lock we used to before. Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
b63b70d8 |
|
24-Jul-2012 |
Shlomo Pongratz <shlomop@mellanox.com> |
IPoIB: Use a private hash table for path lookup in xmit path Dave Miller <davem@davemloft.net> provided a detailed description of why the way IPoIB is using neighbours for its own ipoib_neigh struct is buggy: Any time an ipoib_neigh is changed, a sequence like the following is made: spin_lock_irqsave(&priv->lock, flags); /* * It's safe to call ipoib_put_ah() inside * priv->lock here, because we know that * path->ah will always hold one more reference, * so ipoib_put_ah() will never do more than * decrement the ref count. */ if (neigh->ah) ipoib_put_ah(neigh->ah); list_del(&neigh->list); ipoib_neigh_free(dev, neigh); spin_unlock_irqrestore(&priv->lock, flags); ipoib_path_lookup(skb, n, dev); This doesn't work, because you're leaving a stale pointer to the freed up ipoib_neigh in the special neigh->ha pointer cookie. Yes, it even fails with all the locking done to protect _changes_ to *ipoib_neigh(n), and with the code in ipoib_neigh_free() that NULLs out the pointer. The core issue is that read side calls to *to_ipoib_neigh(n) are not being synchronized at all, they are performed without any locking. So whether we hold the lock or not when making changes to *ipoib_neigh(n) you still can have threads see references to freed up ipoib_neigh objects. cpu 1 cpu 2 n = *ipoib_neigh() *ipoib_neigh() = NULL kfree(n) n->foo == OOPS [..] Perhaps the ipoib code can have a private path database it manages entirely itself, which holds all the necessary information and is looked up by some generic key which is available easily at transmit time and does not involve generic neighbour entries. See <http://marc.info/?l=linux-rdma&m=132812793105624&w=2> and <http://marc.info/?l=linux-rdma&w=2&r=1&s=allows+references+to+freed+memory&q=b> for the full discussion. This patch aims to solve the race conditions found in the IPoIB driver. The patch removes the connection between the core networking neighbour structure and the ipoib_neigh structure. In addition to avoiding the race described above, it allows us to handle SKBs carrying IP packets that don't have any associated neighbour. We add an ipoib_neigh hash table with N buckets where the key is the destination hardware address. The ipoib_neigh is fetched from the hash table and instead of the stashed location in the neighbour structure. The hash table uses both RCU and reference counting to guarantee that no ipoib_neigh instance is ever deleted while in use. Fetching the ipoib_neigh structure instance from the hash also makes the special code in ipoib_start_xmit that handles remote and local bonding failover redundant. Aged ipoib_neigh instances are deleted by a garbage collection task that runs every M seconds and deletes every ipoib_neigh instance that was idle for at least 2*M seconds. The deletion is safe since the ipoib_neigh instances are protected using RCU and reference count mechanisms. The number of buckets (N) and frequency of running the GC thread (M), are taken from the exported arb_tbl. Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
6700c270 |
|
17-Jul-2012 |
David S. Miller <davem@davemloft.net> |
net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() This will be used so that we can compose a full flow key. Even though we have a route in this context, we need more. In the future the routes will be without destination address, source address, etc. keying. One ipv4 route will cover entire subnets, etc. In this environment we have to have a way to possess persistent storage for redirects and PMTU information. This persistent storage will exist in the FIB tables, and that's why we'll need to be able to rebuild a full lookup flow key here. Using that flow key will do a fib_lookup() and create/update the persistent entry. Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
d90f9b35 |
|
05-Jul-2012 |
Roland Dreier <roland@purestorage.com> |
IB: Use IS_ENABLED(CONFIG_IPV6) Instead of testing defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
fec14d2f |
|
29-Aug-2011 |
Paul Gortmaker <paul.gortmaker@windriver.com> |
infiniband: add moduleparam.h to drivers/infiniband as required These files were getting the moduleparam infrastructure from the implicit presence of module.h being everywhere, but that is going away soon. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
|
#
9e903e08 |
|
18-Oct-2011 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: add skb frag size accessors To ease skb->truesize sanitization, its better to be able to localize all references to skb frags size. Define accessors : skb_frag_size() to fetch frag size, and skb_frag_size_{set|add|sub}() to manipulate it. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
787adb9d |
|
18-Oct-2011 |
Dotan Barak <dotanb@dev.mellanox.co.il> |
IPoIB: Use the right function to do DMA unmap pages Pages that were mapped using ib_dma_map_page() should be unmapped using ib_dma_unmap_page(). Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il> Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
96104eda |
|
23-May-2011 |
Sean Hefty <sean.hefty@intel.com> |
RDMA/core: Add SRQ type field Currently, there is only a single ("basic") type of SRQ, but with XRC support we will add a second. Prepare for this by defining an SRQ type and setting all current users to IB_SRQT_BASIC. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
|
#
5581be3b |
|
24-Aug-2011 |
Ian Campbell <Ian.Campbell@citrix.com> |
IPoIB: convert to SKB paged frag API. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Roland Dreier <roland@kernel.org> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hal Rosenstock <hal.rosenstock@gmail.com> Cc: linux-rdma@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
3d96c74d |
|
18-Apr-2011 |
Michał Mirosław <mirq-linux@rere.qmqm.pl> |
net: infiniband/ulp/ipoib: convert to hw_features Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
948579cd |
|
04-Nov-2010 |
Joe Perches <joe@perches.com> |
RDMA: Use vzalloc() to replace vmalloc()+memset(0) Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
8ae31e5b |
|
10-Jan-2011 |
Or Gerlitz <ogerlitz@voltaire.com> |
IPoIB: Add GRO support Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
5a0e3ad6 |
|
24-Mar-2010 |
Tejun Heo <tj@kernel.org> |
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
|
#
a48f509b |
|
04-Mar-2010 |
Or Gerlitz <ogerlitz@voltaire.com> |
IPoIB: Include return code in trace message for ib_post_send() failures Print the return code of ib_post_send() if it fails to make these debugging messages more useful. Signed-off-by: Or Gerlitz <ogerlitz@voltaire.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
f0dc117a |
|
02-Mar-2010 |
Eli Cohen <eli@mellanox.co.il> |
IPoIB: Fix TX queue lockup with mixed UD/CM traffic The IPoIB UD QP reports send completions to priv->send_cq, which is usually left unarmed; it only gets armed when the number of outstanding send requests reaches the size of the TX queue. This arming is done only in the send path for the UD QP. However, when sending CM packets, the net queue may be stopped for the same reasons but no measures are taken to recover the UD path from a lockup. Consider this scenario: a host sends high rate of both CM and UD packets, with a TX queue length of N. If at some time the number of outstanding UD packets is more than N/2 and the overall outstanding packets is N-1, and CM sends a packet (making the number of outstanding sends equal N), the TX queue will be stopped. When all the CM packets complete, the number of outstanding packets will still be higher than N/2 so the TX queue will not be restarted. Fix this by calling ib_req_notify_cq() when the queue is stopped in the CM path. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
3ffe533c |
|
18-Feb-2010 |
Alexey Dobriyan <adobriyan@gmail.com> |
ipv6: drop unused "dev" arg of icmpv6_send() Dunno, what was the idea, it wasn't used for a long time. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
cd0bcf4c |
|
05-Sep-2009 |
Roland Dreier <rolandd@cisco.com> |
IPoIB: Remove unused <rdma/ib_cache.h> includes Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
451f1443 |
|
31-Aug-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
drivers: Kill now superfluous ->last_rx stores The generic packet receive code takes care of setting netdev->last_rx when necessary, for the sake of the bonding ARP monitor. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Neil Horman <nhorman@txudriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
adf30907 |
|
01-Jun-2009 |
Eric Dumazet <eric.dumazet@gmail.com> |
net: skb->dst accessors Define three accessors to get/set dst attached to a skb struct dst_entry *skb_dst(const struct sk_buff *skb) void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst) void skb_dst_drop(struct sk_buff *skb) This one should replace occurrences of : dst_release(skb->dst) skb->dst = NULL; Delete skb->dst field Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
26574401 |
|
13-May-2009 |
Eric W. Biederman <ebiederm@xmission.com> |
net: Fix ipoib rtnl_lock sysfs deadlock. Network device sysfs files that grab the rtnl_lock unconditionally will deadlock if accessed when the network device is being unregistered. So use trylock and syscall_restart to avoid this deadlock. Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
5b095d989 |
|
29-Oct-2008 |
Harvey Harrison <harvey.harrison@gmail.com> |
net: replace %p6 with %pI6 Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
fcace2fe |
|
28-Oct-2008 |
Harvey Harrison <harvey.harrison@gmail.com> |
infiniband: ipoib replace IPOIB_GID_FMT with %p6 Replace all uses of IPOIB_GID_FMT, IPOIB_GID_RAW_ARG() and IPOIB_GID_ARG() Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
943c246e |
|
30-Sep-2008 |
Roland Dreier <rolandd@cisco.com> |
IPoIB: Use netif_tx_lock() and get rid of private tx_lock, LLTX Currently, IPoIB is an LLTX driver that uses its own IRQ-disabling tx_lock. Not only do we want to get rid of LLTX, this actually causes problems because of the skb_orphan() done with this tx_lock held: some skb destructors expect to be run with interrupts enabled. The simplest fix for this is to get rid of the driver-private tx_lock and stop using LLTX. We kill off priv->tx_lock and use netif_tx_lock[_bh]() instead; the patch to do this is a tiny bit tricky because we need to update places that take priv->lock inside the tx_lock to disable IRQs, rather than relying on tx_lock having already disabled IRQs. Also, there are a couple of places where we need to disable BHs to make sure we have a consistent context to call netif_tx_lock() (since we no longer can use _irqsave() variants), and we also have to change ipoib_send_comp_handler() to call drain_tx_cq() through a timer rather than directly, because ipoib_send_comp_handler() runs in interrupt context and drain_tx_cq() must run in BH context so it can call netif_tx_lock(). Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
b1404069 |
|
08-Aug-2008 |
David J. Wilder <dwilder@us.ibm.com> |
IPoIB/cm: Use vmalloc() to allocate rx_rings There are users that are running UDP applications that require a large receive queue size in order to get good performance. To prevent allocation failures for rx_rings when using non-SRQ mode and large recv_queue_size (1K or larger), use vmalloc() instead of kcalloc() to alocate rx_rings. Signed-off-by: David Wilder <dwilder@us.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
e0819816 |
|
30-Jul-2008 |
Roland Dreier <rolandd@cisco.com> |
IPoIB/cm: Set correct SG list in ipoib_cm_init_rx_wr() wr->sg_list should be set to the sge pointer passed in, not priv->cm.rx_sge. Reported-by: Hoang-Nam Nguyen <HNGUYEN@de.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
e112373f |
|
15-Jul-2008 |
Eli Cohen <eli@dev.mellanox.co.il> |
IPoIB/cm: Reduce connected mode TX object size Since IPoIB connected mode does not NETIF_F_SG, we only have one DMA mapping per send, so we don't need a mapping[] array. Define a new struct with a single u64 mapping member and use it for the CM tx_ring. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
bd360671 |
|
15-Jul-2008 |
Eli Cohen <eli@mellanox.co.il> |
IPoIB: Use dev_set_mtu() to change mtu When the driver sets the MTU of the net device outside of its change_mtu method, it should make use of dev_set_mtu() instead of directly setting the mtu field of struct netdevice. Otherwise functions registered to be called upon MTU change will not get called (this is done through call_netdevice_notifiers() in dev_set_mtu()). Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
c8c2afe3 |
|
15-Jul-2008 |
Eli Cohen <eli@mellanox.co.il> |
IPoIB: Use rtnl lock/unlock when changing device flags Use of this lock is required to synchronize changes to the netdvice's data structs. Also move the call to ipoib_flush_paths() after the modification of the netdevice flags in set_mode(). Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
a7d834c4 |
|
15-Jul-2008 |
Roland Dreier <rolandd@cisco.com> |
IPoIB/cm: Fix racy use of receive WR/SGL in ipoib_cm_post_receive_nonsrq() For devices that don't support SRQs, ipoib_cm_post_receive_nonsrq() is called from both ipoib_cm_handle_rx_wc() and ipoib_cm_nonsrq_init_rx(), and these two callers are not synchronized against each other. However, ipoib_cm_post_receive_nonsrq() always reuses the same receive work request and scatter list structures, so multiple callers can end up stepping on each other, which leads to posting garbled work requests. Fix this by having the caller pass in the ib_recv_wr and ib_sge structures to use, and allocating new local structures in ipoib_cm_nonsrq_init_rx(). Based on a patch by Pradeep Satyanarayana <pradeep@us.ibm.com> and David Wilder <dwilder@us.ibm.com>, with debugging help from Hoang-Nam Nguyen <hnguyen@de.ibm.com>. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
f89271da |
|
15-Jul-2008 |
Eli Cohen <eli@dev.mellanox.co.il> |
IPoIB: Copy small received SKBs in connected mode The connected mode implementation in the IPoIB driver has a large overhead in the way SKBs are handled in the receive flow. It usually allocates an SKB with as big as was used in the currently received SKB and moves unused fragments from the old SKB to the new one. This involves a loop on all the remaining fragments and incurs overhead on the CPU. This patch, for small SKBs, allocates an SKB just large enough to contain the received data and copies to it the data from the received SKB. The newly allocated SKB is passed to the stack and the old SKB is reposted. When running netperf, UDP small messages, without this pach I get: UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 14.4.3.178 (14.4.3.178) port 0 AF_INET Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 114688 128 10.00 5142034 0 526.31 114688 10.00 1130489 115.71 With this patch I get both send and receive at ~315 mbps. The reason that send performance actually slows down is as follows: When using this patch, the overhead of the CPU for handling RX packets is dramatically reduced. As a result, we do not experience RNR NAK messages from the receiver which cause the connection to be closed and reopened again; when the patch is not used, the receiver cannot handle the packets fast enough so there is less time to post new buffers and hence the mentioned RNR NACKs. So what happens is that the application *thinks* it posted a certain number of packets for transmission but these packets are flushed and do not really get transmitted. Since the connection gets opened and closed many times, each time netperf gets the CPU time that otherwise would have been given to IPoIB to actually transmit the packets. This can be verified when looking at the port counters -- the output of ifconfig and the oputput of netperf (this is for the case without the patch): tx packets ========== port counter: 1,543,996 ifconfig: 1,581,426 netperf: 5,142,034 rx packets ========== netperf 1,1304,089 Signed-off-by: Eli Cohen <eli@mellanox.co.il>
|
#
f3781d2e |
|
15-Jul-2008 |
Roland Dreier <rolandd@cisco.com> |
RDMA: Remove subversion $Id tags They don't get updated by git and so they're worse than useless. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
f56bcd80 |
|
29-Apr-2008 |
Eli Cohen <eli@dev.mellanox.co.il> |
IPoIB: Use separate CQ for UD send completions Use a dedicated CQ for UD send completions. Also, do not arm the UD send CQ, which reduces the number of interrupts generated. This patch farther reduces overhead by not calling poll CQ for every posted send WR -- it does polls only when there 16 or more outstanding work requests. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
9fdd5e5b |
|
16-Apr-2008 |
Roland Dreier <rolandd@cisco.com> |
IPoIB: Handle case when P_Key is deleted and re-added at same index If a P_Key is deleted and then re-added at the same index, then IPoIB gets confused because __ipoib_ib_dev_flush() only checks whether the index is the same without checking whether the P_Key was present, so the interface is stopped when the P_Key is deleted, but the event when the P_Key is re-added gets ignored and the interface never gets restarted. Also, switch to using ib_find_pkey() instead of ib_find_cached_pkey() everywhere in IPoIB, since none of the places that look for P_Keys are in a fast path or in non-sleeping context, and in general we want to kill off the whole caching infrastructure eventually. This also fixes consistency problems caused because some IPoIB queries were cached and some were uncached during the window where the cache was not updated. Thanks to Venkata Subramonyam <vsubramo@cisco.com> for debugging this problem and testing this fix. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
40ca1988 |
|
16-Apr-2008 |
Eli Cohen <eli@dev.mellanox.co.il> |
IPoIB: Add LSO support For HCAs that support TCP segmentation offload (IB_DEVICE_UD_TSO), set NETIF_F_TSO and use HW LSO to offload TCP segmentation. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
6046136c |
|
16-Apr-2008 |
Eli Cohen <eli@dev.mellanox.co.il> |
IPoIB: Use checksum offload support if available For HCAs that support checksum offload (ie that set IB_DEVICE_UD_IP_CSUM in the device capabilities flags), have IPoIB set NETIF_F_IP_CSUM and use the HCA to generate and verify IP checksums. Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
10313cbb |
|
12-Mar-2008 |
Roland Dreier <rolandd@cisco.com> |
IPoIB: Allocate priv->tx_ring with vmalloc() Commit 7143740d ("IPoIB: Add send gather support") made struct ipoib_tx_buf significantly larger, since the mapping member changed from a single u64 to an array with MAX_SKB_FRAGS + 1 entries. This means that allocating tx_rings with kzalloc() may fail because there is not enough contiguous memory for the new, much bigger size. Fix this regression by allocating the rings with vmalloc() instead. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
4200406b |
|
11-Mar-2008 |
Roland Dreier <rolandd@cisco.com> |
IPoIB/cm: Set tx_wr.num_sge in connected mode post_send() Commit 7143740d ("IPoIB: Add send gather support") made it possible for tx_wr.num_sge to be != 1 -- this happens if send gather support is enabled. However, the code in the connected mode post_send() function assumes the old invariant, namely that tx_wr.num_sge is always 1. Fix this by explicitly setting tx_wr.num_sge to 1 in the CM post_send(). Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
ec229e5e |
|
12-Feb-2008 |
Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com> |
IPoIB/cm: Fix ipoib_cm_dev_stop() cleanup when drain times out Commit efcd9971 ("IPoIB/cm: Factor out ipoib_cm_free_rx_reap_list()") introduced a bug in ipoib_cm_dev_stop() when the receive drain times out. In that case, the function moves all the pending rx stuff into a private list but then calls ipoib_cm_free_rx_reap_list(), which handles a different list. Fix this by moving everything to the rx_reap_list that will actually get freed up. This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=906>. Signed-off-by: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
7143740d |
|
30-Jan-2008 |
Eli Cohen <eli@mellanox.co.il> |
IPoIB: Add send gather support This patch acts as a preparation for using checksum offload for IB devices capable of inserting/verifying checksum in IP packets. The patch does not actaully turn on NETIF_F_SG - we defer that to the patches adding checksum offload capabilities. We only add support for send gathers for datagram mode, since existing HW does not support checksum offload on connected QPs. Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Signed-off-by: Eli Cohen <eli@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
586a6934 |
|
21-Dec-2007 |
Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com> |
IPoIB/CM: Enable SRQ support on HCAs that support fewer than 16 SG entries Some HCAs (such as ehca2) support SRQ, but only support fewer than 16 SG entries for SRQs. Currently IPoIB/CM implicitly assumes all HCAs will support 16 SG entries for SRQs (to handle a 64K MTU with 4K pages). This patch removes that restriction by limiting the maximum MTU in connected mode to what the maximum number of SRQ SG entries allows. This patch addresses <https://bugs.openfabrics.org/show_bug.cgi?id=728> Signed-off-by: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
68e995a2 |
|
25-Jan-2008 |
Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com> |
IPoIB/cm: Add connected mode support for devices without SRQs Some IB adapters (notably IBM's eHCA) do not implement SRQs (shared receive queues). The current IPoIB connected mode support only works on devices that support SRQs. Fix this by adding support for using the receive queue of each connected mode receive QP. The disadvantage of this compared to using an SRQ is that it means a full queue of receives must be posted for each remote connected mode peer, which means that total memory usage is potentially much higher than when using SRQs. To manage this, add a new module parameter "max_nonsrq_conn_qp" that limits the number of connections allowed per interface. The rest of the changes are fairly straightforward: we use a table of struct ipoib_cm_rx to hold all the active connections, and put the table index of the connection in the high bits of receive WR IDs. This is needed because we cannot rely on the struct ib_wc.qp field for non-SRQ receive completions. Most of the rest of the changes just test whether or not an SRQ is available, and post receives or find received packets in the right place depending on the answer. Cleaning up dead connections actually becomes simpler, because we do not have to do the "last WQE reached" dance that is required to destroy QPs attached to an SRQ. We just move the QP to the error state and wait for all pending receives to be flushed. Signed-off-by: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com> [ Completely rewritten and split up, based on Pradeep's work. Several bugs fixed and no doubt several bugs introduced. - Roland ] Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
efcd9971 |
|
25-Jan-2008 |
Roland Dreier <rolandd@cisco.com> |
IPoIB/cm: Factor out ipoib_cm_free_rx_reap_list() Factor out the code for going through the rx_reap list of struct ipoib_cm_rx and freeing each one. This consolidates the code duplicated between ipoib_cm_dev_stop() and ipoib_cm_rx_reap() and reduces the risk of error when adding additional accounting. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
7b3687df |
|
25-Jan-2008 |
Roland Dreier <rolandd@cisco.com> |
IPoIB/cm: Factor out ipoib_cm_create_srq() Factor out the code to create an SRQ and allocate the receive ring in ipoib_cm_dev_init() into a new function ipoib_cm_create_srq(). This will make the code neater when support for devices that don't implement SRQs is added. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
1efb6144 |
|
25-Jan-2008 |
Roland Dreier <rolandd@cisco.com> |
IPoIB/cm: Factor out ipoib_cm_free_rx_ring() Factor out the code to unmap/free skbs and free the receive ring in ipoib_cm_dev_cleanup() into a new function ipoib_cm_free_rx_ring(). This function will be called from a couple of other places when support for devices that don't implement SRQs is added. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
2337f809 |
|
23-Oct-2007 |
Roland Dreier <rolandd@cisco.com> |
IPoIB: Trivial formatting cleanups Fix whitespace blunders, convert "foo* bar" to "foo *bar", etc. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
09f60f8f |
|
26-Oct-2007 |
Roland Dreier <rolandd@cisco.com> |
IPoIB/cm: Fix receive QP cleanup Commit 1b524963 ("IPoIB/cm: Use common CQ for CM send completions") changed how the high-order bits of work request IDs were used, which had the effect that IPOIB_CM_RX_DRAIN_WRID was no longer handled as a connected mode receive completion. This leads to the messages ib1: cm send completion event with wrid 1073741823 (> 64) ib1: RX drain timing out when an interface with connected mode QPs is brought down. Fix this by making sure that both IPOIB_OP_CM and IPOIB_OP_RECV are set in IPOIB_CM_RX_DRAIN_WRID. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
1b524963 |
|
16-Aug-2007 |
Michael S. Tsirkin <mst@mellanox.co.il> |
IPoIB/cm: Use common CQ for CM send completions Use the same CQ for CM send completions as for all other IPoIB completions. This means all completions are processed via the same NAPI polling routine. This should help reduce the number of interrupts for bi-directional traffic (such as TCP) and fixes "driver is hogging interrupts" errors reported for IPoIB send side, e.g. <https://bugs.openfabrics.org/show_bug.cgi?id=508> To do this, keep a per-interface counter of outstanding send WRs, and stop the interface when this counter reaches the send queue size to avoid CQ overruns. Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
fd312561 |
|
17-Oct-2007 |
Roland Dreier <rolandd@cisco.com> |
IPoIB: Rewrite "if (!likely(...))" as "if (unlikely(!(...)))" It's too hard to figure out what "!likely(...)" really means, and who knows how compilers interpret the hint. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
de903512 |
|
28-Sep-2007 |
Roland Dreier <rolandd@cisco.com> |
[IPoIB]: Convert to netdevice internal stats Use the stats member of struct netdevice in IPoIB, so we can save memory by deleting the stats member of struct ipoib_dev_priv, and save code by deleting ipoib_get_stats(). Signed-off-by: Roland Dreier <rolandd@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
ede6bc04 |
|
07-Oct-2007 |
Dotan Barak <dotanb@dev.mellanox.co.il> |
IPoIB/cm: Clean up initialization of QP attr in ipoib_cm_create_tx_qp() Make the way QP is being created in ipoib_cm_create_tx_qp() consistent with ipoib_cm_create_rx_qp(). Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
1d846126 |
|
18-Jun-2007 |
Sean Hefty <sean.hefty@intel.com> |
IB/cm: Include HCA ACK delay in local ACK timeout The IB CM should include the HCA ACK delay when calculating the local ACK timeout value to use for RC QPs. If the HCA ACK delay is large enough relative to the packet life time, then if it is not taken into account, the calculated timeout value ends up being too small, which can result in "retry exceeded" errors. Signed-off-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
20089ca5 |
|
10-Jul-2007 |
Roland Dreier <rolandd@cisco.com> |
IPoIB/cm: Fix warning if IPV6 is not enabled Fix drivers/infiniband/ulp/ipoib/ipoib_cm.c:1151: warning: unused variable 'dev' by getting rid of the variable dev, which is only used if CONFIG_IPV6 is enabled, and replacing the one use of it with the value it is assigned, namely priv->dev. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
841adfca |
|
29-Jun-2007 |
Ralph Campbell <ralph.campbell@qlogic.com> |
IPoIB/cm: Partial error clean up unmaps wrong address If a page can't be allocated for the frag list of a skb, the code to unmap the partially allocated list is off by one. For exaple, if 'frags' equals one, i == 0, and the alloc_page() fails, then the old loop would have unmapped mapping[1] which is uninitialized. The same would happen if the call to ib_dma_map_page() failed. Signed-off-by: Ralph Campbell <ralph.campbell@qlogic.com> Acked-by: Michael S. Tsirkin <mst@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
13ef5f44 |
|
21-Jun-2007 |
Roland Dreier <rolandd@cisco.com> |
IPoIB/cm: Remove dead definition of struct ipoib_cm_id It's completely unused. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
82c3aca6 |
|
20-Jun-2007 |
Michael S. Tsirkin <mst@dev.mellanox.co.il> |
IPoIB/cm: Fix interoperability when MTU doesn't match IPoIB connected mode currently rejects a connection request unless the supported MTU is >= the local netdevice MTU. This breaks interoperability with implementations that might have tweaked IPOIB_CM_MTU, and there's real no longer a reason to do so: this test is just a leftover from when we did not tweak MTU per-connection. Fix this by making the test as permissive as possible. Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
3ec7393a |
|
19-Jun-2007 |
Michael S. Tsirkin <mst@dev.mellanox.co.il> |
IPoIB/cm: Initialize RX before moving QP to RTR Fix a crasher bug in IPoIB CM: once a QP is in the RTR state, a receive completion (or even an asynchronous error) might be observed on this QP, so we have to initialize all of our receive data structures before moving to the RTR state. As an optimization (since modify_qp might take a long time), the jiffies update done when moving RX to the passive_ids list is also left in place to reduce the chance of the RX being misdetected as stale. This fixes bug <https://bugs.openfabrics.org/show_bug.cgi?id=662>. Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
ec56dc0b |
|
28-May-2007 |
Michael S. Tsirkin <mst@dev.mellanox.co.il> |
IPoIB/cm: Fix performance regression on Mellanox commit 518b1646 ("IPoIB/cm: Fix SRQ WR leak") introduced a severe performance regression on Mellanox cards, because keeping a QP in the error state for extended periods of time moves hardware to the slow path (until the QP is destroyed). For example, MPI latency goes from ~3 usecs to ~7 usecs. Fix this by posting a send WR on one of the QPs that are being flushed, instead of using a separate drain QP that is kept in the error state. This fixes bug <https://bugs.openfabrics.org/show_bug.cgi?id=636>, reported and bisected by Scott Weitzenkamp at Cisco and debugged by Sasha Mikheev at Voltaire. Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
2dfbfc37 |
|
24-May-2007 |
Michael S. Tsirkin <mst@dev.mellanox.co.il> |
IPoIB/cm: Drain cq in ipoib_cm_dev_stop() Since NAPI polling is disabled while ipoib_cm_dev_stop() is running, ipoib_cm_dev_stop() must poll the CQ itself in order to see the packets draining. Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
8fd357a6 |
|
24-May-2007 |
Michael S. Tsirkin <mst@dev.mellanox.co.il> |
IPoIB/cm: Fix timeout check in ipoib_cm_dev_stop() time_after() was used backwards, so the timeout occurred immediately. Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
518b1646 |
|
21-May-2007 |
Michael S. Tsirkin <mst@dev.mellanox.co.il> |
IPoIB/cm: Fix SRQ WR leak SRQ WR leakage has been observed with IPoIB/CM: e.g. flipping ports on and off will, with time, leak out all WRs and then all connections will start getting RNR NAKs. Fix this in the way suggested by spec: move the QP being destroyed to the error state, wait for "Last WQE Reached" event and then post WR on a "drain QP" connected to the same CQ. Once we observe a completion on the drain QP, it's safe to call ib_destroy_qp. Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
7c5b9ef8 |
|
13-May-2007 |
Michael S. Tsirkin <mst@dev.mellanox.co.il> |
IPoIB/cm: Optimize stale connection detection In the presence of some running RX connections, we repeat queue_delayed_work calls each 4 RX WRs, which is a waste. It's enough to start stale task when a first passive connection is added, and rerun it every IPOIB_CM_RX_DELAY as long as there are outstanding passive connections. This removes some code from RX data path. Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
8d1cc86a |
|
06-May-2007 |
Roland Dreier <rolandd@cisco.com> |
IPoIB: Convert to NAPI Convert the IP-over-InfiniBand network device driver over to using NAPI to handle completions for the main CQ. This covers all receives as well as datagram mode sends; send completions for connected mode connections are still handled from interrupt context. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
f4fd0b22 |
|
03-May-2007 |
Michael S. Tsirkin <mst@dev.mellanox.co.il> |
IB: Add CQ comp_vector support Add a num_comp_vectors member to struct ib_device and extend ib_create_cq() to pass in a comp_vector parameter -- this parallels the userspace libibverbs API. Update all hardware drivers to set num_comp_vectors to 1 and have all ULPs pass 0 for the comp_vector value. Pass the value of num_comp_vectors to userspace rather than hard-coding a value of 1. We want multiple CQ event vector support (via MSI-X or similar for adapters that can generate multiple interrupts), but it's not clear how many vectors we want, or how we want to deal with policy issues such as how to decide which vector to use or how to set up interrupt affinity. This patch is useful for experimenting, since no core changes will be necessary when updating a driver to support multiple vectors, and we know that we want to make at least these changes anyway. Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
d6ef7d68 |
|
02-May-2007 |
Michael S. Tsirkin <mst@dev.mellanox.co.il> |
IPoIB/cm: Don't crash if remote side uses one QP for both directions The IPoIB CM spec allows the use of a single connection in both active->passive and passive->active directions. The current Linux code uses one connection for both directions, but if another node only uses one connection for both directions, we oops when we try to look up the passive connection. Fix by checking that qp_context is non-NULL before dereferencing it. Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il>
|
#
347fcfbe |
|
30-Apr-2007 |
Michael S. Tsirkin <mst@dev.mellanox.co.il> |
IPoIB/cm: Fix error handling in ipoib_cm_dev_open() If skb allocation fails when we start the device, we call ipoib_cm_dev_stop() even though ipoib_cm_dev_open() did not run to completion, so we pass an invalid pointer to ib_destroy_cm_id and get an oops. Fix by clearing cm.id on error, and testing it in cm_dev_stop(). This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=561> Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
459a98ed |
|
19-Mar-2007 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
[SK_BUFF]: Introduce skb_reset_mac_header(skb) For the common, open coded 'skb->mac.raw = skb->data' operation, so that we can later turn skb->mac.raw into a offset, reducing the size of struct sk_buff in 64bit land while possibly keeping it as a pointer on 32bit. This one touches just the most simple case, next will handle the slightly more "complex" cases. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|
#
37aebbde |
|
24-Apr-2007 |
Roland Dreier <rolandd@cisco.com> |
IPoIB/cm: spin_lock_irqsave() -> spin_lock_irq() replacements There are quite a few places in ipoib_cm.c where we know IRQs are enabled because we do something that sleeps in the same function, so we can convert several occurrences of spin_lock_irqsave() to a plain spin_lock_irq(). This cleans up the source a little and makes the code smaller too: add/remove: 0/0 grow/shrink: 1/5 up/down: 3/-51 (-48) function old new delta ipoib_cm_tx_reap 403 406 +3 ipoib_cm_stale_task 146 145 -1 ipoib_cm_dev_stop 173 172 -1 ipoib_cm_tx_handler 964 956 -8 ipoib_cm_rx_handler 956 937 -19 ipoib_cm_skb_reap 212 190 -22 Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
a89875fc |
|
18-Apr-2007 |
Roland Dreier <rolandd@cisco.com> |
IPoIB: Remove pointless opcode field from debugging output There's no point in printing the opcode field in the completion handling debugging output, since the type of completion is already printed at the beginning of the line. In fact the opcode field is not even defined for completions with a status other than success. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
6371ea3d |
|
10-Apr-2007 |
Michael S. Tsirkin <mst@dev.mellanox.co.il> |
IPoIB/cm: Fix DMA direction typo Receive buffers need to be mapped with DMA_FROM_DEVICE. Incorrectly mapping with DMA_TO_DEVICE causes a hard lock on ppc64 machines with an IOMMU. This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=431> Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
77d8e1ef |
|
21-Mar-2007 |
Michael S. Tsirkin <mst@dev.mellanox.co.il> |
IB/ipoib: Fix thinko in packet length checks The packet length checks in ipoib are broken: we add 4 bytes (IPoIB encapsulation header) when sending a packet, not 20 bytes (hardware address length) to each packet. Therefore, if connected mode is enabled so that the interface MTU is larger than the multicast MTU, IPoIB may end up trying to send too-long multicast packets. For example, multicast is broken if a message of size 2048 bytes is sent on an interface with UD MTU 2048, because 2048 is bigger than the real limit of 2044 but the code tests against the wrong limit of 2060. This patch fixes <https://bugs.openfabrics.org/show_bug.cgi?id=418>, submitted by Scott Weitzenkamp <sweitzen@cisco.com>. Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
60a596da |
|
22-Mar-2007 |
Michael S. Tsirkin <mst@dev.mellanox.co.il> |
IPoIB/cm: Fix reaping of stale connections The sense of the time_after_eq() test in ipoib_cm_stale_task() is reversed so that only non-stale connections are reaped. Fix this by changing to time_before_eq(). Noticed by Pradeep Satyanarayana <pradeep@us.ibm.com>. Signed-off-by: Michael S. Tsirkin <mst@dev.mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
1812063b |
|
20-Feb-2007 |
Michael S. Tsirkin <mst@mellanox.co.il> |
IPoIB/cm: Improve small message bandwidth Avoid the overhead of freeing/reallocating and mapping/unmapping for DMA pages that have not been written to by hardware. Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
8a2e65f8 |
|
15-Feb-2007 |
Michael S. Tsirkin <mst@mellanox.co.il> |
IPoIB: CM error handling thinko fix ipoib_cm_alloc_rx_skb() might be called from IRQ context, so it must use dev_kfree_skb_any(), not kfree_skb(). Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
551fd612 |
|
16-Feb-2007 |
Roland Dreier <rolandd@cisco.com> |
IPoIB: Only allow root to change between datagram and connected mode Change the permissions of the "mode" sysfs attribute to be S_IWUSR instead of S_IWUGO. Signed-off-by: Roland Dreier <rolandd@cisco.com>
|
#
839fcaba |
|
05-Feb-2007 |
Michael S. Tsirkin <mst@mellanox.co.il> |
IPoIB: Connected mode experimental support The following patch adds experimental support for IPoIB connected mode, as defined by the draft from the IETF ipoib working group. The idea is to increase performance by increasing the MTU from the maximum of 2K (theoretically 4K) supported by IPoIB on top of UD. With this code, I'm able to get 800MByte/sec or more with netperf without options on a Mellanox 4x back-to-back DDR system. Some notes on code: 1. SRQ is used for scalability to large cluster sizes 2. Only RC connections are used (UC does not support SRQ now) 3. Retry count is set to 0 since spec draft warns against retries 4. Each connection is used for data transfers in only 1 direction, so each connection is either active(TX) or passive (RX). 2 sides that want to communicate create 2 connections. 5. Each active (TX) connection has a separate CQ for send completions - this keeps the code simple without CQ resize and other tricks 6. To detect stale passive side connections (where the remote side is down), we keep an LRU list of passive connections (updated once per second per connection) and destroy a connection after it has been unused for several seconds. The LRU rule makes it possible to avoid scanning connections that have recently been active. Signed-off-by: Michael S. Tsirkin <mst@mellanox.co.il> Signed-off-by: Roland Dreier <rolandd@cisco.com>
|