Cross Reference: /freebsd-11.0-release/sys/netinet/tcp

History log of /freebsd-11.0-release/sys/netinet/tcp_lro.c
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 303975	11-Aug-2016	gjb	Copy stable/11@r303970 to releng/11.0 as part of the 11.0-RELEASE cycle. Prune svn:mergeinfo from the new branch, and rename it to RC1. Update __FreeBSD_version. Use the quarterly branch for the default FreeBSD.conf pkg(8) repo and the dvd1.iso packages population. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-11.0-release/etc/pkg/FreeBSD.conf /freebsd-11.0-release/release/pkg_repos/release-dvd.conf /freebsd-11.0-release/sys/conf/newvers.sh /freebsd-11.0-release/sys/sys/param.h
# 302408	08-Jul-2016	gjb	Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle. Prune svn:mergeinfo from the new branch, as nothing has been merged here. Additional commits post-branch will follow. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
# 301249	03-Jun-2016	hselasky	Use insertion sort instead of bubble sort in TCP LRO. Replacing the bubble sort with insertion sort gives an 80% reduction in runtime on average, with randomized keys, for small partitions. If the keys are pre-sorted, insertion sort runs in linear time, and even if the keys are reversed, insertion sort is faster than bubble sort, although not by much. Update comment describing "tcp_lro_sort()" while at it. Differential Revision: https://reviews.freebsd.org/D6619 Sponsored by: Mellanox Technologies Tested by: Netflix Suggested by: Pieter de Goeje <pieter@degoeje.nl> Reviewed by: ed, gallatin, gnn, transport
# 300731	26-May-2016	hselasky	Use optimised complexity safe sorting routine instead of the kernel's "qsort()". The kernel's "qsort()" routine can in worst case spend O(N*N) amount of comparisons before the input array is sorted. It can also recurse a significant amount of times using up the kernel's interrupt thread stack. The custom sorting routine takes advantage of that the sorting key is only 64 bits. Based on set and cleared bits in the sorting key it partitions the array until it is sorted. This process has a recursion limit of 64 times, due to the number of set and cleared bits which can occur. Compiled with -O2 the sorting routine was measured to use 64-bytes of stack. Multiplying this by 64 gives a maximum stack consumption of 4096 bytes for AMD64. The same applies to the execution time, that the array to be sorted will not be traversed more than 64 times. When serving roughly 80Gb/s with 80K TCP connections, the old method consisting of "qsort()" and "tcp_lro_mbuf_compare_header()" used 1.4% CPU, while the new "tcp_lro_sort()" used 1.1% for LRO related sorting as measured by Intel Vtune. The testing was done using a sysctl to toggle between "qsort()" and "tcp_lro_sort()". Differential Revision: https://reviews.freebsd.org/D6472 Sponsored by: Mellanox Technologies Tested by: Netflix Reviewed by: gallatin, rrs, sephe, transport
# 298974	03-May-2016	sephe	tcp/lro: Refactor the active list operation. Ease more work concerning active list, e.g. hash table etc. Reviewed by: gallatin, rrs (earlier version) Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D6137
# 298730	28-Apr-2016	sephe	tcp/lro: Fix more typo Noticed by: hiren MFC after: 1 week Sponsored by: Microsoft OSTC
# 298696	27-Apr-2016	sephe	tcp/lro: Fix typo. MFC after: 1 week Sponsored by: Microsoft OSTC
# 297483	01-Apr-2016	sephe	tcp/lro: Change SLIST to LIST, so that removing an entry is O(1) This is kinda critical to the performance when the CPU is slow and network bandwidth is high, e.g. in the hypervisor. Reviewed by: rrs, gallatin, Dexuan Cui <decui microsoft com> Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D5765
# 297482	01-Apr-2016	sephe	tcp/lro: Use tcp_lro_flush_all in device drivers to avoid code duplication And factor out tcp_lro_rx_done, which deduplicates the same logic with netinet/tcp_lro.c Reviewed by: gallatin (1st version), hps, zbb, np, Dexuan Cui <decui microsoft com> Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D5725
# 297265	25-Mar-2016	sephe	tcp/lro: Return TCP_LRO_NO_ENTRIES if we are short of LRO entries. So that callers could react accordingly. Reviewed by: gallatin (no objection) MFC after: 1 week Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D5695
# 295739	18-Feb-2016	sephe	tcp/lro: Allow drivers to set the TCP ACK/data segment aggregation limit ACK aggregation limit is append count based, while the TCP data segment aggregation limit is length based. Unless the network driver sets these two limits, it's an NO-OP. Reviewed by: adrian, gallatin (previous version), hselasky (previous version) Approved by: adrian (mentor) MFC after: 1 week Sponsored by: Microsoft OSTC Differential Revision: https://reviews.freebsd.org/D5185
# 295506	11-Feb-2016	hselasky	Use a pair of ifs when comparing the 32-bit flowid integers so that the sign bit doesn't cause an overflow. The overflow manifests itself as a sorting index wrap around in the middle of the sorted array, which is not a problem for the LRO code, but might be a problem for the logic inside qsort(). Reviewed by: gnn @ Sponsored by: Mellanox Technologies Differential Revision: https://reviews.freebsd.org/D5239
# 295126	01-Feb-2016	glebius	These files were getting sys/malloc.h and vm/uma.h with header pollution via sys/mbuf.h
# 294327	19-Jan-2016	hselasky	Add optimizing LRO wrapper: - Add optimizing LRO wrapper which pre-sorts all incoming packets according to the hash type and flowid. This prevents exhaustion of the LRO entries due to too many connections at the same time. Testing using a larger number of higher bandwidth TCP connections showed that the incoming ACK packet aggregation rate increased from ~1.3:1 to almost 3:1. Another test showed that for a number of TCP connections greater than 16 per hardware receive ring, where 8 TCP connections was the LRO active entry limit, there was a significant improvement in throughput due to being able to fully aggregate more than 8 TCP stream. For very few very high bandwidth TCP streams, the optimizing LRO wrapper will add CPU usage instead of reducing CPU usage. This is expected. Network drivers which want to use the optimizing LRO wrapper needs to call "tcp_lro_queue_mbuf()" instead of "tcp_lro_rx()" and "tcp_lro_flush_all()" instead of "tcp_lro_flush()". Further the LRO control structure must be initialized using "tcp_lro_init_args()" passing a non-zero number into the "lro_mbufs" argument. - Make LRO statistics 64-bit. Previously 32-bit integers were used for statistics which can be prone to wrap-around. Fix this while at it and update all SYSCTL's which expose LRO statistics. - Ensure all data is freed when destroying a LRO control structures, especially leftover LRO entries. - Reduce number of memory allocations needed when setting up a LRO control structure by precomputing the total amount of memory needed. - Add own memory allocation counter for LRO. - Bump the FreeBSD version to force recompilation of all KLDs due to change of the LRO control structure size. Sponsored by: Mellanox Technologies Reviewed by: gallatin, sbruno, rrs, gnn, transport Tested by: Netflix Differential Revision: https://reviews.freebsd.org/D4914
# 284961	30-Jun-2015	np	Fix leak in tcp_lro_rx. Simply clearing M_PKTHDR isn't enough, any tags hanging off the header need to be freed too. Differential Revision: https://reviews.freebsd.org/D2708 Reviewed by: ae@, hiren@
# 255010	28-Aug-2013	np	Merge r254336 from user/np/cxl_tuning. Add a last-modified timestamp to each LRO entry and provide an interface to flush all inactive entries. Drivers decide when to flush and what the inactivity threshold should be. Network drivers that process an rx queue to completion can enter a livelock type situation when the rate at which packets are received reaches equilibrium with the rate at which the rx thread is processing them. When this happens the final LRO flush (normally when the rx routine is done) does not occur. Pure ACKs and segments with total payload < 64K can get stuck in an LRO entry. Symptoms are that TCP tx-mostly connections' performance falls off a cliff during heavy, unrelated rx on the interface. Flushing only inactive LRO entries works better than any of these alternates that I tried: - don't LRO pure ACKs - flush _all_ LRO entries periodically (every 'x' microseconds or every 'y' descriptors) - stop rx processing in the driver periodically and schedule remaining work for later. Reviewed by: andre
# 247104	21-Feb-2013	gallatin	Fix tcp_lro_rx_ipv4() for drivers that do not set CSUM_IP_CHECKED. Specifcially, in_cksum_hdr() returns 0 (not 0xffff) when the IPv4 checksum is correct. Without this fix, the tcp_lro code will reject good IPv4 traffic from drivers that do not implement IPv4 header harder csum offload. Sponsored by: Myricom Inc. MFC after: 7 days
# 236394	01-Jun-2012	bz	Make TCP LRO work properly with VIMAGE kernels rather than just panicing. There's no VIMAGE context set there yet as this is before if_ethersubr.c. MFC after: 3 days X-MFC with: r235981
# 236093	26-May-2012	bz	Trim the extra $FreeBSD$ from the comment below the license. We use the __FBSDID() macro on the file now instead. MFC after: 3 days
# 235981	25-May-2012	bz	In case forwarding is turned on for a given address family, refuse to queue the packet for LRO and tell the driver to directly pass it on. This avoids re-assembly and later re-fragmentation problems when forwarding. It's not the best solution but the simplest and most effective for the moment. Should have been done: ages ago Discussed with and by: many MFC after: 3 days
# 235944	24-May-2012	bz	MFp4 bz_ipv6_fast: Significantly update tcp_lro for mostly two things: 1) introduce basic support for IPv6 without extension headers. 2) try hard to also get the incremental checksum updates right, especially also in the IPv4 case for the IP and TCP header. Move variables around for better locality, factor things out into functions, allow checksum updates to be compiled out, ... Leave a few comments on further things to look at in the future, though that is not the full list. Update drivers with appropriate #includes as needed for IPv6 data type in LRO. Sponsored by: The FreeBSD Foundation Sponsored by: iXsystems Reviewed by: gnn (as part of the whole) MFC After: 3 days
# 235474	15-May-2012	bz	Switch to a standard 2 clause BSD license (from bsd-style-copyright). Approved by: Myricom Inc. (gallatin) Approved by: Intel Corporation (jfv)
# 223797	05-Jul-2011	cperciva	Don't allow lro->len to exceed 65535, as this will result in overflow when len is inserted back into the synthetic IP packet and cause a multiple of 2^16 bytes of TCP "packet loss". This improves Linux->FreeBSD netperf bandwidth by a factor of 300 in testing on Amazon EC2. Reviewed by: jfv MFC after: 2 weeks
# 220428	07-Apr-2011	jfv	Port of the LRO fix from mxge driver to the generic LRO code. Thanks to Andrew Gallatin for the change. MFC after: 7 days
# 217126	07-Jan-2011	jhb	Trim extra spaces before tabs.
# 182089	24-Aug-2008	kmacy	Don't calculate checksum if it has already been validated Obtained from: Chelsio Inc. MFC after: 3 days
# 179737	11-Jun-2008	jfv	Add generic TCP LOR into netinet