History log of /freebsd-11.0-release/sys/netinet/tcp_lro.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 303975 11-Aug-2016 gjb

Copy stable/11@r303970 to releng/11.0 as part of the 11.0-RELEASE
cycle.

Prune svn:mergeinfo from the new branch, and rename it to RC1.

Update __FreeBSD_version.

Use the quarterly branch for the default FreeBSD.conf pkg(8) repo and
the dvd1.iso packages population.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 302408 08-Jul-2016 gjb

Copy head@r302406 to stable/11 as part of the 11.0-RELEASE cycle.
Prune svn:mergeinfo from the new branch, as nothing has been merged
here.

Additional commits post-branch will follow.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 301249 03-Jun-2016 hselasky

Use insertion sort instead of bubble sort in TCP LRO.

Replacing the bubble sort with insertion sort gives an 80% reduction
in runtime on average, with randomized keys, for small partitions.

If the keys are pre-sorted, insertion sort runs in linear time, and
even if the keys are reversed, insertion sort is faster than bubble
sort, although not by much.

Update comment describing "tcp_lro_sort()" while at it.

Differential Revision: https://reviews.freebsd.org/D6619
Sponsored by: Mellanox Technologies
Tested by: Netflix
Suggested by: Pieter de Goeje <pieter@degoeje.nl>
Reviewed by: ed, gallatin, gnn, transport


# 300731 26-May-2016 hselasky

Use optimised complexity safe sorting routine instead of the kernel's
"qsort()".

The kernel's "qsort()" routine can in worst case spend O(N*N) amount of
comparisons before the input array is sorted. It can also recurse a
significant amount of times using up the kernel's interrupt thread
stack.

The custom sorting routine takes advantage of that the sorting key is
only 64 bits. Based on set and cleared bits in the sorting key it
partitions the array until it is sorted. This process has a recursion
limit of 64 times, due to the number of set and cleared bits which can
occur. Compiled with -O2 the sorting routine was measured to use
64-bytes of stack. Multiplying this by 64 gives a maximum stack
consumption of 4096 bytes for AMD64. The same applies to the execution
time, that the array to be sorted will not be traversed more than 64
times.

When serving roughly 80Gb/s with 80K TCP connections, the old method
consisting of "qsort()" and "tcp_lro_mbuf_compare_header()" used 1.4%
CPU, while the new "tcp_lro_sort()" used 1.1% for LRO related sorting
as measured by Intel Vtune. The testing was done using a sysctl to
toggle between "qsort()" and "tcp_lro_sort()".

Differential Revision: https://reviews.freebsd.org/D6472
Sponsored by: Mellanox Technologies
Tested by: Netflix
Reviewed by: gallatin, rrs, sephe, transport


# 298974 03-May-2016 sephe

tcp/lro: Refactor the active list operation.

Ease more work concerning active list, e.g. hash table etc.

Reviewed by: gallatin, rrs (earlier version)
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D6137


# 298730 28-Apr-2016 sephe

tcp/lro: Fix more typo

Noticed by: hiren
MFC after: 1 week
Sponsored by: Microsoft OSTC


# 298696 27-Apr-2016 sephe

tcp/lro: Fix typo.

MFC after: 1 week
Sponsored by: Microsoft OSTC


# 297483 01-Apr-2016 sephe

tcp/lro: Change SLIST to LIST, so that removing an entry is O(1)

This is kinda critical to the performance when the CPU is slow and
network bandwidth is high, e.g. in the hypervisor.

Reviewed by: rrs, gallatin, Dexuan Cui <decui microsoft com>
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5765


# 297482 01-Apr-2016 sephe

tcp/lro: Use tcp_lro_flush_all in device drivers to avoid code duplication

And factor out tcp_lro_rx_done, which deduplicates the same logic with
netinet/tcp_lro.c

Reviewed by: gallatin (1st version), hps, zbb, np, Dexuan Cui <decui microsoft com>
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5725


# 297265 25-Mar-2016 sephe

tcp/lro: Return TCP_LRO_NO_ENTRIES if we are short of LRO entries.

So that callers could react accordingly.

Reviewed by: gallatin (no objection)
MFC after: 1 week
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5695


# 295739 18-Feb-2016 sephe

tcp/lro: Allow drivers to set the TCP ACK/data segment aggregation limit

ACK aggregation limit is append count based, while the TCP data segment
aggregation limit is length based. Unless the network driver sets these
two limits, it's an NO-OP.

Reviewed by: adrian, gallatin (previous version), hselasky (previous version)
Approved by: adrian (mentor)
MFC after: 1 week
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5185


# 295506 11-Feb-2016 hselasky

Use a pair of ifs when comparing the 32-bit flowid integers so that
the sign bit doesn't cause an overflow. The overflow manifests itself
as a sorting index wrap around in the middle of the sorted array,
which is not a problem for the LRO code, but might be a problem for
the logic inside qsort().

Reviewed by: gnn @
Sponsored by: Mellanox Technologies
Differential Revision: https://reviews.freebsd.org/D5239


# 295126 01-Feb-2016 glebius

These files were getting sys/malloc.h and vm/uma.h with header pollution
via sys/mbuf.h


# 294327 19-Jan-2016 hselasky

Add optimizing LRO wrapper:

- Add optimizing LRO wrapper which pre-sorts all incoming packets
according to the hash type and flowid. This prevents exhaustion of
the LRO entries due to too many connections at the same time.
Testing using a larger number of higher bandwidth TCP connections
showed that the incoming ACK packet aggregation rate increased from
~1.3:1 to almost 3:1. Another test showed that for a number of TCP
connections greater than 16 per hardware receive ring, where 8 TCP
connections was the LRO active entry limit, there was a significant
improvement in throughput due to being able to fully aggregate more
than 8 TCP stream. For very few very high bandwidth TCP streams, the
optimizing LRO wrapper will add CPU usage instead of reducing CPU
usage. This is expected. Network drivers which want to use the
optimizing LRO wrapper needs to call "tcp_lro_queue_mbuf()" instead
of "tcp_lro_rx()" and "tcp_lro_flush_all()" instead of
"tcp_lro_flush()". Further the LRO control structure must be
initialized using "tcp_lro_init_args()" passing a non-zero number
into the "lro_mbufs" argument.

- Make LRO statistics 64-bit. Previously 32-bit integers were used for
statistics which can be prone to wrap-around. Fix this while at it
and update all SYSCTL's which expose LRO statistics.

- Ensure all data is freed when destroying a LRO control structures,
especially leftover LRO entries.

- Reduce number of memory allocations needed when setting up a LRO
control structure by precomputing the total amount of memory needed.

- Add own memory allocation counter for LRO.

- Bump the FreeBSD version to force recompilation of all KLDs due to
change of the LRO control structure size.

Sponsored by: Mellanox Technologies
Reviewed by: gallatin, sbruno, rrs, gnn, transport
Tested by: Netflix
Differential Revision: https://reviews.freebsd.org/D4914


# 284961 30-Jun-2015 np

Fix leak in tcp_lro_rx. Simply clearing M_PKTHDR isn't enough, any tags
hanging off the header need to be freed too.

Differential Revision: https://reviews.freebsd.org/D2708
Reviewed by: ae@, hiren@


# 255010 28-Aug-2013 np

Merge r254336 from user/np/cxl_tuning.

Add a last-modified timestamp to each LRO entry and provide an interface
to flush all inactive entries. Drivers decide when to flush and what
the inactivity threshold should be.

Network drivers that process an rx queue to completion can enter a
livelock type situation when the rate at which packets are received
reaches equilibrium with the rate at which the rx thread is processing
them. When this happens the final LRO flush (normally when the rx
routine is done) does not occur. Pure ACKs and segments with total
payload < 64K can get stuck in an LRO entry. Symptoms are that TCP
tx-mostly connections' performance falls off a cliff during heavy,
unrelated rx on the interface.

Flushing only inactive LRO entries works better than any of these
alternates that I tried:
- don't LRO pure ACKs
- flush _all_ LRO entries periodically (every 'x' microseconds or every
'y' descriptors)
- stop rx processing in the driver periodically and schedule remaining
work for later.

Reviewed by: andre


# 247104 21-Feb-2013 gallatin

Fix tcp_lro_rx_ipv4() for drivers that do not set CSUM_IP_CHECKED.
Specifcially, in_cksum_hdr() returns 0 (not 0xffff) when the IPv4
checksum is correct. Without this fix, the tcp_lro code will reject
good IPv4 traffic from drivers that do not implement IPv4 header
harder csum offload.

Sponsored by: Myricom Inc.

MFC after: 7 days


# 236394 01-Jun-2012 bz

Make TCP LRO work properly with VIMAGE kernels rather than just panicing.
There's no VIMAGE context set there yet as this is before if_ethersubr.c.

MFC after: 3 days
X-MFC with: r235981


# 236093 26-May-2012 bz

Trim the extra $FreeBSD$ from the comment below the license. We use
the __FBSDID() macro on the file now instead.

MFC after: 3 days


# 235981 25-May-2012 bz

In case forwarding is turned on for a given address family, refuse to
queue the packet for LRO and tell the driver to directly pass it on.
This avoids re-assembly and later re-fragmentation problems when
forwarding.

It's not the best solution but the simplest and most effective for
the moment.

Should have been done: ages ago
Discussed with and by: many
MFC after: 3 days


# 235944 24-May-2012 bz

MFp4 bz_ipv6_fast:

Significantly update tcp_lro for mostly two things:
1) introduce basic support for IPv6 without extension headers.
2) try hard to also get the incremental checksum updates right,
especially also in the IPv4 case for the IP and TCP header.

Move variables around for better locality, factor things out into
functions, allow checksum updates to be compiled out, ...

Leave a few comments on further things to look at in the future,
though that is not the full list.

Update drivers with appropriate #includes as needed for IPv6 data
type in LRO.

Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems

Reviewed by: gnn (as part of the whole)
MFC After: 3 days


# 235474 15-May-2012 bz

Switch to a standard 2 clause BSD license (from bsd-style-copyright).

Approved by: Myricom Inc. (gallatin)
Approved by: Intel Corporation (jfv)


# 223797 05-Jul-2011 cperciva

Don't allow lro->len to exceed 65535, as this will result in overflow
when len is inserted back into the synthetic IP packet and cause a
multiple of 2^16 bytes of TCP "packet loss".

This improves Linux->FreeBSD netperf bandwidth by a factor of 300 in
testing on Amazon EC2.

Reviewed by: jfv
MFC after: 2 weeks


# 220428 07-Apr-2011 jfv

Port of the LRO fix from mxge driver to the generic
LRO code. Thanks to Andrew Gallatin for the change.

MFC after: 7 days


# 217126 07-Jan-2011 jhb

Trim extra spaces before tabs.


# 182089 24-Aug-2008 kmacy

Don't calculate checksum if it has already been validated

Obtained from: Chelsio Inc.
MFC after: 3 days


# 179737 11-Jun-2008 jfv

Add generic TCP LOR into netinet