stanislav shalunov: GigaTCP--TCP on Gigabit Ethernet

August 2001

Credits

I'd like to acknowledge outside advice from Phil Dykstra of WareOnEarth Communications, Matt Hodge of University of Washington, Jonathan Lemon, Ken Merry of FreeBSD zero copy fame, Joerg Micheel of University of Waikato, Bosko Milekic, David Richardson of University of Washington, and Peter Seebach of Plethora Internet.

People who were involved with this project at Internet2 are, besides myself: Guy Almes, Bill Cerveny, Ben Teitelbaum, and Matt Zekauskas.

Motivation

We at Internet2 wanted to build two or more machines that would be capable of achieving TCP throughputs of 700-800Mb/s over WAN (with a single TCP connection). The purposes of such an exercise are:
  1. to be able to tell how the networks we engineer today will behave few years down the road when TCP connections with such throughput become more commonplace;
  2. more specifically, to be able to measure impact of things like minor packet re-ordering on TCP throughput;
  3. to validate the assumption of low loss in Internet2 networks;
  4. to learn more about effects of network techniques such as SACK;
  5. to learn more about effects of host techniques such as zero copy TCP;
  6. to demonstrate that in tomorrow's network TCP can still provide viable transport even with today's algorithms;
  7. to evaluate the real need for dangerous host techniques such as hardware checksumming;
  8. to understand just exactly how hard is it to run hundreds-of-megabits-per-second TCP flows over WAN today and what will be the difficulties involved in doing so;
  9. possibly, to break or at least approach the current Microsoft's land speed record of 786Mb/s from Seattle to Atlanta (not a reason to start building machines, but if we already have them, we should try).

Operating System

We wanted something that'd have available source code. BSD/OS seemed unnecessarily expensive, given its competition, and our past experience with its drivers was less than glamorous. Of the other BSDs, FreeBSD was the clear winner with its ubiquity and stack optimizations. We have also considered Linux, including Linux with Web100 patches.

We decided to put Linux on ice for the time being. It lacks zero copy support (sendfile(2) notwithstanding: sendfile() doesn't do anything about the receive side, and it only works with disk files; disk would easily become our bottleneck, and we're not into I/O monstrosity, we're into TCP monstrosity). It also lacks any clear advantages over FreeBSD for the task at hand. The option to use it later--if we wanted--was left open, and hardware was chosen so that it'd work well with both FreeBSD and Linux.

Hardware

First, we wanted to buy a commercial off-the-shelf Dell server and go from there. Once we realized how much we would have to pay for the features we want and how much unnecessary I/O fanciness we'd get with that, we started to evaluate the option of custom-built machines.
PCI bus
Regular 32-bit 33MHz PCI bus would afford nominal theoretical throughput of just over 1Gb/s, half-duplex. With overhead and direction switches that would come unconfortably close to the desired TCP goodput. Therefore, we wanted 64-bit 66MHz PCI bus.
The NIC
Based on the experience of various people we conculted and hardware compatibility lists, we narrowed the choice to 3Com 3c985B-SX, Netgear GA620 [Bay Networks has apparently removed all information about it from its web site], or one of the four SysKonnect cards. The former two cards are the same Tigon II chipset with different amount of on-board SRAM, the latter is substantially different. FreeBSD zero copy receive side requires special header-splitting firmware, which is only available for Tigon cards; additionally, SysKonnect can't do hardware checksumming and jumbo frames at the same time. On the other hand, SysKonnect was rumored to have slightly higher packet-per-second rates. We decided to go with the Tigon. We have chosen the more expensive 3c985B-SX with its 1MB of SRAM rather than the cheaper GA620 with its 512KB. Even though 1MB is still way below our target window size, we decided that every little bit would help.
FSB
With software checksumming all the memory bandwidth we could get would help with a fast CPU. Initially, we wanted to look at RDRAM. We eliminated RAMBUS for reliability reasons. Then, we considered DDR SDRAM; it didn't look like it's supported by solid chipsets and it didn't look like we could get it with ECC; so we discarded DDR. The only thing left other than regular SDR SDRAM was interleaved SDR. We decided we want at least two-way interleaving.
The chipset
ServerWorks ServerSet III HE was in all sources of information we could find considered good. It's the highest-end ServerWorks chipset. Its four-way memory interleaving with 32Gb/s theoretical memory throughput with ECC sure looked attractive, too.
The motherboard
We have considered the following motherboards with this chipset: Tyan S2597, SuperMicro 370DLE, and SuperMicro 370DE6. Of these boards only the last actually appeared to provide any memory interleaving (based on lack of its mention in specifications of the other two boards and on private third-party benchmarks). This board also had the advantage of having an on-board SCSI adaptor that was supported by FreeBSD.
Memory
The 370DE6 actually only provides two-way interleaving, not taking advantage of the full four-way interleaving that the chipset is supposed to support. So, we needed two DIMMs. We decided we'd get 512MB of memory total, so that we have room to play with. With the price of the rest of the components two 256MB registered ECC DIMMs weren't expensive.
The CPU(s)
The fastest we could afford seemed like a good choice. We went with 1GHz Pentium III (the fastest available was 1.1GHz, but going one step down the ladder provided such a sizeable discount that doing so appeared to make sense). We decided to get a dual-CPU configuration, even though we understood the benefits of the second CPU would be slim to none in many configurations (FreeBSD-current being a notable potential exception).
I/O
We didn't care much for I/O in these systems, but we wanted to be able to occasionally use one of them for snooping. This meant SCSI disks. I like to get two identical disks into everything, money permitting (nightly dd or playing around with OSes). The smallest available disks were 18GB.

So, the final spec was:

	SuperMicro 370DE6 Motherboard
	2*256MB ECC registered DIMMs
	2*IBM 18.2GB ultra160 drives
	2*Intel Pentium III/1000 (133Mhz FSB)
	SuperMicro SC760 Chassis
	3Com 3c985B-SX Gigabit Ethernet
	Cheap AGP video, 32x IDE CD-ROM, TEAC floppy

We ordered three of these machines from Plethora Internet at $3,000 apiece with an educational discount (we ordered and received these machines in June 2001). Thanks for the nice machines, Peter!

The chassis is the smallest that SuperMicro recommends for this motherboard. It's a full tower case (tall). With a few fans in the case, these beasts still generated enough heat to raise the temperature between the backplane and the wall to 35C in an air-conditioned lab. We moved them so that they were better ventilated.

Initial Setup

The machines were hooked up to the network with their fxp0 interfaces acting as general access points. Two of them were hooked up back-to-back with multi-mode fiber through their ti0 interfaces, with private IP numbers.

Initially, we put FreeBSD 4.3-RELEASE on them. I ramped maxusers up to 512 and NMBCLUSTERS to 102400 (feeling generous), enabled SMP and APIC_IO, removed a bunch of unnecessary devices, and added the ti driver. Then, I set kern.ipc.maxsockbuf=20480000, net.inet.tcp.sendspace=1024000, net.inet.tcp.recvspace=1024000, and set the MTU size on ti0 to 4470 bytes (the SONET MTU, which we would have in Abilene).

Preliminary Baseline Results

The latency across the strip of fiber, as measured by ping, was around 100us (obviously the actual propagation delay is much lower).

TCP throughput with a single Iperf connection was about 700-850Mb/s (yes, with tens of Mb/s of variation across individual runs) with send and receive buffer space of 1MB. CPU utilization during the test, as reported by top, was stable 50% (give or take 0.1%), which suggests that as we expected the machines were CPU-bound with software checksumming, and they could not make any use of the second CPU.

It's worth noticing that Iperf wouldn't correctly report throughput for runs longer than something over 20s (apparently since a 32-bit counter for bytes transferred would wrap around).

With send and receive buffer space of 100KB, TCP throughput would raise to stable 970-975Mb/s. With send and receive buffer space of 64KB, the throughput would drop to 865-870Mb/s. Notice that this is without RFC1323 extensions (no window scaling). So, increasing buffer space beyond the advertized window size did make a difference in performance.

More systematically, the following table shows TCP throughput dependence on buffer space, as reported by Iperf with 20-second measurement interval (at least three runs per reported diapason):

Buffer space	Throughput w/o RFC1323, Mb/s	With RFC1323, Mb/s
4096		2.2-3.4				2.1-3.7
8192		170-180				157-188
16384		104-144				62-91
32768		623-648				659-660
65536		861-865				795-821
131072		973-975				827-829
262144		870-884				692-693
524288		836-842				674-675
786432						.2/Didn't work
1048576		578-723				216-410/6/Didn't work
2097152		635-836				Didn't work
4194304		70-819
8388608		584-739				Didn't work

The machines were sensitive to any other activity, even running top when there are CPU cycles to burn on both CPUs, concurrently with Iperf. I suppose the context switching really hurts performance.

Problems with large window sizes

With window sizes over half a megabyte, our TCP transmission wouldn't work reliably. Sometimes, it would be just ridiculously slow (kilobits or single-digit megabits per second). Sometimes, it would freeze completely. In cases when it froze, one would be able to interrupt the server (receiver) with Control-C, but not the client (sender); the sender would miss its scheduled completion time of 10 or 20 seconds and would run for a long time, ignoring the interrupt character, but accepting the quit character. Usually the run would still complete in less than 10 minutes.

I've collected tcpdump traces from a stalled sender and stalled receiver (different views of the same connection); xplot time-series graphs produced by tcptrace from the sender side and the receiver side might prove useful.

Additionally, traces from a slow connection are available (sender side and receiver side). These files actually contain two connections, and the first is represented by these time-series graphs: sender side, receiver side.

It should be recognized that the process of collecting tcpdump data by itself changes the pattern of load on the machine. Additionally, it's not guaranteed that BPF sees the same things TCP stack sees.

This stalling happens for both NMBCLUSTERS=102400 and the default value for maxusers 512. In fact, I have increased it while trying to troubleshoot the problem.

An annotated tcpdump of the stalled connection follows (sender view). Everywhere where ellipsis ("...") is used in the annotation, there are packets that we omit.

23:54:23.817524 10.0.0.2.1072 > 10.0.0.1.5001: S 577076198:577076198(0) win 65535 <mss 4430,nop,wscale 6,nop,nop,timestamp 2323869 0> (DF) (ttl 64, id 32041) 23:54:23.817674 10.0.0.1.5001 > 10.0.0.2.1072: S 168494072:168494072(0) ack 577076199 win 65535 <mss 4430,nop,wscale 6,nop,nop,timestamp 2321727 2323869> (DF) (ttl 64, id 18233) 23:54:23.817715 10.0.0.2.1072 > 10.0.0.1.5001: . ack 1 win 32768 <nop,nop,timestamp 2323869 2321727> (DF) (ttl 64, id 32042) 23:54:23.817864 10.0.0.1.5001 > 10.0.0.2.1072: . ack 1 win 32768 <nop,nop,timestamp 2321727 2323869> (DF) (ttl 64, id 18234) Connection initiated. Actual window advertized by both sides (taking window scaling into account) is 32768*2^6=2097152=2^21=2MB. 23:54:23.818382 10.0.0.2.1072 > 10.0.0.1.5001: P 1:4097(4096) ack 1 win 32768 <nop,nop,timestamp 2323869 2321727> (DF) (ttl 64, id 32043) 23:54:23.818403 10.0.0.2.1072 > 10.0.0.1.5001: P 4097:8193(4096) ack 1 win 32768 <nop,nop,timestamp 2323869 2321727> (DF) (ttl 64, id 32044) 23:54:23.818475 10.0.0.2.1072 > 10.0.0.1.5001: P 8193:12289(4096) ack 1 win 32768 <nop,nop,timestamp 2323869 2321727> (DF) (ttl 64, id 32046) 23:54:23.818493 10.0.0.2.1072 > 10.0.0.1.5001: P 12289:16385(4096) ack 1 win 32768 <nop,nop,timestamp 2323869 2321727> (DF) (ttl 64, id 32047) ...We proceed to blast packets... 23:54:23.818613 10.0.0.1.5001 > 10.0.0.2.1072: . ack 8193 win 32640 <nop,nop,timestamp 2321727 2323869> (DF) (ttl 64, id 18235) 23:54:23.818645 10.0.0.2.1072 > 10.0.0.1.5001: P 32769:36865(4096) ack 1 win 32768 <nop,nop,timestamp 2323869 2321727> (DF) (ttl 64, id 32052) 23:54:23.818664 10.0.0.2.1072 > 10.0.0.1.5001: P 36865:40961(4096) ack 1 win 32768 <nop,nop,timestamp 2323869 2321727> (DF) (ttl 64, id 32053) 23:54:23.818692 10.0.0.2.1072 > 10.0.0.1.5001: P 40961:45057(4096) ack 1 win 32768 <nop,nop,timestamp 2323869 2321727> (DF) (ttl 64, id 32054) 23:54:23.818712 10.0.0.2.1072 > 10.0.0.1.5001: P 45057:49153(4096) ack 1 win 32768 <nop,nop,timestamp 2323869 2321727> (DF) (ttl 64, id 32055) 23:54:23.818726 10.0.0.1.5001 > 10.0.0.2.1072: . ack 16385 win 32512 <nop,nop,timestamp 2321727 2323869> (DF) (ttl 64, id 18236) ...We see some ACKs coming in, but at this point we really don't care as we can send 2MB (512 packets) before we need to even bother to look at anything the remote side says... 23:54:23.822377 10.0.0.2.1072 > 10.0.0.1.5001: P 528385:532481(4096) ack 1 win 32768 <nop,nop,timestamp 2323870 2321728> (DF) (ttl 64, id 32173) 23:54:23.822391 10.0.0.1.5001 > 10.0.0.2.1072: . ack 466945 win 25600 <nop,nop,timestamp 2321728 2323870> (DF) (ttl 64, id 18291) 23:54:23.822419 10.0.0.2.1072 > 10.0.0.1.5001: P 532481:536577(4096) ack 1 win 32768 <nop,nop,timestamp 2323870 2321728> (DF) (ttl 64, id 32174) Everything goes well until this point. The preceding packet is the first packet lost in this connection. We can blast a lot more according to the window size. Since there were no losses up to this point (130 packets), our cwnd is the same as window size for a while now. We go on to send more stuff... 23:54:23.996387 10.0.0.2.1072 > 10.0.0.1.5001: . 2293761:2297857(4096) ack 1 win 32768 <nop,nop,timestamp 2323887 2321745> (DF) (ttl 64, id 32604) 23:54:23.996407 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 <nop,nop,timestamp 2321745 2323870> (DF) (ttl 64, id 18389) 23:54:23.996444 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 <nop,nop,timestamp 2321745 2323870> (DF) (ttl 64, id 18390) 23:54:23.996495 10.0.0.2.1072 > 10.0.0.1.5001: . 2297857:2301953(4096) ack 1 win 32768 <nop,nop,timestamp 2323887 2321745> (DF) (ttl 64, id 32605) 23:54:23.996579 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 <nop,nop,timestamp 2321745 2323870> (DF) (ttl 64, id 18391) 23:54:23.996597 10.0.0.2.1072 > 10.0.0.1.5001: . 2301953:2306049(4096) ack 1 win 32768 <nop,nop,timestamp 2323887 2321745> (DF) (ttl 64, id 32606) 23:54:23.996657 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 <nop,nop,timestamp 2321745 2323870> (DF) (ttl 64, id 18392) 23:54:23.996698 10.0.0.2.1072 > 10.0.0.1.5001: . 2306049:2310145(4096) ack 1 win 32768 <nop,nop,timestamp 2323887 2321745> (DF) (ttl 64, id 32607) 23:54:23.996753 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 <nop,nop,timestamp 2321745 2323870> (DF) (ttl 64, id 18393) 23:54:23.996801 10.0.0.2.1072 > 10.0.0.1.5001: . 2310145:2314241(4096) ack 1 win 32768 <nop,nop,timestamp 2323887 2321745> (DF) (ttl 64, id 32608) 23:54:23.996853 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 <nop,nop,timestamp 2321745 2323870> (DF) (ttl 64, id 18394) 23:54:23.996898 10.0.0.2.1072 > 10.0.0.1.5001: . 2314241:2318337(4096) ack 1 win 32768 <nop,nop,timestamp 2323887 2321745> (DF) (ttl 64, id 32609) 23:54:23.996967 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 <nop,nop,timestamp 2321745 2323870> (DF) (ttl 64, id 18395) The other end has noticed the loss and is reporting it with duplicate ACKs. There are lots more of these duplicates down the road, but even at this point it should be obvious that there was a loss. Fast Retransmit should kick in and we need to resend starting from offset 532481. Instead, we proceed to send the full window before we turn around to face the problem of loss. Why don't we resend starting from offset 532481 at this point? This isn't good. We proceed to send... 23:54:24.005786 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 <nop,nop,timestamp 2321746 2323870> (DF) (ttl 64, id 18469) 23:54:24.005875 10.0.0.2.1072 > 10.0.0.1.5001: . 2621441:2625537(4096) ack 1 win 32768 <nop,nop,timestamp 2323888 2321745> (DF) (ttl 64, id 32684) 23:54:24.005899 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 <nop,nop,timestamp 2321746 2323870> (DF) (ttl 64, id 18470) 23:54:24.006007 10.0.0.2.1072 > 10.0.0.1.5001: P 2625537:2629633(4096) ack 1 win 32768 <nop,nop,timestamp 2323888 2321745> (DF) (ttl 64, id 32685) 23:54:24.006037 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 <nop,nop,timestamp 2323888 2321745> (DF) (ttl 64, id 32686) At this point, we have sent 2629633-532481=2097152 bytes ahead, which is exactly our window size. Finally, we notice the duplicate ACKs are coming in and resend the segment starting from 532481. 23:54:24.006060 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 <nop,nop,timestamp 2321746 2323870> (DF) (ttl 64, id 18471) 23:54:24.006178 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 <nop,nop,timestamp 2321746 2323870> (DF) (ttl 64, id 18472) Up to this point, everything was happening in rapid fire mode. Now's the first time there's any waiting. We just resent segment starting from 532481, and there were only two more duplicate ACKs for it since then: not enough to warrant another Fast Retransmit. So, we wait 1 second. 23:54:25.000538 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 <nop,nop,timestamp 2323988 2321746> (DF) (ttl 64, id 32687) We resend the unlucky packet again, without getting any ACK back. (Peek ahead: What's actually happening is that all large packets are being lost due to jumbo slots shortage in the driver while small packets get through fine.) 23:54:27.000562 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 <nop,nop,timestamp 2324188 2321746> (DF) (ttl 64, id 32688) We wait 2 seconds now, and send again. No response. 23:54:31.000646 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 <nop,nop,timestamp 2324588 2321746> (DF) (ttl 64, id 32689) Wait 4 seconds, send again, no response. 23:54:39.000751 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 <nop,nop,timestamp 2325388 2321746> (DF) (ttl 64, id 32690) Wait 8 seconds, send again, no response. 23:54:55.001004 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 <nop,nop,timestamp 2326988 2321746> (DF) (ttl 64, id 32691) Wait 16 seconds, send again, no response. 23:55:27.001500 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 <nop,nop,timestamp 2330188 2321746> (DF) (ttl 64, id 32692) Wait 32 seconds, send again, no response... 00:00:47.006327 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 <nop,nop,timestamp 2362188 2321746> (DF) (ttl 64, id 32697) 00:01:51.007294 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 <nop,nop,timestamp 2368588 2321746> (DF) (ttl 64, id 32698) A pause of 64 seconds between these last packets. It's 446 seconds since we have sent the first retransmission. 00:02:55.008237 10.0.0.2.1072 > 10.0.0.1.5001: R 2629633:2629633(0) ack 1 win 32768 (DF) (ttl 64, id 32699) Our patience is exhausted. Reset the connection.

Looking at the sender's behavior in this example we see that:

[2001-08-28 update: Matt Mathis seems to have an explanation for the apparent delay of fast retransmit behavior: since BPF is below IP, we are capturing one queue down the line when we capture at the "sender". So, the remainder of the window's worth of data could have been queued above BPF in the sending machine. This looks plausible, but the setup is disassembled for WAN testing, so I can't really verify that this is what's happening.]

Let's now examine the receiver's view of the same connection:

23:54:23.681963 10.0.0.2.1072 > 10.0.0.1.5001: S 577076198:577076198(0) win 65535 <mss 4430,nop,wscale 6,nop,nop,timestamp 2323869 0> (DF) (ttl 64, id 32041) 23:54:23.682021 10.0.0.1.5001 > 10.0.0.2.1072: S 168494072:168494072(0) ack 577076199 win 65535 <mss 4430,nop,wscale 6,nop,nop,timestamp 2321727 2323869> (DF) (ttl 64, id 18233) 23:54:23.682161 10.0.0.2.1072 > 10.0.0.1.5001: . ack 1 win 32768 <nop,nop,timestamp 2323869 2321727> (DF) (ttl 64, id 32042) 23:54:23.682208 10.0.0.1.5001 > 10.0.0.2.1072: . ack 1 win 32768 <nop,nop,timestamp 2321727 2323869> (DF) (ttl 64, id 18234) 23:54:23.682875 10.0.0.2.1072 > 10.0.0.1.5001: P 1:4097(4096) ack 1 win 32768 <nop,nop,timestamp 2323869 2321727> (DF) (ttl 64, id 32043) 23:54:23.682907 10.0.0.2.1072 > 10.0.0.1.5001: P 4097:8193(4096) ack 1 win 32768 <nop,nop,timestamp 2323869 2321727> (DF) (ttl 64, id 32044) 23:54:23.682949 10.0.0.1.5001 > 10.0.0.2.1072: . ack 8193 win 32640 <nop,nop,timestamp 2321727 2323869> (DF) (ttl 64, id 18235) 23:54:23.682969 10.0.0.2.1072 > 10.0.0.1.5001: P 8193:12289(4096) ack 1 win 32768 <nop,nop,timestamp 2323869 2321727> (DF) (ttl 64, id 32046) Initial handshake, start of transmission, everything goes well for a while... 23:54:23.687188 10.0.0.2.1072 > 10.0.0.1.5001: P 524289:528385(4096) ack 1 win 32768 <nop,nop,timestamp 2323870 2321728> (DF) (ttl 64, id 32172) 23:54:23.687198 10.0.0.1.5001 > 10.0.0.2.1072: . ack 524289 win 24704 <nop,nop,timestamp 2321728 2323870> (DF) (ttl 64, id 18298) 23:54:23.687222 10.0.0.2.1072 > 10.0.0.1.5001: P 528385:532481(4096) ack 1 win 32768 <nop,nop,timestamp 2323870 2321728> (DF) (ttl 64, id 32173) 23:54:23.799127 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 24576 <nop,nop,timestamp 2321739 2323870> (DF) (ttl 64, id 18299) 23:54:23.855860 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 24704 <nop,nop,timestamp 2321744 2323870> (DF) (ttl 64, id 18300) 23:54:23.855957 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 24832 <nop,nop,timestamp 2321745 2323870> (DF) (ttl 64, id 18302) Apparently, BPF or tcpdump has lost some incoming packets here, which the TCP stack has seen, and which caused the TCP stack to generate those duplicate ACKs. 23:54:23.856019 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 24960 <nop,nop,timestamp 2321745 2323870> (DF) (ttl 64, id 18303) 23:54:23.856070 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 25088 <nop,nop,timestamp 2321745 2323870> (DF) (ttl 64, id 18304) 23:54:23.856123 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 25216 <nop,nop,timestamp 2321745 2323870> (DF) (ttl 64, id 18305) 23:54:23.856180 10.0.0.2.1072 > 10.0.0.1.5001: . 2105345:2109441(4096) ack 1 win 32768 <nop,nop,timestamp 2323887 2321744> (DF) (ttl 64, id 32558) Good grief! That's what I call burst loss. Everything between 532481 and 2105345 (384 packets) was lost. (Well, almost everything: something must have caused us to generate those duplicate ACKs.) 23:54:23.856220 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 25280 <nop,nop,timestamp 2321745 2323870> (DF) (ttl 64, id 18306) 23:54:23.856264 10.0.0.2.1072 > 10.0.0.1.5001: . 2109441:2113537(4096) ack 1 win 32768 <nop,nop,timestamp 2323887 2321744> (DF) (ttl 64, id 32559) 23:54:23.856288 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 25408 <nop,nop,timestamp 2321745 2323870> (DF) (ttl 64, id 18307) 23:54:23.856345 10.0.0.2.1072 > 10.0.0.1.5001: . 2113537:2117633(4096) ack 1 win 32768 <nop,nop,timestamp 2323887 2321745> (DF) (ttl 64, id 32560) They go on blasting, we go on sending ACK for 532481... 23:54:23.870256 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 <nop,nop,timestamp 2321746 2323870> (DF) (ttl 64, id 18470) 23:54:23.870374 10.0.0.2.1072 > 10.0.0.1.5001: . 2621441:2625537(4096) ack 1 win 32768 <nop,nop,timestamp 2323888 2321745> (DF) (ttl 64, id 32684) 23:54:23.870403 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 <nop,nop,timestamp 2321746 2323870> (DF) (ttl 64, id 18471) 23:54:23.870495 10.0.0.2.1072 > 10.0.0.1.5001: P 2625537:2629633(4096) ack 1 win 32768 <nop,nop,timestamp 2323888 2321745> (DF) (ttl 64, id 32685) 23:54:23.870526 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 <nop,nop,timestamp 2321746 2323870> (DF) (ttl 64, id 18472) They have almost exhausted the window now (there's one packet's worth left in in, and from the examination of the sender's view it's known that this packet was in fact sent, but we don't see it). 00:02:54.876791 10.0.0.2.1072 > 10.0.0.1.5001: R 2629633:2629633(0) ack 1 win 32768 (DF) (ttl 64, id 32699) And now, out of the blue, and quite after some time, they just send us this RST packet. We just ignore it, without sending an RST back, even though it's within our window (2629633-532481=2097152=2^21=window size). Is this simply an off-by-one error in TCP code ("<" instead of "<=") somethere? Or maybe tcpdump just never gets to see the outgoing RST packet (unlikely, given there was no load on the machine)?

To summarize, on the receiver side we see that:

Jonathan Lemon seems to have confirmed the off-by-one bug, but not the lack of Fast Retransmit.

While loss is happening, communication using small packet sizes works fine. However, while loss is happening, and for a long while after the TCP connection completes, sometimes ping -s4096 from the receiver gets zero responses, while from the sender it works fine; sometimes it's just the other way around.

The apparent cause of the loss is in the ti driver on the receiver's side. The receiver's kernel keeps printing these messages:

ti0: no free jumbo buffers
ti0: jumbo allocation failed -- packet dropped!
ti0: no free jumbo buffers
ti0: jumbo allocation failed -- packet dropped!

The relevant files appear to be if_ti.c and if_tireg.h. The former file contains the following comment:

/*
 * Memory management for the jumbo receive ring is a pain in the
 * butt. We need to allocate at least 9018 bytes of space per frame,
 * _and_ it has to be contiguous (unless you use the extended
 * jumbo descriptor format). Using malloc() all the time won't
 * work: malloc() allocates memory in powers of two, which means we
 * would end up wasting a considerable amount of space by allocating
 * 9K chunks. We don't have a jumbo mbuf cluster pool. Thus, we have
 * to do our own memory management.
 *
 * The driver needs to allocate a contiguous chunk of memory at boot
 * time. We then chop this up ourselves into 9K pieces and use them
 * as external mbuf storage.
 *
 * One issue here is how much memory to allocate. The jumbo ring has
 * 256 slots in it, but at 9K per slot than can consume over 2MB of
 * RAM. This is a bit much, especially considering we also need
 * RAM for the standard ring and mini ring (on the Tigon 2). To
 * save space, we only actually allocate enough memory for 64 slots
 * by default, which works out to between 500 and 600K. This can
 * be tuned by changing a #define in if_tireg.h.
 */
It appears, however, that the values have changed since then, as if_tireg.h contains:
/*
 * Memory management stuff. Note: the SSLOTS, MSLOTS and JSLOTS
 * values are tuneable. They control the actual amount of buffers
 * allocated for the standard, mini and jumbo receive rings.
 */

#define TI_SSLOTS       256
#define TI_MSLOTS       256
#define TI_JSLOTS       384
Now, I don't have any memory shortage here, and would even be willing to spend whole another TCP window (the target is to have it at roughly 8-16MB) on driver buffers (why should it be so escapes me, but it still would be OK). I wish I knew what values make sense, though! The value for TI_JSLOTS (384) is already greater than the claimed number of jumbo ring slots (256).

ti Driver Problem Solved, Sort Of

I haven't heard any concrete suggestions from freebsd-net after sending the question about TI_JSLOTS values, so I decided to be on the safe side and changed TI_JSLOTS in /sys/pci/if_tireg.h from 384 to 8192 (twice the number of 4KB packets in a 16MB window). This appears to have removed the symptoms of stalling and burst loss.

Here's the throughput values (as reported by Iperf, in Mb/s, three runs of 20 seconds, extreme values reported) that I get from it now (with RFC1323):

Window Size		4470 bytes MTU		8940 bytes MTU
524288			690			855-986
1048576			657-659			986
2097152			561-563			984-988
4194304			217-218			986-989
8388608			93			985-988
16777216		86			984-986

To a large extent, as far as I am concerned, the problem is now "solved." However, an unanswered question remains: Why didn't the ti driver recover very quickly from the shortage of jumbo slots? The interface would malfunction for different periods of time after it would run out of jumbo slots. In some cases, it was enough to completely stall the connection and have it timeout, in other cases it produced a one-second-step "ladder" on the time-series graph indicating that the driver would recover fairly quickly (but not as quickly as one would expect or desire).

Wide Area Network Testing

On 2001-10-02 we have two GigaTCP machines installed at SoX and PNW GigaPoPs (with jumbo frames connectivity to Abilene) for WAN testing. The (symmetric) path between machines is:
gigatcp1 - Extreme Summit 7i - SoX M40 - ATLA - IPLS - KSCY - 
		DNVR - STTL - PNW pnwgpi2 M40 - PNW uwbr4 M40 - gigatcp2
(The four-letter codes are Internet2 backbone Abilene core routers. All links are OC-48c POS, except for terminating Gigabit Ethernet SX on the two sides going into the machines.)

The bottleneck link capacity is therefore 1Gb/s. All OC-48c POS circuits on the path are generally uncongested. The round-trip time is 58.3ms:

100 packets transmitted, 100 packets received, 0% packet loss
round-trip min/avg/max/stddev = 58.300/58.346/58.434/0.029 ms
This comes out to be a link that can hold 7,287,500 bytes. This number is the minimal TCP window size at which we can hope to fully use the link. TCP throughput vs. window size (as reported by Iperf, in Mb/s, three runs of 20 seconds, extreme values reported) was as reported in the following table. In some cases, TCP performance was unstable and I would get a number that's an order of magnitude lower than the bulk of results for a given window size; these data were discarded.
Window Size	SoX->PNW	PNW->SoX
524288		53-53.5		69.6
1048576		86.6-94.1	139
1166000		92.2-93.4	153-154
1457500		106-158		192
1572864		156-160		200-201
2097152		157-160		198-203
4194304		155-161		206-207
8388608		156-160		203-204
16777216	158-162		199-203
The machines are identical with identical OS and configuration.

SoX to PNW case: Note how throughput never goes above 160Mb/s. A window size of 1.1MB would in principle (taking into consideration delay*bandwidth only) allow to achieve such throughput. In reality, one only achieves 93Mb/s with this window size.

PNW to SoX case: One could achieve 200Mb/s at 1457500-byte window. At that window size, one gets a respectable 192Mb/s.

In both cases, examination of top output while Iperf with different window sizes is running seems to indicate that once throughput is saturated, one of the CPUs of the sender (but not the receiver) is fully used and becomes the bottleneck. Two identical CPUs become bottlenecks of different throughputs! (Is electricity in Seattle better than it is in Atlanta?) One (weak) explanation of such behavior could be that some data structures in the TCP/IP stack or in the driver are getting set up in a particular way (which can depend on semi-random circumstances of the beginning of the test) and then the machine becomes stuck in this state. To confirm or deny this explanation, I have rebooted both machines and ran a subset of the tests again (identical methodology):

Window Size	SoX->PNW	PNW->SoX
524288		57.7-58.4	69.6
2097152		182-184		212-216
8388608		174-178		204-208

The difference in results between batches and the consistency within a batch seems to support the hypothetical explanation, but doesn't appear to be conclusive. To get more data, I performed another pair of reboots and another small batch of testing:

Window Size	SoX->PNW	PNW->SoX
2097152		180-185		215-218
In this case, there appears to be no difference from the previous batch.

An additional piece of data is that in every batch of testing after a fresh reboot the first data connection from SoX to PNW would become seemingly stuck and would only finish after about a minute (with `-t20' option for Iperf) with a throughput of hundreds of Kb/s. This never repeats with subsequent connection until the machine is rebooted and doesn't appear to occur in the other direction, even though during the last batch I started with PNW to SoX connections while in other batches I was starting with the other direction.

Afterhours testing on 2001-10-02 with UDP has shown that in SoX to PNW direction one can get up to 528Mb/s with 3-4% packet loss and minor reordering (wihch could be sustained for a few minutes and externally verified by the Abilene weather map), while in the opposite direction up to 650Mb/s is possible with a small fraction of a percentage point loss, but only for short time intervals (such as 10s), and at some point the loss shoots up very sharply to 60-90% (and performance deteriorates accordingly); bandwidth similar to the other direction could be sustained and verified externally. In other words, no definitive path assymetry was found to be present with UDP testing.

Typical sender CPU states:

CPU states:  1.4% user,  0.0% nice, 47.7% system,  1.0% interrupt, 50.0% idle
Typical receiver CPU states with major loss:
CPU states:  0.0% user,  0.0% nice, 11.2% system, 40.3% interrupt, 48.4% idle
Typical receiver CPU states without major loss:
CPU states:  0.4% user,  0.0% nice, 22.0% system, 16.7% interrupt, 60.9% idle
Typical sender top display of the Iperf process:
  PID USERNAME PRI NICE  SIZE    RES STATE  C   TIME   WCPU    CPU COMMAND
  505 shalunov  59   0  1292K   912K CPU1   0   2:32 99.06% 99.02% iperf
Typical receiver top display of the Iperf process with major loss is similar to that of the sender. Typical receiver top display of the Iperf process without major loss:
  PID USERNAME PRI NICE  SIZE    RES STATE  C   TIME   WCPU    CPU COMMAND
  519 shalunov  53   0  1292K   912K CPU1   0   1:15 58.68% 58.59% iperf
Please remember that these are dual-CPU machines with FreeBSD-4.3, where the giant kernel lock is present (so the receiver uses slightly more than one CPU when it is reporting major packet loss).

To verify whether the possible asymmetry is explained by a fragmented datastructure, I have followed advice from Bosko Milekic and increased NCL_INIT and NMB_INIT in uipc_mbuf.c (both to 32768, after looking at `netstat -m' output). The results of testing after the change:

Window Size	SoX->PNW	PNW->SoX
524288		58.6-60		68.9-69
2097152		184-188		220-222
4194304		174-176		202-206
8388608		173-176		207-209
Stalling of the first connection after reboot in the SoX to PNW direction is still there (not reflected in the table, as it only happens once). The increase of initial allocation appeared to make no significant difference in the test results.

The next step seems to be to enable hardware checksums, even though trouble was reported with them:

--- if_ti.c.orig        Wed Aug 23 20:07:58 2000
+++ if_ti.c     Fri Oct  5 15:42:13 2001
@@ -125,9 +124,0 @@
-/*
- * Temporarily disable the checksum offload support for now.
- * Tests with ftp.freesoftware.com show that after about 12 hours,
- * the firmware will begin calculating completely bogus TX checksums
- * and refuse to stop until the interface is reset. Unfortunately,
- * there isn't enough time to fully debug this before the 4.1
- * release, so this will need to stay off for now.
- */
-#ifdef notdef
@@ -135,3 +125,0 @@
-#else
-#define TI_CSUM_FEATURES       0
-#endif
Surprisingly, it seems that hardware checksums actually have hurt performance quite significantly, reducing it to 35-40Mb/s for all considered window sizes in SoX to PNW direction. In this direction, packets sent were also 954 payload bytes in size. Later, the payload size changed to 1440 bytes without any other visible changes, and then reverted back to 954. Throughput measured at a different time was about 53-172Mb/s.

Nothing seems to be changed in the opposite direction. That is, for PNW->SoX direction, throughput, packet size, CPU utilization all stay the same after turning on hardware checksums.

The change in packet size does not appear to be explained by anything in the TCP negotiation:

15:36:21.838823 gigatcp1.1051 > gigatcp2.5001: S 2343373891:2343373891(0) win 65535 <mss 4430,nop,wscale 6,nop,nop,timestamp 164007190 0> (DF) 15:36:21.897227 gigatcp2.5001 > gigatcp1.1051: S 4012531915:4012531915(0) ack 2343373892 win 65535 <mss 4430,nop,wscale 6,nop,nop,timestamp 163922984 164007190> (DF) 15:36:21.897266 gigatcp1.1051 > gigatcp2.5001: . ack 1 win 32782 <nop,nop,timestamp 164007196 163922984> (DF) 15:36:21.897821 gigatcp1.1051 > gigatcp2.5001: . 1:1441(1440) ack 1 win 32782 <nop,nop,timestamp 164007196 163922984> (DF) 15:36:21.955612 gigatcp2.5001 > gigatcp1.1051: . ack 1 win 32768 <nop,nop,timestamp 163922990 164007196> (DF) 15:36:22.049129 gigatcp2.5001 > gigatcp1.1051: . ack 1441 win 32768 <nop,nop,timestamp 163923000 164007196> (DF) 15:36:22.049184 gigatcp1.1051 > gigatcp2.5001: . 1441:2881(1440) ack 1 win 32782 <nop,nop,timestamp 164007211 163923000> (DF) [...]

SC2001

Our third machine that was sent to SC2001 was smashed in transit and would not boot on arrival. That was bad enough, but the kind folks at RackSaver were incredibly kind and have provided a nice machine for us: dual Pentium4, 512MB DRAM, and a 64-bit 66MHz PCI bus into which we plugged in our Gigabit Ethernet card. The machine was then installed in SCinet NOC and directly connected to a Gigabit Ethernet interface on Juniper M160, which had a OC-192c POS connection to a Cisco GSR, which had 2*OC-48c POS to DENV (Denver core node of Abilene).

We have mostly used the machine at SCinet and the machine at PNW for subsequent testing. The RTT between these machines, according to ping, was 28.6ms:

round-trip min/avg/max/stddev = 28.617/28.636/28.726/0.018 ms
The Path MTU was 4470 bytes.

We continued to observe asymmetry in the TCP payload size: in one direction, it would work as expected (sending 4096-byte TCP payloads), but in the other direction, it would have a handshake where both side would advertize 4430-byte MSS and then the sender would proceed sending 1448-byte payload TCP packets without ever getting any ICMP Fragmentation Needed messages.

An investigation has revealed that in tcp_input.c there are two different functions, tcp_mssopt() that determines what MSS value would be advertized during handshake and tcp_mss() that determines the actual MSS value that will be used. The latter function has a peculiar bug: when one tries to connect to a previously unknown destination via TCP, it grabs the MTU attribute of a freshly cloned route, and this attribute has nothing to do with the appropriate interface MTU, but rather is based on a driver-dependent constant; needless to say, for all Ethernet drivers this constant is 1500 bytes. Since this value then gets stuck in the cloned rtentry for a long time (about an hour), all TCP connection sending data from the machine to the particular remote destination will not use jumbo frames. The opposite direction works fine. The MTU attribute can be printed by issuing the command route get destination.

This is a bug in FreeBSD. A very similar pattern of behavior was observed on Linux with Web100 patches, but there I could not easily track down where the PMTU information is cached.

Throughput at SC2001 with different window sizes (during prime time) was (methodology as in all other tests, results in Mb/s):

Window		PNW->SC2001
1048576		242-282
2097152		307-320
4194304		307-344
8388608		346-371