I'd like to acknowledge outside advice from Phil Dykstra of
WareOnEarth Communications, Matt Hodge of University of Washington,
Jonathan Lemon, Ken Merry of FreeBSD zero copy fame, Joerg Micheel of
University of Waikato, Bosko Milekic, David Richardson of University
of Washington, and Peter
Seebach of Plethora Internet.
We at Internet2 wanted to build two or more machines that would be
capable of achieving TCP throughputs of 700-800Mb/s over WAN (with a
single TCP connection). The purposes of such an exercise are:
to be able to tell how the networks we engineer today will behave
few years down the road when TCP connections with such throughput
become more commonplace;
more specifically, to be able to measure impact of things like
minor packet re-ordering on TCP throughput;
to validate the assumption of low loss in Internet2 networks;
to learn more about effects of network techniques such as SACK;
to learn more about effects of host techniques such as zero copy TCP;
to demonstrate that in tomorrow's network TCP can still provide
viable transport even with today's algorithms;
to evaluate the real need for dangerous host techniques such as
hardware checksumming;
to understand just exactly how hard is it to run
hundreds-of-megabits-per-second TCP flows over WAN today and what will
be the difficulties involved in doing so;
possibly, to break or at least approach the current Microsoft's
land speed record of 786Mb/s from Seattle to Atlanta (not a reason to
start building machines, but if we already have them, we should try).
Operating System
We wanted something that'd have available source code. BSD/OS
seemed unnecessarily expensive, given its competition, and our past
experience with its drivers was less than glamorous. Of the other
BSDs, FreeBSD was the clear winner with its ubiquity and stack
optimizations. We have also considered Linux, including Linux with
Web100 patches.
We decided to put Linux on ice for the time being. It lacks zero
copy support (sendfile(2) notwithstanding: sendfile() doesn't do
anything about the receive side, and it only works with disk files;
disk would easily become our bottleneck, and we're not into I/O
monstrosity, we're into TCP monstrosity). It also lacks any clear
advantages over FreeBSD for the task at hand. The option to use it
later--if we wanted--was left open, and hardware was chosen so that
it'd work well with both FreeBSD and Linux.
Hardware
First, we wanted to buy a commercial off-the-shelf Dell server and
go from there. Once we realized how much we would have to pay for the
features we want and how much unnecessary I/O fanciness we'd get with
that, we started to evaluate the option of custom-built machines.
PCI bus
Regular 32-bit 33MHz PCI bus would afford nominal theoretical
throughput of just over 1Gb/s, half-duplex. With overhead and
direction switches that would come unconfortably close to the desired
TCP goodput. Therefore, we wanted 64-bit 66MHz PCI bus.
The NIC
Based on the experience of various people we conculted and hardware
compatibility lists, we narrowed the choice to 3Com 3c985B-SX,
Netgear GA620 [Bay Networks has apparently removed all information
about it from its web site], or one of the four SysKonnect cards. The former
two cards are the same Tigon II chipset with different amount of
on-board SRAM, the latter is substantially different. FreeBSD zero
copy receive side requires special header-splitting firmware, which is
only available for Tigon cards; additionally, SysKonnect can't do
hardware checksumming and jumbo frames at the same time. On the other
hand, SysKonnect was rumored to have slightly higher packet-per-second
rates. We decided to go with the Tigon. We have chosen the more
expensive 3c985B-SX with its 1MB of SRAM rather than the cheaper GA620
with its 512KB. Even though 1MB is still way below our target window
size, we decided that every little bit would help.
FSB
With software checksumming all the memory bandwidth we could get would
help with a fast CPU. Initially, we wanted to look at RDRAM. We
eliminated RAMBUS for reliability reasons. Then, we considered DDR
SDRAM; it didn't look like it's supported by solid chipsets and it
didn't look like we could get it with ECC; so we discarded DDR.
The only thing left other than regular SDR SDRAM was interleaved SDR.
We decided we want at least two-way interleaving.
The chipset
ServerWorks ServerSet III
HE was in all sources of information we could find considered
good. It's the highest-end ServerWorks chipset. Its four-way memory
interleaving with 32Gb/s theoretical memory throughput with ECC sure
looked attractive, too.
The motherboard
We have considered the following motherboards with this chipset: Tyan
S2597, SuperMicro
370DLE, and SuperMicro
370DE6. Of these boards only the last actually appeared to
provide any memory interleaving (based on lack of its mention in
specifications of the other two boards and on private third-party
benchmarks). This board also had the advantage of having an on-board
SCSI adaptor that was supported by FreeBSD.
Memory
The 370DE6 actually only provides two-way interleaving, not taking
advantage of the full four-way interleaving that the chipset is
supposed to support. So, we needed two DIMMs. We decided we'd get
512MB of memory total, so that we have room to play with. With
the price of the rest of the components two 256MB registered ECC DIMMs
weren't expensive.
The CPU(s)
The fastest we could afford seemed like a good choice. We went with
1GHz Pentium III (the fastest available was 1.1GHz, but going one step
down the ladder provided such a sizeable discount that doing so
appeared to make sense). We decided to get a dual-CPU configuration,
even though we understood the benefits of the second CPU would be slim
to none in many configurations (FreeBSD-current being a notable
potential exception).
I/O
We didn't care much for I/O in these systems, but we wanted to be able
to occasionally use one of them for snooping. This meant SCSI disks.
I like to get two identical disks into everything, money permitting
(nightly dd or playing around with OSes). The smallest
available disks were 18GB.
We ordered three of these machines from Plethora Internet
at $3,000 apiece with an educational discount (we ordered and received
these machines in June 2001). Thanks for the nice machines, Peter!
The chassis is the smallest that SuperMicro recommends for this
motherboard. It's a full tower case (tall). With a few fans in the
case, these beasts still generated enough heat to raise the
temperature between the backplane and the wall to 35C in an
air-conditioned lab. We moved them so that they were better
ventilated.
Initial Setup
The machines were hooked up to the network with their
fxp0 interfaces acting as general access points. Two of
them were hooked up back-to-back with multi-mode fiber through their
ti0 interfaces, with private IP numbers.
Initially, we put FreeBSD 4.3-RELEASE on them. I ramped maxusers
up to 512 and NMBCLUSTERS to 102400 (feeling generous), enabled SMP
and APIC_IO, removed a bunch of unnecessary devices, and added the
ti driver. Then, I set
kern.ipc.maxsockbuf=20480000,
net.inet.tcp.sendspace=1024000,
net.inet.tcp.recvspace=1024000, and set the MTU size on
ti0 to 4470 bytes (the SONET MTU, which we would have in
Abilene).
Preliminary Baseline Results
The latency across the strip of fiber, as measured by
ping, was around 100us (obviously the actual propagation
delay is much lower).
TCP throughput with a single Iperf connection was
about 700-850Mb/s (yes, with tens of Mb/s of variation across
individual runs) with send and receive buffer space of 1MB. CPU
utilization during the test, as reported by top, was
stable 50% (give or take 0.1%), which suggests that as we expected the
machines were CPU-bound with software checksumming, and they could not
make any use of the second CPU.
It's worth noticing that Iperf wouldn't correctly report throughput
for runs longer than something over 20s (apparently since a 32-bit
counter for bytes transferred would wrap around).
With send and receive buffer space of 100KB, TCP throughput would
raise to stable 970-975Mb/s. With send and receive buffer space of
64KB, the throughput would drop to 865-870Mb/s. Notice that this is
without RFC1323 extensions (no window scaling). So, increasing buffer
space beyond the advertized window size did make a difference in
performance.
More systematically, the following table shows TCP throughput
dependence on buffer space, as reported by Iperf with 20-second
measurement interval (at least three runs per reported diapason):
Buffer space Throughput w/o RFC1323, Mb/s With RFC1323, Mb/s
4096 2.2-3.4 2.1-3.7
8192 170-180 157-188
16384 104-144 62-91
32768 623-648 659-660
65536 861-865 795-821
131072 973-975 827-829
262144 870-884 692-693
524288 836-842 674-675
786432 .2/Didn't work
1048576 578-723 216-410/6/Didn't work
2097152 635-836 Didn't work
4194304 70-819
8388608 584-739 Didn't work
The machines were sensitive to any other activity, even running
top when there are CPU cycles to burn on both CPUs,
concurrently with Iperf. I suppose the context switching really hurts
performance.
Problems with large window sizes
With window sizes over half a megabyte, our TCP transmission
wouldn't work reliably. Sometimes, it would be just ridiculously slow
(kilobits or single-digit megabits per second). Sometimes, it would
freeze completely. In cases when it froze, one would be able to
interrupt the server (receiver) with Control-C, but not the client
(sender); the sender would miss its scheduled completion time of 10 or
20 seconds and would run for a long time, ignoring the interrupt
character, but accepting the quit character. Usually the run would
still complete in less than 10 minutes.
I've collected tcpdump traces from a stalled sender and stalled receiver (different views
of the same connection); xplot time-series graphs
produced by tcptrace from the sender side and the receiver side might prove useful.
Additionally, traces from a slow connection are available (sender side and receiver side). These files actually
contain two connections, and the first is represented by these
time-series graphs: sender side, receiver side.
It should be recognized that the process of collecting
tcpdump data by itself changes the pattern of load on the
machine. Additionally, it's not guaranteed that BPF sees the same
things TCP stack sees.
This stalling happens for both NMBCLUSTERS=102400 and
the default value for maxusers 512. In fact, I have
increased it while trying to troubleshoot the problem.
An annotated tcpdump of the stalled connection follows
(sender view). Everywhere where ellipsis ("...") is used in the
annotation, there are packets that we omit.
23:54:23.817524 10.0.0.2.1072 > 10.0.0.1.5001: S 577076198:577076198(0) win 65535 (DF) (ttl 64, id 32041)
23:54:23.817674 10.0.0.1.5001 > 10.0.0.2.1072: S 168494072:168494072(0) ack 577076199 win 65535 (DF) (ttl 64, id 18233)
23:54:23.817715 10.0.0.2.1072 > 10.0.0.1.5001: . ack 1 win 32768 (DF) (ttl 64, id 32042)
23:54:23.817864 10.0.0.1.5001 > 10.0.0.2.1072: . ack 1 win 32768 (DF) (ttl 64, id 18234)
Connection initiated. Actual window advertized by both sides (taking
window scaling into account) is 32768*2^6=2097152=2^21=2MB.
23:54:23.818382 10.0.0.2.1072 > 10.0.0.1.5001: P 1:4097(4096) ack 1 win 32768 (DF) (ttl 64, id 32043)
23:54:23.818403 10.0.0.2.1072 > 10.0.0.1.5001: P 4097:8193(4096) ack 1 win 32768 (DF) (ttl 64, id 32044)
23:54:23.818475 10.0.0.2.1072 > 10.0.0.1.5001: P 8193:12289(4096) ack 1 win 32768 (DF) (ttl 64, id 32046)
23:54:23.818493 10.0.0.2.1072 > 10.0.0.1.5001: P 12289:16385(4096) ack 1 win 32768 (DF) (ttl 64, id 32047)
...We proceed to blast packets...
23:54:23.818613 10.0.0.1.5001 > 10.0.0.2.1072: . ack 8193 win 32640 (DF) (ttl 64, id 18235)
23:54:23.818645 10.0.0.2.1072 > 10.0.0.1.5001: P 32769:36865(4096) ack 1 win 32768 (DF) (ttl 64, id 32052)
23:54:23.818664 10.0.0.2.1072 > 10.0.0.1.5001: P 36865:40961(4096) ack 1 win 32768 (DF) (ttl 64, id 32053)
23:54:23.818692 10.0.0.2.1072 > 10.0.0.1.5001: P 40961:45057(4096) ack 1 win 32768 (DF) (ttl 64, id 32054)
23:54:23.818712 10.0.0.2.1072 > 10.0.0.1.5001: P 45057:49153(4096) ack 1 win 32768 (DF) (ttl 64, id 32055)
23:54:23.818726 10.0.0.1.5001 > 10.0.0.2.1072: . ack 16385 win 32512 (DF) (ttl 64, id 18236)
...We see some ACKs coming in, but at this point we really don't care
as we can send 2MB (512 packets) before we need to even bother to look
at anything the remote side says...
23:54:23.822377 10.0.0.2.1072 > 10.0.0.1.5001: P 528385:532481(4096) ack 1 win 32768 (DF) (ttl 64, id 32173)
23:54:23.822391 10.0.0.1.5001 > 10.0.0.2.1072: . ack 466945 win 25600 (DF) (ttl 64, id 18291)
23:54:23.822419 10.0.0.2.1072 > 10.0.0.1.5001: P 532481:536577(4096) ack 1 win 32768 (DF) (ttl 64, id 32174)
Everything goes well until this point. The preceding packet is the
first packet lost in this connection. We can blast a lot more
according to the window size. Since there were no losses up to this
point (130 packets), our cwnd is the same as window size for a while
now. We go on to send more stuff...
23:54:23.996387 10.0.0.2.1072 > 10.0.0.1.5001: . 2293761:2297857(4096) ack 1 win 32768 (DF) (ttl 64, id 32604)
23:54:23.996407 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 (DF) (ttl 64, id 18389)
23:54:23.996444 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 (DF) (ttl 64, id 18390)
23:54:23.996495 10.0.0.2.1072 > 10.0.0.1.5001: . 2297857:2301953(4096) ack 1 win 32768 (DF) (ttl 64, id 32605)
23:54:23.996579 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 (DF) (ttl 64, id 18391)
23:54:23.996597 10.0.0.2.1072 > 10.0.0.1.5001: . 2301953:2306049(4096) ack 1 win 32768 (DF) (ttl 64, id 32606)
23:54:23.996657 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 (DF) (ttl 64, id 18392)
23:54:23.996698 10.0.0.2.1072 > 10.0.0.1.5001: . 2306049:2310145(4096) ack 1 win 32768 (DF) (ttl 64, id 32607)
23:54:23.996753 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 (DF) (ttl 64, id 18393)
23:54:23.996801 10.0.0.2.1072 > 10.0.0.1.5001: . 2310145:2314241(4096) ack 1 win 32768 (DF) (ttl 64, id 32608)
23:54:23.996853 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 (DF) (ttl 64, id 18394)
23:54:23.996898 10.0.0.2.1072 > 10.0.0.1.5001: . 2314241:2318337(4096) ack 1 win 32768 (DF) (ttl 64, id 32609)
23:54:23.996967 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 (DF) (ttl 64, id 18395)
The other end has noticed the loss and is reporting it with duplicate
ACKs. There are lots more of these duplicates down the road, but even
at this point it should be obvious that there was a loss. Fast
Retransmit should kick in and we need to resend starting from offset
532481. Instead, we proceed to send the full window before we turn
around to face the problem of loss. Why don't we resend starting
from offset 532481 at this point? This isn't good. We proceed to
send...
23:54:24.005786 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 (DF) (ttl 64, id 18469)
23:54:24.005875 10.0.0.2.1072 > 10.0.0.1.5001: . 2621441:2625537(4096) ack 1 win 32768 (DF) (ttl 64, id 32684)
23:54:24.005899 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 (DF) (ttl 64, id 18470)
23:54:24.006007 10.0.0.2.1072 > 10.0.0.1.5001: P 2625537:2629633(4096) ack 1 win 32768 (DF) (ttl 64, id 32685)
23:54:24.006037 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 (DF) (ttl 64, id 32686)
At this point, we have sent 2629633-532481=2097152 bytes ahead, which
is exactly our window size. Finally, we notice the duplicate ACKs are
coming in and resend the segment starting from 532481.
23:54:24.006060 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 (DF) (ttl 64, id 18471)
23:54:24.006178 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 (DF) (ttl 64, id 18472)
Up to this point, everything was happening in rapid fire mode. Now's
the first time there's any waiting. We just resent segment starting
from 532481, and there were only two more duplicate ACKs for it since
then: not enough to warrant another Fast Retransmit. So, we wait 1
second.
23:54:25.000538 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 (DF) (ttl 64, id 32687)
We resend the unlucky packet again, without getting any ACK back.
(Peek ahead: What's actually happening is that all large packets are
being lost due to jumbo slots shortage in the driver while small
packets get through fine.)
23:54:27.000562 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 (DF) (ttl 64, id 32688)
We wait 2 seconds now, and send again. No response.
23:54:31.000646 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 (DF) (ttl 64, id 32689)
Wait 4 seconds, send again, no response.
23:54:39.000751 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 (DF) (ttl 64, id 32690)
Wait 8 seconds, send again, no response.
23:54:55.001004 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 (DF) (ttl 64, id 32691)
Wait 16 seconds, send again, no response.
23:55:27.001500 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 (DF) (ttl 64, id 32692)
Wait 32 seconds, send again, no response...
00:00:47.006327 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 (DF) (ttl 64, id 32697)
00:01:51.007294 10.0.0.2.1072 > 10.0.0.1.5001: . 532481:536577(4096) ack 1 win 32768 (DF) (ttl 64, id 32698)
A pause of 64 seconds between these last packets. It's 446 seconds
since we have sent the first retransmission.
00:02:55.008237 10.0.0.2.1072 > 10.0.0.1.5001: R 2629633:2629633(0) ack 1 win 32768 (DF) (ttl 64, id 32699)
Our patience is exhausted. Reset the connection.
Looking at the sender's behavior in this example we see that:
The sender never sends beyond the advertized window size (good).
When it notices the loss, it starts to resend the first lost
packet with exponentially increasing intervals between attempts
(good), but never sees any ACKs coming back.
Fast Retransmit doesn't happen right after the sender get
duplicate ACKs; instead, the sender blasts the remainder of the window
and only then resends.
[2001-08-28 update: Matt Mathis seems to have an explanation for
the apparent delay of fast retransmit behavior: since BPF is below IP,
we are capturing one queue down the line when we capture at the
"sender". So, the remainder of the window's worth of data could have
been queued above BPF in the sending machine. This looks plausible,
but the setup is disassembled for WAN testing, so I can't really
verify that this is what's happening.]
Let's now examine the receiver's view of the same connection:
23:54:23.681963 10.0.0.2.1072 > 10.0.0.1.5001: S 577076198:577076198(0) win 65535 (DF) (ttl 64, id 32041)
23:54:23.682021 10.0.0.1.5001 > 10.0.0.2.1072: S 168494072:168494072(0) ack 577076199 win 65535 (DF) (ttl 64, id 18233)
23:54:23.682161 10.0.0.2.1072 > 10.0.0.1.5001: . ack 1 win 32768 (DF) (ttl 64, id 32042)
23:54:23.682208 10.0.0.1.5001 > 10.0.0.2.1072: . ack 1 win 32768 (DF) (ttl 64, id 18234)
23:54:23.682875 10.0.0.2.1072 > 10.0.0.1.5001: P 1:4097(4096) ack 1 win 32768 (DF) (ttl 64, id 32043)
23:54:23.682907 10.0.0.2.1072 > 10.0.0.1.5001: P 4097:8193(4096) ack 1 win 32768 (DF) (ttl 64, id 32044)
23:54:23.682949 10.0.0.1.5001 > 10.0.0.2.1072: . ack 8193 win 32640 (DF) (ttl 64, id 18235)
23:54:23.682969 10.0.0.2.1072 > 10.0.0.1.5001: P 8193:12289(4096) ack 1 win 32768 (DF) (ttl 64, id 32046)
Initial handshake, start of transmission, everything goes well for a while...
23:54:23.687188 10.0.0.2.1072 > 10.0.0.1.5001: P 524289:528385(4096) ack 1 win 32768 (DF) (ttl 64, id 32172)
23:54:23.687198 10.0.0.1.5001 > 10.0.0.2.1072: . ack 524289 win 24704 (DF) (ttl 64, id 18298)
23:54:23.687222 10.0.0.2.1072 > 10.0.0.1.5001: P 528385:532481(4096) ack 1 win 32768 (DF) (ttl 64, id 32173)
23:54:23.799127 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 24576 (DF) (ttl 64, id 18299)
23:54:23.855860 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 24704 (DF) (ttl 64, id 18300)
23:54:23.855957 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 24832 (DF) (ttl 64, id 18302)
Apparently, BPF or tcpdump has lost some incoming packets
here, which the TCP stack has seen, and which caused the TCP stack to
generate those duplicate ACKs.
23:54:23.856019 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 24960 (DF) (ttl 64, id 18303)
23:54:23.856070 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 25088 (DF) (ttl 64, id 18304)
23:54:23.856123 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 25216 (DF) (ttl 64, id 18305)
23:54:23.856180 10.0.0.2.1072 > 10.0.0.1.5001: . 2105345:2109441(4096) ack 1 win 32768 (DF) (ttl 64, id 32558)
Good grief! That's what I call burst loss. Everything between 532481
and 2105345 (384 packets) was lost. (Well, almost everything:
something must have caused us to generate those duplicate ACKs.)
23:54:23.856220 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 25280 (DF) (ttl 64, id 18306)
23:54:23.856264 10.0.0.2.1072 > 10.0.0.1.5001: . 2109441:2113537(4096) ack 1 win 32768 (DF) (ttl 64, id 32559)
23:54:23.856288 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 25408 (DF) (ttl 64, id 18307)
23:54:23.856345 10.0.0.2.1072 > 10.0.0.1.5001: . 2113537:2117633(4096) ack 1 win 32768 (DF) (ttl 64, id 32560)
They go on blasting, we go on sending ACK for 532481...
23:54:23.870256 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 (DF) (ttl 64, id 18470)
23:54:23.870374 10.0.0.2.1072 > 10.0.0.1.5001: . 2621441:2625537(4096) ack 1 win 32768 (DF) (ttl 64, id 32684)
23:54:23.870403 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 (DF) (ttl 64, id 18471)
23:54:23.870495 10.0.0.2.1072 > 10.0.0.1.5001: P 2625537:2629633(4096) ack 1 win 32768 (DF) (ttl 64, id 32685)
23:54:23.870526 10.0.0.1.5001 > 10.0.0.2.1072: . ack 532481 win 32768 (DF) (ttl 64, id 18472)
They have almost exhausted the window now (there's one packet's worth
left in in, and from the examination of the sender's view it's known
that this packet was in fact sent, but we don't see it).
00:02:54.876791 10.0.0.2.1072 > 10.0.0.1.5001: R 2629633:2629633(0) ack 1 win 32768 (DF) (ttl 64, id 32699)
And now, out of the blue, and quite after some time, they just send us
this RST packet. We just ignore it, without sending an RST back, even
though it's within our window (2629633-532481=2097152=2^21=window
size). Is this simply an off-by-one error in TCP code
("<" instead of "<=") somethere?
Or maybe tcpdump just never gets to see the outgoing RST
packet (unlikely, given there was no load on the machine)?
To summarize, on the receiver side we see that:
Huge burst loss occurs in the middle of the sender's window.
Other losses occur later, but we get most packets starting from
some point (but, notably, not the last packet in the succession).
We possibly reject a valid reset (BAD, if true, i.e., if
tcpdump hasn't missed the outgoing reset).
Jonathan Lemon seems to
have confirmed the off-by-one bug, but not the lack of Fast
Retransmit.
While loss is happening, communication using small packet sizes
works fine. However, while loss is happening, and for a long while
after the TCP connection completes, sometimes ping -s4096
from the receiver gets zero responses, while from the sender it works
fine; sometimes it's just the other way around.
The apparent cause of the loss is in the ti driver on
the receiver's side. The receiver's kernel keeps printing these
messages:
The relevant files appear to be if_ti.c and
if_tireg.h. The former file contains the following
comment:
/*
* Memory management for the jumbo receive ring is a pain in the
* butt. We need to allocate at least 9018 bytes of space per frame,
* _and_ it has to be contiguous (unless you use the extended
* jumbo descriptor format). Using malloc() all the time won't
* work: malloc() allocates memory in powers of two, which means we
* would end up wasting a considerable amount of space by allocating
* 9K chunks. We don't have a jumbo mbuf cluster pool. Thus, we have
* to do our own memory management.
*
* The driver needs to allocate a contiguous chunk of memory at boot
* time. We then chop this up ourselves into 9K pieces and use them
* as external mbuf storage.
*
* One issue here is how much memory to allocate. The jumbo ring has
* 256 slots in it, but at 9K per slot than can consume over 2MB of
* RAM. This is a bit much, especially considering we also need
* RAM for the standard ring and mini ring (on the Tigon 2). To
* save space, we only actually allocate enough memory for 64 slots
* by default, which works out to between 500 and 600K. This can
* be tuned by changing a #define in if_tireg.h.
*/
It appears, however, that the values have changed since then, as
if_tireg.h contains:
/*
* Memory management stuff. Note: the SSLOTS, MSLOTS and JSLOTS
* values are tuneable. They control the actual amount of buffers
* allocated for the standard, mini and jumbo receive rings.
*/
#define TI_SSLOTS 256
#define TI_MSLOTS 256
#define TI_JSLOTS 384
Now, I don't have any memory shortage here, and would even be willing
to spend whole another TCP window (the target is to have it at roughly
8-16MB) on driver buffers (why should it be so escapes me, but it
still would be OK). I wish I knew what values make sense, though!
The value for TI_JSLOTS (384) is already greater than the claimed
number of jumbo ring slots (256).
ti Driver Problem Solved, Sort Of
I haven't heard any concrete suggestions from freebsd-net after sending
the question about TI_JSLOTS values, so I decided to
be on the safe side and changed TI_JSLOTS in
/sys/pci/if_tireg.h from 384 to 8192 (twice the number of
4KB packets in a 16MB window). This appears to have removed the
symptoms of stalling and burst loss.
Here's the throughput values (as reported by Iperf, in Mb/s, three
runs of 20 seconds, extreme values reported) that I get from it now
(with RFC1323):
To a large extent, as far as I am concerned, the problem is now
"solved." However, an unanswered question remains: Why didn't the
ti driver recover very quickly from the shortage of jumbo
slots? The interface would malfunction for different periods of
time after it would run out of jumbo slots. In some cases, it was
enough to completely stall the connection and have it timeout, in
other cases it produced a one-second-step "ladder" on the time-series
graph indicating that the driver would recover fairly quickly (but not
as quickly as one would expect or desire).
Wide Area Network Testing
On 2001-10-02 we have two GigaTCP machines installed at SoX and PNW
GigaPoPs (with jumbo frames connectivity to Abilene) for WAN testing.
The (symmetric) path between machines is:
(The four-letter codes are Internet2 backbone Abilene core routers.
All links are OC-48c POS, except for terminating Gigabit Ethernet SX
on the two sides going into the machines.)
The bottleneck link capacity is therefore 1Gb/s. All OC-48c POS
circuits on the path are generally uncongested. The round-trip time
is 58.3ms:
100 packets transmitted, 100 packets received, 0% packet loss
round-trip min/avg/max/stddev = 58.300/58.346/58.434/0.029 ms
This comes out to be a link that can hold 7,287,500 bytes. This
number is the minimal TCP window size at which we can hope to fully
use the link. TCP throughput vs. window size (as reported by Iperf,
in Mb/s, three runs of 20 seconds, extreme values reported) was as
reported in the following table. In some cases, TCP performance was
unstable and I would get a number that's an order of magnitude lower
than the bulk of results for a given window size; these data were
discarded.
The machines are identical with identical OS and configuration.
SoX to PNW case: Note how throughput never goes
above 160Mb/s. A window size of 1.1MB would in principle (taking into
consideration delay*bandwidth only) allow to achieve such
throughput. In reality, one only achieves 93Mb/s with this window
size.
PNW to SoX case: One could achieve 200Mb/s at
1457500-byte window. At that window size, one gets a respectable
192Mb/s.
In both cases, examination of top output while Iperf
with different window sizes is running seems to indicate that once
throughput is saturated, one of the CPUs of the sender (but not the
receiver) is fully used and becomes the bottleneck. Two identical
CPUs become bottlenecks of different throughputs! (Is electricity in
Seattle better than it is in Atlanta?) One (weak) explanation of such
behavior could be that some data structures in the TCP/IP stack or in
the driver are getting set up in a particular way (which can depend on
semi-random circumstances of the beginning of the test) and then the
machine becomes stuck in this state. To confirm or deny this
explanation, I have rebooted both machines and ran a subset of the
tests again (identical methodology):
The difference in results between batches and the consistency
within a batch seems to support the hypothetical explanation, but
doesn't appear to be conclusive. To get more data, I performed
another pair of reboots and another small batch of testing:
In this case, there appears to be no difference from the previous
batch.
An additional piece of data is that in every batch of testing after
a fresh reboot the first data connection from SoX to PNW would become
seemingly stuck and would only finish after about a minute (with
`-t20' option for Iperf) with a throughput of hundreds of
Kb/s. This never repeats with subsequent connection until the machine
is rebooted and doesn't appear to occur in the other direction, even
though during the last batch I started with PNW to SoX connections
while in other batches I was starting with the other direction.
Afterhours testing on 2001-10-02 with UDP has shown that in SoX to
PNW direction one can get up to 528Mb/s with 3-4% packet loss and
minor reordering (wihch could be sustained for a few minutes and
externally verified by the Abilene weather map), while in the opposite
direction up to 650Mb/s is possible with a small fraction of a
percentage point loss, but only for short time intervals (such as
10s), and at some point the loss shoots up very sharply to
60-90% (and performance deteriorates accordingly); bandwidth similar
to the other direction could be sustained and verified externally. In
other words, no definitive path assymetry was found to be present with
UDP testing.
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND
505 shalunov 59 0 1292K 912K CPU1 0 2:32 99.06% 99.02% iperf
Typical receiver top display of the Iperf process with
major loss is similar to that of the sender.
Typical receiver top display of the Iperf process without
major loss:
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU CPU COMMAND
519 shalunov 53 0 1292K 912K CPU1 0 1:15 58.68% 58.59% iperf
Please remember that these are dual-CPU machines with FreeBSD-4.3,
where the giant kernel lock is present (so the receiver uses slightly
more than one CPU when it is reporting major packet loss).
To verify whether the possible asymmetry is explained by a
fragmented datastructure, I have followed advice from Bosko Milekic
and increased NCL_INIT and NMB_INIT in
uipc_mbuf.c (both to 32768, after looking at
`netstat -m' output). The results of testing after the
change:
Stalling of the first connection after reboot in the SoX to PNW
direction is still there (not reflected in the table, as it only
happens once). The increase of initial allocation appeared to make no
significant difference in the test results.
The next step seems to be to enable hardware checksums, even though
trouble was reported with them:
--- if_ti.c.orig Wed Aug 23 20:07:58 2000
+++ if_ti.c Fri Oct 5 15:42:13 2001
@@ -125,9 +124,0 @@
-/*
- * Temporarily disable the checksum offload support for now.
- * Tests with ftp.freesoftware.com show that after about 12 hours,
- * the firmware will begin calculating completely bogus TX checksums
- * and refuse to stop until the interface is reset. Unfortunately,
- * there isn't enough time to fully debug this before the 4.1
- * release, so this will need to stay off for now.
- */
-#ifdef notdef
@@ -135,3 +125,0 @@
-#else
-#define TI_CSUM_FEATURES 0
-#endif
Surprisingly, it seems that hardware checksums actually have hurt
performance quite significantly, reducing it to 35-40Mb/s for all
considered window sizes in SoX to PNW direction. In this direction,
packets sent were also 954 payload bytes in size. Later, the payload
size changed to 1440 bytes without any other visible changes, and then
reverted back to 954. Throughput measured at a different time was
about 53-172Mb/s.
Nothing seems to be changed in the opposite direction. That is,
for PNW->SoX direction, throughput, packet size, CPU utilization all
stay the same after turning on hardware checksums.
The change in packet size does not appear to be explained by
anything in the TCP negotiation:
Our third machine that was sent to SC2001 was smashed in transit and
would not boot on arrival. That was bad enough, but the kind folks at
RackSaver were incredibly kind
and have provided a nice machine for us: dual Pentium4, 512MB DRAM,
and a 64-bit 66MHz PCI bus into which we plugged in our Gigabit
Ethernet card. The machine was then installed in SCinet NOC and
directly connected to a Gigabit Ethernet interface on Juniper M160,
which had a OC-192c POS connection to a Cisco GSR, which had 2*OC-48c
POS to DENV (Denver core node of Abilene).
We have mostly used the machine at SCinet and the machine at PNW
for subsequent testing. The RTT between these machines, according to
ping, was 28.6ms:
round-trip min/avg/max/stddev = 28.617/28.636/28.726/0.018 ms
The Path MTU was 4470 bytes.
We continued to observe asymmetry in the TCP payload size: in one
direction, it would work as expected (sending 4096-byte TCP payloads),
but in the other direction, it would have a handshake where both side
would advertize 4430-byte MSS and then the sender would proceed
sending 1448-byte payload TCP packets without ever getting any ICMP
Fragmentation Needed messages.
An investigation has revealed that in tcp_input.c
there are two different functions, tcp_mssopt() that
determines what MSS value would be advertized during handshake and
tcp_mss() that determines the actual MSS value that will
be used. The latter function has a peculiar bug: when one tries to
connect to a previously unknown destination via TCP, it grabs the MTU
attribute of a freshly cloned route, and this attribute has nothing to
do with the appropriate interface MTU, but rather is based on a
driver-dependent constant; needless to say, for all Ethernet drivers
this constant is 1500 bytes. Since this value then gets stuck in the
cloned rtentry for a long time (about an hour), all TCP connection
sending data from the machine to the particular remote destination
will not use jumbo frames. The opposite direction works fine. The
MTU attribute can be printed by issuing the command route get
destination.
This is a bug in FreeBSD. A very similar pattern of behavior was
observed on Linux with Web100 patches, but there I could not easily
track down where the PMTU information is cached.
Throughput at SC2001 with different window sizes (during prime
time) was (methodology as in all other tests, results in Mb/s):