$Id: tcp-perf.html,v 1.5 2005/09/06 19:18:29 shalunov Exp $
Summary: End-to-end TCP performance, its tuning and troubleshooting are discussed. Common problems and ways to identify them are presented. An algorithm to follow when starting to troubleshoot end-to-end performance is suggested.
Note that most commodity Internet users (i.e., virtually all home users and corporate users) are excluded from the applicability scope by the preceding paragraph. However, most Internet2 users (mostly people at U.S. research universities) are covered -- at least when workstations are involved (rather than computers at dormitories, which many network administrators put behind a shared half-duplex 10Mb/s Ethernet or worse behind a NAT or a QoS appliance).
Sometimes, we delve into details that might be helpful to an interested reader by can confuse someone who is only looking for guidance to solve a specific TCP performance problem. Those parts of the document that might be fairly technical or are only of interest to a subset of the intended readers and that are not always necessary for the understanding of the procedure to be applied are presented in a smaller type. The author apologizes if the document is presented to you in a form that does not distinguish between the two sizes (a mitigating circumstance is that if you are reading this in Lynx or w3m, you are likely to be a geek who would want to know the gory details anyway).
If you're a student at a university in a dormitory, you should check what kind of outside Internet2 connectivity your university provides to the dormitory. Internet2 doesn't charge per bit, so there's no good excuse for a university not to provide a 100Mb/s full-duplex connection except for cost of switches, which is a one-time expense of only tens of dollars per capita. You might friendly ask your overworked and underpaid campus network administrator (who stays because of nice work environment) about the Internet2 connection that the dormitory has. If it's less than a switched full-duplex unlimited 100Mb/s connection, you might petition the university to get a better connection. Note that your institution pays per bit for commodity Internet traffic (essentially, those packets that don't go to other universities); don't expect them to work too hard to make it possible for you to run up their bills too much if you're paying a flat monthly networking fee -- on the other hand, if you are willing to let them pass the costs of commodity Internet transit to you, let them know: the administration probably thinks that there will be riots if they depart from a flat-fee policy. (If you think that you're entitled to eat as much commodity Internet capacity as you want because you already pay your tuition, you might also think that you are entitled to spending arbitrary amounts of printer paper and toner.)
If you are stuck with 10Mb/s half-duplex shared Ethernet for your connection, you might get 2-3Mb/s from a single TCP connection and almost no increase from any connection banding.
If you are not a part of the intended audience of this document, you should stop reading now unless you want to turn green with envy.
There are three most common causes for TCP performance problems, in the order of frequency:
In two years with Internet2 [as of summer 2002] the author has not encountered a single TCP performance problem that would be traced to the core network. He routinely obtains hundreds of megabits per second across the North America with a single TCP connection between specifically well-tuned and well-connected hosts. Internet2's measurement devices do not register any packet loss in the core and jitter only in microseconds.
If you have a TCP performance problem, a problem in the core probably has about the same probability as little green men twisting the wires in your patch cable -- you are likely better off spending your time investigating more likely causes, starting with the three listed above.
To troubleshoot the problem, follow these three steps (briefly; background for each step is given below, followed by a detailed troubleshooting algorithm):
In traditional TCP, the window size cannot be larger than 65,535 bytes (because the unsigned integer field that holds it is only 16 bits wide). You will probably need a bigger window than 64KB. This is achieved by turning on TCP extensions specified in RFC1323, `TCP Extensions for High Performance', published in 1992 and now supported by most operating systems. The TCP window scaling option allows one to use TCP window sizes of up to 1,073,741,823 bytes, which should be good until speeds that approach 1Tb/s (with RTT of 100ms). Turn on this option. (It might already be on. No operating-system-specific instructions are provided in this document, but there are some links at the end that might help. If you have or know of more publicly available documents describing operating-systems-specific steps to set TCP options and window sizes, propose them to be included.) Once you figure out how to do this, you should check if your TCP stack supports selective acknowledgment (SACK) or explicit congestion notification (ECN) options; if it does, turn these on, too (SACK and ECN and not strictly speaking necessary to get good performance but they might help in certain environments).
The sure way to learn the TCP window sizes in question is to perform a packet capture of your connection (e.g., with tcpdump, snoop, Etherpeek, or equivalent tool for your OS) and examine it.
You will need a few packets from the beginning of the connection, including the SYN packets. (If you don't know what is a SYN packet or cannot do packet capture yourself, find someone local who has some networking expertise -- such as your network support technician. Otherwise, prepare to spend some time figuring this stuff out, but you must be the type that likes to fix car engines by yourself.) For each direction, the window scaling option will be in the SYN packet (if it is not, it is implicitly 0). For each direction, take the actual window size number from the first data or pure ACK packet and multiply it by 2^{window scaling}; this is the actual window size at the start of the connection. (For example, if window scale in the SYN packet is 3 and the window size in the first data or pure ACK packet is specified as 32768, the actual window size for that direction is 32768*2^3 = 262144; this figure is in bytes.) Make note of the window size numbers in both directions. To make matters more confusing, some packet sniffers will already have done the multiplication for you; it happened if and only if the window size advertised in the first data or pure ACK packet is already presented as being larger than 65535; if this is the case, you do not need to multiply the number by the exponentiated window scale (it is already usable as is). Having established the window sizes, make note of the minimum of the two numbers for each direction and use it in subsequent paragraphs.
Measure the round-trip time. The simplest way is to use the
ping utility.
Make it send 60 packets slowly (e.g., 1 packet/second) while the network does not carry your traffic and take the minimum and the average of the 60 numbers (these are normally printed at the bottom along with some other numbers). On a properly functioning high-speed network, the numbers should both resemble within a factor of two your propagation delay (to be estimated in seconds as distance between hosts in kilometers divided by 200,000km/s, which is approximately equal to the speed of light in fiber; the delay should be fudged 50% up for network path not being a part of a great circle of the planet). Take the minimum as your RTT, to be used subsequently.
The theoretical maximum TCP throughput that you can obtain with your tuning parameters is window/RTT. Rule of thumb: For a realistic number, divide the theoretic hard limit by two.
With 70ms RTT (typical cross-country delay), you might see the following throughputs for different window sizes (assumed equal at both sides):
| Window Size | Theoretical max throughput | Realistic throughput |
|---|---|---|
| 8KB | 0.9Mb/s | 0.8Mb/s |
| 16KB | 1.9Mb/s | 1.8Mb/s |
| 32KB | 3.7Mb/s | 2-3.5Mb/s |
| 64KB | 7.5Mb/s | 3-7Mb/s |
| 128KB | 15.0Mb/s | 6-14Mb/s |
| 256KB | 30.0Mb/s | 10-25Mb/s |
| 512KB | 59.9Mb/s | 20-40Mb/s |
| 1MB | 119.8Mb/s | 30-60Mb/s |
| 2MB | 239.7Mb/s | 60-100Mb/s |
(I.e., for what is currently considered large values of window size, perhaps a factor of two of safety margin is required; for small window sizes, the safety margin is largely determined simply by the remainder of the division of allocated buffer space by TCP Maximum Segment Size (MSS); MSS is simply the maximum number of payload bytes that can be sent inside a single TCP packet on a given link.)
The theoretical maximum is a hard limit of sustained performance with arbitrary TCP options, only a small number of non-consecutive drops per RTT, and a perfect network. The realistic values are what the author would personally expect in a typical case, based on experience, from hosts, TCP implementation, and networks from the real world (in particular, it assumes no SACK, no ECN, standard aggressiveness, 1500-byte MTU, and no RED at the bottleneck).
For example, if a 64KB window size is used on both sides and one sees 3Mb/s actual throughput with 70ms RTT, this is nothing out of the ordinary: it's only slightly less than twice the hard limit one would see in a perfect world.
To make the long story short: On a high-speed wide-area network, hosts should have RFC1323 extensions turned on and default TCP send and receive buffers that are equal to each other and are at least 1 megabyte.
NB (added in 2005): This description was a work-in-progress and is preserved on this web page for historical record. If you want to learn about duplex mismatch (rather than how I understood it in 2002), you should read the following paper:
Fast Ethernet (the 100Mb/s version of Ethernet) can operate in two modes: half-duplex and full-duplex. In full-duplex mode, both sides can send simultaneously. In half-duplex mode, only one network interface card at a time can send.
Some older (or cheaper) network interface cards (NICs) support only half-duplex mode; most modern cards from a brand-name manufacturer support both half-duplex and full-duplex modes. Each NIC can either have duplex mode hard-coded by the host computer's operating system, or have no preset mode. If a NIC is configured to use a certain mode, it will use it without regard for the other side of the connection. If a NIC does not have a hard-set mode, it will attempt to engage in auto-duplex negotiations with the other side. The process attempts to establish a common duplex of communication, with preference to full duplex if both sides support it; on a connection that is not fully switched, full-duplex communication makes no sense, so the auto-negotiation process attempts to discover if more than one other NIC is present and uses half-duplex mode if it senses more than one other NIC. This process works fairly well for modern cards, but for old cards it used to have success rate of 52% (anecdotally), in other words, the results were essentially random.
If the two sides of a switched Fast Ethernet connection disagree on the duplex mode they use (i.e., one is using full-duplex mode and the other is using half-duplex mode), a duplex mismatch is said to occur.
A Fast Ethernet duplex mismatch can happen in one of the following ways (not meant to be an exhaustive list of causes):
When duplex mismatch happens, a peculiar breakdown in communication occurs; an attempt to analyze it will be made here (the analysis is tentative). Denote the NIC that thinks that the connection is in full-duplex mode F and the NIC that thinks that the connection is in half-duplex mode H. F sends data whenever it has any and can receive data at any time; H does not receive bits while sending. Thus, frames going from H to F will sometimes be lost if there are frames going in the opposite direction. Frames going from H to F should not suffer. Assume that TCP data flow is essentially unidirectional. Denote the NIC on whose side the sender is located S, and the NIC on which the receiver is located R. A stream of MTU-sized frames will go from S to R and a stream (with typically half as many frames per second) of small frames containing packets that are TCP ACKs will go from R to S. ACKs are cumulative: if an ACK is lost and subsequent ACK is delivered, the same information is conveyed (and the effect of missed increment of TCP congestion window is mild). Consider two cases:
A duplex mismatch can be detected by two means:
netstat -s')
counters increase while data are sent (note that no particular value
of these counters is `bad'; only increase is meaningful).
Do not take preventive action with respect to duplex mismatch: If you don't have it, leave the configuration of duplex as it was before (you might experiment with it to see if duplex mismatch is indeed the problem; if it isn't return everything to the state it was before you started).
To correct a duplex mismatch problem, you should set both NICs to auto-negotiate, if both cards are relatively new brand-name cards. If the cards are very old, or cheap no-name cards, or if you ascertain that auto-negotiation fails, hard-code both to the same duplex mode; if both cards support full duplex, use it; if you cannot use full duplex and hard-code both to half duplex, do not expect more than perhaps 20Mb/s with TCP.
The cheapest, quickest, and simplest way to find if a patch cable is the culprit of a problem is to replace the patch cable with a new one. If the problem consistently goes away, this was it. Simply curing the problem for one boot cycle does not differentiate from one of the duplex mismatch scenarios, when the problem occurs in about 50% of boot cycles. Reboot (or reset the interface re-initializing the driver on) the host or on the switch several times before making the conclusion that bad cable was the problem. (If you establish that it was, do not forget to cut the bad cable in half and put it in the garbage or someone else might use it.)
Bad cables normally introduce erroneous phenomena that are associated with fairly constant bit-error rate situations. The problem might manifest itself in one direction only, but more commonly both directions suffer from bit corruption about equally. A different kind of twisted pair cable damage can result in crosstalk; this case is hard to differentiate from a consistent duplex mismatch due to, e.g., different modes hard-coded at the host and the switch. For this reason, and because even excellent patch cables are cheap, replacement is recommended as a diagnostic procedure.
The algorithm is to be interpreted with some common sense (for example, it is hard to automate it). Comments are in {braces}. The default action after performing a step is to follow to the next one (the default action doesn't apply if an explicit goto action is given).
$Revision: 1.5 $ (mention the revision in any
correspondence).
Step 0 [Consistency within boot cycle]: Set duplex_changed to false. Set cables_changed to false. Run the application five times in a row without rebooting any devices or resetting any interfaces. Did you always get bad results? Yes: goto step 1; No: goto step FAIL. {Each of the three conditions that we test for will give consistently bad results during one reboot cycle; if results are inconsistent, something more complicated is probably happening.}
Step 1 [Packet capture]: Run the application and perform a packet capture on one of the communicating hosts or on a third snooping machine. Were you successful in displaying an understandable packet dump? Yes: goto step 3; No: goto step 2.
Step 2 [Assistance with packet capture]: Find someone older than seven years who is not already troubleshooting the problem and get them to help. Goto step 1. {Packet capture is vital and must be available. The algorithm will always leave the loop between steps 1 and 2 because there is only a finite number of people who cannot perform a packet capture and non-zero number of those who can; assuming that there are 5 billion people who cannot perform a packet capture and 50,000 of those who can, at most 5 billion iterations will be required in the worst case and 100,000 iteration in the average case even without any optimization; optimization is left as a recommended exercise for the reader. Hint: Instead of adding just anyone, add someone with more networking experience than anyone in the set of people currently troubleshooting the problem. A different approach: Choose people randomly, but have everyone in the set of troubleshooters read the manual page for tcpdump while the loop is executing.}
Step 3 [TCP?]: Based on packet capture examination, is the application using TCP? Yes: goto step 4; No: goto step 8. {The documentation might not reliably tell you whether TCP is the relevant protocol, but packet capture will tell reliably.}
Step 4 [Window size too small?]: Based on packet capture, is the window size less than 1 megabyte on either side (taking window scaling into account)? Yes: goto step 5; No: goto step 8. {If window size is OK, do not fiddle with it.}
Step 5 [Enable Window Scaling]: Enable RFC1323 TCP extensions on each side. {Setting buffer space larger than 64KB will not have much effect without RFC1323 options on both sides.}
Step 6 [Set TCP window size]: Set both default TCP send and receive buffer space to 1 megabyte on each side and make the changes go into effect (on some systems it will require rebooting). {The 1MB figure assumes Fast Ethernet connectivity and at most 100ms RTT, not Gigabit Ethernet, which would require at least 8MB to have full benefits over a WAN. If bottleneck link capacity is larger than 100Mb/s or the RTT is larger than 100ms, use 2*link_capacity*RTT as the window size.}
Step 7 [OPTIONAL: Advanced TCP options]: On each side: Enable SACK and ECN options if supported. {SACK helps to recover gracefully and without aid of a timeout from multiple packet losses per round-trip time. ECN, if supported by routers, let one sense congestion without any losses. These options -- especially SACK -- might allow to sustain good throughput even in the face of network conditions that might be very difficult to detect, analyze, or fix.}
Step 8 [Problem consistently fixed?] : Reboot the host or reset the host interface five times on each side. Between reboots or resets, run the application each time. What results did you get? Consistently satisfactory: goto SUCCESS; consistently unsatisfactory: goto step 14; inconsistent: goto step 9. {Many tries are important because the problem might have been temporarily `fixed' by a reboot or an interface reset and reappear after more reboots. With five attempts on each side, the probability of incorrectly identifying certain kinds of problems and falsely declaring them solved is about 3%.}
Step 9 [Fix duplex mismatch]: If duplex_changed, goto FAIL. Set duplex_changed to true. Are both interface cards and both switches next to them relatively new, not the cheapest kind from a toy store and made by a brand-name manufacturer other than SGI? Yes: goto step 13; No: goto step 10. {Duplex will not be fixed more than once. Auto-negotiate, if the cards can do it reliably. The author fairly recently had poor experience with then-new SGI cards auto-negotiation.}
Step 10 [Support full duplex?]: Do both switches and both network interface cards support full-duplex mode? Yes: goto step 11; No: goto step 12. {Full duplex is preferred.}
Step 11 [Full duplex]: Set all four interfaces (both hosts and both ports on the switches) to full-duplex mode, 100Mb/s. Goto step 8.
Step 12 [Half duplex]: Set all four interfaces (both hosts and both ports on the switches) to half-duplex mode, 100Mb/s. Goto step 8. {But consistent half duplex is much better than a duplex mismatch.}
Step 13 [Auto-negotiation]: Set all four interfaces (both hosts and both ports on the switches) to auto-negotiate duplex and speed. Goto step 8. {And auto-negotiation is best of all -- if it has chances to work.}
Step 14 [Replace cables]: If cables_changed, then goto step 9. Set cables_changed to true. Replace patch cables on both sides with new ones, known to be good. Goto step 8. {Cables really get damaged in subtle ways that affect TCP performance but not basic connectivity. Replacing cables will not happen more than once.}
Step SUCCESS: Terminate and send the author a postcard via snail mail; indicate algorithm revision number and the steps that were executed. {Plain-text email is also OK.}
Step FAIL: This algorithm could not help you. If duplex_changed, then restore duplex settings to what they were before you started. Terminate with an error and seek expert assistance. (Start with someone local; local people should know proper escalation procedures.) The failure might be reported to the author via email especially if the problem is later traced to small window size, duplex mismatch, or bad patch cable; indicate algorithm revision number, the steps that were executed, and symptoms description (please do not expect that you would get help debugging the problem, but the author is interested in hearing about failures and might attempt to investigate). {There are also many other things to try based on information in this document.}