Aug 21, 2009

TCP window scale option

看了這麼久的TCP Windows Size相關的文章,終於搞懂為何可以突破TCP Windows Size的最大值(2^16 = 0~65535 bytes)。之前我一直被TCP header長度的問題困擾,因為Windows Size欄位就只有16 bits,那麼要如何才能紀錄使用超過TCP Windows 65535 bytes長度的資料呢?

不過說也奇怪,明明是一個很簡單的理論,但是找來找去總是找不到一份很簡單的文章來說明為什麼? 透過一些相關文章的佐證,我就在這邊用比較淺顯易懂的文字來表達。

我們先來看看TCP Header的樣子:

我們可以看到在TCP header中共有20 bytes,其中包含了16 bits的Windows Size。因為原有的TCP Windows Size最大值無法超過65536 bytes,所以後來在IETF RFC 1323中定義了TCP Windows scale option的功能,讓我們可以使用TCP options欄位(共32 bits)中的14 bits當成是延伸的Windows Size。因此我們現在的TCP Windows Size最大長度可以達到2^(16+14) = 1GB(1,073,741,824 bytes)


TCP window scale option

From Wikipedia, the free encyclopedia

The TCP window scale option is an option to increase the TCP receive window size above its maximum value of 65,535 bytes. This TCP option, along with several others, is defined in IETF RFC 1323 which deals with Long-Fat Networks, or LFN.

In fact, the throughput of a communication is limited by two windows: congestion window and receive window. The first one tries not to exceed the capacity of the network (congestion control) and the second one tries not to exceed the capacity of the receiver to process data (flow control). The receiver may be overwhelmed by data if for example it is very busy (such as a Web server). Each TCP segment contains the current value of the receive window. If for example a sender receives an ack which acknowledges byte 4000 and specifies a receive window of 10000 (bytes), the sender will not send packets after byte 14000, even if the congestion window allows it.


The TCP window scale option is needed for efficient transfer of data when the bandwidth-delay product is greater than 64K. For instance, if a T1transmission line of 1.5Mbits/second was used over a satellite link with a 513 millisecond round trip time (RTT), the bandwidth-delay product is (1500000*.513) = 769,500 bits or 96,188 bytes. Using a maximum buffer size of 64K only allows the buffer to be filled to 68% of the theoretical maximum speed of 1.5Mbits/second, or 1.02 Mbit/s.

By using the window scale option, files can be transferred at nearly 1.5Mbit/second utilizing nearly all of the available bandwidth.

This option is also useful when sending large files greater than 64KB over slow networks.

By using the window scale option, the receive window size may be increased up to a maximum value of 1 gigabyte (1,073,741,824 bytes). This is done by specifying a one byte shift count in the header options field. The true receive window size is left shifted by the value in shift count. A maximum value of 14 may be used for the shift count value.

另外以下這篇文章是針對TCP Windows Size & BDP(Bandwidth Delay Product)加以說明,內容相對來說比較容易讓人接受及了解:

How to Calculate TCP throughput for long distance WAN links

So you just lit up your new high-speed link between Data Centers but are unpleasantly surprised to see relatively slow file transfers across this high speed, long distance link — Bummer! Before you call Cisco TAC and start trouble shooting your network, do a quick calculation of what you should realistically expect in terms of TCP throughput from a one host to another over this long distance link.

When using TCP to transfer data the two most important factors are the TCP window size and the round trip latency. If you know the TCP window size and the round trip latency you can calculate the maximum possible throughput of a data transfer between two hosts, regardless of how much bandwidth you have.

Here is how you make the calculation:

TCP-Window-Size-in-bits / Latency-in-seconds = Bits-per-second-throughput

So lets work through a simple example. I have a 1Gig Ethernet link from Chicago to New York with a round trip latency of 30 milliseconds. If I try to transfer a large file from a server in Chicago to a server in New York using FTP, what is the best throughput I can expect?

First lets convert the TCP window size from bytes to bits. In this case we are using the standard 64KB TCP window size of a Windows machine.

64KB = 65536 Bytes. 65536 * 8 = 524288 bits

Next, lets take the TCP window in bits and divide it by the round trip latency of our link in seconds. So if our latency is 30 milliseconds we will use 0.030 in our calculation.

524288 bits / 0.030 seconds = 17476266 bits per second throughput = 17.4 Mbps maximum possible throughput

So, although I may have a 1GE link between these Data Centers I should not expect any more than 17Mbps when transferring a file between two servers, given the TCP window size and latency.

What can you do to make it faster? Increase the TCP window size and/or reduce latency.

To increase the TCP window size you can make manual adjustments on each individual server to negotiate a larger window size. This leads to the obvious question: What size TCP window should you use? We can use the reverse of the calculation above to determine optimal TCP window size.

Formula to calculate the optimal TCP window size:

Bandwidth-in-bits-per-second * Round-trip-latency-in-seconds = TCP window size in bits / 8 = TCP window size in bytes

So in our example of a 1GE link between Chicago and New York with 30 milliseconds round trip latency we would work the numbers like this…

1,000,000,000 bps * 0.030 seconds = 30,000,000 bits / 8 = 3,750,000 Bytes

Therefore if we configured our servers for a 3750KB TCP Window size our FTP connection would be able to fill the pipe and achieve 1Gbps throughput.

One downside to increasing the TCP window size on your servers is that it requires more memory for buffering on the server, because all outstanding unacknowledged data must be held in memory should it need to be retransmitted again. Another potential pitfall is performance (ironically) where there is packet loss, because any lost packets within a window requires that the entire window be retransmitted – unless your TCP/IP stack on the server employs a TCP enhancement called “selective acknowledgements”, which most do not.

Another option is to place a WAN accelerator at each end that uses a larger TCP window and other TCP optimizations such as TCP selective acknowledgements just between the accelerators on each end of the link, and does not require any special tuning or extra memory on the servers. The accelerators may also be able to employ Layer 7 application specific optimizations to reduce round trips required by the application.

Reduce latency? How is that possible? Unless you can figure out how to overcome the speed of light there is nothing you can do to reduce the real latency between sites. One option is, again, placing a WAN accelerator at each end that locally acknowledges the TCP segments to the local server, thereby fooling the servers into seeing very low LAN like latency for the TCP data transfers. Because the local server is seeing very fast local acknowledgments, rather than waiting for the far end server to acknowledge, is the very reason why we do not need to adjust the TCP window size on the servers.

RiOS 5.5 SSL Enhancements

With 5.0, we have SSL auto-discovery so that administrators can whitelist or blacklist peers very easily and the peers are automatically discovered upon the first SSL connection and appear in the self-signed peer gray list. You simply mark them as trusted. The connections are not optimized until after you move the peers to the trusted whitelist. Both the client-side and server-side Steelhead appliances must use RiOS 5.0 or later.

SSL Certificates and private keys copied to server-side Steelhead appliance
(no certificate faking in branch offices)
Auto-discovery of SSL Steelhead peers with gray-list capability
Automatic optimization of SSL traffic
Support for certificate domain wildcards