Analyzing TCP
Performance
Stephen Hemminger
Sr. Staff Engineer
Linux Kongress 2004
2004-09-09
Copyright 2004 OSDL, All rights reserved.
Agenda
■ Introduction
■ TCP for muggles
■ Engineering Process
■ Problem examples
■ Network Tools
■ Wrapup
Copyright 2004 OSDL, All rights reserved. -2-
Outside of scope
■ Non TCP protocols
■ SCTP, multicast, etc
■ Queuing theory - “no math”
■ Hardware and product comparisons
Copyright 2004 OSDL, All rights reserved. -3-
My Background
■ Did TCP back in the “old school”
■ BSD 4.2, Ethernet
■ SMP Unix versions of OSI, Netware, Appletalk, ...
■ Plan9 Hypercube communication
■ Linux
■ Incorporation of TCP research in 2.6 kernel
■ Performance tests for LWE
■ Wizard gap
Copyright 2004 OSDL, All rights reserved. -4-
Limits of my knowledge
■ Only worked with current Linux (2.4/2.6)
■ Will mention tools here that I have not used
extensively
■ Involved in development of Linux not deployment
or research
Copyright 2004 OSDL, All rights reserved. -5-
Agenda
■ Introduction
■ TCP for muggles
■ Engineering Process
■ Problem examples
■ Network Tools
■ Wrapup
Copyright 2004 OSDL, All rights reserved. -6-
TCP for “muggles”
■ connection establishment
■ slow start
■ windows
■ congestion control
■ silly window
Copyright 2004 OSDL, All rights reserved. -7-
Connection establishment
Client Server
SYN
connect
ACK
+
SYN accept
write
Dat
a1
(10
)
ck 11
A
read
Copyright 2004 OSDL, All rights reserved. -8-
ethereal
Copyright 2004 OSDL, All rights reserved. -9-
tcpdump trace
13:28:21.745624 IP 172.20.1.60.38052 > 216.239.39.99.http: S 1765497548:1765497548(0)
win 5840 <mss 1460,sackOK,timestamp 1563951453 0,nop,wscale 7>
13:28:21.831935 IP 216.239.39.99.http > 172.20.1.60.38052: S 227058185:227058185(0)
ack 1765497549 win 8190 <mss 1460>
13:28:21.832035 IP 172.20.1.60.38052 > 216.239.39.99.http: . ack 1 win 5840
13:28:21.832321 IP 172.20.1.60.38052 > 216.239.39.99.http: P 1:126(125) ack 1 win 5840
13:28:21.939237 IP 216.239.39.99.http > 172.20.1.60.38052: . ack 126 win 31460
13:28:21.972448 IP 216.239.39.99.http > 172.20.1.60.38052: P 1:485(484) ack 126 win 31460
13:28:21.972529 IP 172.20.1.60.38052 > 216.239.39.99.http: . ack 485 win 6432
13:28:21.973016 IP 172.20.1.60.38052 > 216.239.39.99.http: F 126:126(0) ack 485 win 6432
Copyright 2004 OSDL, All rights reserved. - 10 -
Flow control
10 10 ( 50 00)
write ACK
Data
1011
Data (1400
2411 )
Data (1
3811 400)
Data (
5211 1400)
(800)
60 10 (0)
Ack
read (1000)
(1000)
k 6010
Ac
Copyright 2004 OSDL, All rights reserved. - 11 -
Retransmission
write
Data
1
Ack 1
Multiple ack's Ack 1
= fast retransmit
Data 2
Copyright 2004 OSDL, All rights reserved. - 12 -
Tcptrace
http://tcptrace.org
Tool to convert captured data into graphs
■ Time sequence graph
■ Throughput
■ RTT
Lots more than time to cover here!
Copyright 2004 OSDL, All rights reserved. - 13 -
Xplot
http://xplot.org
■ Takes plot command scripts
■ Mouse
■ Zoom – drag with the left button
■ Zoom out – click the left button
■ Scroll – drag with middle button
■ Dump – shift-left button produces postscript
■ Shift-middle and shift-right also
Copyright 2004 OSDL, All rights reserved. - 14 -
Time Sequence Graph
Copyright 2004 OSDL, All rights reserved. - 15 -
Copyright 2004 OSDL, All rights reserved. - 16 -
Windows & Buffering
■ Used to isolate TCP from application read/write
■ Used for congestion control
■ Upper bound determined by system parameters
Copyright 2004 OSDL, All rights reserved. - 17 -
Congestion window
■ slow start
■ Window normally starts small
■ Grows in response to ack
■ congestion control
■ Packet loss = congestion
Copyright 2004 OSDL, All rights reserved. - 18 -
Silly Window
write
8k bytes ck [10]
A
“Hey, I am not going to
try and send this data now Read
give me a bigger window 8k bytes
first” [2000]
Ack
Data
OK, (2000
)
thanks
Copyright 2004 OSDL, All rights reserved. - 19 -
Model of TCP networks
Sender Receiver
Send Receive
Window Window
Data
Network
Ack
BDP = Bandwidth (bytes/sec) * Delay (secs/unit)
Copyright 2004 OSDL, All rights reserved. - 20 -
BDP - Bandwidth Delay Product
■ BDP = amount of data in transit
■ Examples
■ DSL/Cable modem (international)
1,000,000 bit/sec
* 1/8 byte/bit
* 500 ms = 62500 bytes
■ Gigabit across US
1,000,000,000 bit/sec
* 1/8 byte/bit
* 70 ms = 8,75 Mbytes
Copyright 2004 OSDL, All rights reserved. - 21 -
Bandwidth Delay Product (BDP)
1000
64K 1M
8K
LAN Research
100
Bandwidth
Mbits/sec
10
1
Broadband
0.1
0.1 1 10 100 1000
Delay (ms)
Copyright 2004 OSDL, All rights reserved. - 22 -
Internet
■ Router queues
■ Delays
■ Speed of light (70ms coast/coast)
■ Slow routers
■ Packet correlation, sizes
■ DoS
Copyright 2004 OSDL, All rights reserved. - 23 -
Extensions for larger windows
■ TCP Selective Acknowlegement (SACK)
RFC2018
■ Don't have to retransmit everything
■ Window scaling (RFC1323)
■ Window size multiplied by 2n
■ Protection Against Wrapped Sequence (PAWS)
■ Timestamp inside each packet
Copyright 2004 OSDL, All rights reserved. - 24 -
TCP options negotiation 1
Window scale by 4
IP 172.20.1.60.32820 > 216.239.39.99.http: S 3599527174:3599527174(0) win 5840
<mss 1460,sackOK,timestamp 2519711 0,nop,wscale 2>
IP 216.239.39.99.http > 172.20.1.60.32820: S 3820474812:3820474812(0) ack 3599527175
win 8190 <mss 1460>
IP 172.20.1.60.32820 > 216.239.39.99.http: . ack 1 win 5840
IP 172.20.1.60.32820 > 216.239.39.99.http: P 1:126(125) ack 1 win 5840
But server doesn't support scaling
Copyright 2004 OSDL, All rights reserved. - 25 -
TCP options negotiation 2
Window scale by 4
IP 172.20.1.60.32823 > 65.172.181.13.http: S 4120108902:4120108902(0) win 5840
<mss 1460,sackOK,timestamp 3036627 0,nop,wscale 2>
IP 65.172.181.13.http > 172.20.1.60.32823: S 2295773021:2295773021(0) ack 4120108903
win 5792
<mss 1460,sackOK,timestamp 1818411318 3036627,nop,wscale 0>
IP 172.20.1.60.32823 > 65.172.181.13.http: . ack 1 win 1460 <nop,nop,timestamp
3036628 1818411318>
IP 172.20.1.60.32823 > 65.172.181.13.http: P 1:144(143) ack 1 win 1460
<nop,nop,timestamp 3036628 1818411318>
Your scaling is okay, but don't scale mine
Copyright 2004 OSDL, All rights reserved. - 26 -
Linux TCP window tuning
■ Send window - net.ipv4.tcp_wmem
■ three values : initial default max
■ default is 4K 16K 128K
■ also limited by net.core.wmem_max
■ Receive window – net.ipv4.tcp_rmem
■ three values : initial default max
■ default is 4K 85K 170K
■ also limited by net.core.rmem_max
Copyright 2004 OSDL, All rights reserved. - 27 -
Linux TCP window tuning
■ Overall memory – net.ipv4.tcp_mem
■ three values : low pressure max
■ automatic value based on system memory
■ Application window – net.ipv4.tcp_app_mem
■ reserved space to handle slow applications
Copyright 2004 OSDL, All rights reserved. - 28 -
But!
■ Some firewalls and routers are buggy
■ Corrupt window scale change N to 0
■ Forget to track state, or read RFC wrong
■ Connections will hang because initial window looks
like a silly window
■ 1% of the net is buggy..
■ Linux 2.6.9 chooses window scale based on
maximum possible receive window
■ Default tcp_rmem => window scale of 2
■ Buggy devices will see ¼ of the real window
Copyright 2004 OSDL, All rights reserved. - 29 -
Break
Copyright 2004 OSDL, All rights reserved. - 30 -
Agenda
■ Introduction
■ TCP for muggles
■ Engineering Process
■ Problem examples
■ Network Tools
■ Wrapup
Copyright 2004 OSDL, All rights reserved. - 31 -
Performance Engineering process
■ Define what your goal
■ Capture information
■ Analyze and form hypothesis
■ Prototype to validate hypothesis
■ If successful
■ Make changes on production system
■ Report problems or patches to others
Copyright 2004 OSDL, All rights reserved. - 32 -
Goal setting
■ Know what is possible:
■ bus bandwidth, network latency, etc.
■ Know your application
■ Compare with similar applications
Copyright 2004 OSDL, All rights reserved. - 33 -
TCP performance testing
■ Goal: Improve TCP performance over high
bandwidth * delay links
■ Plan:
■ New TCP congestion control
■ Validate and test
Copyright 2004 OSDL, All rights reserved. - 34 -
Testing TCP over WAN
■ Want to test performance of TCP over high BDP
links
■ Can't afford a 10Gbit trans-continental link
■ Proposal: emulate network delay over 1Gbit
Ethernet
Copyright 2004 OSDL, All rights reserved. - 35 -
Existing network emulation tools
■Dummynet
http://info.iet.unipi.it/~luigi/ip_dummynet/
I don't want to setup separate FreeBSD machine
■ NISTnet
http://snad.ncsl.nist.gov/itg/nistnet/
Only on 2.4 and not ready to be in main tree
Copyright 2004 OSDL, All rights reserved. - 36 -
Netem
TCP
IP
netem
Ethernet (eth0)
http://developer.osdl.org/shemminger/netem
■ Started out as simple delay only hack
■ Grown up to do all the functionality of NISTnet
Copyright 2004 OSDL, All rights reserved. - 37 -
Current TCP research
■ Alternative TCP congestion
■ Vegas
■ Westwood
■ Binary Increase Congestion Control (BIC)
■ Research community based around Web100
Copyright 2004 OSDL, All rights reserved. - 38 -
TCP Reno
■ Standard default in 2.4/2.6
■ Adjusts congestion window based on packet loss
■ Slow start – window grows slowly
■ Additive Increase window on each Ack
■ Multiplicative Decrease on loss
Copyright 2004 OSDL, All rights reserved. - 39 -
TCP Vegas
■ Original work by Larry Peterson
■ Patches existed for 2.2, 2.4 and part of web100
■ sysctl net.ipv4.tcp_cong_avoid
■ Measure bandwidth based on RTT
■ Adjust congestion window on bandwidth
■ Avoids packet loss
Copyright 2004 OSDL, All rights reserved. - 40 -
TCP Westwood
■ Work by Caludio Casetti
■ Patches for 2.4 by Angelo Dell'Aera
■ sysctl net.ipv4.tcp_westwood
■ Focused on wireless
■ packet loss != congestion
■ Measure bandwidth based on RTT
■ Use normal Reno till congestion then adjust
congestion window based on bandwidth
Copyright 2004 OSDL, All rights reserved. - 41 -
Binary Increase Congestion Control (BIC)
■ Work by Lisung Xu
■ Patches for Web100 (2.4)
■ sysctl net.ipv4.tcp_bic
■ Designed for best high speed networks
■ Modification of Reno
■ Use additive increase when congestion window
is large
■ Binary search increase when window is small
Copyright 2004 OSDL, All rights reserved. - 42 -
Tuning
■ Default tcp parameters not big enough
■ Need bigger send and receive window
■ Send window autosized based on rtt already
■ Receive window autosizing was done in Web100
Copyright 2004 OSDL, All rights reserved. - 43 -
Receiver Tuning
■ Patches from John Heffner
■ sysctl net.ipv4.tcp_moderate_rcvbuf
■ Dynamic Right Sizing (DRS)
■ adjust receive window based on RTT
■ If application doesn't set window then do it for them
■ Window will grow from default to max
Copyright 2004 OSDL, All rights reserved. - 44 -
Receiver auto-tuning
1000
800
Throughput (Mbits/sec)
600
Default
400 Auto Tuned
200
0
0 50 100 150 200
Delay (ms)- 45 -
Copyright 2004 OSDL, All rights reserved.
Throughput vs Delay (initial run)
800
Reno
Vegas
Westwood
700 Bic
600
Bandwidth (Mbits/sec)
500
400
300
200
100
0
0 50 100 150 200
Delay (ms)
Copyright 2004 OSDL, All rights reserved. - 46 -
What's happening
■ NAPI
■ Driver API to allow avoiding interrupts
■ Trades off latency for overall performance
■ E1000 driver
■ Uses NAPI for transmit
Answer: Transmit ring gets full and driver flow
blocks
Solution: set TxDescriptors=1000
Copyright 2004 OSDL, All rights reserved. - 47 -
Thorughput vs Delay (rerun)
800
700
600
Throughput (bits/sec)
500
Reno
400
Vegas
Westwood
300 BIC
200
100
0
0 25 50 75 100 125 150 175 200
Delay (ms)
Copyright 2004 OSDL, All rights reserved. - 48 -
Performance still slow
■ Vegas and Westwood are terrible
■ Not at full link speed
■ Performance falling off with delay
Copyright 2004 OSDL, All rights reserved. - 49 -
Vegas trace with 100ms delay
Copyright 2004 OSDL, All rights reserved. - 50 -
Vegas detail
Copyright 2004 OSDL, All rights reserved. - 51 -
Westwood (70ms)
Copyright 2004 OSDL, All rights reserved. - 52 -
Westwood detail
Copyright 2004 OSDL, All rights reserved. - 53 -
BIC trace (100ms)
Copyright 2004 OSDL, All rights reserved. - 54 -
BIC detail (100ms)
Copyright 2004 OSDL, All rights reserved. - 55 -
How to squeeze out more performance
■ Large MTU (4k) + 63%
■ LAN driver not-module up to 10%
■ Turn off timestamps + 4%
■ Bind IRQ to processor varies
Copyright 2004 OSDL, All rights reserved. - 56 -
Congestion more work
■ Vegas doesn't use available window
■ Does it under estimate bandwidth?
■ Westwood
■ Another bandwidth problem
■ BIC
■ When does it make into binary mode?
■ What is holding back window?
■ Netem
■ Higher resolution? Packet groups?
Copyright 2004 OSDL, All rights reserved. - 57 -
Break
Copyright 2004 OSDL, All rights reserved. - 58 -
Agenda
■ Introduction
■ TCP for muggles
■ Engineering Process
■ Problem examples
■ Network Tools
■ Wrapup
Copyright 2004 OSDL, All rights reserved. - 59 -
Other tools
■ Information about
■ ISP connection
■ Sockets open
■ Testing infrastructure
■ More data capture
■ Monitoring
Copyright 2004 OSDL, All rights reserved. - 60 -
Tools: basic
■ Network path information
■ Ping – send icmp echo
■ Measure of round trip time and loss
■ Can be blocked by firewall
■ Traceroute – use IP source routing
■ Usually blocked now
■ Pathcapture (pcap)
■ Bandwidth and delay measurement
Copyright 2004 OSDL, All rights reserved. - 61 -
Tools: Network interface
■ ifconfig
■ Basic statistics, packets sent/received/errors
■ ip -stats link
■ Alternate newer, may have more info
■ SNMP
■ Remote access to same information
■ Slightly more work
Copyright 2004 OSDL, All rights reserved. - 62 -
Tools: Sockets
■ Netstat
■ TCP statistics
■ Open sockets
■ Ss
■ More statistics available (rtt, etc)
■ Recvmsg
■ Application can see TCP info (cmsg)
Copyright 2004 OSDL, All rights reserved. - 63 -
Tools: test servers
■ SYN test
telnet syntest.psc.edu 7960
■ TCP bandwidth
http://www.epm.ornl.gov/~duniga
n/java/misc/tcpbw.html
http://dslreports.com
■ ANL network config
http://miranda.ctd.anl.gov:7123
■ Path MTU
http://www.ncne.org/jumbogram/mtu_discove
ry.php
Copyright 2004 OSDL, All rights reserved. - 64 -
Tools: testing
■ Ttcp
■ Basic send /receive throughput
■ Iperf
■ Longer running tests and turnaround
■ Netperf
■ Includes cpu and other statistics
■ Dbs
■ Multiclient testing
Copyright 2004 OSDL, All rights reserved. - 65 -
Tools: monitoring
■ Ntop
■ Measure of network activity by service
■ Nice web interface
■ Mailgraph
■ Long term mail statistics
■ Web server activity log analysis
Copyright 2004 OSDL, All rights reserved. - 66 -
Tools: data capture
■ Tcpdump
■ Filter packets by protocol, address, etc
■ Decode many protcols
■ Ethereal
■ GUI interface
■ RMON
■ Remote monitoring
■ Kismet
■ Wireless activity
Copyright 2004 OSDL, All rights reserved. - 67 -
Tools: generators
■ Pktgen
■ Kernel level packet generation
■ Can generate maximum hardware packet rate
■ Network packet generator
■ Application level
Copyright 2004 OSDL, All rights reserved. - 68 -
Tools: simulation
■ Ns
■ Describe overall system
■ Event based simulation
■ Used for protocol analysis
■ SSFnet
■ More detailed models of real hardware
Copyright 2004 OSDL, All rights reserved. - 69 -
Tools: client simulator
■ Web
■ SPECweb, Apache (as), httpload
■ NFS
■ Nfsstone
■ FTP
■ Dkftpbench
Copyright 2004 OSDL, All rights reserved. - 70 -
Conclusion
■ Data capture can provide clues of:
■ Application problems
■ Device problems
■ TCP/IP problems
■ Nothing is ever simple
Copyright 2004 OSDL, All rights reserved. - 71 -