[go: up one dir, main page]

Academia.eduAcademia.edu

WebTP -A Transport-Layer Differentiated Services Architecture

WebTP - A Transport-Layer Differentiated Services Architecture Ye Xia∗ Hoi-Sheung Wilson So Jeng Lung Yogesh K. Bhumralkar Jean Walrand David Tse Department of Electrical Engineering and Computer Science University of California, Berkeley Abstract We introduce an experimental transport protocol, the WebTP, that can be used to provide differentiated quality-of-service (QOS) to flows in a “best effort” fashion on network segments. WebTP encapsulates TCP and UDP packets, aggregates many flows for integrated congestion control and schedules the transmission of their packets. Compared with layer-3 QOSprovisioning framework, WebTP can be deployed gradually with less overall planning, mainly due to its adaptive nature and its automatic gateway discovery mechanism. In this paper, we describe the WebTP protocol and its implementation on the Linux operating system. 1 Introduction In this paper, we report the design and implementation of the WebTP protocol. The WebTP protocol can be viewed as a layer-3/layer-4 hybrid under the OSI model, and can be understood in relation to two streams of active research during the past decade. It was first conceived as a new end-to-end transport protocol, which tries to integrate and implement various recommendations for protocol improvement based on extensive research on TCP. Second, it tries to provide quality of service (QOS) to connections, which is a goal shared by some IP-layer network control frameworks, most noticeably, DiffServ [8] and MPLS [16]. 1.1 Transport Functions: Application Support and Network Control The primary function of the transport layer is to establish a logical end-to-end communication path between the sending application and the receiving application, shielding the applications from the actual network that may consist of heterogeneous pieces with different technologies, topology, bandwidth, routing/forwarding behaviors and reliability levels. A transport packet typically has identifiers for the sending and receiving application, such as the port numbers in TCP and UDP. The network may corrupt data packets in a variety of ways. It may also delay or drop packets when congested. It may route packets through different network paths, causing them to arrive out-of-order at the receiver. If reliable communication is required, a reliable transport protocol ∗ This project is funded by NSF Award #9872764. 1 such as TCP is used to correct packet errors caused by imperfect networks. TCP can recover packet losses through retransmission of the lost packets, re-sequence packets that are misordered by the network, and detect duplicated or erroneous packets. In order to do so, the transport layers at the two hosts typically establish a reverse communication channel for passing feedback or acknowledgment information, notifying the sender the reception status of data packets. When the communication is not required to be reliable, an unreliable transport protocol such as UDP is used. UDP does not have a feedback path and cannot perform most of the error recovery functions. It has the advantages of simplicity and improved response time. We call the transport functions outlined above application-support functions. Congestion control capability is later added to TCP, which belongs to what we call network control functions. Other such functions include the end-to-end flow control, which ensures the amount of transmitted packets do not overflow the receiver’s capacity to consume or buffer them, and bandwidth allocation for connections. Note that TCP does not explicitly allocate bandwidth to connections, but does so implicitly through its window-based congestion control algorithm. (See [12] and [14] on this point.) Application Application 6 Presentation Presentation 5 Session WebTP Gateway WebTP Gateway Session 4 TCP UDP WebTP WebTP WebTP TCP UDP WebTP 3 Network 2 Link 1 Physical Network Link Physical Network Link Network 7 Network End Host Network End Host Physical Network Link Physical Figure 1: OSI networking model with WebTP Our discussion in this paper only covers WebTP’s implementation of the network control functions. The WebTP protocol can be viewed as a transport protocol not only because it implements the network control part of the transport functions, but also because it conceptually sits above the IP layer in the OSI model. An extension to WebTP discussed in this paper can also be used in end hosts as an alternative to TCP or UDP. Figure 1 shows the OSI model with WebTP. Note that, in the figure, WebTP is one of the choices for transport protocol at the end hosts. However, this does not have to be the case. 1.2 Differences with Other Layer-3 or Layer-4 Approaches Important differences exist between WebTP and TCP. First, WebTP’s feedback control loop does not have to run end-to-end. In fact, it is mainly meant to run on network segments between two WebTP gateways, which are machines on network paths (i.e., IP routes) that run the WebTP protocol. A end-to-end connection may see multiple such segments on its route. By doing so, we 2 LAN/ISP/Intranet WAN/Backone LAN/ISP/Intranet WebTP Gateway H2 H1 Router/Switch W1 W2 W3 Figure 2: Control of network segments with WebTP break down an end-to-end network control loop into multiple smaller control loops. We call this segment-by-segment network control. An illustration of network control with WebTP gateways is shown in figure 2. Second, WebTP’s congestion control is performed over a set of connections grouped into a pipe, which is a logical communication path between two adjacent WebTP gateways on an IP route. We call this integrated congestion control. Since the bandwidth probed by the congestion control algorithm is shared by multiple connections, it is possible to give differential treatment to the connections in a pipe. Thus, WebTP can explicitly implement bandwidth allocation by scheduling packet transmissions from connections. If IP routers implement some form of active queue management (AQM), it is also possible to give differential treatment of delays to the connections. Since WebTP relies on IP routing, the QOS that it can provide is limited by its inability to select routes. Hence, WebTP’s QOS-related capabilities are less powerful than those proposed systems that allow QOS-based routing. However, it won’t be entirely impossible to allow some kind of QOS-based routing among WebTP gateways in the future. Unlike QOS control frameworks with reservation and admission control, such as DiffServ, WebTP can only provide QOS on “best-effort” basis and cannot promise hard guarantee. This is because WebTP probes the available bandwidth on a network segment using a feedback-based, adaptive control scheme, much like TCP’s congestion control. WebTP gateways do not know the topology of the surrounding network and do not exchange complicated control or signaling messages with other WebTP gateways, except messages used in gateway discovery and simple ACK packets that acknowledge successful transmissions of data packets exchanged between two adjacent gateways on each route. As a result, WebTP can only make “local” trade-off among existing connections within the same pipe, as apposed to the network-wise trade-off envisioned by DiffServ. Nevertheless, WebTP still represents a step forward to the even more “myopic” nature of the original TCP/IP. If a pair of WebTP gateways control a high-speed network segment with many connections, such local trade-off is likely to be feasible and can be made useful. WebTP’s approach clearly has advantage over reservation-based frameworks in its simplicity and ease for deployment. WebTP gateways can be first deployed in important locations in the network as needed and gradually extend their coverage before the network managers have a full understanding or an overall plan about the entire network. As a final note, there is nothing radical about pushing the closed-loop control inside the network when one realizes that functions such as routing and resource reservation are naturally thought to 3 operate from within the network. Here, we simply move more transport functions into the network. 1.3 Differences with the Congestion Manager, SCTP and DCP The idea of integrated congestion control and its benefit can be found in many previous studies [4] [6] [9] [15] [19]. With respect to this feature, WebTP is most related to the congestion manager (CM) reported in [4] and [3]. WebTP differs from the above in its realization of this feature. In [3], we see that CM is not implemented as a new protocol, but as an algorithmic enhancement in the networking stack at end hosts. In WebTP, which defines a new layer of header for packets, the congestion control is truly integrated because WebTP completely serializes the transmission of packets that are subject to the integrated control. Together with WebTP’s automatic gateway discovery algorithm, i.e., the Hello protocol, we can close the control loop between two arbitrary points in the network, and hence, realize domain-to-domain-based integrated congestion control. As a result, WebTP is more powerful than CM, and can be used as a general framework for providing differentiated services. The Stream Control Transmission Protocol (SCTP) [18] is a reliable end-to-end transport protocol proposed at the IETF. It has an array of features, most of which are orthogonal to WebTP’s non-native mode but quite related to its native mode. Since this paper is about WebTP’s non-native mode, we won’t compare SCTP with WebTP’s native mode, which is described in [20]. The Datagram Control Protocol (DCP) [11], as an end-to-end protocol for unreliable datagram flows, fills certain void left by SCTP. DCP has handshakes for connection setup and tear-down. It defines request and response control messages for feature negotiation, and therefore, is quite flexible and extensible. For instance, at connection setup, the end hosts negotiate which one of potentially many congestion control algorithms to use. Being an end-to-end protocol, DCP can stop traffic from entering an already congested network. On the other hand, if we don’t extend WebTP to the end hosts, non-native WebTP and UDP combination can only drop excessive packets already in the network. Similar to WebTP, DCP uses ACK vectors for feedback when such an option is requested. DCP compresses a variable-length ACK vector with run-length encoding into a variable-length ACK option. Moreover, the ACK traffic itself can also be acknowledged in DCP, and therefore, can be made fully reliable. WebTP uses a simpler 32-bit ACK vector, and relies on appropriate ACK placement to combat ACK losses. DCP’s ACK traffic is congestion-controlled through ACK-ratio negotiation. WebTP can only use a fixed ACK ratio. DCP does not share other features of the non-native WebTP. We omit the more appropriate comparison of DCP with the native WebTP. 2 2.1 Protocol and Architecture WebTP’s Non-native Mode: Encapsulation of TCP or UDP Packets We now discuss the essential components of the WebTP protocol. WebTP may have two modes of operation: the native mode and the non-native mode. When operating in the native mode, WebTP replaces TCP or UDP at the end hosts as the end-to-end transport protocol with a complete set of transport functions. The focus of the paper is about WebTP running on WebTP gateways in 4 the non-native mode. In this mode, applications on the end hosts are unaware of WebTP and use TCP or UDP as their transport as usual. WebTP gateways capture TCP or UDP packets and encapsulate them into WebTP packets. The last WebTP gateway on an IP route removes the WebTP header and re-injects the TCP or UDP packet. Figure 2 depicts a situation where a TCP connection between end hosts H1 and H2 encounters three WebTP gateways, W1, W2 and W3, on its IP route drawn in dotted lines. Host H1 sends a normal TCP packet, which is captured by W1. Suppose W1 knows that packets destined for H2 will go through W2 next. It then establishes a pipe between itself and W2, if one does not already exist. W1 then encapsulates the entire TCP packet, including the IP header, into a WebTP packet. A new IP header is added to the WebTP packet, with the source address set to the outgoing interface of the route at W1 and the destination address set to the incoming interface of the route at W2. The new IP packet is then released to the network. When W2 receives the IP packet that contains the WebTP packet, it removes the IP header and sends an acknowledgment (ACK) packet back to W1. Suppose at this point, there is also a pipe between W2 and W3. W2 modifies the WebTP header using information relevant to this pipe and encapsulates the resulting WebTP packet into a new IP packet with the addresses corresponding to the interfaces of W2 and W3. As the new WebTP packet arrives at W3, W3 removes the IP header and the WebTP header to recover the original TCP packet, which is then released to the network. Relying on the IP header, the TCP packet will arrive at H2 eventually. Figure 3 illustrates what we mean by a TCP packet or a WebTP packet. To distinguish the two IP headers, we call the outer IP header the IP2 header and the inner one the IP1 header. For simplicity, we call a TCP or UDP packet a TU packet. IP1 TCP Data TCP Packet IP2 WebTP IP1 TCP Data WebTP Packet Figure 3: TCP packet and WebTP packet 2.2 Definition of a Flow and a Pipe We say two WebTP gateways are adjacent if they are on a common IP route and there are no other WebTP gateways between them on that route. A pipe is the logical communication channel between two interfaces of two adjacent WebTP gateways. Inside a WebTP gateway, it is specified by a pair of IP addresses corresponding to the two interfaces, one local address and one foreign address. A pipe is used as the device to aggregate all flows that go between the two IP addresses that specify the pipe for the purpose of integrated congestion control and packet transmission scheduling. A flow resembles a uni-directional TCP connection. More precisely, in a WebTP gateway, a flow is specified by a 5-tuple: the protocol, the port numbers and IP addresses of the source and destination hosts, 5 where the protocol can be either TCP or UDP. Unlike a TCP connection, a flow is uni-directional. This is in accordance to the possibility of asymmetric route in which the two directions of a TCP connection go through different sets of WebTP gateways. As examples, figure 4 shows that, at gateway A, flow 1 and 2 are multiplexed into pipe 2, and flow 3 and 4 are multiplexed into pipe 3. All these flows belong to pipe 1 when they enter gateway A. Note that each flow is associated with a receiving pipe and a sending pipe. Figure 5 shows how a bi-directional TCP connection between host A and B becomes two uni-directional flows at each WebTP gateway. Figure 6 shows the same connection in the situation where the forward and reverse IP routes are asymmetric. Note that the two directions involve two different set of WebTP gateways. Flow1 Pipe 4 Flow2 Flow1 Pipe 2 Flow2 WebTP Gateway B Flow3 Pipe 1 Pipe 3 Flow4 Flow3 WebTP Gateway A Pipe 5 Flow4 WebTP Gateway C Figure 4: Multiplexing flows into pipes Flow: A to B Pipe 1 A Host Pipe 2 Host Flow: B to A WebTP Gateway B WebTP Gateway WebTP Gateway Figure 5: A TCP connection between host A and B becomes two uni-directional flows in a WebTP gateway. 2.3 Packet Header Format The packet format for WebTP’s non-native mode is very simple, as shown in figure 7. The fields in the packet header are used by congestion control, the Hello protocol and fragmentation and reassembly. Their meanings are as follows. 6 WebTP Gateway C WebTP Gateway D Flow: A to B Flow: A to B Pipe 1 A B Host Host Flow: B to A Flow: B to A Pipe 2 WebTP Gateway E WebTP Gateway F Figure 6: TCP connection between host A and B with asymmetric routes 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Packet Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Acknowledgment Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Acknowledgment Vector | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data |S|A|N|M| | | | Offset|Y|C|A|F| RES | Fragment Offset | | |N|K|V| | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 7: WebTP packet header format for the non-native mode (NAV = 0) 7 The sequence number of the packet, which is incremented by one for each WebTP packet. The sequence number is shared by all flows of the pipe. Acknowledgment Number The highest packet sequence number for which the ACK packet contains acknowledgment information. Acknowledgment Vector A bit map in which each entry indicates whether the corresponding packet has been received (1) or not (0). Fragment Offset The byte offset of the WebTP packet fragment. This is used when the WebTP packet is fragmented. A value 1 for the control bits has the following meanings. SYN (Synchronization) This is a packet for initiating the Hello protocol. ACK (Acknowledgment) This packet carries acknowledgment information. When SYN is also 1, this packet is a response in the Hello protocol. NAV Native WebTP mode. In this mode, extra header fields can be defined. MF More fragments are coming. This is used together with Fragment Offset when a WebTP packet is fragmented. MF is 0 in the last fragment, and is 1 for all other fragments. Note that in the non-native mode, the NAV bit must be 0. In the native mode (NAV = 1), WebTP can be used as the transport protocol on end hosts. For that purpose, many more fields need to be defined. We do not discuss this scenario further, but refer interested readers to [20] for some possibilities. Packet number (PN) 2.4 Hello Protocol: Automatic Discovery of WebTP Gateways Two adjacent WebTP gateways must detect the presence of each other in order to close the control loop for a pipe between them. One of our main considerations in the design of WebTP is to allow incremental deployment of WebTP gateways. For that purpose, a WebTP gateway should discover its peer automatically if one exists on the IP route that is in use. This is achieved by the Hello protocol. When a WebTP gateway captures a TU packet whose destination host is, say D, it tries to find out if there is a downstream adjacent WebTP gateway on the route to D, which we call the next-hop WebTP gateway. For that, the gateway queries a table that keeps the mapping between IP addresses of destination hosts and the next-hop WebTP gateways, called the host-to-gateway mapping table (HTG map). If the HTG map shows that the next-hop gateway exists, the TU packet will go through WebTP processing and will be sent out encapsulated in a WebTP packet. If the next-hop gateway does not exist, the TU packet will not go through WebTP processing and will be released to the network. If the HTG map does not know whether the next-hop WebTP gateway exists, the TU packet will be released to the network first and the Hello protocol is initiated for discovering the next-hop gateway. A Hello packet is constructed with the source IP address set to the outgoing interface’s address at the gateway and the destination IP address set to the final destination’s host IP address, in this case, D, and is released to the network. Under normal IP routing, the Hello packet will go through the same route as regular TU packets. When it passes through the next-hop WebTP gateway, the 8 gateway intercepts it and returns a Hello-ACK packet with the source address set to the one of its interface addresses and destination address to be the source address of the Hello packet. The Hello-ACK packet will then return to the upstream WebTP gateway. Both the Hello packet and the Hello-ACK packet use the packet headers of the original TU packet as their payload, including the IP and TCP/UDP headers. Thus, the upstream gateway can establish the mapping between the IP address of the destination host and the next-hop WebTP gateway’s address. At this point, a pipe can be established since both its local and the foreign IP addresses are known. There is also enough information to establish the mapping from the flow associated with the TU packet to the pipe. A Hello packet is one with SYN = 1 and ACK = 0. A Hello-ACK packet is one with SYN = 1 and ACK = 1. The payload of either packet must start with an IP header followed by a header, depending on the IP header’s protocol field. If the IP protocol field is TCP, then a TCP packet header follows the IP header. If the protocol is UDP, then a UDP packet header follows the IP header. As mentioned earlier, the payload comes from the headers of the TU packet that triggers the Hello protocol. Figure 8 shows the various packets involved for a situation where host A sends a TCP packet to host D via WebTP gateways B and C, in that order. Assume B1 is the IP address of the egress interface at B and C1 is the IP address of the ingress interface at C for the IP route from A to D. Note that the source and destination IP addresses are B1 and D, respectively, for the Hello packet’s IP2 header. For the Hello-ACK packet, the source and destination IP addresses are C1 and B1, respectively. Also notice that the IP and TCP headers of the original TCP packet are encapsulated as the payload of the Hello and Hello-ACK packets. IP Header Source: A Destination:D TCP Header Data TCP Packet IP2 Header Source: B1 Destination:D IP2 Header Source: C1 Destination:B1 WebTP Header SYN=1 ACK=0 IP1 Header Source: A Destination:D WebTP Header SYN=1 ACK=1 IP1 Header Source: A Destination:D TCP Header TCP Header Hello Packet Hello−ACK Packet Figure 8: TCP, Hello and Hello-ACK packets for the case where host A sends data to host D via TCP passing WebTP gateways B and C. 2.5 WebTP’s Feedback Scheme Traditionally, feedback or acknowledgment schemes are used for packet loss detection, which serves as the basis for congestion control and error recovery (also known as reliability control). In the non-native mode, WebTP does not have the error recovery capability. In other words, WebTP does not require the correct receipt of all transmitted packets. For reliable communication, end 9 hosts must use a reliable end-to-end protocol, such as TCP. Thus, WebTP’s feedback is used for congestion control only. WebTP cannot use TCP’s cumulative ACK (CACK) because a lost packet prevents the cumulative ACK number from increasing. Instead, it uses a bit-vector-based ACK (BACK) scheme in which each ACK packet contains an ACK number acknowledging the packet that was just received 1 , together with a vector of bits representing the receiving status of a set of earlier packets. A pipe is bi-directional, allowing WebTP data packets to travel in both directions. We arbitrarily choose one of the directions and call it the forward direction. Each WebTP data packet is given a pipe-level sequence number, carried in the Packet Number field. After arriving at the receiving WebTP gateway, WebTP returns an acknowledgment packet, either in the form of a pure ACK packet, or in a data packet that will travel in the reverse direction of the pipe, called the piggybacked ACK. An ACK packet has SYN=0 and ACK=1. A pure ACK does not have any WebTP payload, and a piggybacked ACK has WebTP payload that follows the format of a WebTP data packet. We also call an ACK packet with data a WebTP data packet, depending on the context. In addition to ACK=1, an ACK packet contains an ACK number and a 32-bit ACK vector, which is a bit map with each bit indicating whether the corresponding data packet is received or not, 1 for received and 0 for not received. The correspondence between bits in the ACK vector and data packets in the forward direction is called the ACK placement. In all ACK placement schemes, the most significant bit corresponds to the data packet whose packet number is the ACK number of the ACK packet. A study of BACK with various ACK placement schemes can be found in [17]. We will not discuss them further in this paper, but simply point out that WebTP uses a hybrid of the Consecutive Placement (CP) and the Exponential Placement (EP), called CEP(16,16). In this scheme, the first 16 bits are placed consecutively, that is, if the ACK number is n, then the ith most significant bit corresponds to data packet n − i + 1, for i = 1, 2, ..., 16. The next 16 bits are placed exponentially, that is, the ith bit corresponds to data packet n − 15 − 2i−16 , for i = 17, 18, ..., 32. This scheme has the property that it guards against burst ACK losses while keeping the average ACK delay low, where the ACK delay is the interval between the arrival of a data packet and the successful receipt of the corresponding ACK. With very few control bits, the scheme is robust against both forward losses and ACK losses. It achieves good performance even when the actual network conditions deviate from the designer’s estimate. It works in networks having orders of magnitude difference in bandwidth and loss probability. 2.6 WebTP Congestion Control The so-called integrated congestion control operates at the pipe level. Since all flows within a pipe go through the same network segment represented by the pipe, integrated control is only natural. Because the aggregated traffic from many flows tend to be more smooth and long-lasting, congestion control can be made more effective. Furthermore, differentiated services can be provided to connections by aggregating the bandwidth at the pipe level and redistributing it to the connections 1 This is true for most cases. Occasionally, the ACK number corresponds to a packet that has not been received, in which case the most significant bit of the ACK vector is 0. 10 according to their QOS requirements. WebTP uses a window-based congestion control algorithm very similar to TCP’s algorithm [10] [2]. The algorithm has the slow start and the congestion avoidance phases, each also with similar meaning to its counterpart in TCP. In order to be competitive with other TCP connections, the linear increase speed in the congestion avoidance phase is made proportional to the number of flows in the pipe. Many improvements can be made to the basic window control algorithm (See [13] for some examples.). We plan to write another paper to specifically address these improvements. 2.7 WebTP Transmission Scheduling Packet transmission scheduling for a pipe can be viewed as the single-server scheduling problem, where the server capacity is the available bandwidth of the pipe, determined by the congestion control algorithm. The basic goal is to distribute the available bandwidth of the pipe to the flows so as to achieve some QOS differentiation. Since the pipe’s bandwidth is generally time-varying, it is not always possible to provide hard QOS guarantee to flows. However, if the pipe aggregates a large number of connections and/or if it passes through high-bandwidth networks, the bandwidth variation may be highly regular or even predicable, or has a pattern from which certain QOS guarantee can be extracted. For instance, if the bandwidth consistently varies from 50 Mbps to 70 Mbps, it is then possible to provide 10 Mbps bandwidth to certain flow. Moreover, if the routers on the path of the pipe has AQM capability so that the queueing delays at the routers are kept small, proper scheduling can make the total delay small for some flows. Many scheduling algorithms can be used, such as weighted fair queueing, priority-based scheduling, or some combination of them. The scheduler should expose an API that allows flows to specify their QOS requirement and allows some authority to specify a policy for conflict resolution when the set of QOS requirements for all flows cannot be satisfied simultaneously. Currently, we do not have a complete solution for this issue. 3 Implementation Depending on the capacity of the network, WebTP protocol can run on a variety of systems, ranging from personal computers to specialized hardware equipment. In this section, we share the experiences of our experimental implementation of WebTP on Linux PCs (kernel 2.4.7 or above). We decide to run WebTP in the user-space as a “transport server”. This term betrays a philosophical undertone that the transport is just some kind of network service provided by a server not so different from other servers, which does not have to be tied to a single connection or single process when it is invoked. Having WebTP in the user-space means ease of implementation and flexibility in experimenting with variations of the protocol and its associated algorithms. 3.1 Overview of Implementation Our WebTP implementation is divided into a few modules, as shown in figure 9. The arrows in the figure roughly correspond to the paths of control. A WebTP process has three threads, one 11 is with the alarm module and is responsible for generating timeout events, one is with the event queue module for servicing the event queue, and one is with the input part of the NIO module that waits for incoming packets from the network. The alarm module implements a potentially large number of outstanding alarms using only one timer provided by the kernel. It generates an event upon each timeout. We should mention that most modules need to set up timeout events. However, for simplicity, we omit the arrows pointing to the alarm module in figure 9. The NIO module stands for network input and output and is responsible for (i) extracting packets from the network and calling input processing, and (ii) putting packets onto the network when requested. The input part of NIO is currently implemented with Linux’s packet capturing utility, netfilter/iptables, and the output part is implemented using raw IP. The input demultiplexing module pre-processes the packets captured by the NIO module. It differentiates different types of packets, delivers legitimate WebTP data and/or ACK packets for further WebTP input processing, and finishes processing other packets, including Hello, Hello-ACK and TU packets. It generates responding packets and sends them to the network when necessary. For example, the handshake of the Hello protocol is carried out in this module. In cases when a TU packet cannot be encapsulated in a WebTP packet, this module re-injects it into the network. Depending on whether the next-hop WebTP gateway exists, a received WebTP data packet is either queued at the buffer of the corresponding flow, waiting to be sent to the next-hop, or is released as a TU packet. In addition to the Hello protocol, WebTP’s protocol processing includes WebTP input (also known as pipe input) processing, WebTP output (pipe output) processing, and congestion control. The input module takes a WebTP packet, generates an ACK packet in response to a data packet and/or invokes ACK processing and congestion control in response to an ACK packet. The scheduler module decides which packet to send out and calls the output processing on the packet. Before sending out a packet, the scheduler must make sure the congestion control allows it to do so. The output processing formats the correct WebTP header, generates an ACK vector when required, and calls the NIO output to send the packet. Event Queue Alarm Scheduler Control Blocks WebTP Output WebTP Input Keys: Congestion Control Input Demultiplexing Tread of Execution Module User Space Kernel NIO Adaptation Separate Memory Space Raw IP Netfilter/iptables Figure 9: Different modules in WebTP implementation 12 Most modules access a core data structure, collectedly called the control blocks. These include flow and pipe control blocks and the HTG mapping table. Each pipe also has a congestion control block. WebTP’s protocol processing requires manipulation of these control blocks. To resolve the concurrency issue about this shared data structure in our multi-threaded implementation, we serialize all access the the control blocks through an event queue. Any function call that requires access to the control blocks is made into an event and is added to the event queue. A single thread is used to service the events from the queue one by one. Only this event-service thread has direct access to the control blocks. 3.2 Capturing Packets into the User-Space We use the netfilter/iptables [1], the firewalling subsystem for Linux kernel 2.4.x/2.5.x, for packet capture. We use the user-space tool, iptables, to set up filtering rules so that the kernel will capture and queue (i) transit TCP, UDP or WebTP packets that use the current machine as a gateway, and (ii) WebTP packets that are destined for this machine. In the terminology of iptables, the filtering rules are set on the FORWARD chain for case (i) and they are set on the INPUT chain for case (ii). With some complication, we can also set up proper rules to capture locally generated TU packets with non-local destination so that these packets can use a local WebTP server. This requires filtering rules on the OUTPUT chain. Packets that satisfy the filtering rules are initially queued in the kernel. Netfilter has a userspace library, libipq, that allows user-space programs to retrieve a copy of each queued packet, and either drop the original copy or release it to the network. Currently, we always drop the original copy of the packet, process the copy in the user-space, and when needed, re-inject the captured copy into the network. Some of this inefficiency can be reduced if we implement some packet pre-processing functions in the input demultiplexing module as a kernel module 3.3 Raw IP For sending a packet onto the network, WebTP uses the RAW IP facility, which is accessible by opening a socket of the SOCK RAW type. A captured TU packet needs to be released into the network eventually. In such a case, the packet comes with its own IP header, and we let the Raw IP send the entire packet including the IP header. A drawback is that the Linux kernel will not fragment the packet. As a result, the kernel cannot send a packet larger than the maximum transmission unit (MTU) and will generate an error when it is forced to do so. However, we expect the a TU packet fits in an MTU in typical setups. If we wish to let the kernel fragment a TU packet, we have to discard the IP header and let the raw IP rebuild it. However, we do not know how to preserve the source address 2 . On the other hand, a WebTP data packet contains a TU packet, and may easily be larger than the MTU. Currently, we let the kernel build the IP2 header and fragment the WebTP packet into multiple segments when necessary. An alternative approach is for WebTP to build link-level packets and send them through raw IP. This way, both TU packets and WebTP packets can be fragmented in the user-space. In situations 2 sendto() uses the IP address of outgoing interface at the gateway as the source IP address for the packet header. 13 where none of the above approaches work, WebTP can fragment packets at the WebTP level, making WebTP fragments, with the help of the fragmentation related fields in the WebTP packet header. 3.4 HTG Map and Algorithms Currently, the HTG map is implemented as a hash table, whose search key is the host IP address. In addition to the host IP address, each entry of the mapping table has three other fields, a state denoting the availability of the next-hop gateway, the next-hop gateway’s IP address, and a timeout counter. Possible states are AVAILABLE, UNAVAILABLE or UNKNOWN. When the next-hop gateway returns a Hello-ACK packet in the process of running the Hello protocol, the corresponding entry is marked as AVAILABLE. When the next-hop gateway does not exist, the sender of the Hello packets will eventually timeout after making several attempts. At this time, the corresponding HTG entry will be marked as UNAVAILABLE. Both AVAILABLE and UNAVAILABLE values need to be timed out in anticipation of events such as IP route change or WebTP gateway failure or recovery. When an entry times out, it goes into the UNKNOWN state. Besides Hello packet timeout, there is a second way of putting HTG entries into the UNAVAILABLE state. When a pipe fails to receive ACKs from the foreign gateway for a prolonged period of time while sending data, we assume the foreign gateway is no longer available. We search the HTG table (by the gateway address) and put all entries that contain the foreign gateway address into the UNAVAILABLE state. The Hello protocol is initiated by each flow instead of by each destination host address. This means that two or more flows with the same final destination can all be running the protocol simultaneously and discover the same gateway. This design is chosen for simplicity reason in spite of its slight inefficiency. 3.5 Congestion Control WebTP’s congestion control operates on each pipe and tries to emulate TCP’s window control algorithm. In the congestion avoidance phase, a pipe’s congestion window size increases by a 1 , where cwnd is the window size, on every positive acknowledgment, constant proportion of cwnd and decreases by half when congestion is detected by the observation of packet losses. The quantity, window size minus the amount of outstanding data, is called the congestion control credit. This is the maximal amount of extra data that can be sent to the network at that time. Details of WebTP’s congestion control are different from TCP mainly due to the fact that a pipe does not retransmit lost packets, and as a result, WebTP uses the bit-vector-based acknowledgment scheme. At the sender, successful transmission of a packet is recognized explicitly by a received ACK packet that contains a 1 in the bit-vector location corresponding to that packet. Similar to TCP, packet loss detection in WebTP is through either packet timeout or sequence-number gaps in the ACK packets. For the former, we use the timeout scheme used in TCP Vegas [7] to mitigate the effect of timer-granularity constraint. When a data packet is transmitted, a record containing its departure time is put on the outstanding packet queue. Every time an ACK packet arrives, we check the system clock, update the round-trip time (RTT) estimate, visit the outstanding packet queue and timeout packets that have been there for longer than one timeout interval. The timeout 14 interval is derived based on the RTT estimate, similarly as in TCP. We omit the details here. For loss detection based on the sequence number gaps, we have implemented a very simple algorithm that resembles TCP’s 3-duplicated ACK mechanism. Specifically, suppose an ACK packet for data packet n is just received. If packet n − k has not been acked by then, where k is greater than an integer K, called Misordering Threshold, we declare packet n − k lost. When the network does not misorder packets, a gap in the ACK sequence numbers can only be caused by a data packet loss. With K > 0, the algorithm tolerates some amount of packet misordering by the network. 3.6 Scheduler We now briefly introduce the simple class-based weighted-round-robin scheduler we have implemented. Related details can be found [5]. The scheduler is not fully general but probably enough for many applications. In our implementation, the scheduler has its own flow and pipe control blocks that contain scheduling-related data and are linked to the corresponding control blocks in the control blocks module in figure 9. This way, we can replace the current scheduler with a new one fairly easily. Pipe W1 NBE Class 1 W2 NBE Class 2 BE Class Flows Figure 10: WebTP scheduler. W1 and W2 are weights for NBE class 1 and NBE class 2. Flows also have weights or rates associated with them. As shown in figure 10, each pipe has one best-effort (BE) traffic class, and up to n non-besteffort (NBE) classes, where n is a constant. NBE classes always have priority over the best-effort class, and are served in weighted-round-robin fashion. Specifically, the weight of the each NBE class is converted to a credit, in number of bytes, that is proportional to the weight. The scheduler visits an NBE class, transmits data up to the credit, then moves to the next class. The process repeats after all NBE classes are visited once. When none of the NBE classes has data to send, the scheduler visits the best-effort class. Note that we do not have priority classes among the NBE classes, but can approximate a two-level priority scheduler. We pick one NBE class and make it approximately a high-priority class by giving it a lot more credit than all other classes combined. Each class can have many flows, which are also serviced in weighted-round-robin fashion. A flow’s requested weight or requested rate for transmission is also converted into credit. Finally, we use a simple buffer management technique to control the packet queues for the flows. Each queue has a 15 threshold, beyond which newly arrived packets are dropped. The scheduler is executed through the event queue thread. In order not to block other activities, such as the input processing, the scheduler checks whether the event queue is empty after it transmits certain amount of data. If the queue is not empty, the scheduler saves its state and add itself to the end of the event queue. Otherwise, it continues to execute. The interaction of the scheduler with the congestion control is mainly through checking the congestion control credit. If the credit is not available, i.e. transmission is blocked by the congestion control, the scheduler moves into the idle mode by getting out of the event queue. When the credit becomes non-zero, the scheduler is put into the event queue again, waiting for execution. 3.7 Soft States and Timeout Mechanism For simplicity, the WebTP protocol does not have flow or pipe termination handshakes. As a result, no implementation can maintain consistency across gateways of the key data structures, e.g., the flow and pipe control blocks. Therefore, it is possible to have one flow established at the upstream gateway and not established in the downstream gateway, or vice versa. Flows and pipes must be established as needed, and freed when they have been idle for a prescribed amount of time. For such a purpose, the alarm module is used to maintain a potentially large number of outstanding timeout events. We also need to timeout an HTG entry when the time since its last use exceeds a threshold. We implement batch timeout for the HTG entries, flow or pipe control blocks. As an example, we discuss the mechanism to time out an HTG entry. For an entry in the AVAILABLE state, the timeout counter is reset to zero whenever the corresponding flow sends packets. A periodic timeout routine visits all entries and increments the counter by one on every timeout. When the counter reaches a threshold, say K, the entry is timed out and put in the UNAVAILABLE state. This way the entry is timed out after it has been idle for roughly KT seconds, where we assume the timeout period is T . Another periodic timeout routine is used to time out the UNAVAILABLE state in a similar fashion. 4 Conclusion and Discussion Finally, we summarize the desirable properties of the WebTP protocol and discuss possible future work. We emphasize that the WebTP protocol is a platform for providing differentiated QOS that can be gradually incorporated into today’s network. For instance, one can start with two WebTP gateways placed between the entry points of two sites of the same corporate intranet and control the bandwidth allocations to different connections between the two sites. WebTP can easily be deployed in today’s network due to the Hello protocol for automatic discovery of WebTP gateways. WebTP extends the end-to-end, feedback-based, adaptive congestion control of TCP to segmentby-segment control, and augments it from operating over a single isolated connection to over many connections in an integrated fashion. The enhancement makes it more straightforward to apply QOS control to arbitrary locations in the network than TCP could do. Compared with layer-3’s QOS control framework, the control of WebTP is adaptive and does not require the knowledge of 16 the network topology and capacities of the links. Since a WebTP pipe can contain both TCP and UDP flows, WebTP can congestion-control UDP traffic. We have specifically designed a bit-vector based acknowledgment scheme that can be used for unreliable traffic. The design of WebTP’s nonnative mode is not complicated, compared with many other major protocols, such as TCP or IP. WebTP’s major algorithms are rooted in well-known research areas of scheduling and congestion control. The WebTP header is extremely simple. However, we also have to add another layer of IP header to the WebTP packet. Hence, the header overhead in each packet increases slightly. Our WebTP implementation on Linux PCs provides an example for other possible implementations. With this implementation, we plan to learn how to apply WebTP to network control problems, study its capabilities and performance, and tune its many algorithms. Situating in the user-space, our implementation is especially useful and flexible as an experimental platform for the study and design of transport protocols. Besides improving the built-in algorithms of WebTP, such as scheduling and congestion control, there are many other possible directions in the future work related to WebTP. For instance, we can extend the current WebTP architecture to a full transport protocol, including all applicationsupport functions in TCP and UDP plus many additional features, as outlined in the original WebTP proposal [20]. Most noticeably, one of these new features is fine-grained reliability control at the level of application data unit (ADU). To extend these transport functions end-to-end, we may either use the native mode of WebTP or add a thin WebTP layer that runs on end hosts and that cooperates with the “fat” WebTP gateways. A second direction is to devise an API and a flow classifier for configuring and controlling the scheduler in each WebTP gateway. Related to this, a thorough understanding of where to place WebTP gateways and how to control a collection of schedulers have both practical and theoretical importance. So is the understanding of how the endto-end control loop interacts with segment-by-segment control loops. Finally, we need to address the danger caused by the fact that WebTP relies on and also interferes with IP routing. Currently, a route change for a connection is not reflected in the HTG map in WebTP gateways. There is a possibility of forming routing loops. For instance, a WebTP gateway may continue to forward the connection’s packets to the next-hop gateway on the connection’s old IP route. After recovering the TU packet inside a WebTP packet, the downstream gateway may forward the TU packet on the way back to the upstream gateway according to the new IP route. This way, the packet will circulate between the two gateways indefinitely. References [1] Web Page. http://netfilter.samba.org/. [2] M. Allman, V. Paxson, and W. Stevens. TCP Congestion Control, RFC 2581. IETF, April 1999. [3] David Andersen, Deepak Bansal, Dorothy Curtis, Srinivasan Seshan, and Hari Balakrishnan. System Support for Bandwidth Management and Content Adaptation in Internet Applications. 17 In Proceedings of OSDI 2000. 4th Symposium on Operating Systems Design and Implementation, San Diego, CA, October 2000. [4] H. Balakrishnan, H. S. Rahul, and S. Seshan. An Integrated Congestion Management Architecture for Internet Hosts. In Proc. ACM SIGCOMM ’99, Cambridge, MA, 1999. [5] Yogesh Krishnakant Bhumralkar. Generic scheduler framework for priority queueing and bandwith sharing. Master’s thesis, University of California, Berkeley, 2001. [6] R. Braden. T/TCP – TCP Extensions for Transactions: Functional Specification, RFC 1644. IETF, July 1994. ftp://ftp.isi.edu/in-notes/rfc1644.txt. [7] L. Brakmo and L. Peterson. TCP Vegas: End to End Congestion Avoidance on a Global Internet. IEEE Journal on Selected Areas in Communication, Vol 13, No. 8, pages 1465–1480, October 1995. [8] S. Blake et.al. An Architecture for Differentiated Services, RFC 2475. IETF, December 1998. [9] Jim Gettys and Henrik Frystyk Nielsen. The WebMUX Protocol, Internet Draft. IETF, August 1998. [10] Van Jacobson. Congestion Avoidance and Control. In Proc. ACM SIGCOMM ’88, Stanford, CA, August 1988. [11] Eddie Kohler, Mark Handley, Sally Floyd, and Jitendra Padhye. Datagram Control Protocol, Internet Draft. IETF, March 2002. [12] T.V. Lakshman and U. Madhow. The Performance of TCP/IP for Networks with High Bandwidth-Delay Products. IEEE/ACM Transactions on Networking, vol. 5, June 1997. [13] Jeng Lung. Network adaptive TCP/webTP congestion control. Master’s thesis, University of California, Berkeley, 2000. [14] J. Padhye, V. Firoiu, D. Towsley, and J. Kurose. Modeling TCP throughput: A simple model and its empirical validation. Proceedings of SIGCOMM’98, pages 22–35, 1998. [15] Venkata N. Padmanabhan. Addressing the Challenges of Web Data Transport. PhD thesis, University of California, Berkeley, 1998. [16] E. Rosen, A. Viswanathan, and R. Callon. Multiprotocol Label Switching Architecture, RFC 3031. IETF, January 2001. [17] Hoi-Sheung Wilson So, Ye Xia, and Jean Walrand. A Robust Acknowledgement Scheme for Unreliable Flows. In Proceedings of the Infocom’2002, New York, June 2002. [18] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, and V. Paxson. Stream Control Transmission Protocol, RFC 2960. IETF, October 2000. 18 [19] J. Touch. TCP Control Block Interdependence, RFC 2140. IETF, April 1997. http://info.internet.isi.edu/in-notes/rfc/files/rfc2140.txt. [20] Y. Xia, H.-S. So, V. Anantharam, S. McCanne, D. Tse, J. Walrand, and P. Varaiya. The WebTP Architecture and Algorithm. Technical Report Memorandum No. UCB/ERL M00/53, University of California University, Berkeley, Electronic Research Laboratory, January 2000. 19