Low Latency TOE with Double-Queue Structure for 10Gbps Ethernet on FPGA
<p>The TOE framework architecture.</p> "> Figure 2
<p>The TOE reception principle.</p> "> Figure 3
<p>Proposed FPGA-based TOE reception transmission model structure. Parsing and Pre: parsing and preprocessing. APP layer: application layer.</p> "> Figure 4
<p>Query interaction and data transmission process between TOE and the application layer.</p> "> Figure 5
<p>APP query interface signal diagram.</p> "> Figure 6
<p>Transmission scheduling strategy diagram.</p> "> Figure 7
<p>Multi-mode update address method structure diagram.</p> "> Figure 8
<p>Multi-mode update length method structure diagram.</p> "> Figure 9
<p>Multi-session priority arbitration method structure diagram.</p> "> Figure 10
<p>Test connection topology diagram.</p> "> Figure 11
<p>TOE received TCP data frame packet capture results.</p> "> Figure 12
<p>Input interface signal format.</p> "> Figure 13
<p>Data storage signals.</p> "> Figure 14
<p>Direct reading of control signals during data reading.</p> "> Figure 15
<p>Indirect reading of control signals during data reading.</p> "> Figure 16
<p>The transmission latency performance of different hardware approaches with different payload lengths. TOE—DQ: double-queue storage structure TOE. TOE—DDR: single DDR3 storage structure TOE. Sidler—BRAM: BRAM storage structure from [<a href="#B9-sensors-23-04690" class="html-bibr">9</a>]. Xie—BRAM: BRAM storage structure from [<a href="#B21-sensors-23-04690" class="html-bibr">21</a>]. Sidler—DDR: single DDR3 storage structure from [<a href="#B8-sensors-23-04690" class="html-bibr">8</a>].</p> "> Figure 17
<p>Comparison of TOE transmission delay with different indexes of packets. TOE—DDR: single DDR3 storage structure TOE. TOE—DQ: double-queue storage structure TOE.</p> "> Figure 18
<p>Comparison of TOE transmission delay between TOE’s double-queue structure and DDR3 structure with different update time interval <math display="inline"><semantics> <mrow> <msub> <mi>t</mi> <mi>s</mi> </msub> </mrow> </semantics></math>. payload: different payload lengths, its value is in the right vertical coordinate (Blue circles and blue arrows point in the direction). TOE—DDR: single DDR3 storage structure TOE, its value is in the left vertical coordinate (Green circles and green arrows point in the direction). TOE—DQ: double-queue storage structure TOE, its value is in the left vertical coordinate (Purple circles and purple arrows point in the direction).</p> "> Figure 19
<p>Comparison of TOE transmission delay between TOE’s double-queue structure and DDR3 structure with different the release space capacity <math display="inline"><semantics> <mrow> <msub> <mi>B</mi> <mi>s</mi> </msub> </mrow> </semantics></math>. payload: different payload lengths, its value is in the right vertical coordinate. TOE—DDR: single DDR3 storage structure TOE, its value is in the left vertical coordinate. TOE—DQ: double-queue storage structure TOE, its value is in the left vertical coordinate.</p> "> Figure 20
<p>TCP/UDP performance testing tool settings.</p> "> Figure 21
<p>Packet capture results of Wireshark. These frames with different colors represent chain-building operations for different TCP sessions, and the three packets in the same colored box are a set of three handshake processes.</p> "> Figure 22
<p>Conversation statistics of all TCP sessions captured by Wireshark.</p> "> Figure 23
<p>The slicing results of multi-session interactive query mechanism. (<b>a</b>) Slicing results for session index 1 with 50% sufficient probability. (<b>b</b>) Slicing results for session index 1 with 25% sufficient probability. (<b>c</b>) Slicing results for session index 2 with 50% sufficient probability. (<b>d</b>) Slicing results for session index 2 with 25% sufficient probability. (<b>e</b>) Slicing results for session index 3 with 50% sufficient probability. (<b>f</b>) Slicing results for session index 3 with 25% sufficient probability.</p> "> Figure 23 Cont.
<p>The slicing results of multi-session interactive query mechanism. (<b>a</b>) Slicing results for session index 1 with 50% sufficient probability. (<b>b</b>) Slicing results for session index 1 with 25% sufficient probability. (<b>c</b>) Slicing results for session index 2 with 50% sufficient probability. (<b>d</b>) Slicing results for session index 2 with 25% sufficient probability. (<b>e</b>) Slicing results for session index 3 with 50% sufficient probability. (<b>f</b>) Slicing results for session index 3 with 25% sufficient probability.</p> "> Figure 24
<p>The experimental results of TOE reception data rate performance.</p> ">
Abstract
:1. Introduction
2. Related Works
- We analyze the TOE transmission structure for 10-gigabit ethernet and build an end-to-end TOE transmission delay model. From the perspectives of theoretical analysis and experimental verification, the correctness of the model is confirmed.
- A double-queue storage structure combining first input first output (FIFO) and DDR3 is proposed, which is capable of dynamically switching transmission channels and achieving a minimum end-to-end transmission delay of 600 ns for 1024 TCP sessions. We also use a multi-mode method of updating address length to achieve consistency in data transmission.
- A non-blocking data transmission method for multiple-session server application layer reception is proposed. A handshake data query update mechanism with priority is used to obtain the amount of transferable data at the application layer and achieve efficient slicing and transmission of stored data.
3. TOE Reception Transmission Delay Theoretical Analysis Model
3.1. TOE Framework Architecture
- Tx engine, which is used to generate new packets to send to the physical (PHY) layer.
- Rx engine, which is used to process incoming packets to send to the application layer.
- TCP session management pool, also called TCP PCB BLOCK.
- TCP session state manager, which is used to switch and transfer TCP state.
- Tx buffer and Rx buffer, which are used to store data.
3.2. TOE Reception Principle
3.3. Proposed FPGA-Based TOE Reception Transmission Model Structure
3.4. Delay Model Parameterization
- Direct storage latency. This contains the latency of the TCP session state manager module for state switching of the current TCP session, updating the state FIFO, and feeding the payload data into the payload FIFO before conducting direct queries. Since the direct query state machine determines that the condition for opening a query is that the state FIFO and the payload FIFO are non-empty, the latency of direct storage does not vary with variables such as payload length and is a constant noted as . The direct storage latency is expressed by the following Equation (3):
- Indirect storage latency. Constant latency includes TCP session state manager processing latency, payload data waiting storage latency, and updating the finished FIFO’s latency after the completion of storage, noted as . Non-constant latency includes DDR3 write operations. Using the AXI interface protocol for burst operation to control the MIG IP core to write to DDR3 requires the write address, write data, and write response operations, which will be affected by the payload length and DDR3’s own characteristics. Ideally, when continuously writing payload data with a data bit width of bits and a length of byte, the indirect storage latency can be expressed by the following Equation (4), where “” is the upward rounding sign:
- When the direct query results in X ≤ Y, i.e., the application layer receives data without blocking, then the data should be read directly. Thus includes direct query, notification of indirect query, reading the payload FIFO, and interface arbitration delay. Among them, the direct query time uses a handshake interaction, which mainly depends on the application layer response time, while other parts are pipeline architectures whose delay is denoted as . It assumes that the query request is initiated and waits clock cycles for the application layer to respond to the ack signal. Therefore, the data read delay is expressed by the following Equation (5):
- When the direct query results in X > Y, i.e., the application layer is slow to transfer data, then the data should be read indirectly. Thus includes indirect query, reads the DDR3, and interface arbitration time. In the case of indirect query, if the application layer feeds back the amount of transferable data Y = 0, it still needs to wait for the next round of queries until Y is not zero before it can start slicing and reading. If the application layer responds to the non-zero Y () after clock cycles, and other parts are pipeline architectures whose delay is denoted as , the data read delay is expressed by the following Equation (6):
- X ≤ Y:
- X > Y:
3.5. Analysis of Factors Affecting TOE Transmission Delay
3.5.1. Data Configuration Parameter Factors
3.5.2. DDR3 Read/Write Characteristic Factor
3.5.3. Application Layer Processing Data Rate Factor
3.5.4. State Machine Processing Fixed Latency Factor
4. TOE Transmission Structure Design Key Factors
4.1. Transmission Scheduling Strategy
- If X ≤ Y and X’ = 0, assign a direct length to X and a channel flag to 1.
- If X > Y or X’ ≠ 0, assign a direct length to 0 and a channel flag to 2.
- If the channel flag equals 1, the indirect read control module writes the direct length value to the direct report FIFO and returns to an idle state.
- If the channel flag equals 2, this module reads the length of RAM2 to obtain the remaining length X’ and applies to interact with the application layer to obtain Y. Then, it takes the smaller value of X’ and Y as the slice length and reads the address of RAM3 to obtain a read pointer. Finally, this module reads the payload data stored in DDR3 in segments and transfers them to APP data interface II.
4.2. Data Transmission Consistency
- The first mode is the chain-building mode, i.e., the current TCP connection is established, the session index and DDR3 write address assigned by the previous module are obtained, and it applies to update the read pointer to the DDR3 write address.
- The second mode is the direct mode, indicating that the DDR3 read data pointer needs to skip the data transmitted by the direct channel. When the status information and direct report FIFO is non-empty, it obtains the current session index, reads the direct length, and applies to add the read pointer to the direct length.
- The third mode is the slicing mode, indicating that DDR3 has already transferred a part of this payload length, which obtains the session index and slice length of the current connection and applies to add the read pointer to the slice length.
4.3. Multi-Session Priority Arbitration
5. Experimental Design and Analysis of Results
5.1. TCP Data Transmission Validation Experiment
5.2. TOE Transmission Latency Performance Experiments
- Double-queue storage structure TOE.
- Single DDR3 storage structure TOE which has modified the internal logic so that direct query still selects DDR3 transfer.
- BRAM storage structure of other literature.
- Single DDR3 storage structure of other literature.
5.2.1. Application Layer Unblockage Scenario Latency Experiment
5.2.2. Application Layer Blockage Scenario Latency Experiment
5.3. Maximum Number Experiment of TCP Sessions
5.4. Interactive Query Mechanism Experiment
5.5. TOE Reception Performance Experiment
5.6. TOE Resource Analysis
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Lim, S.S.; Park, K.H. TPF: TCP Plugged File System for Efficient Data Delivery over TCP. IEEE Trans. Comput. 2007, 56, 459–473. [Google Scholar] [CrossRef] [Green Version]
- Thomas, Y.; Karaliopoulos, M.; Xylomenos, G.; Polyzos, G.C. Low Latency Friendliness for Multipath TCP. IEEE/ACM Trans. Netw. 2020, 28, 248–261. [Google Scholar] [CrossRef]
- DPU White Paper. 2023. Available online: https://www.xdyanbao.com/doc/gct8cww2xv?bd_vid=10958326526835579132 (accessed on 3 January 2023).
- Balanici, M.; Pachnicke, S. Hybrid electro-optical intra-data center networks tailored for different traffic classes. J. Opt. Commun. Netw. 2018, 10, 889–901. [Google Scholar] [CrossRef]
- Jia, W.K. A Scalable Multicast Source Routing Architecture for Data Center Networks. IEEE J. Sel. Areas Commun. 2014, 32, 116–123. [Google Scholar] [CrossRef] [Green Version]
- Kant, K. TCP offload performance for front-end servers. In Proceedings of the 2003 IEEE Global Telecommunications Conference, San Francisco, CA, USA, 1–5 December 2003; pp. 3242–3247. [Google Scholar]
- Langenbach, U.; Berthe, A.; Traskov, B.; Weide, S.; Hofmann, K.; Gregorius, P. A 10 GbE TCP/IP hardware stack as part of a protocol acceleration platform. In Proceedings of the 2013 IEEE Third International Conference on Consumer Electronics (IC-CE-Berlin), Berlin/Heidelberg, Germany, 9–11 September 2013; pp. 381–384. [Google Scholar]
- Sidler, D.; Alonso, G.; Blott, M.; Karras, K.; Vissers, K.; Carley, R. Scalable 10Gbps TCP/IP Stack Architecture for Reconfigurable Hardware. In Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, Vancouver, BC, Canada, 2–6 May 2015; pp. 36–43. [Google Scholar]
- Sidler, D.; Istvan, Z.; Alonso, G. Low-latency TCP/IP stack for data center applications. In Proceedings of the 2016 26th International Conference on Field Programmable Logic and Applications, Lausanne, Switzerland, 29 August–2 September 2016; pp. 1–4. [Google Scholar]
- Ruiz, M.; Sidler, D.; Sutter, G.; Alonso, G.; López-Buedo, S. Limago: An FPGA-Based Open-Source 100 GbE TCP/IP Stack. In Proceedings of the 2019 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 8–12 September 2019; pp. 286–292. [Google Scholar]
- Wang, W.; Zheng, J.S. Design and implementation of FPGA-based TCP/IP network communication system. Mod. Electron. Technol. 2018, 41, 5–9. [Google Scholar]
- Yu, H.S.; Deng, H.W.; Wu, C. Design of multi-channel acquisition and TCP/IP transmission system based on FPGA. Telecom Power Technol. 2019, 36, 25–27. [Google Scholar]
- Wu, H.; Liu, Y.Q. FPGA-based TCP/IP protocol processing architecture for 10 Gigabit Ethernet. Electron. Des. Eng. 2020, 28, 81–87. [Google Scholar]
- Yang, Y.; Zhou, S.Y.; Wang, S.P. Design of TCP/IP protocol offload engine based on FPGA. Pract. Electron. 2023, 31, 48–53. [Google Scholar]
- Intilop. 2023. Available online: http://www.intilop.com/tcpipengines.php/ (accessed on 20 January 2023).
- Dini Group. 2023. Available online: http://www.dinigroup.com/new/TOE.php/ (accessed on 20 January 2023).
- PLDA. 2023. Available online: https://www.plda.com/products/fpga-ip/xilinx/fpga-ip-tcpip/quicktcp-xilinx/ (accessed on 3 January 2023).
- Ding, L.; Kang, P.; Yin, W.B.; Wang, L.L. Hardware TCP Offload Engine based on 10-Gbps Ethernet for low-latency network communication. In Proceedings of the 2016 International Conference on Field-Programmable Technology (FPT), Xi’an, China, 7–9 December 2016; pp. 269–272. [Google Scholar]
- Xiong, X.J.; Tan, L.B.; Zhang, J.J.; Chen, T.Y.; Song, Y.X. FPGA-based implementation of low-latency TCP protocol stack. Electron. Des. Eng. 2020, 43, 43–48. [Google Scholar]
- Xilinx. Virtex 7 FPGA VC709. 2023. Available online: https://china.xilinx.com/products/boards-and-kits/dk-v7-vc709-g.html (accessed on 22 April 2023).
- Xie, J.; Yin, W.; Wang, L. Achieving Flexible, Low-Latency and 100 Gbps Line-rate Load Balancing over Ethernet on FPGA. In Proceedings of the 2020 IEEE 33rd International System-on-Chip Conference (SOCC), Las Vegas, NV, USA, 8–11 September 2020; pp. 201–206. [Google Scholar]
- Kumar, M.; Gavrilovska, A. TCP Ordo: The cost of ordered processing in TCP servers. In Proceedings of the IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications, San Francisco, CA, USA, 10–14 April 2016; pp. 1–9. [Google Scholar]
System Parameter | Category | Value | Unit |
---|---|---|---|
Data clock frequency | MAC | 156.25 | MHz |
TOE | 200 | ||
APP | 200 | ||
Data bit width | MAC | 64 | |
TOE | 512 | bits | |
APP | 128 | ||
IP address | Host A | 192.168.116.20 | / |
TOE | 192.168.116.1 | ||
Port number | Host A | 30604 | / |
TOE | 10000 |
Resource | Utilization | Available | Utilization% |
---|---|---|---|
LUT | 51,591 | 433,200 | 11.91 |
FF | 69,031 | 866,400 | 7.97 |
BRAM | 363 | 1470 | 24.69 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, D.; Xu, X.; Chen, T.; Chen, Y.; Zhang, J. Low Latency TOE with Double-Queue Structure for 10Gbps Ethernet on FPGA. Sensors 2023, 23, 4690. https://doi.org/10.3390/s23104690
Yang D, Xu X, Chen T, Chen Y, Zhang J. Low Latency TOE with Double-Queue Structure for 10Gbps Ethernet on FPGA. Sensors. 2023; 23(10):4690. https://doi.org/10.3390/s23104690
Chicago/Turabian StyleYang, Dan, Xuhan Xu, Tianyang Chen, Yanhao Chen, and Junjie Zhang. 2023. "Low Latency TOE with Double-Queue Structure for 10Gbps Ethernet on FPGA" Sensors 23, no. 10: 4690. https://doi.org/10.3390/s23104690
APA StyleYang, D., Xu, X., Chen, T., Chen, Y., & Zhang, J. (2023). Low Latency TOE with Double-Queue Structure for 10Gbps Ethernet on FPGA. Sensors, 23(10), 4690. https://doi.org/10.3390/s23104690