Abstract
A very important ingredient in the computing landscape is Utility Computing Data Centres (UCDCs), large-scale computing systems that offer computational services to concurrently running jobs through virtual servers. As UCDC systems increase in size and the mean time between failure decreases, it is becoming an increasingly important challenge to expediently tolerate failures (dynamically), while distributing the effects of the failure amongst the virtual servers according to their service level agreements. We propose and evaluate a strategy for offering predictable service in fat-trees experiencing faults, by reprioritising packets. The strategy is able to distribute the effect of network faults in order to satisfy a number of quality-of-service demands. Which demands to favour depends on the computer system and the characteristics of the jobs it is running, and in the presence of a moderate number of faults it is to some degree possible to meet the demands.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alfaro, F.J., Sanchez, J.L., Duato, J., Das, C.R.: A strategy to compute the infiniband arbitration tables. In: Proceedings of International Parallel and Distributed Processing Symposium (April 2002)
Alfaro, F.J., Sanchez, J.L., Duato, J.: A strategy to manage time sensitive traffic in infiniband. In: Proceedings of Workshop on Communication Architechture for Clusters (CAC) (April 2002)
Beecroft, J., Addison, D., Hewson, D., McLaren, M., Roweth, D., Petrini, F., Nieplocha, J.: Qsnetii: Defining high-performance network design. IEEE Micro. 25(4), 34–47 (2005)
Chalasani, S., Raghavendra, C.S., Varma, A.: Fault-tolerant routing in MIN based supercomputers. In: Supercomputing 1990: Proceedings of the 1990 conference on Supercomputing, pp. 244–253. IEEE Computer Society Press, Los Alamitos (1990)
Myrinet Inc. Myrinet overview (2007), http://www.myri.com/myrinet/overview/
J-sim (May 2006), http://www.j-sim.org/
Lee, T.-H., Chou, J.-J.: Some directed graph theorems for testing the dynamic full access property of multistage interconnection networks. In: IEEE TENCON (1993)
Leiserson, C.E.: Fat-trees: Universal networks for hardware-efficient supercomputing. IEEE Transactions on Computers C-34(10), 892–901 (1985)
Martinez, R., Alfaro, F.J., Sanchez, J.L.: Decoupling the bandwidth and latency bounding for table-based schedulers. In: Proceedings of the 2006 International Conference on Parallel Processing, pp. 155–163 (2006)
Petrini, F., Vanneschi, M.: K-ary N-trees: High performance networks for massively parallel architectures. Technical Report TR-95-18, 15 (1995)
Sem-Jacobsen, F.O., Lysne, O., Skeie, T.: Combining source routing and dynamic fault tolerance. In: De Souza, A.F., Buyya, R., Meira Jr., W. (eds.) Proceedings of the 18th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Washington, DC, USA, pp. 151–158. IEEE Computer Society, Los Alamitos (2006)
Sem-Jacobsen, F.O., Skeie, T., Lysne, O.: A dynamic fault-torlerant routing algorithm for fat-trees. In: International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Nevada, USA, June 27- 30. CSREA Press (2005)
Sem-Jacobsen, F.O., Skeie, T., Lysne, O., Duato, J.: Dynamic fault tolerance with misrouting in fat trees. In: Feng, W.c. (ed.) Proceedings of the International Conference on Parallel Processing (ICPP), pp. 33–45. IEEE Computer Society, Los Alamitos (2006)
Sem-Jacobsen, F.O., Skeie, T., Lysne, O.: Dynamic fault tolerance in multistage interconnection networks (2008), Research note, Simula, http://simula.no/research/networks/publications/simula.nd.121
Sem-Jacobsen, F.O., Skeie, T., Lysne, O., Tørudbakken, O., Rongved, E., Johnsen, B.: Siamese-twin: A dynamically fault tolerant fat tree. In: Proceedings of the 19th IPDPS (2005)
Sengupta, J., Bansal, P.K.: Fault-tolerant routing in irregular MINs. In: TENCON 1998. 1998 IEEE Region 10 International Conference on Global Connectivity in Energy, Computer, Communication and Control, vol. 2, pp. 638–641 (1998)
Sengupta, J., Bansal, P.K.: High speed dynamic fault-tolerance. In: Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology, 2001. TENCON, vol. 2, pp. 669–675 (2001)
Sharma, N.K.: Fault-tolerance of a MIN using hybrid redundancy. In: Proceedings of the 27th Annual Simulation Symposium, pp. 142–149 (April 1994)
Skeie, T.: A fault-tolerant method for wormhole multistage networks. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 1998), pp. 637–644 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sem-Jacobsen, F.O., Skeie, T. (2008). Maintaining Quality of Service with Dynamic Fault Tolerance in Fat-Trees. In: Sadayappan, P., Parashar, M., Badrinath, R., Prasanna, V.K. (eds) High Performance Computing - HiPC 2008. HiPC 2008. Lecture Notes in Computer Science, vol 5374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89894-8_40
Download citation
DOI: https://doi.org/10.1007/978-3-540-89894-8_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89893-1
Online ISBN: 978-3-540-89894-8
eBook Packages: Computer ScienceComputer Science (R0)