[go: up one dir, main page]

Features correlation-based workflows for high-performance computing systems diagnosis

[thumbnail of WRAP_Theses_Chuah_2020.pdf]
Preview
PDF
WRAP_Theses_Chuah_2020.pdf - Submitted Version - Requires a PDF viewer.

Download (5MB) | Preview

Request Changes to record.

Abstract

Analysing failures to improve the reliability of high performance computing systems and data centres is important. The primary source of information for diagnosing system failures is the system logs and it is widely known that finding the cause of a system failure using only system logs is incomplete. Resource utilisation data – recently made available – is another potential useful source of information for failure analysis. However, large High-Performance Computing (HPC) systems generate a lot of data. Processing the huge amount of data presents a significant challenge for online failure diagnosis. Most of the work on failure diagnosis have studied errors that lead to system failures only, but there is little work that study errors which lead to a system failure or recovery on real data.

In this thesis, we design, implement and evaluate two failure diagnostics frameworks. We name the frameworks CORRMEXT and EXERMEST. We implement the Data Type Extraction, Feature Extraction, Correlation and Time-bin Extraction modules. CORRMEXT integrates the Data Type Extraction, Correlation and Time-bin Extraction modules. It identifies error cases that occur frequently and reports the success and failure of error recovery protocols. EXERMEST integrates the Feature Extraction and Correlation modules. It extracts significant errors and resource use counters and identifies error cases that are rare. We apply the diagnostics frameworks on the resource use data and system logs on three HPC systems operated by the Texas Advanced Computing Center (TACC). Our results show that: (i) multiple correlation methods are required for identifying more dates of groups of correlated resource use counters and groups of correlated errors, (ii) the earliest hour of change in system behaviour can only be identified by using the correlated resource use counters and correlated errors, (iii) multiple feature extraction methods are required for identifying the rare error cases, and (iv) time-bins of multiple granularities are necessary for identifying the rare error cases. CORRMEXT and EXERMEST are available on the public domain for supporting system administrators in failure diagnosis.

Item Type: Thesis (PhD)
Subjects: Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software
Library of Congress Subject Headings (LCSH): Computer system failures, High performance computing, Data centers -- Reliability, Supercomputers -- Reliability
Official Date: June 2020
Dates:
Date
Event
June 2020
UNSPECIFIED
Institution: University of Warwick
Theses Department: Department of Computer Science
Thesis Type: PhD
Publication Status: Unpublished
Supervisor(s)/Advisor: Jhumka, Arshad
Sponsors: Alan Turing Institute ; University of Warwick ; National Science Foundation (U.S.)
Format of File: pdf
Extent: xxii, 180 leaves : illustrations (some colour)
Language: eng
Persistent URL: https://wrap.warwick.ac.uk/147261/

Export / Share Citation


Request changes or add full text files to a record

Repository staff actions (login required)

View Item View Item