Skip to main content
For the implementation of data-intensive C++ applications for cache coherent Non-Uniform Memory Access (NUMA) systems, both massive parallelism and data locality have to be considered. While massive parallelism has been largely... more
For the implementation of data-intensive C++ applications for cache coherent Non-Uniform Memory Access (NUMA) systems, both massive parallelism and data locality have to be considered. While massive parallelism has been largely understood, the shared memory paradigm is still deeply entrenched in the mindset of many C++ software developers. Hence, data locality aspects of NUMA systems have been widely neglected thus far. At first sight, applying shared nothing approaches might seem like a viable workaround to address locality. However, we argue that developers should be enabled to address locality without having to surrender the advantages of the shared address space of cache coherent NUMA systems. Based on an extensive review of parallel programming languages and frameworks, we propose a programming model specialized for NUMA-aware C++ development that incorporates essential mechanisms for parallelism and data locality. We suggest that these mechanisms should be used to implement specialized data structures and algorithm templates which encapsulate locality, data distribution, and implicit data parallelism. We present an implementation of the proposed programming model in the form of a C++ framework. To demonstrate the applicability of our programming model, we implement a prototypical application on top of this framework and evaluate its performance.
Energy efficiency has become a crucial aspect in the domain of High Performance Computing since running costs for electricity often exceed the initial acquisition costs. In consequence, low-power System-on-a-Chip designs are drawing much... more
Energy efficiency has become a crucial aspect in the domain of High Performance Computing since running costs for electricity often exceed the initial acquisition costs. In consequence, low-power System-on-a-Chip designs are drawing much attention from the HPC community. Driven by the demand for high performance and long battery life in mobile consumer devices, all building blocks of SoCs are undergoing drastic improvements. In addition to the end-user availability of SoCs based on the ARMv8-A instruction set architecture, heterogenous aspects ranging from the big.LITTLE paradigm to compute-capable GPUs are gaining popularity. Focusing on the heterogenous nature of SoCs, we investigate both performance and energy consumption of todays state-of-the-art SoCs for heterogenous workloads using the Rodinia benchmark suite. Based on the results, we anticipate the potential of forthcoming SoC designs in the HPC domain.
This publication gives new results in applying the theoretical knowledge based on the Laplace-Stiltes transform. The main purpose is to predict packets transmitting delay, in a network based on the Internet technology. A new method for... more
This publication gives new results in applying the theoretical knowledge based on the Laplace-Stiltes transform. The main purpose is to predict packets transmitting delay, in a network based on the Internet technology. A new method for modeling the real-time networking process is designed.
Due to the increasing heterogeneity of parallel and distributed systems, coordination of data (placement) and tasks (scheduling) becomes increasingly complex. Many traditional solutions do not take into account the details of modern... more
Due to the increasing heterogeneity of parallel and distributed systems, coordination of data (placement) and tasks (scheduling) becomes increasingly complex. Many traditional solutions do not take into account the details of modern system topologies and consequently experience unacceptable performance penalties with modern hierarchical interconnect technologies and memory architectures. Others offload the coordination of tasks and data to the programmer by requiring explicit information about thread and data creation and placement. While allowing full control of the system, explicit coordination severely decreases programming productivity and disallows implementing best practices in a reusable layer. In this paper we introduce Claud, a locality-preserving latency-aware hierarchical object space. Claud is based on the understanding that productivity-oriented programmers prefer simple programming constructs for data access (like key-value stores) and task coordination (like parallel loops). Instead of providing explicit facilities for coordination, our approach places and moves data and tasks implicitly based on a detailed topology model of the system relying on best performance practices like hierarchical task queues, concurrent data structures, and similarity-based placement.
Parallel workloads using compute resources such as GPUs and accelerators is a rapidly developing trend in the field of high performance computing. At the same time, virtualization is a generally accepted solution to share compute... more
Parallel workloads using compute resources such as GPUs and accelerators is a rapidly developing trend in the field of high performance computing. At the same time, virtualization is a generally accepted solution to share compute resources with remote users in a secure and isolated way. However, accessing compute resources from inside virtualized environments still poses a huge problem without any generally accepted and vendor independent solution. This work presents a brief experimental evaluation of employing dOpenCL as an approach to solve this problem. dOpenCL extends OpenCL for distributed computing by forwarding OpenCL calls to remote compute nodes. We evaluate the dOpenCL implementation for accessing local GPU resources from inside virtual machines, thus omitting the need of any specialized or proprietary GPU virtualization software. Our measurements revealed that the overhead of using dOpenCL from inside a VM compared to utilizing OpenCL directly on the host is less than 10% for average and large data sets. For very small data sets, it may even provide a performance benefit. Furthermore, dOpenCL greatly simplifies distributed programming compared to, e.g., MPI based approaches, as it only requires a single programming paradigm and is mostly binary compatible to plain OpenCL implementations.
Certain workloads such as in‐memory databases are inherently hard to scale‐out and rely on cache‐coherent scale‐up non‐uniform memory access (NUMA) systems to keep up with the ever‐increasing demand for compute resources. However, many... more
Certain workloads such as in‐memory databases are inherently hard to scale‐out and rely on cache‐coherent scale‐up non‐uniform memory access (NUMA) systems to keep up with the ever‐increasing demand for compute resources. However, many parallel programming frameworks such as OpenMP do not make efficient use of large scale‐up NUMA systems as they do not consider data locality sufficiently. In this work, we present PGASUS, a C++ framework for NUMA‐aware application development that provides integrated facilities for NUMA‐aware task parallelism and data placement. The framework is based on an extensive review of parallel programming languages and frameworks to incorporate the best practices of the field. In a comprehensive evaluation, we demonstrate that PGASUS provides average performance improvements of and peak performance improvements of up to across a wide range of workloads.
To alleviate development of FPGA-based accelerator function units for software engineers, the OpenPOWER Accelerator Work Group has recently introduced the CAPI Storage, Network, and Analytics Programming (SNAP) framework. However, we... more
To alleviate development of FPGA-based accelerator function units for software engineers, the OpenPOWER Accelerator Work Group has recently introduced the CAPI Storage, Network, and Analytics Programming (SNAP) framework. However, we found that software engineers are still overwhelmed with many aspects of the novel hardware development framework. This paper provides background and instructions for mastering the first steps of hardware development using the CAPI SNAP framework. The insights reported in this paper are based on the experiences of software engineering students with little to no prior knowledge about hardware development.
Modern distributed systems have reached a level of complexity where software bugs and hardware failures are no longer exceptional, but a permanent operational threat. This holds especially for cloud infrastructures, which need to deliver... more
Modern distributed systems have reached a level of complexity where software bugs and hardware failures are no longer exceptional, but a permanent operational threat. This holds especially for cloud infrastructures, which need to deliver resources to their customers under well-defined service-level agreements. Dependability need to be assessed carefully. This article presents a structured approach for dependability stress testing in a cloud infrastructure. We automatically determine and inject the maximum amount of simultaneous non-fatal errors in different variations. This puts the existing resiliency mechanisms under heavy load, so that they are tested for their effectiveness in corner cases. The starting point is a failure space dependability model of the system. It includes the notion of fault tolerance dependencies, which encode fault-triggering relations between different software layers. From the model, our deterministic algorithm automatically derives fault injection campaigns that maximize dependability stress. The article demonstrates the feasibility of the approach with an assessment of a fault tolerant OpenStack cloud infrastructure deployment.
The ever-growing demand for computing resources has reached a wide range of application domains. Even though the ubiquitous availability of cloud-based GPU instances provides an abundance of computing resources, the programmatic... more
The ever-growing demand for computing resources has reached a wide range of application domains. Even though the ubiquitous availability of cloud-based GPU instances provides an abundance of computing resources, the programmatic complexity of utilizing heterogeneous hardware in a scale-out scenario is not yet addressed sufficiently. We deal with this issue by introducing the CloudCL framework, which enables developers to focus their implementation efforts on compute kernels without having to consider inter-node communication. Using CloudCL, developers can access the resources of an entire cluster as if they were local resources. The framework also facilitates the development of cloud-native application behavior by supporting dynamic addition and removal of resources at runtime. The combination of a straightforward job design and the corresponding job scheduling framework make sure that cluster resources are used efficiently and fairly. In an extensive performance evaluation, we demonstrate that the framework provides close-to-linear scale-out capabilities in multi-node deployment scenarios.
The ideas and findings in this report should not be construed as an official DoD position. It is published in the interest of scientific and technical information exchange.
Das Future SOC Lab am HPI ist eine Kooperation des Hasso-Plattner-Instituts mit verschiedenen Industriepartnern. Seine Aufgabe ist die Ermoglichung und Forderung des Austausches zwischen Forschungsgemeinschaft und Industrie. Am Lab wird... more
Das Future SOC Lab am HPI ist eine Kooperation des Hasso-Plattner-Instituts mit verschiedenen Industriepartnern. Seine Aufgabe ist die Ermoglichung und Forderung des Austausches zwischen Forschungsgemeinschaft und Industrie. Am Lab wird interessierten Wissenschaftlern eine Infrastruktur von neuester Hard- und Software kostenfrei fur Forschungszwecke zur Verfugung gestellt. Dazu zahlen teilweise noch nicht am Markt verfugbare Technologien, die im normalen Hochschulbereich in der Regel nicht zu finanzieren waren, bspw. Server mit bis zu 64 Cores und 2 TB Hauptspeicher. Diese Angebote richten sich insbesondere an Wissenschaftler in den Gebieten Informatik und Wirtschaftsinformatik. Einige der Schwerpunkte sind Cloud Computing, Parallelisierung und In-Memory Technologien. In diesem Technischen Bericht werden die Ergebnisse der Forschungsprojekte des Jahres 2015 vorgestellt. Ausgewahlte Projekte stellten ihre Ergebnisse am 15. April 2015 und 4. November 2015 im Rahmen der Future SOC Lab ...
A justifiably trustworthy provisioning of cloud services can only be ensured if reliability, availability, and other dependability attributes are assessed accordingly.We present a structured approach for deriving fault injection campaigns... more
A justifiably trustworthy provisioning of cloud services can only be ensured if reliability, availability, and other dependability attributes are assessed accordingly.We present a structured approach for deriving fault injection campaigns from a failure space model of the system. Fault injection experiments are selected based on criteria of coverage, efficiency and maximality of the faultload. The resulting campaign is enacted automatically and shows the performance impact of the tested worst case non-failure scenarios.We demonstrate the feasibility of our approach with a fault tolerant deployment of an OpenStack cloud infrastructure.
Cloud-based exchange of sensitive data demands the enforcement of fine-grained and flexible access rights, that can be time-bounded and revoked at any time. In a setting that does not rely on trusted computing bases on the client side,... more
Cloud-based exchange of sensitive data demands the enforcement of fine-grained and flexible access rights, that can be time-bounded and revoked at any time. In a setting that does not rely on trusted computing bases on the client side, these access control features require a trusted authorization service that mediates access control decisions. Using threshold cryptography, we present an implementation scheme for a distributed authorization service which improves reliability over a single service instance and limits the power and responsibility of single authorization service nodes.
GPU compute devices have become very popular for general purpose computations. However, the SIMD-like hardware of graphics processors is currently not well suited for irregular workloads, like searching unbalanced trees. In order to... more
GPU compute devices have become very popular for general purpose computations. However, the SIMD-like hardware of graphics processors is currently not well suited for irregular workloads, like searching unbalanced trees. In order to mitigate this drawback, NVIDIA introduced an extension to GPU programming models called dynamic parallelism. This extension enables GPU programs to spawn new units of work directly on the GPU, allowing the refinement of subsequent work items based on intermediate results without any involvement of the main CPU. This work investigates methods for employing dynamic parallelism with the goal of improved workload distribution for tree search algorithms on modern GPU hardware. For the evaluation of the proposed approaches, a case study is conducted on the n-queens problem. Extensive benchmarks indicate that the benefits of improved resource utilization fail to outweigh high management overhead and runtime limitations due to the very fine level of granularity of the investigated problem. However, novel memory management concepts for passing parameters to child grids are presented. These general concepts are applicable to other, more coarse-grained problems that benefit from the use of dynamic parallelism.
With memory-centric architectures appearing on the horizon as potential candidates for future computer architectures, we propose that the tuple space paradigm is well suited for the task of managing the large shared memory pools that are... more
With memory-centric architectures appearing on the horizon as potential candidates for future computer architectures, we propose that the tuple space paradigm is well suited for the task of managing the large shared memory pools that are a central concept of these new architectures. We support this hypothesis by presenting MemSpaces, an implementation of the tuple space paradigm based on POSIX shared memory objects. To demonstrate both efficacy and efficiency of the approach, we provide a performance evaluation that compares MemSpaces to message-based implementations of the tuple space paradigm. Due to the lack of commercial availability of adequate hardware, we perform the evaluation inside an emulated environment that mimics the general characteristics of memory-centric architectures. For many operations, MemSpaces performs an order of magnitude faster compared to state of the art implementations.
Besides design and implementation of components, software engineering for component-based systems has to deal with component integration issues whose impact is not restricted to separate components but rather affects the system as a... more
Besides design and implementation of components, software engineering for component-based systems has to deal with component integration issues whose impact is not restricted to separate components but rather affects the system as a whole. The bigger the software system is, the more difficult it will be to deal with. Aspect-Oriented programming (AOP) addresses these cross-cutting, multi-component concerns. AOP describes system properties and component interactions in terms of so-called aspects. Often, aspects express non-functional component properties, such as resource usage (CPU, memory, network bandwidth), component and object (co-) locations, fault-tolerance, timing behavior, or security settings. Typically, these properties do not manifest in the components’ functional interfaces. Aspects often constrain the design space for a given software system. System designers have to trade off multiple, possibly contradicting aspects affecting a set of components (e.g.; the fault-toleran...
Traditional encryption schemes can effectively ensure the confidentiality of sensitive data stored on cloud infrastructures. Unfortunately, they also prevent most operations on the data such as search by design. As a solution, searchable... more
Traditional encryption schemes can effectively ensure the confidentiality of sensitive data stored on cloud infrastructures. Unfortunately, they also prevent most operations on the data such as search by design. As a solution, searchable encryption schemes have been proposed that provide keyword-search capability on encrypted content. In this paper, we evaluate the practical usability of searchable encryption schemes and analyze the tradeoff between performance, functionality and security. We present a prototypical implementation of such a scheme embedded in a document-oriented database, report on performance benchmarks under realistic conditions and analyze the threats to data confidentiality and corresponding countermeasures.
The "HPI Future SOC Lab" is a cooperation of the Hasso Plattner Institute (HPI) and industry partners. Its mission is to enable and promote exchange and interaction between the research community and the industry partners. The... more
The "HPI Future SOC Lab" is a cooperation of the Hasso Plattner Institute (HPI) and industry partners. Its mission is to enable and promote exchange and interaction between the research community and the industry partners. The HPI Future SOC Lab provides researchers with free of charge access to a complete infrastructure of state of the art hard and software. This infrastructure includes components, which might be too expensive for an ordinary research environment, such as servers with up to 64 cores and 2 TB main memory. The offerings address researchers particularly from but not limited to the areas of computer science and business information systems. Main areas of research include cloud computing, parallelization, and In-Memory technologies. This technical report presents results of research projects executed in 2017. Selected projects have presented their results on April 25th and November 15th 2017 at the Future SOC Lab Day events.
The recent restructuring of the electricity grid (i.e., smart grid) introduces a number of challenges for today's large-scale computing systems. To operate reliable and efficient, computing systems must adhere not only to technical... more
The recent restructuring of the electricity grid (i.e., smart grid) introduces a number of challenges for today's large-scale computing systems. To operate reliable and efficient, computing systems must adhere not only to technical limits (i.e., thermal constraints) but they must also reduce operating costs, for example, by increasing their energy efficiency. Efforts to improve the energy efficiency, however, are often hampered by inflexible software components that hardly adapt to underlying hardware characteristics. In this paper, we propose an approach to bridge the gap between inflexible software and heterogeneous hardware architectures. Our proposal introduces adaptive software components that dynamically adapt to heterogeneous processing units (i.e., accelerators) during runtime to improve the energy efficiency of computing systems.
Abstract Detection of the QRS complex is a long-standing topic in the context of electrocardiography and many algorithms build upon the knowledge of the QRS positions. Although the first solutions to this problem were proposed in the... more
Abstract Detection of the QRS complex is a long-standing topic in the context of electrocardiography and many algorithms build upon the knowledge of the QRS positions. Although the first solutions to this problem were proposed in the 1970s and 1980s, there is still potential for improvements. Advancements in neural network technology made in recent years also lead to the emergence of enhanced QRS detectors based on artificial neural networks. In this work, we propose a method for assessing the certainty that is in each of the detected QRS complexes, i.e. how confident the QRS detector is that there is, in fact, a QRS complex in the position where it was detected. We further show how this metric can be utilised to distinguish correctly detected QRS complexes from false detections.
In contrast to applications relying on specialized and expensive highly-available infrastructure, the basic approach of microservice architectures to achieve fault tolerance – and finally high availability – is to modularize the software... more
In contrast to applications relying on specialized and expensive highly-available infrastructure, the basic approach of microservice architectures to achieve fault tolerance – and finally high availability – is to modularize the software system into small, self-contained services that are connected via implementation-independent interfaces. Microservices and all dependencies are deployed into self-contained environments called containers that are executed as multiple redundant instances. If a service fails, other instances will often still work and take over. Due to the possibility of failing infrastructure, these services have to be deployed on several physical systems. This horizontal scaling of redundant service instances can also be used for load-balancing. Decoupling the service communication using asynchronous message queues can increase fault tolerance, too. The Deutsche Bahn AG (German railway company) uses as system called EPA for seat reservations for inter-urban rail serv...
In an age of ever-growing data volumes, lossless data compression is unarguably one of the most relevant techniques to handle vast data sets. To facilitate high throughput compression, modern IBM POWER CPUs provide hardware acceleration... more
In an age of ever-growing data volumes, lossless data compression is unarguably one of the most relevant techniques to handle vast data sets. To facilitate high throughput compression, modern IBM POWER CPUs provide hardware acceleration for the proprietary 842 compression algorithm. The 842 algorithm is optimized for main memory compression, and a software-based implementation of the algorithm is available as a part of the Linux kernel. Even though GPU-equipped computers are vital for many of todays data intensive applications, GPUs have thus far been unable to interoperate with 842-compressed data due to the lack of GPU-based decompressors. The main contribution of this paper is to fill this gap by providing optimized implementations for 842 decompression in both CUDA and OpenCL. We demonstrate that GPU-based decompression provides 4.5-9.5x speed-up on integrated GPUs and 30-34x speed-up on dedicated GPUs when compared to software-based decompression on CPUs for various test system...
Research Interests:
The increasing complexity of software systems challenges the assurance of the likewise increasing dependability demands. Software fault injection is a widely accepted means of assessing dependability, but far less accessible and... more
The increasing complexity of software systems challenges the assurance of the likewise increasing dependability demands. Software fault injection is a widely accepted means of assessing dependability, but far less accessible and integrated into engineering practices than unit or integration testing. To address this issue, we present a dataset of existing fault injection tools in a programmatically evaluable model. Our Fault Injection ADvisor (FIAD) suggests applicable fault injectors mainly by analyzing definitions from Infrastructure as Code (IaC) solutions. Perspectively, FIAD can yield findings on how to classify fault injectors and is extensible in a way that it can additionally suggest workloads or run fault injectors.
This paper introduces ReDAC, a new algorithm for dy-namic reconfiguration of multi-threaded applications with cyclic dependencies. In order to achieve high reliability and availability, distributed component software has to sup-port... more
This paper introduces ReDAC, a new algorithm for dy-namic reconfiguration of multi-threaded applications with cyclic dependencies. In order to achieve high reliability and availability, distributed component software has to sup-port dynamic reconfiguration. Typical examples include the application of hot-fixes to deal with security vulnera-bilities. ReDAC can be implemented on top of the modern component-platforms Java and.NET. We extend the statical term component, denoting a unit of deployment, to runtime by defining a capsule (runtime component instance) to be a set of interconnected objects. This allows us to apply dy-namic updates at the level of components during runtime without stopping whole applications. Using system-wide unique identifiers for threads (logical thread IDs), we can detect and also bring capsules into a reconfigurable state by selectively blocking threads, relying on data structures maintained by additional logic integrated into the capsules using aspect-orie...
Dynamic reconfiguration provides of powerful mecha-nism to adapt component-based distributed applications to changing environmental conditions. We have designed and implemented a framework for dynamic component reconfig-uration on the... more
Dynamic reconfiguration provides of powerful mecha-nism to adapt component-based distributed applications to changing environmental conditions. We have designed and implemented a framework for dynamic component reconfig-uration on the basis of the Microsoft.NET environment. Within this paper we present an experimental evalua-tion of our infrastructure for dynamic reconfiguration of component-based applications. Our framework supports the description of application configurations and profiles and allows for selection of a particular configuration and object/component instantiation based on measured environ-mental conditions. In response to changes in the environ-ment, our framework will dynamically load new configu-rations, thus implementing dynamic reconfiguration of an application. Configuration code for components and appli-cations has to interact with many functional modules and therefore is often scattered around the whole application. We use aspect-oriented programming techniqu...

And 141 more