In literature, computer architectures are frequently claimed to be highly flexible , typically im... more In literature, computer architectures are frequently claimed to be highly flexible , typically implying the existence of trade-offs between flexibility and performance or energy efficiency. Processor flexibility, however, is not very sharply defined, and consequently these claims cannot be validated, nor can such hypothetical relations be fully understood and exploited in the design of computing systems. This paper is an attempt to introduce scientific rigour to the notion of flexibility in computing systems. A survey is conducted to provide an overview of references to flexibility in literature, both in the computer architecture domain, as well as related fields. A classification is introduced to categorize different views on flexibility, which ultimately form the foundation for a qualitative definition of flexibility. Departing from the qualitative definition of flexibility, a generic quantifiable metric is proposed, enabling valid quantitative comparison of the flexibility of var...
High Level Synthesis tools have reduced accelerator design time. How-ever, a complex scaling prob... more High Level Synthesis tools have reduced accelerator design time. How-ever, a complex scaling problem that remains is the data transfer bottle-neck. Accelerators require huge amounts of data and are often limited by interconnect resources. Local buffers can reduce communication by ex-ploiting data reuse, but the data access order has a substantial impact on the amount of reuse that can be utilized. With loop transformations such as interchange and tiling the data access order can be modified. How-ever, for real applications the design space is huge, finding the best set of transformations is often intractable. Therefore, we present a new method-ology that minimizes the data transfer by loop interchange and tiling. In contrast to other methods we take inter-tile reuse and loop bounds into account. For real-world applications we show buffer size trade-offs that can give speedups up to 14x, alternatively these can reduce the required FPGA resources substantially.
2016 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH), 2016
Memristor-based Computation-in-Memory is one of the emerging architectures proposed to deal with ... more Memristor-based Computation-in-Memory is one of the emerging architectures proposed to deal with Big Data problems. The design of such architectures requires a radically new automatic design flow because the memristor is a passive device that uses resistance to encode its logic value. This paper proposes a design flow for mapping parallel algorithms on the CIM architecture. Algorithms with similar data flow graphs can be mapped on the crossbar using the same template containing scheduling, placement, and routing information; this template is named skeleton. By configuring such a skeleton with different pre-designed circuits, we can build CIM implementations of the corresponding algorithms in that class. This approach does not only map an algorithm on a memristor crossbar, but also gives an estimation of its performance, area, and energy consumption. It also supports user-defined constraints and parallel SystemC simulation. Experimental results demonstrate the feasibility and the pot...
High Level Synthesis tools have reduced accelerator design time. However, a complex scaling probl... more High Level Synthesis tools have reduced accelerator design time. However, a complex scaling problem that remains is the data transfer bottleneck. Accelerators require huge amounts of data and are often limited by interconnect resources. Local buffers can reduce communication by exploiting data reuse, but the data access order has a substantial impact on the amount of reuse that can be utilized. With loop transformations such as interchange and tiling the data access order can be modified. However, for real applications the design space is huge, finding the best set of transformations is often intractable. Therefore, we present a new methodology that minimizes the data transfer by loop interchange and tiling. In contrast to other methods we take inter-tile reuse and loop bounds into account. For real-world applications we show buffer size trade-offs that can give speedups up to 14x, alternatively these can reduce the required FPGA resources substantially.
IMPACT 2018: Eighth International Workshop on Polyhedral Compilation Techniques, In conjunction w... more IMPACT 2018: Eighth International Workshop on Polyhedral Compilation Techniques, In conjunction with HiPEAC 2018. January 23, 2018, Manchester, United Kingdom
Proceedings of the 56th Annual Design Automation Conference 2019, 2019
The cost of moving data between the memory/storage units and the compute units is a major contrib... more The cost of moving data between the memory/storage units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. A promising paradigm to alleviate this data movement bottleneck is near-memory computing (NMC), which consists of placing compute units close to the memory/storage units. There is substantial research effort that proposes NMC architectures and identifies work-loads that can benefit from NMC. System architects typically use simulation techniques to evaluate the performance and energy consumption of their designs. However, simulation is extremely slow, imposing long times for design space exploration. In order to enable fast early-stage design space exploration of NMC architectures, we need high-level performance and energy models.We present NAPEL, a high-level performance and energy estimation framework for NMC architectures. NAPEL leverages ensemble learning to develop a model that is based on micro architectural parameters and application characteristics. NAPEL training uses a statistical technique, called design of experiments, to collect representative training data efficiently. NAPEL provides early design space exploration 220× faster than a state-of-the-art NMC simulator, on average, with error rates of to 8.5% and 11.6% for performance and energy estimations, respectively, compared to the NMC simulator. NAPEL is also capable of making accurate predictions for previously-unseen applications.
ACM Transactions on Architecture and Code Optimization, 2019
Increasingly complex hardware makes the design of effective compilers difficult. To reduce this p... more Increasingly complex hardware makes the design of effective compilers difficult. To reduce this problem, we introduce Declarative Loop Tactics , which is a novel framework of composable program transformations based on an internal tree-like program representation of a polyhedral compiler. The framework is based on a declarative C++ API built around easy-to-program matchers and builders, which provide the foundation to develop loop optimization strategies. Using our matchers and builders, we express computational patterns and core building blocks, such as loop tiling, fusion, and data-layout transformations, and compose them into algorithm-specific optimizations. Declarative Loop Tactics (Loop Tactics for short) can be applied to many domains. For two of them, stencils and linear algebra, we show how developers can express sophisticated domain-specific optimizations as a set of composable transformations or calls to optimized libraries. By allowing developers to add highly customized...
ACM Transactions on Architecture and Code Optimization, 2017
Specialized Digital Signal Processors (DSPs), which can be found in a wide range of modern device... more Specialized Digital Signal Processors (DSPs), which can be found in a wide range of modern devices, play an important role in power-efficient, high-performance image processing. Applications including camera sensor post-processing and computer vision benefit from being (partially) mapped onto such DSPs. However, due to their specialized instruction sets and dependence on low-level code optimization, developing applications for DSPs is more time-consuming and error-prone than for general-purpose processors. Halide is a domain-specific language (DSL) that enables low-effort development of portable, high-performance imaging pipelines—a combination of qualities that is hard, if not impossible, to find among DSP programming models. We propose a set of extensions and modifications to Halide to generate code for DSP C compilers, focusing specifically on diverse SIMD target instruction sets and heterogeneous scratchpad memory hierarchies. We implement said techniques for a commercial DSP fo...
In literature, computer architectures are frequently claimed to be highly flexible , typically im... more In literature, computer architectures are frequently claimed to be highly flexible , typically implying the existence of trade-offs between flexibility and performance or energy efficiency. Processor flexibility, however, is not very sharply defined, and consequently these claims cannot be validated, nor can such hypothetical relations be fully understood and exploited in the design of computing systems. This paper is an attempt to introduce scientific rigour to the notion of flexibility in computing systems. A survey is conducted to provide an overview of references to flexibility in literature, both in the computer architecture domain, as well as related fields. A classification is introduced to categorize different views on flexibility, which ultimately form the foundation for a qualitative definition of flexibility. Departing from the qualitative definition of flexibility, a generic quantifiable metric is proposed, enabling valid quantitative comparison of the flexibility of var...
High Level Synthesis tools have reduced accelerator design time. How-ever, a complex scaling prob... more High Level Synthesis tools have reduced accelerator design time. How-ever, a complex scaling problem that remains is the data transfer bottle-neck. Accelerators require huge amounts of data and are often limited by interconnect resources. Local buffers can reduce communication by ex-ploiting data reuse, but the data access order has a substantial impact on the amount of reuse that can be utilized. With loop transformations such as interchange and tiling the data access order can be modified. How-ever, for real applications the design space is huge, finding the best set of transformations is often intractable. Therefore, we present a new method-ology that minimizes the data transfer by loop interchange and tiling. In contrast to other methods we take inter-tile reuse and loop bounds into account. For real-world applications we show buffer size trade-offs that can give speedups up to 14x, alternatively these can reduce the required FPGA resources substantially.
2016 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH), 2016
Memristor-based Computation-in-Memory is one of the emerging architectures proposed to deal with ... more Memristor-based Computation-in-Memory is one of the emerging architectures proposed to deal with Big Data problems. The design of such architectures requires a radically new automatic design flow because the memristor is a passive device that uses resistance to encode its logic value. This paper proposes a design flow for mapping parallel algorithms on the CIM architecture. Algorithms with similar data flow graphs can be mapped on the crossbar using the same template containing scheduling, placement, and routing information; this template is named skeleton. By configuring such a skeleton with different pre-designed circuits, we can build CIM implementations of the corresponding algorithms in that class. This approach does not only map an algorithm on a memristor crossbar, but also gives an estimation of its performance, area, and energy consumption. It also supports user-defined constraints and parallel SystemC simulation. Experimental results demonstrate the feasibility and the pot...
High Level Synthesis tools have reduced accelerator design time. However, a complex scaling probl... more High Level Synthesis tools have reduced accelerator design time. However, a complex scaling problem that remains is the data transfer bottleneck. Accelerators require huge amounts of data and are often limited by interconnect resources. Local buffers can reduce communication by exploiting data reuse, but the data access order has a substantial impact on the amount of reuse that can be utilized. With loop transformations such as interchange and tiling the data access order can be modified. However, for real applications the design space is huge, finding the best set of transformations is often intractable. Therefore, we present a new methodology that minimizes the data transfer by loop interchange and tiling. In contrast to other methods we take inter-tile reuse and loop bounds into account. For real-world applications we show buffer size trade-offs that can give speedups up to 14x, alternatively these can reduce the required FPGA resources substantially.
IMPACT 2018: Eighth International Workshop on Polyhedral Compilation Techniques, In conjunction w... more IMPACT 2018: Eighth International Workshop on Polyhedral Compilation Techniques, In conjunction with HiPEAC 2018. January 23, 2018, Manchester, United Kingdom
Proceedings of the 56th Annual Design Automation Conference 2019, 2019
The cost of moving data between the memory/storage units and the compute units is a major contrib... more The cost of moving data between the memory/storage units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. A promising paradigm to alleviate this data movement bottleneck is near-memory computing (NMC), which consists of placing compute units close to the memory/storage units. There is substantial research effort that proposes NMC architectures and identifies work-loads that can benefit from NMC. System architects typically use simulation techniques to evaluate the performance and energy consumption of their designs. However, simulation is extremely slow, imposing long times for design space exploration. In order to enable fast early-stage design space exploration of NMC architectures, we need high-level performance and energy models.We present NAPEL, a high-level performance and energy estimation framework for NMC architectures. NAPEL leverages ensemble learning to develop a model that is based on micro architectural parameters and application characteristics. NAPEL training uses a statistical technique, called design of experiments, to collect representative training data efficiently. NAPEL provides early design space exploration 220× faster than a state-of-the-art NMC simulator, on average, with error rates of to 8.5% and 11.6% for performance and energy estimations, respectively, compared to the NMC simulator. NAPEL is also capable of making accurate predictions for previously-unseen applications.
ACM Transactions on Architecture and Code Optimization, 2019
Increasingly complex hardware makes the design of effective compilers difficult. To reduce this p... more Increasingly complex hardware makes the design of effective compilers difficult. To reduce this problem, we introduce Declarative Loop Tactics , which is a novel framework of composable program transformations based on an internal tree-like program representation of a polyhedral compiler. The framework is based on a declarative C++ API built around easy-to-program matchers and builders, which provide the foundation to develop loop optimization strategies. Using our matchers and builders, we express computational patterns and core building blocks, such as loop tiling, fusion, and data-layout transformations, and compose them into algorithm-specific optimizations. Declarative Loop Tactics (Loop Tactics for short) can be applied to many domains. For two of them, stencils and linear algebra, we show how developers can express sophisticated domain-specific optimizations as a set of composable transformations or calls to optimized libraries. By allowing developers to add highly customized...
ACM Transactions on Architecture and Code Optimization, 2017
Specialized Digital Signal Processors (DSPs), which can be found in a wide range of modern device... more Specialized Digital Signal Processors (DSPs), which can be found in a wide range of modern devices, play an important role in power-efficient, high-performance image processing. Applications including camera sensor post-processing and computer vision benefit from being (partially) mapped onto such DSPs. However, due to their specialized instruction sets and dependence on low-level code optimization, developing applications for DSPs is more time-consuming and error-prone than for general-purpose processors. Halide is a domain-specific language (DSL) that enables low-effort development of portable, high-performance imaging pipelines—a combination of qualities that is hard, if not impossible, to find among DSP programming models. We propose a set of extensions and modifications to Halide to generate code for DSP C compilers, focusing specifically on diverse SIMD target instruction sets and heterogeneous scratchpad memory hierarchies. We implement said techniques for a commercial DSP fo...
Uploads