2018 7th International Conference on Modern Circuits and Systems Technologies (MOCAST)
In this paper the logarithmic Number System (LNS) is adopted to implement Long-Short Term Memory ... more In this paper the logarithmic Number System (LNS) is adopted to implement Long-Short Term Memory (LSTM), the basic component of a deep learning network type. Initially, piece wise approximations to activation functions σ and tanh are proposed and evaluated in LNS. Secondly, LNS multipliers and adders are implemented for wordlengths of 9,10 and 11 bits. The circuits are implemented in an 90-nm 1.0 V CMOS standard-cell library and quantitative results are reported. Results demonstrate that LNS is a good candidate for data representation and processing in deep learning networks, as area reduction of up to 36% is possible.
2019 IEEE Nordic Circuits and Systems Conference (NORCAS): NORCHIP and International Symposium of System-on-Chip (SoC), 2019
Deep-Learning-based Decoders have been recently introduced for use with short-length codes. They ... more Deep-Learning-based Decoders have been recently introduced for use with short-length codes. They have been found to act as a Soft-Decision-Decoding method achieving near Maximum-Likelihood error correcting capability. However, Deep-Learning decoding methods are hard to implement as they normally require millions of operations for inference. In order for Deep-Learning decoding to be a competitive candidate for practical applications, research effort is required to reduce the computational complexity and storage requirements of the Neural Networks involved. In this work, a structured flow is presented that significantly compresses a trained Syndrome-Based Neural Network Decoder by pruning up to 80% of the network's weights and quantizing them to 8-bit fixed-point representation, with no loss in its BER performance. The attained compressed Neural Network can then be used for inference, by designing specific hardware or by using a generic Deep-Learning hardware accelerator that exploits the compressed structure of the network. The deployment of the DL Decoder in an embedded application is showcased, using the AI Edge platform by Xilinx. To accomplish this, a simple method to obtain a computationally equivalent convolutional layer from a fully-connected one is introduced. Implementation results are provided for the compressed DL Decoder, regarding throughput rate and BER performance. To our knowledge, this is the first DL decoder in hardware reported.
2018 25th IEEE International Conference on Electronics, Circuits and Systems (ICECS), 2018
Deep-learning technology proliferates in a wide spectrum of applications. Modern communications a... more Deep-learning technology proliferates in a wide spectrum of applications. Modern communications applications require innovative solutions that can deliver near optimum performance in diverse and evolving environments. This paper proposes several substantial simplifications to deep-learning networks applied to the decoding of linear block codes. All proposed techniques reduce computational and interconnection complexity required for the inference in a deep-learning network over prior art. The proposed techniques build on inducing or exploiting sparsity in the trained network. Complexity savings of 60% to more than 80% are achieved without any practical degradation on decoding performance, quantified as coding gain.
2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2021
In this paper a simplified hardware implementation of a dot product arithmetic operation with con... more In this paper a simplified hardware implementation of a dot product arithmetic operation with constant coefficients is presented. The proposed methodology exploits a combination of distributed arithmetic and common subexpression sharing techniques. An algorithm is introduced for identifying the common sub partial sums systematically. Subsequently, a hardware architecture is proposed and the obtained circuits are synthesized in a 90-nm 1.0 V CMOS standard-cell library using Synopsys Design Compiler. Comparisons reveal significant reduction of 52% and 23% in area and power respectively for 1.5 ns delay over a regular dot product constant multiplier.
2021 10th International Conference on Modern Circuits and Systems Technologies (MOCAST), 2021
In this paper a parallel neural network architecture is proposed targeting efficient hardware imp... more In this paper a parallel neural network architecture is proposed targeting efficient hardware implementation on low-resource devices. Following the introduction of the proposed technique, the novel concept is applied on two basic function approximation examples namely cos(x) and sin(x). Quantitative results are offered and discussed in terms of accuracy and hardware complexity. It is shown that the proposed technique achieves promising results when considering low-power and high-performance hardware implementations targeted to edge devices.
ABSTRACT This paper introduces hardware architectures for encoding Quasi-Cyclic Low-Density Parit... more ABSTRACT This paper introduces hardware architectures for encoding Quasi-Cyclic Low-Density Parity Check (QC-LDPC) codes. The proposed encoders are based on appropriate factorization and subsequent compression of involved matrices by means of a novel technique, which exploits features of recursively-constructed QC-LDPC codes. The particular approach derives to linear encoding time complexity and requires a constant number of clock cycles for the computation of parity bits for all the constructed codes of various lengths that stem from a common base matrix. The proposed architectures are flexible, as they are parameterized and can support multiple code rates and codes of different lengths simply by appropriate initialization of memories and determination of data bus widths. Implementation results show that the proposed encoding technique is more efficient for some LDPC codes than previously proposed solutions. Both serial and parallel architectures are proposed. Hardware instantiations of the proposed serial encoders demonstrate high throughput with low area complexity for code words of many thousand bits, achieving area reduction compared to prior art. Furthermore, parallelization is shown to efficiently support multi-Gbps solutions at the cost of moderate area increase. The proposed encoders are shown to outperform the current state-of-the-art in terms of throughput-area-ratio and area-time complexity by 10 to up to 80 times for codes of comparable error-correction strength.
This paper discusses techniques for low-power addition/subtraction in the logarithmic number syst... more This paper discusses techniques for low-power addition/subtraction in the logarithmic number system (LNS) and evaluates their impact on digital filter implementation. Initially, the impact of partitioning the look-up tables (LUT) required for addition/subtraction on complexity, performance, and power dissipation is studied. Subsequently techniques for the low-power implementation of an LNS multiply- accumulate (MAC) unit are investigated. The obtained LNS MACs are used for the design of digital filters. Synthesis of LNS-based digital filters using a 0.18 mum 1.8 V CMOS standard-cell library, reveal that significant power dissipation savings are possible at no performance penalty, when compared to linear two's-complement equivalent.
IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 1997
In this paper, a systematic graph-based methodology for synthesizing VLSI RNS architectures using... more In this paper, a systematic graph-based methodology for synthesizing VLSI RNS architectures using full adders as the basic building block is introduced. The design methodology derives array architectures starting from the algorithm level and ending up with the bit-level design. Using as target architectural style the regular array processor, the proposed procedure constructs the two-dimensional (2-D) dependence graph of the bit-level algorithm, which is formally described by sets of uniform recurrent equations. The main characteristic of the proposed architectures is that they can operate at very high-throughput rates. The proposed architectures exhibit significantly reduced complexity over ROM-based ones
IEEE Transactions on Circuits and Systems I: Regular Papers, 2000
A graph-based technique is introduced for the design of a class of residue arithmetic multipliers... more A graph-based technique is introduced for the design of a class of residue arithmetic multipliers, as well as a family of new high-radix digit adders. A proposed design technique derives simple high-radix modulo-r n multipliers by optimally selecting among the variety of introduced digit adders the ones that compose a minimal-area multiplier. The proposed technique minimizes multiplier complexity by selecting
IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 2004
... Giorgos Dimitrakopoulos and Vassilis Paliouras, Member, IEEE ... transform (DFT) [7]. Elleith... more ... Giorgos Dimitrakopoulos and Vassilis Paliouras, Member, IEEE ... transform (DFT) [7]. Elleithy and Bayoumi presented an architecture for modular multiplication, which consists mainly of modular adders and is suitable for medium and large moduli [8]. Stouraitis et al. ...
... The remainder of the paper is as follows: Section 2 discusses the basics of OFDM transmission... more ... The remainder of the paper is as follows: Section 2 discusses the basics of OFDM transmission, defines PAPR, the efficiency of a class-A power amplifier and its relationship to PAPR reduction. Section 3 outlines the PTS approach. ...
... 2. BASIC NOTATION This section discusses the basics of OFDM transmission, defines PAPR and Cr... more ... 2. BASIC NOTATION This section discusses the basics of OFDM transmission, defines PAPR and Crest Factor (CF) and the PTS approach is outlined. Initially, the binary input data are mapped onto QAM symbols. An IFFT/FFT pair is used as a modulator/demodulator. ...
2018 7th International Conference on Modern Circuits and Systems Technologies (MOCAST)
In this paper the logarithmic Number System (LNS) is adopted to implement Long-Short Term Memory ... more In this paper the logarithmic Number System (LNS) is adopted to implement Long-Short Term Memory (LSTM), the basic component of a deep learning network type. Initially, piece wise approximations to activation functions σ and tanh are proposed and evaluated in LNS. Secondly, LNS multipliers and adders are implemented for wordlengths of 9,10 and 11 bits. The circuits are implemented in an 90-nm 1.0 V CMOS standard-cell library and quantitative results are reported. Results demonstrate that LNS is a good candidate for data representation and processing in deep learning networks, as area reduction of up to 36% is possible.
2019 IEEE Nordic Circuits and Systems Conference (NORCAS): NORCHIP and International Symposium of System-on-Chip (SoC), 2019
Deep-Learning-based Decoders have been recently introduced for use with short-length codes. They ... more Deep-Learning-based Decoders have been recently introduced for use with short-length codes. They have been found to act as a Soft-Decision-Decoding method achieving near Maximum-Likelihood error correcting capability. However, Deep-Learning decoding methods are hard to implement as they normally require millions of operations for inference. In order for Deep-Learning decoding to be a competitive candidate for practical applications, research effort is required to reduce the computational complexity and storage requirements of the Neural Networks involved. In this work, a structured flow is presented that significantly compresses a trained Syndrome-Based Neural Network Decoder by pruning up to 80% of the network's weights and quantizing them to 8-bit fixed-point representation, with no loss in its BER performance. The attained compressed Neural Network can then be used for inference, by designing specific hardware or by using a generic Deep-Learning hardware accelerator that exploits the compressed structure of the network. The deployment of the DL Decoder in an embedded application is showcased, using the AI Edge platform by Xilinx. To accomplish this, a simple method to obtain a computationally equivalent convolutional layer from a fully-connected one is introduced. Implementation results are provided for the compressed DL Decoder, regarding throughput rate and BER performance. To our knowledge, this is the first DL decoder in hardware reported.
2018 25th IEEE International Conference on Electronics, Circuits and Systems (ICECS), 2018
Deep-learning technology proliferates in a wide spectrum of applications. Modern communications a... more Deep-learning technology proliferates in a wide spectrum of applications. Modern communications applications require innovative solutions that can deliver near optimum performance in diverse and evolving environments. This paper proposes several substantial simplifications to deep-learning networks applied to the decoding of linear block codes. All proposed techniques reduce computational and interconnection complexity required for the inference in a deep-learning network over prior art. The proposed techniques build on inducing or exploiting sparsity in the trained network. Complexity savings of 60% to more than 80% are achieved without any practical degradation on decoding performance, quantified as coding gain.
2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2021
In this paper a simplified hardware implementation of a dot product arithmetic operation with con... more In this paper a simplified hardware implementation of a dot product arithmetic operation with constant coefficients is presented. The proposed methodology exploits a combination of distributed arithmetic and common subexpression sharing techniques. An algorithm is introduced for identifying the common sub partial sums systematically. Subsequently, a hardware architecture is proposed and the obtained circuits are synthesized in a 90-nm 1.0 V CMOS standard-cell library using Synopsys Design Compiler. Comparisons reveal significant reduction of 52% and 23% in area and power respectively for 1.5 ns delay over a regular dot product constant multiplier.
2021 10th International Conference on Modern Circuits and Systems Technologies (MOCAST), 2021
In this paper a parallel neural network architecture is proposed targeting efficient hardware imp... more In this paper a parallel neural network architecture is proposed targeting efficient hardware implementation on low-resource devices. Following the introduction of the proposed technique, the novel concept is applied on two basic function approximation examples namely cos(x) and sin(x). Quantitative results are offered and discussed in terms of accuracy and hardware complexity. It is shown that the proposed technique achieves promising results when considering low-power and high-performance hardware implementations targeted to edge devices.
ABSTRACT This paper introduces hardware architectures for encoding Quasi-Cyclic Low-Density Parit... more ABSTRACT This paper introduces hardware architectures for encoding Quasi-Cyclic Low-Density Parity Check (QC-LDPC) codes. The proposed encoders are based on appropriate factorization and subsequent compression of involved matrices by means of a novel technique, which exploits features of recursively-constructed QC-LDPC codes. The particular approach derives to linear encoding time complexity and requires a constant number of clock cycles for the computation of parity bits for all the constructed codes of various lengths that stem from a common base matrix. The proposed architectures are flexible, as they are parameterized and can support multiple code rates and codes of different lengths simply by appropriate initialization of memories and determination of data bus widths. Implementation results show that the proposed encoding technique is more efficient for some LDPC codes than previously proposed solutions. Both serial and parallel architectures are proposed. Hardware instantiations of the proposed serial encoders demonstrate high throughput with low area complexity for code words of many thousand bits, achieving area reduction compared to prior art. Furthermore, parallelization is shown to efficiently support multi-Gbps solutions at the cost of moderate area increase. The proposed encoders are shown to outperform the current state-of-the-art in terms of throughput-area-ratio and area-time complexity by 10 to up to 80 times for codes of comparable error-correction strength.
This paper discusses techniques for low-power addition/subtraction in the logarithmic number syst... more This paper discusses techniques for low-power addition/subtraction in the logarithmic number system (LNS) and evaluates their impact on digital filter implementation. Initially, the impact of partitioning the look-up tables (LUT) required for addition/subtraction on complexity, performance, and power dissipation is studied. Subsequently techniques for the low-power implementation of an LNS multiply- accumulate (MAC) unit are investigated. The obtained LNS MACs are used for the design of digital filters. Synthesis of LNS-based digital filters using a 0.18 mum 1.8 V CMOS standard-cell library, reveal that significant power dissipation savings are possible at no performance penalty, when compared to linear two's-complement equivalent.
IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 1997
In this paper, a systematic graph-based methodology for synthesizing VLSI RNS architectures using... more In this paper, a systematic graph-based methodology for synthesizing VLSI RNS architectures using full adders as the basic building block is introduced. The design methodology derives array architectures starting from the algorithm level and ending up with the bit-level design. Using as target architectural style the regular array processor, the proposed procedure constructs the two-dimensional (2-D) dependence graph of the bit-level algorithm, which is formally described by sets of uniform recurrent equations. The main characteristic of the proposed architectures is that they can operate at very high-throughput rates. The proposed architectures exhibit significantly reduced complexity over ROM-based ones
IEEE Transactions on Circuits and Systems I: Regular Papers, 2000
A graph-based technique is introduced for the design of a class of residue arithmetic multipliers... more A graph-based technique is introduced for the design of a class of residue arithmetic multipliers, as well as a family of new high-radix digit adders. A proposed design technique derives simple high-radix modulo-r n multipliers by optimally selecting among the variety of introduced digit adders the ones that compose a minimal-area multiplier. The proposed technique minimizes multiplier complexity by selecting
IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 2004
... Giorgos Dimitrakopoulos and Vassilis Paliouras, Member, IEEE ... transform (DFT) [7]. Elleith... more ... Giorgos Dimitrakopoulos and Vassilis Paliouras, Member, IEEE ... transform (DFT) [7]. Elleithy and Bayoumi presented an architecture for modular multiplication, which consists mainly of modular adders and is suitable for medium and large moduli [8]. Stouraitis et al. ...
... The remainder of the paper is as follows: Section 2 discusses the basics of OFDM transmission... more ... The remainder of the paper is as follows: Section 2 discusses the basics of OFDM transmission, defines PAPR, the efficiency of a class-A power amplifier and its relationship to PAPR reduction. Section 3 outlines the PTS approach. ...
... 2. BASIC NOTATION This section discusses the basics of OFDM transmission, defines PAPR and Cr... more ... 2. BASIC NOTATION This section discusses the basics of OFDM transmission, defines PAPR and Crest Factor (CF) and the PTS approach is outlined. Initially, the binary input data are mapped onto QAM symbols. An IFFT/FFT pair is used as a modulator/demodulator. ...
Uploads
Papers