Kurt Keutzer

University of California, Berkeley, Http://www.eecs.berkeley.edu/~keutzer/, Faculty Member

Followers

194

Following

202

Co-authors

Mentions

Public Views

Address: Berkeley, CA

less

InterestsView All (21)

Uploads

Tibetan Studies and Digital Tibetan by Kurt Keutzer

Immeasurable, Yet No Bigger than Your Thumb: The Tshon Gang in Bon Dzogchen

Download

Applying Text Analytics to the Mind-section Literature of the Tibetan Tradition of the Great Perfection

ACM Transactions on Asian and Low-Resource Language Information Processing, 2021

Over the past decade, through a mixture of optical character recognition and manual input, there ... more Over the past decade, through a mixture of optical character recognition and manual input, there is now a growing corpus of Tibetan literature available as e-texts in Unicode format. With the creation of such a corpus, the techniques of text analytics that have been applied in the analysis of English and other modern languages may now be applied to Tibetan. In this work, we narrow our focus to examine a modest portion of that literature, the Mind-section portion of the literature of the Tibetan tradition of the Great Perfection. Here, we will use the lens of text analytics tools based on machine learning techniques to investigate a number of questions of interest to scholars of this and related traditions of the Great Perfection. It has been necessary for us to participate in all portions of this process: corpora identification and text edition selection, rendering the text as e-texts in Unicode using both Optical Character Recognition and manual entry, data cleaning and transformation, implementation of software for text analysis, and interpretation of results. For this reason, we hope this study can serve as a model for other low-resource languages that are just beginning to approach the problem of providing text analytics for their language.

Download

The Nine Cycles of the Hidden, The Nine Mirrors, and Nine Minor Texts on Mind: Early Mind Section Literature in Bon

Revue d’Etudes Tibétaines, 2012

In sGa ston's list of the Southern Treasures discovered by gShen chen Klu dga' a series of texts ... more In sGa ston's list of the Southern Treasures discovered by gShen chen Klu dga' a series of texts referred to as the Facets of Mind, Nine Minor Texts on Mind are mentioned. The Bon tradition has acknowledged from that time to the present day that these are seminal texts in the literature of Bon. Furthermore, these texts would eventually be classified as the exemplary works of the Mind Section of Bon Dzogchen. Nevertheless, the precise content of these texts has been unclear to modern scholars, both Tibetan and Western, working outside of Tibet. With the publication in 1999 of Mongyal Lhase's Edition of the Bon Kangyur, as well as with other subsequent publications, we are now in a better position to identify and understand these works. The aim of this paper is to clearly identify the titles of these texts, to identify the various editions in which they are available, and to begin to understand how they work together with tantric elements to form a holistic system of training.

Download

Namsel: An optical character recognition system for Tibetan text

by Kurt Keutzer and Zach Rowinski

The use of advanced computational methods for the analysis of large corpora of electronic texts i... more The use of advanced computational methods for the analysis of large corpora of electronic texts is becoming increasingly popular in humanities and social science research. Unfortunately, Tibetan Studies has lacked such a repository of electronic, searchable texts. The automated recognition of printed texts, known as Optical Character Recognition (OCR), offers a solution to this problem; however, until recently, robust OCR systems for the Tibetan language have not been available. In this paper, we introduce one new system, called Namsel, which uses Optical Character Recognition (OCR) to support the production, review, and distribution of searchable Tibetan texts at a large scale. Namsel tackles a number of challenges unique to the recognition of complex scripts such as Tibean uchen and has been able to achieve high accuracy rates on a wide range of machine-printed works. In this paper, we discuss the details of Tibetan OCR, how Namsel works, and the problems it is able to solve. We also discuss the collaborative work between Namsel and its partner libraries aimed at building a comprehensive database of historical and modern Tibetan works—a database that consists of more than one million pages of texts spanning over a thousand years of literary production.

Download

Recognition of Tibetan wood block prints with generalized hidden Markov and kernelized modified quadratic distance function

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data - MOCR_AND '11, 2011

ABSTRACT Recognition of Tibetan wood block print is a difficult problem that has many challenging... more ABSTRACT Recognition of Tibetan wood block print is a difficult problem that has many challenging steps. We propose a two stage framework involving image preprocessing, which consists of noise removal and baseline detection, and simultaneous character segmentation and recognition by the aid of a generalized hidden Markov model (also known as gHMM). For the latter stage, we train a gHMM and run the generalized Viterbi algorithm on our image to decode observations. There are two major motivations for using gHMM. First, it incorporates a language model into our recognition system which in turn enforces grammar and disambiguates classification errors caused by printing errors and image noise. Second, gHMM solves the segmentation challenge. Simply put gHMM is an HMM where the emission model allows multiple consecutive observations to be mapped to the same state. For features of our emission model we apply line and circle Hough transform to stroke detection, and use classspecific scaling for feature weighing. With gHMM, we find KMQDF to be the most effective distance metric for discriminating character classes. The accuracy of our system is 91.29%.

Deep Learning/Deep Neural Nets by Kurt Keutzer

Communication-Avoiding QR Decomposition for GPUs

2011 IEEE International Parallel & Distributed Processing Symposium, 2011

Download

Communication-minimizing 2D convolution in GPU registers

2013 IEEE International Conference on Image Processing, 2013

Download

PyCASP: Pattern-Based, Productive, Efficient and Portable Application Development on Parallel Platforms

I-BERT: Integer-only BERT Quantization

Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many N... more Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive for many edge processors, and it has been a challenge to deploy these models for edge applications and devices that have resource constraints. While quantization can be a viable solution to this, previous work on quantizing Transformer based models uses floating-point arithmetic during inference, thus limiting model deployment on many edge processors. In this work, we propose a novel integer-only quantization scheme for Transformer based models that quantizes the entire inference process. In particular, we demonstrate how to approximate nonlinear operations in Transformer architectures, e.g., GELU, Softmax, and Layer Normalization, with lightweight integer computations. We use those approximations in our method, I-BERT, with an end-to-end integer-only inference, and...

Download

Efficient Machine Learning by Kurt Keutzer

Clinically feasible reconstruction time for L1-SPIRiT parallel imaging and compressed sensing MRI

... Theory: L1-SPIRiT reconstruction requires solving a non-linear constrained optimization probl... more

Efficient parallelization of h. 264 decoding with macro block level scheduling

Download

An FPGA-based soft multiprocessor system for IPv4 packet forwarding

Download

Data-parallel large vocabulary continuous speech recognition on graphics processors

by Youngmin Yi, Kurt Keutzer, and Youngmin Yi

A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit

Download

Fast< formula formulatype

Efficient parallel CKY parsing using GPUs

Journal of Logic and Computation, 2014

Download

$Research paper thumbnail of Fast <formula formulatype="inline"><tex Notation="TeX">$\ell_1$</tex> </formula>-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime$

Fast <formula formulatype="inline"><tex Notation="TeX">$\ell_1$</tex> </formula>-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime

IEEE Transactions on Medical Imaging, 2000

Download

Parallel scalability in speech recognition

IEEE Signal Processing Magazine, 2000

Download

A parallel region based object recognition system

2011 IEEE Workshop on Applications of Computer Vision (WACV), 2011

... In Work-shop on Generative-Model Based Vision, CVPR, 2004. 3 [10] P. Felzenszwalb, D. McAlles... more

Efficient, high-quality image contour detection

2009 IEEE 12th International Conference on Computer Vision, 2009

Abstract Image contour detection is fundamental to many image analysis applications, including im... more

Immeasurable, Yet No Bigger than Your Thumb: The Tshon Gang in Bon Dzogchen

Download

Applying Text Analytics to the Mind-section Literature of the Tibetan Tradition of the Great Perfection

ACM Transactions on Asian and Low-Resource Language Information Processing, 2021

Download

The Nine Cycles of the Hidden, The Nine Mirrors, and Nine Minor Texts on Mind: Early Mind Section Literature in Bon

Revue d’Etudes Tibétaines, 2012

Download

Namsel: An optical character recognition system for Tibetan text

by Kurt Keutzer and Zach Rowinski

Download

Recognition of Tibetan wood block prints with generalized hidden Markov and kernelized modified quadratic distance function

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data - MOCR_AND '11, 2011

Communication-Avoiding QR Decomposition for GPUs

2011 IEEE International Parallel & Distributed Processing Symposium, 2011

Download

Communication-minimizing 2D convolution in GPU registers

2013 IEEE International Conference on Image Processing, 2013

Download

PyCASP: Pattern-Based, Productive, Efficient and Portable Application Development on Parallel Platforms

I-BERT: Integer-only BERT Quantization

Download

Clinically feasible reconstruction time for L1-SPIRiT parallel imaging and compressed sensing MRI

... Theory: L1-SPIRiT reconstruction requires solving a non-linear constrained optimization probl... more

Efficient parallelization of h. 264 decoding with macro block level scheduling

Download

An FPGA-based soft multiprocessor system for IPv4 packet forwarding

Download

Data-parallel large vocabulary continuous speech recognition on graphics processors

by Youngmin Yi, Kurt Keutzer, and Youngmin Yi

A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit

Download

Fast< formula formulatype

Efficient parallel CKY parsing using GPUs

Journal of Logic and Computation, 2014

Download

Fast <formula formulatype="inline"><tex Notation="TeX">$\ell_1$</tex> </formula>-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime

IEEE Transactions on Medical Imaging, 2000

Download

Parallel scalability in speech recognition

IEEE Signal Processing Magazine, 2000

Download

A parallel region based object recognition system

2011 IEEE Workshop on Applications of Computer Vision (WACV), 2011

... In Work-shop on Generative-Model Based Vision, CVPR, 2004. 3 [10] P. Felzenszwalb, D. McAlles... more

Efficient, high-quality image contour detection

2009 IEEE 12th International Conference on Computer Vision, 2009

Abstract Image contour detection is fundamental to many image analysis applications, including im... more

clSpMV

Proceedings of the 26th ACM international conference on Supercomputing - ICS '12, 2012

ABSTRACT Sparse matrix vector multiplication (SpMV) kernel is a key computation in linear algebra... more ABSTRACT Sparse matrix vector multiplication (SpMV) kernel is a key computation in linear algebra. Most iterative methods are composed of SpMV operations with BLAS1 updates. Therefore, researchers make extensive efforts to optimize the SpMV kernel in sparse linear algebra. With the appearance of OpenCL, a programming language that standardizes parallel programming across a wide variety of heterogeneous platforms, we are able to optimize the SpMV kernel on many different platforms. In this paper, we propose a new sparse matrix format, the Cocktail Format, to take advantage of the strengths of many different sparse matrix formats. Based on the Cocktail Format, we develop the clSpMV framework that is able to analyze all kinds of sparse matrices at runtime, and recommend the best representations of the given sparse matrices on different platforms. Although solutions that are portable across diverse platforms generally provide lower performance when compared to solutions that are specialized to particular platforms, our experimental results show that clSpMV can find the best representations of the input sparse matrices on both Nvidia and AMD platforms, and deliver 83% higher performance compared to the vendor optimized CUDA implementation of the proposed hybrid sparse format in [3], and 63.6% higher performance compared to the CUDA implementations of all sparse formats in [3].

Monte Carlo–Based Financial Market Value-at-Risk Estimation on GPUs

GPU Computing Gems Jade Edition, 2012

Code Generation and Optimization Techniques for Embedded Digital Signal Processors

Hardware/Software Co-Design, 1996

Download

Practical parallel imaging compressed sensing MRI: Summary of two years of experience in accelerating body MRI of pediatric patients

2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2011

For the last two years, we have been experimenting with applying compressed sensing parallel imag... more For the last two years, we have been experimenting with applying compressed sensing parallel imaging for body imaging of pediatric patients. It is a joint-effort by teams from UC Berkeley, Stanford University and GE Healthcare. This paper aims to summarize our experience so far. We describe our acquisition approach: 3D spoiled-gradient-echo with poisson-disc random undersampling of the phase encodes. Our re-construction approach: ℓ1-SPIRiT, an iterative autocalibrating parallel imaging reconstruction that enforces both data consistency and joint-sparsity in the wavelet domain. Our implementation: an on-line parallelized implementation of ℓ1-SPIRiT on multi-core CPU and General Purpose Graphics Processors (GPGPU) that achieves sub-minute 3D reconstructions with 8-channels. Clinical results showing higher quality reconstruction and better diagnostic confidence than parallel imaging alone at accelerations on the order of number of coils.

Scalable HMM based inference engine in large vocabulary continuous speech recognition

2009 IEEE International Conference on Multimedia and Expo, 2009

Download

Densenet: Implementing efficient convnet descriptor pyramids

Download

Quantifying the energy efficiency of object recognition and optical flow

Download

Fast support vector machine training and classification on graphics processors

Proceedings of the 25th international conference on Machine learning - ICML '08, 2008

Download

Enabling Technology for more Pervasive and Responsive Market Risk Management Systems

DAGON: technology binding and local optimization by DAG matching

Download

Estimation of average switching activity in combinational and sequential circuits

Download

Parallel BFS graph traversal on images using structured grid

ABSTRACT Graph algorithms are widely used in image processing techniques. With technology advance... more ABSTRACT Graph algorithms are widely used in image processing techniques. With technology advancements, image sizes are increasing, and the contents inside images are becoming more complex, resulting in increased runtimes for graph algorithms on these images. Breadth First Search (BFS) is a fundamental graph traversal approach. A key to parallelizing graph algorithms used in image processing is to parallelize the BFS graph traversal operation. In this paper, we propose using highly parallelizable structured grid computations to realize the BFS graph traversal operations. This mapping enables efficient implementation of the BFS graph traversal operations on highly parallel manycore platforms. By using such a mapping, we were able to achieve performance gains of 2× to 33× depending on image complexity.

Algorithms for synthesis of hazard-free asynchronous circuits

Proceedings of the 28th …, Jun 1, 1991

Abstract. 4 technique for the synthesis of asynchronous sequential circuits from a Signal Transit... more Abstract. 4 technique for the synthesis of asynchronous sequential circuits from a Signal Transition Graph(STCJ) specification is clescribed. We give algorithms for synthesis and hazard removal, able to produce hazard-free circuits with the bounded wire-delay mode!, requiring the STG to be live, safe and to have the umque state coding property. A proof that, contrary to previous beliefs, STG persistency is not necessary for hazard-free implementation is given.

Download

Synthesis of verifiably hazard-free asynchronous control circuits

A synthesis technique for asynchronous sequential control circuits from a high level specificatio... more A synthesis technique for asynchronous sequential control circuits from a high level specification, the Signal Transition Graph (STG) is described. The synthesis technique is guaranteed to generate hazard-free circuits with the unbounded gatedelay model and the bounded wire-delay model, if the STG is live, safe and has the unique state coding property. A proof that STG persistency is not necessary for hazard-free implementation is given.

An Observability-Based Code Coverage Metric for Functional Simulation

Proceedings of the 1996 IEEE/ …, 1997

Abstract— Functional simulation is the most widely used method for design verification. At variou... more

Download

Exploring recognition network representations for efficient speech inference on highly parallel platforms

Download

System-level performance modeling with BACPAC��Berkeley advanced chip performance calculator

... Figure 1. Overall flow of BACPAC Berkeley Advanced Chip Performance Calculator (BACPAC) has b... more

A general probabilistic framework for worst case timing analysis

The traditional approach to worst-case static-timing analysis is becoming unacceptably conservati... more The traditional approach to worst-case static-timing analysis is becoming unacceptably conservative due to an ever-increasing number of circuit and process effects. We propose a fundamentally different framework that aims to significantly improve the accuracy of timing predictions through fully probabilistic analysis of gate and path delays. We describe a bottom-up approach for the construction of joint probability density function of

Testability-Preserving Circuit Transformations

International Conference on Computer Aided Design, 1990

Consideration is given to the synthesis of robustly path-delay-fault testable circuits and it is ... more Consideration is given to the synthesis of robustly path-delay-fault testable circuits and it is shown that a single property, ENF reducibility, unifies previous results on robust delay fault testability and multifault testability and proves new ones. The notion of ENF reducibility is used to show that a constrained version of a common area improving transformation, namely, algebraic resubstitution with complement,

From blind certainty to informed uncertainty

The accuracy, computational efficiency, and reliability of static timing analysis have made it th... more The accuracy, computational efficiency, and reliability of static timing analysis have made it the workhorse for verifying the timing of synchronous digital integrated circuits for more than a decade. In this paper we charge that the traditional deterministic approach to analyzing the timing of circuits is significantly undermining its accuracy and may even challenge its reliability. We argue that computation

Impact of systematic spatial intra-chip gate length variability on performance of high-speed digital circuits

Download

Statistical timing analysis of combinational circuits

Efficient methods for computing an exact probability distribution of the delay of a combinational... more Efficient methods for computing an exact probability distribution of the delay of a combinational circuit, given probability distributions for the gate and wire delays, are developed. The derived distribution can give the probability that a combinational circuit will achieve a certain performance, across the possible range. This information can then be used to predict the expected performance of the entire

Synthesis and optimization procedures for robustly delay-fault testable combinational logic circuits

In this paper we apply recently developed necessary and sufficient conditions for robust path-del... more In this paper we apply recently developed necessary and sufficient conditions for robust path-delay-fault testability to develop synthesis procedures which produce two-level and multilevel circuits with high degrees of robust path delay fault testability. For circuits which can be flattened to two levels, we give a covering procedure which optimizes for robust path delay fault testability. These two-level circuits can then be algebraically factored to produce robustly path-delay-fault testable multilevel circuits. For regular structures which cannot be flattened to two levels, we give a composition procedure which allows for the construction of robustly path-delay-fault testable regular structures. Finally, we show how these two techniques can be combined to produce cascaded combinational logic blocks that are robustly path-delay-fault testable. We demonstrate these techniques on a variety of examples. It is possible to produce entire chips that are fully path delay testable using these techniques.

Estimation of power dissipation in CMOS combinational circuits

Download

Testability properties of multilevel logic networks derived from binary decision diagrams

Design verification and reachability analysis using algebraic manipulation

... Manipulation Srinivas Devadas Kurt Keutzer A. S. Krishnakumar ... The use of higher-order log... more

Design of integrated circuits fully testable for delay-faults and multifaults

Page 1. Design of Integrated Circuits Fully Sri n ivas D evad as Department of EECS MIT, Cambridg... more

Algorithms and Techniques for VLSI Layout and Synthesis

Boolean minimization and algebraic factorization procedures for fully testable sequential machines

The authors present a novel Boolean minimization procedure of prime-implicant generation and cons... more The authors present a novel Boolean minimization procedure of prime-implicant generation and constrained covering based on the Quine-McCluskey algorithm. On completion, it guarantees a prime and irredundant, fully testable Moore or Mealy finite state machine. Given a two-level circuit with these properties, constrained algebraic factorization techniques are used that retain the invariant that no single fault can both produce an invalid state and corrupt the distinguishing sequence by which that invalid state is detected. Besides offering a more detailed understanding of the sources of untestability in sequential circuits than previous approaches, this approach offers significant practical advantages as well. It is applicable to a wider range of circuits than optimal synthesis procedures whose utility is often limited by prohibitively high CPU requirements, and its less restrictive synthesis constraints result in lower area overhead than other constrained synthesis approaches. These observations are supported by experimental results

A Special Section on Multicore Parallel CAD

ACM Transactions on Design Automation of Electronic Systems, 2011

ABSTRACT We describe a parallel simulated annealing algorithm for FPGA placement. The algorithm p... more

The landscape of parallel computing research: A view from berkeley

… , University of California at Berkeley, Technical Report No. …

Download

The Parallel Computing Laboratory at UC Berkeley: A Research Agenda Based on the Berkeley View

… , University of California …, 2008

Download

Network processors: Origin of species

Proceedings of the 17th International …

Network Processors: Origin of Species Niraj Shah, Kurt Keutzer University of California, Berkeley... more

A map reduce framework for programming graphics processors

Download

NP-Click: A programming model for the Intel IXP1200

The architectural diversity and complexity of network processor architectures motivate the need f... more

Automated task allocation on single chip, hardware multithreaded, multiprocessor systems

Download

Automated task allocation for network processors

Network processors have great potential to combine high performance with increased flexibility. T... more

Comparing network processor programming environments: A case study

Network processors have emerged as prominent examples of m ultiprocessor application-specific pro... more Network processors have emerged as prominent examples of m ultiprocessor application-specific pro- grammable architectures. While there have been sig- nificant architectural developments in this field, wide- spread adoption will be predicated on productively programming high performance applications on these architectures. This paper presents a case study of two programming environments for a common network processor, the Intel IXP1200. We compare the devel- opment process, achievable performance, and re- source usage of the final implementations using these two programming approaches and draw conclusions regarding the advantages and disadvantages of these approaches.

Linear programming for sizing, V th and V dd assignment

Most circuit sizing tools calculate the tradeoff between each gate&amp;amp;amp;#x27;s delay a... more Most circuit sizing tools calculate the tradeoff between each gate&amp;amp;amp;#x27;s delay and power or area, and then greedily change the gate with the best tradeoff. We show this is suboptimal. Instead we use a linear program to minimize circuit power. The linear program provides a fast and simultaneous analysis of how each gate affects gates it has a path to.

Programming challenges in network processor deployment

Download

A view of the parallel computing landscape

Communications of the ACM, 2009

Download

The Concurrency Challenge

IEEE Design & Test of Computers, 2000

Download

Ubiquitous Parallel Computing from Berkeley, Illinois, and Stanford

IEEE Micro, 2000

Download

NP-Click: a productive software development approach for network processors

IEEE Micro, 2004

Download

Guest Editors' Introduction: Parallelism on the Desktop

IEEE Software, 2000

Download

Mapping Concurrent Applications onto Architectural Platforms

Networks on Chip, 2004

Download

Parallelizing CAD

Proceedings of the 45th annual conference on Design automation - DAC '08, 2008

... Bor-Yiing Su Department of Electrical Engineering and Computer Sciences Berkeley, CA subrian@... more

The Landscape of Parallel Computing Research: A

A Benchmarking Methodology for Network Processors

Network Processor Design, 2003

Download

A text-compression-based method for code size minimization in embedded systems

ACM Transactions on Design Automation of Electronic Systems, 1999

... Authors&#x27; addresses: S. Liao, Synopsys, Inc., Mountain View, CA; S. Devadas, Massachu... more

Code density optimization for embedded DSP processors using data compression techniques

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1998

Download

Impact of small process geometries on microarchitectures in systems on a chip

Proceedings of the IEEE, 2001

Download

Inclusively Identifying the Architectural Space

Building ASIPS: The Mescal Methodology, 2005

ABSTRACT

Addressing the system-on-a-chip interconnect woes through communication-based design

Proceedings of the 38th conference on Design automation - DAC '01, 2001

Download

Successfully Deploying the ASIP

Building ASIPS: The Mescal Methodology, 2005

Apart from the design flow for building the architecture of a programmable platform, the software... more

Code Optimization Techniques for Embedded DSP Microprocessors

32nd Design Automation Conference, 1995

Download

Automated Task Allocation for Network Processors

Network processors have great potential to combine high performance with increased flexibility. T... more Network processors have great potential to combine high performance with increased flexibility. These multiprocessor systems consist of programmable elements, dedicated logic, and specialized memory and interconnection networks. However, the architectural complexity of the systems makes programming dif-ficult. Programmers must be able to productively implement high performance applications for network proc-essors to succeed. Ideally, designers describe applications in a domain specific

Closing the power gap between ASIC and custom

Proceedings of the 42nd annual conference on Design automation - DAC '05, 2005

Download

Achieving 550 MHz in an ASIC methodology

Proceedings of the 38th conference on Design automation - DAC '01, 2001

Download

From ASIC to ASIP: the next design discontinuity

Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2002

Download

Challenges in Code Generation for Embedded Processors

Code Generation for Embedded Processors, 2002

Download

Creating Synthesizable ARM Processors with Near Custom Performance

Closing the Gap Between ASIC & Custom, 2004

Building ASIPS: The Mescal Methodology

... Thanks to Daniel MacLeod, Lorie Brofferio, and Jennifer Stone; they help us keep the MESCAL t... more

An automated exploration framework for FPGA-based soft multiprocessor systems

Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis - CODES+ISSS '05, 2005

Download

Automatic generation of application-specific architectures for heterogeneous multiprocessor system-on-chip

Proceedings of the 38th conference on Design automation - DAC '01, 2001

Download

SEJITS: Getting productivity and performance with selective embedded JIT specialization

Download

Our Pattern Language (OPL): A design pattern language for engineering (parallel) software

Our pattern language (opl)

Download

The concurrency challenge

Architecting parallel programs

2008 IEEE/ACM International Conference on Computer-Aided Design, 2008

Download

A design pattern language for engineering (parallel) software

Proceedings of the 2010 Workshop on Parallel Programming Patterns - ParaPLoP '10, 2010

Download

A new viewpoint on code generation for directed acyclic graphs

ACM Transactions on Design Automation of Electronic Systems, 1998

... STAN LIAO, KURT KEUTZER, and STEVEN TJIANG Synopsys, Inc. and SRINIVAS DEVADAS Massachusetts ... more

Storage assignment to decrease code size

ACM Transactions on Programming Languages and Systems, 1996

Page 1. Storage Assignment to Stan Liao Srinivas Dcxadas MIT Department of EECS Cambridge, MA 021... more

Closing the gap between ASIC and custom: An ASIC perspective

Download

Closing the power gap between ASIC & custom: tools and techniques for low power design

... 319 DESIGN EXAMPLES 13. Pushing ASIC Performance in a Power Envelope 323 Leon Stok, Ruchir Pu... more

Closing the gap between ASIC and custom

Proceedings of the 37th conference on Design automation - DAC '00, 2000

Download

Acceleration of Market Value-at-Risk Estimation

SSRN Electronic Journal, 2000

Download

Scheduling task dependence graphs with variable task execution times onto heterogeneous multiprocessors

Download

An FPGA-based soft multiprocessor system for IPV 4 packet forwarding

Download

Learned Token Pruning for Transformers

Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Download

Domain-Aware Dynamic Networks

arXiv (Cornell University), Nov 26, 2019

Download

Three fingered jack: tackling portability, performance, and productivity with auto-parallelized python

USENIX conference on Hot topics in parallelism, Jun 24, 2013

Download

Large batch size training of neural networks with adversarial training and second-order information

arXiv (Cornell University), Sep 27, 2018

Download

Himalayan Linguistics Namsel: An optical character recognition system for Tibetan text Namsel: An optical character recognition system for Tibetan text

Download

UC Santa Barbara Himalayan Linguistics Title Namsel: An Optical Character Recognition System for Tibetan Text Publication Date Himalayan Linguistics Namsel: An optical character recognition system for Tibetan text Namsel: An optical character recognition system for Tibetan text

Download

Small neural nets are beautiful

Proceedings of the Twelfth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis Companion, 2017

Download

Boda

Proceedings of the Computing Frontiers Conference, 2017

Download

Convolutional Monte Carlo Rollouts in Go

ArXiv, 2015

In this work, we present a MCTS-based Go-playing program which uses convolutional networks in all... more In this work, we present a MCTS-based Go-playing program which uses convolutional networks in all parts. Our method performs MCTS in batches, explores the Monte Carlo search tree using Thompson sampling and a convolutional network, and evaluates convnet-based rollouts on the GPU. We achieve strong win rates against open source Go programs and attain competitive results against state of the art convolutional net-based Go-playing programs.

Download

ImageNet Training in Minutes

Proceedings of the 47th International Conference on Parallel Processing, 2018

Download

SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving

2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017

Download

Integrated Model, Batch, and Domain Parallelism in Training Neural Networks

Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures, 2018

Download

Co-design of deep neural nets and neural net accelerators for embedded vision applications

IBM Journal of Research and Development, 2019

Download

Overview of the Factors Affecting the Power Consumption

Closing the Power Gap Between ASIC & Custom, 2007

Download

clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs

Sparse matrix vector multiplication (SpMV) kernel is a key computation in linear algebra. Most it... more Sparse matrix vector multiplication (SpMV) kernel is a key computation in linear algebra. Most iterative methods are composed of SpMV operations with BLAS1 updates. Therefore, researchers make extensive efforts to optimize the SpMV kernel in sparse linear algebra. With the appearance of OpenCL, a programming language that standardizes parallel programming across a wide variety of heterogeneous platforms, we are able to optimize the SpMV kernel on many different platforms. In this paper, we propose a new sparse matrix format, the Cocktail Format, to take advantage of the strengths of many different sparse matrix formats. Based on the Cocktail Format, we develop the clSpMV framework that is able to analyze all kinds of sparse matrices at runtime, and recommend the best representations of the given sparse matrices on different platforms. Although solutions that are portable across diverse platforms generally provide lower performance when compared to solutions that are specialized to p...

Download

Analysis and Design of Regular Structures for Robust Dynamic Fault Testability

VLSI Design, 1993

Recent methods of synthesizing logic that is fully and robustly testable for dynamic faults, name... more Recent methods of synthesizing logic that is fully and robustly testable for dynamic faults, namely path delay, transistor stuck-open and gate delay faults, rely almost exclusively on flattening given logic expressions into sum-of-products form, minimizing the cover to obtain a fully dynamic-fault testable two-level representation of the functions, and performing structural transformations to resynthesize the circuit into a multilevel network, while also maintaining full dynamic-fault testability. While this technique will work well for random or control logic, it is not practical for many regular structures.To deal with the synthesis of regular structures for dynamic-fault testability, we present a method that involves the development of a library of cells for these regular structures such that the cells are all fully path-delay-fault, transistor stuck-open fault or gate-delay-fault testable. These cells can then be utilized whenever one of these standard functions is encountered.W...

Download

PALLAS: Mapping Applications onto Manycore

Multiprocessor System-on-Chip, 2010

Abstract Parallel programming using the current state-of-the-art in software engineering techniqu... more

Pipelining to Reduce the Power

Closing the Power Gap Between ASIC & Custom, 2007

Download

MITRA-zh: An efficient, open machine translation solution for Buddhist Chinese

NLP for Digital Humanities, 2023

Buddhist Classical Chinese is a challenging low-resource language that has not yetreceived much d... more Buddhist Classical Chinese is a challenging low-resource language that has not yetreceived much dedicated attention in NLP research. In this paper, we present a novel dataset of 209,454 bitext pairs for the train
ing and 2.300 manually curated and corrected bitext pairs for the evaluation of machine translation models for this language. We also train a number of sequence-to-sequence models and compare their translation performance against commercial models. We also provide a limited case studies where we examine the performance of different machine translation models on a selection of Buddhist Chinese passages.

Download

Kurt Keutzer

Uploads

Tibetan Studies and Digital Tibetan by Kurt Keutzer

Deep Learning/Deep Neural Nets by Kurt Keutzer

Efficient Machine Learning by Kurt Keutzer

Log In