US20130158996A1 - Acoustic Processing Unit - Google Patents
Acoustic Processing Unit Download PDFInfo
- Publication number
- US20130158996A1 US20130158996A1 US13/489,799 US201213489799A US2013158996A1 US 20130158996 A1 US20130158996 A1 US 20130158996A1 US 201213489799 A US201213489799 A US 201213489799A US 2013158996 A1 US2013158996 A1 US 2013158996A1
- Authority
- US
- United States
- Prior art keywords
- senone
- gaussian probability
- ssu
- distance
- gaussian
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000012545 processing Methods 0.000 title description 83
- 239000013598 vector Substances 0.000 claims abstract description 160
- 238000009826 distribution Methods 0.000 claims abstract description 154
- 238000000034 method Methods 0.000 claims abstract description 118
- 230000015654 memory Effects 0.000 claims description 140
- 239000000872 buffer Substances 0.000 claims description 29
- 238000004364 calculation method Methods 0.000 claims description 26
- 239000011159 matrix material Substances 0.000 claims description 15
- 230000008569 process Effects 0.000 description 75
- 230000006870 function Effects 0.000 description 51
- 238000012546 transfer Methods 0.000 description 33
- 238000004891 communication Methods 0.000 description 27
- 238000003860 storage Methods 0.000 description 19
- 238000007792 addition Methods 0.000 description 13
- 238000011156 evaluation Methods 0.000 description 13
- 238000003909 pattern recognition Methods 0.000 description 12
- 238000004422 calculation algorithm Methods 0.000 description 11
- 238000004590 computer program Methods 0.000 description 11
- 238000012854 evaluation process Methods 0.000 description 11
- QECABVMKPMRCRZ-UHFFFAOYSA-N acetyl(methoxy)phosphinic acid Chemical compound COP(O)(=O)C(C)=O QECABVMKPMRCRZ-UHFFFAOYSA-N 0.000 description 10
- 230000008901 benefit Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 7
- 230000002093 peripheral effect Effects 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 5
- 230000005236 sound signal Effects 0.000 description 4
- 230000001419 dependent effect Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000036961 partial effect Effects 0.000 description 3
- 230000002829 reductive effect Effects 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000003071 parasitic effect Effects 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 230000000630 rising effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 241000699670 Mus sp. Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- XMQFTWRPUQYINF-UHFFFAOYSA-N bensulfuron-methyl Chemical compound COC(=O)C1=CC=CC=C1CS(=O)(=O)NC(=O)NC1=NC(OC)=CC(OC)=N1 XMQFTWRPUQYINF-UHFFFAOYSA-N 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 210000004124 hock Anatomy 0.000 description 1
- 230000002045 lasting effect Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
Definitions
- Embodiments of the present invention generally relate to speech recognition. More particular, embodiments of the present invention relate to the implementation of an acoustic modeling process on a dedicated processing unit.
- Real-time data pattern recognition is increasingly used to analyze data streams in electronic systems.
- speech recognition systems On a vocabulary with over tens of thousands of words, speech recognition systems have achieved improved accuracy, making it an attractive feature for electronic systems.
- speech recognition systems are increasingly common in consumer markets targeted to data pattern recognition applications such as, for example, the mobile device, server, automobile, and PC markets.
- An embodiment of the present invention includes a senone scoring unit (SSU).
- the SSU can include a SSU control module, a distance calculator, and an addition module.
- the SSU control module can be configured to receive a feature vector.
- the distance calculator can be configured to receive a plurality of Gaussian probability distributions via a data bus having a width of at least one Gaussian probability distribution (e.g., 768 bits) and the feature vector from the SSU control module.
- the distance calculator can include a plurality of arithmetic logic units (ALUs) and an accumulator.
- ALUs arithmetic logic units
- Each of the ALUs can be configured to receive a portion of the at least one Gaussian probability distribution and to calculate a dimension distance score between a dimension of the feature vector and a corresponding dimension of the at least one Gaussian probability distribution.
- the accumulator can be configured to sum the dimension distance scores from the plurality of ALUs to generate a Gaussian distance score.
- the addition module can be configured to sum a plurality of Gaussian distance scores corresponding to the plurality of Gaussian probability distributions to generate a senone score.
- the SSU can also include a feature vector matrix module configured to store a scaling factor for the dimension of the feature vector.
- Another embodiment of the present invention includes a method for acoustic modeling.
- the method can include the following: receiving a plurality of Gaussian probability distributions via a data bus having a width of at least one Gaussian probability distribution and a feature vector from an external computing device; calculating a plurality of dimension distance scores based on a plurality of dimensions of the feature vector and a corresponding plurality of dimensions of the at least one Gaussian probability distribution; summing the plurality of dimension distance scores to generate a Gaussian distance score for the at least one Gaussian probability distribution; and, summing a plurality of Gaussian distance scores corresponding to the plurality of Gaussian probability distributions to generate a senone score.
- a further embodiment of the present invention includes a system for acoustic modeling.
- the system can include a memory module and a senone scoring unit (SSU).
- the memory module can be configured to interface with an external computing device to receive a feature vector.
- the SSU can include a distance calculator and an addition module, where the distance calculator includes a plurality of arithmetic logic units (ALUs) and an accumulator.
- ALUs arithmetic logic units
- Each of the ALUs can be configured to receive a portion of the at least one Gaussian probability distribution and to calculate a dimension distance score between a dimension of the feature vector and a corresponding dimension of the at least one Gaussian probability distribution.
- the accumulator can be configured to sum the dimension distance scores from the plurality of ALUs to generate a Gaussian distance score.
- the addition module can be configured to sum a plurality of Gaussian distance scores corresponding to the plurality of Gaussian probability distributions to generate a senone score
- FIG. 1 is an illustration of an exemplary flowchart of a speech recognition process according to an embodiment of the present invention.
- FIG. 2 is an illustration of a conventional speech recognition system.
- FIG. 3 is an illustration of a conventional speech recognition system with speech recognition processes performed by an individual processing unit.
- FIG. 4 is an illustration of an embodiment of speech recognition processes performed by an Acoustic Processing Unit (APU) and a Central Processing Unit (CPU).
- APU Acoustic Processing Unit
- CPU Central Processing Unit
- FIG. 5 is an illustration of an embodiment of a Peripheral Controller Interface (PCI) bus architecture for a speech recognition system.
- PCI Peripheral Controller Interface
- FIG. 6 is an illustration of an embodiment of an Advanced Peripheral Bus (APB) architecture for a speech recognition system.
- APIB Advanced Peripheral Bus
- FIG. 7 is an illustration of an embodiment of a Low Power Double Data Rate (LPDDR) bus architecture for a speech recognition system.
- LPDDR Low Power Double Data Rate
- FIG. 8 is an illustration of an embodiment of a system-level architecture for a speech recognition system.
- FIG. 9 is an illustration of an embodiment of a method for data pattern analysis.
- FIG. 10 is an illustration of an embodiment of a system-level architecture for a speech recognition system with an integrated Application-Specific Integrated Circuit (ASIC) and memory device.
- ASIC Application-Specific Integrated Circuit
- FIG. 11 is an illustration of an embodiment of a system-level architecture for a speech recognition system with an integrated Application-Specific Integrated Circuit (ASIC), volatile memory device, and non-volatile memory device.
- ASIC Application-Specific Integrated Circuit
- FIG. 12 is an illustration of an embodiment of a system-level architecture for a speech recognition system with a System-On-Chip that includes an Application-Specific Integrated Circuit (ASIC) and a Central Processing Unit (CPU).
- ASIC Application-Specific Integrated Circuit
- CPU Central Processing Unit
- FIG. 13 is an illustration of another embodiment of a system-level architecture for a speech recognition system with a System-On-Chip that includes an Application-Specific Integrated Circuit (ASIC) and a Central Processing Unit (CPU).
- ASIC Application-Specific Integrated Circuit
- CPU Central Processing Unit
- FIG. 14 is an illustration of an embodiment of an Acoustic Processing Unit (APU).
- APU Acoustic Processing Unit
- FIG. 15 is an illustration of an embodiment of a Senone Scoring Unit (SSU) controller for an Acoustic Processing Unit (APU).
- SSU Senone Scoring Unit
- APU Acoustic Processing Unit
- FIG. 16 is an illustration of an embodiment of a distance calculator for an Acoustic Processing Unit (APU).
- APU Acoustic Processing Unit
- FIG. 17 is an illustration of an embodiment of a method of an acoustic modeling process for an Acoustic Processing Unit (APU).
- APU Acoustic Processing Unit
- FIG. 18 is an illustration of an embodiment of an arithmetic logic unit, according to an embodiment of the present invention.
- FIG. 19 is an illustration of an embodiment of the arithmetic logic unit shown in FIG. 18 , according to an embodiment of the present invention.
- FIG. 20 is an illustration of an embodiment of a computational unit, according to an embodiment of the present invention.
- FIG. 21 is an illustration of an embodiment of a method for computing a one-dimensional distance score.
- FIGS. 22 and 23 are illustrations of embodiments of an acoustic processing system.
- FIG. 24 is an illustration of an embodiment of a hardware accelerator.
- FIG. 25 is a block diagram illustrating an APU software stack.
- FIG. 26 is an illustration of an embodiment of concurrent processing.
- FIG. 27 is an illustration of an embodiment of a method of acoustic processing.
- FIG. 28 is an illustration of an embodiment of an example computer system in which embodiments of the present invention, or portions thereof, can be implemented as computer readable code.
- FIG. 1 is an illustration of an exemplary flowchart of a speech recognition process 100 according to an embodiment of the present invention.
- Speech recognition process 100 includes a signal processing stage 110 , an acoustic modeling stage 120 , a phoneme evaluation stage 130 , and a word modeling stage 140 .
- an analog signal representation of an incoming voice signal 105 can be filtered to eliminate high frequency components of the signal that lie outside the range of frequencies that the human ear can hear.
- the filtered signal is then digitized using sampling and quantization techniques well known to a person skilled in the relevant art.
- One or more parametric digital representations can be extracted from the digitized waveform using techniques such as, for example, linear predictive coding and fast fourier transforms. This extraction can occur at regular time intervals, or frames, of approximately 10 ms, for example.
- acoustic modeling stage 120 feature vectors 115 from signal processing stage 110 are compared to one or more multivariate Gaussian probability distributions (also referred to herein as “Gaussian probability distributions”) stored in memory.
- the one or more Gaussian probability distributions stored in memory can be part of an acoustic library, in which the Gaussian probability distributions represent senones.
- a senone refers to a sub-phonetic unit for a language of interest, as would be understood by a person skilled in the relevant art.
- An individual senone can be made up of, for example, 8 components, in which each of the components can represent a 39-dimension Gaussian probability distribution.
- Acoustic modeling stage 120 can process over 1000 senones, for example.
- the comparison of feature vectors 115 to the one or more Gaussian probability distributions can be a computationally-intensive task, as thousands of Gaussian probability distributions, for example, can be compared to feature vectors 115 every time interval or frame (e.g., 10 ms).
- a set of scores for each of the senones represented in the acoustic library (also referred to herein as “senone scores”) results from the comparison of each of feature vectors 115 to each of the one or more Gaussian probability distributions.
- Acoustic modeling stage 120 provides senone scores 125 to phoneme evaluation stage 130 .
- HMMs Hidden Markov Models
- a Viterbi algorithm can be used to find the likelihood of each HMM corresponding to a phoneme.
- the Viterbi algorithm performs a computation that starts with a first frame and then proceeds to subsequent frames one-at-a-time in a time-synchronous manner A probability score is computed for each senone in the HMMs being considered. Therefore, a cumulative probability score can be successively computed for each of the possible senone sequences as the Viterbi algorithm analyzes sequential frames.
- Phoneme evaluation stage 130 provides the phoneme likelihoods or probabilities 135 (also referred to herein as a “phoneme score”) to word modeling stage 140 .
- searching techniques are used to determine a most-likely string of phonemes and subsequent words, over time. Searching techniques such as, for example, tree-based algorithms can be used to determine the most-likely string of phonemes.
- FIG. 2 is an illustration of a conventional speech recognition system 200 .
- Speech recognition system 200 includes an input device 210 , a processing unit 220 , a memory device 230 , and a data bus 240 , all of which are separate physical components.
- Memory device 230 can be, for example, a Dynamic Random Access Memory (DRAM) device that is external to processing unit 220 and in communication with processing unit 220 via data bus 240 .
- Input device 210 is also in communication with processing unit 220 via data bus 240 .
- Data bus 240 has a typical bus width of, for example, 8 to 32 bits.
- Input device 210 is configured to receive an incoming voice signal (e.g., incoming voice signal 105 of FIG. 1 ) and convert acoustical vibrations associated with the incoming voice signal to an analog signal.
- the analog signal is digitized using an analog to digital converter (not shown in FIG. 2 ), and the resulting digital signal is transferred to processing unit 220 over data bus 240 .
- Input device 210 can be, for example, a microphone.
- Processing unit is configured to process the digital input signal in accordance with the signal processing stage 110 , acoustic modeling stage 120 , phoneme evaluation stage 130 , and word modeler stage 140 described above with respect to FIG. 1 .
- FIG. 3 is an illustration of speech recognition system 200 with speech recognition modules performed by processing unit 220 .
- Processing unit includes signal processing module 310 , acoustic modeling module 320 , phoneme evaluation module 330 , and word modeling module 340 , which operate in a similar manner as signal processing stage 110 , acoustic modeling stage 120 , phoneme evaluation stage 130 , and word modeler stage 140 of FIG. 1 , respectively.
- signal processing module 310 can convert a digital input signal representation of incoming voice signal 305 (e.g., from input device 210 ) into one or more feature vectors 315 .
- Acoustic modeling module 320 compares one or more feature vectors 315 to one or more Gaussian probability distributions stored in an acoustic library in memory device 230 . That is, for each of the comparisons of one or more feature vectors 315 to the one or more Gaussian probability distributions, processing unit 220 accesses memory device 230 via data bus 240 .
- Phoneme evaluation module 330 receives senone scores 325 from acoustic modeling module 320 .
- HMMs can be used to characterize a phoneme as a set of states and an a priori set of transition probabilities between each of the states, where a state is composed of a sequence of senones.
- the sets of states and a priori sets of transition probabilities used by phoneme evaluation module 330 can be stored in memory device 230 .
- Phoneme evaluation module 330 provides phoneme scores 335 to word modeling module 340 .
- Word modeling module 340 uses searching techniques such as, for example, tree-based algorithms to determine a most-likely string of phonemes (e.g., most-likely phoneme 335 ), and subsequent words, over time.
- An issue with conventional speech recognition system 300 of FIG. 3 is the significant load on processing unit 220 due to the acoustic modeling process. For example, for each comparison of one or more feature vectors 315 to the one or more Gaussian probability distributions stored in memory device 220 , memory device 220 is accessed by processing unit 220 . As a result, significant computing resources are dedicated to the acoustic modeling process, in turn placing a significant load on processing unit 220 .
- the load placed on processing unit 220 by the acoustic modeling process affects the speed at which processing unit 220 can process digital signals from input device 210 as well as data from other applications (e.g., where processing unit 220 can operate in a multiuser/multiprogramming environment that concurrently processes data from a plurality of applications). Further, for computing systems with limited memory resources (e.g., handheld devices), the acoustic modeling process not only places a significant load on processing unit 220 , but also consumes a significant portion of memory device 230 and bandwidth of data bus 240 . These issues, among others, with processing capabilities, speed, and memory resources are further exacerbated by the need to process incoming voice signals in real-time or substantially close to real-time in many applications.
- Embodiments of the present invention address the issues discussed above with respect to conventional speech recognition systems 200 and 300 of FIGS. 2 and 3 , respectively.
- the acoustic modeling process is performed by a dedicated processing unit (also referred to herein as an “Acoustic Processing Unit” or “APU”).
- APU operates in conjunction with processing unit 220 of FIG. 3 (also referred to herein as a “Central Processing Unit” or “CPU”).
- CPU Central Processing Unit”
- the APU receives one or more feature vectors (e.g., feature vectors 315 of FIG. 3 ) from the CPU, calculates a senone score (e.g., senone score 325 of FIG.
- the one or more Gaussian probability distributions can be stored in the APU.
- the one or more Gaussian probability distributions can be stored externally to the APU, in which the APU receives the one or more Gaussian probability distributions from an external memory device. Based on the architecture of the APU, which is described in further detail below, an accelerated calculation for the senone score is achieved.
- FIG. 4 is an illustration of an embodiment of a speech recognition process 400 performed by the APU and CPU.
- the CPU performs a signal processing process 410 , a phoneme evaluation process 430 , and a word modeling process 440 .
- the APU performs an acoustic modeling process 420 .
- Signal processing process 410 , acoustic modeling process 420 , phoneme evaluation process 430 , and word modeling process 440 operate in a similar manner as signal processing stage 110 , acoustic modeling stage 120 , phoneme evaluation stage 130 , and word modeler stage 140 of FIG. 1 , respectively, except as otherwise described herein.
- feedback 450 is an optional feature of speech recognition process 400 , in which phoneme evaluation process 430 can provide an active senone list to acoustic modeling process 420 , according to an embodiment of the present invention.
- the APU can compare one or more feature vectors to one or more senones indicated in the active senone list. Such feedback 450 is further discussed below.
- acoustic modeling process 420 can compare the one or more feature vectors to all of the senones associated with an acoustic library. In this case, feedback 450 is not required, as phoneme evaluation process 430 receives an entire set of senone scores (e.g., “score all” function) from the APU for further processing.
- senone scores e.g., “score all” function
- the APU and CPU can be in communication with one another over a Serial Peripheral Interface (SPI) bus, a Peripheral Controller Interface (PCI) bus, an Application Programming Interface (API) bus, an Advanced Microcontroller Bus Architecture High-Performance Bus (AHB), an Advanced Peripheral Bus (APB), a memory bus, or any other type of bus.
- SPI Serial Peripheral Interface
- PCI Peripheral Controller Interface
- API Application Programming Interface
- HAB Advanced Microcontroller Bus Architecture High-Performance Bus
- API Advanced Peripheral Bus
- memory bus or any other type of bus.
- Example, non-limiting embodiments of system bus architectures for speech recognition process 400 of FIG. 4 are described in further detail below.
- FIG. 5 is an illustration of an embodiment of a bus architecture for a speech recognition system 500 .
- Speech recognition system 500 includes an APU 510 , a CPU 520 , a processor/memory bus 530 , a cache 540 , a system controller 550 , a main memory 560 , a plurality of PCI devices 570 1 - 570 M , an Input/Output (I/O) bus 580 , and a PCI bridge 590 .
- Cache 540 can be, for example, a second-level cache implemented on a Static Random Access Memory (SRAM) device.
- main memory 560 can be, for example, a Dynamic Random Access Memory (DRAM) device.
- Speech recognition system 500 can be implemented as a system-on-chip (SOC), according to an embodiment of the present invention.
- SOC system-on-chip
- APU 510 is communicatively coupled to I/O bus 580 through PCI bridge 590 .
- I/O bus 580 can be, for example, a PCI bus.
- PCI bridge 590 and I/O bus 580 APU 510 is communicatively coupled to system controller 550 and CPU 520 .
- APU 510 can be directly coupled to processor/memory bus 530 and, in turn, communicatively coupled to CPU 520 .
- FIG. 6 is an illustration of another embodiment of a bus architecture for a speech recognition system 600 .
- Speech recognition system 600 includes APU 510 , CPU 520 , cache 540 , an AHB 610 , a system controller 620 , a non-volatile memory device 630 , a main memory 640 , an APB bridge 650 , an APB 660 , and a plurality of devices 670 1 - 670 M .
- Non-volatile memory device 630 can be, for example, a Flash memory device.
- Main memory 640 can be, for example, a DRAM device.
- CPU 520 can be, for example, an ARM processor (developed by ARM Holdings plc).
- Speech recognition system 600 can be implemented as an SOC, according to an embodiment of the present invention.
- APU 510 is communicatively coupled to system controller 620 through APB bridge 650 and APB 660 .
- System controller 620 is also communicatively coupled to CPU 520 through AHB 610 .
- system controller 620 is communicatively coupled to CPU 520 through AHB 610 .
- FIG. 7 is an illustration of another embodiment of a bus architecture for a speech recognition system 700 .
- Speech recognition system 700 includes APU 510 , CPU 520 , cache 540 , AHB 610 , system controller 620 , non-volatile memory device 630 , a Low Power Double Data Rate (LPDDR) interface 710 , LPDDR memory bus 720 , and a main memory 730 .
- Main memory 730 can be, for example, a DRAM device.
- CPU 520 can be, for example, an ARM processor (developed by ARM Holdings plc).
- Speech recognition system 700 can be implemented as an SOC, according to an embodiment of the present invention.
- APU 510 and main memory 730 are communicatively coupled to LPDDR interface 710 via LPDDR memory bus 720 .
- APU 510 is also communicatively coupled to system controller 620 through LPDDR memory bus 720 and LPDDR interface 710 .
- system controller 620 is communicatively coupled to CPU 520 via AHB 610 .
- FIG. 8 is an illustration of an embodiment of a system-level architecture for a speech recognition system 800 .
- Speech recognition system 800 includes an APU 810 , a memory controller 820 , a non-volatile memory device 830 , and a volatile memory device 840 .
- Memory controller 820 is communicatively coupled to APU 810 via a bus 815 and coupled to non-volatile memory device 830 and volatile memory device 850 via a bus 825 (which may represent two or more buses in certain embodiments).
- APU 810 and memory controller 820 are integrated on a single chip. Alternatively, in an embodiment, APU 810 and memory controller 820 are integrated on separate chips.
- Non-volatile memory device 830 can be a NAND memory module, a NOR memory module, or another type of non-volatile memory device.
- volatile memory device 840 can be a DRAM device.
- APU 810 can communicate with a CPU (not shown in FIG. 8 ) using, for example, one of the bus architectures described above with respect to FIGS. 5-7 , according to an embodiment of the present invention.
- Non-volatile memory device 830 can store an acoustic library to be used in a speech recognition process, in which the acoustic library can include over 1000 senones, according to an embodiment of the present invention.
- memory controller 820 copies the acoustic library from non-volatile memory device 830 to volatile memory device 840 via bus 825 .
- the acoustic library transfer process between the non-volatile and volatile memory devices can be implemented using, for example, a direct memory access (DMA) operation.
- DMA direct memory access
- speech recognition system 800 can be powered on in anticipation of a senone scoring request.
- the acoustic library from non-volatile memory device 830 is immediately copied to volatile memory device 840 .
- volatile memory device 840 Once volatile memory device 840 has received the acoustic library, APU 810 is ready to begin processing senone scoring requests (e.g., acoustic modeling process 420 of FIG. 4 ) using the acoustic library stored in volatile memory device 840 .
- APU 810 When the senone scoring request is received by APU 810 , a selected senone from the acoustic library is copied from volatile memory device 840 to APU 810 via memory controller 820 .
- APU 810 calculates a senone score based on the selected senone and a data stream received by APU 810 (e.g., one or more feature vectors 315 of FIG. 3 ). After completing the calculation, APU 810 transfers the senone score to the requesting system (e.g., the CPU).
- the requesting system e.g., the CPU
- volatile memory device 840 can be powered down after a predetermined time of inactivity (e.g., senone scoring inactivity by APU 810 ).
- a predetermined time of inactivity e.g., senone scoring inactivity by APU 810
- volatile memory device 840 can be powered down.
- power efficiency in speech recognition system 800 can be improved, as a periodic refresh of memory cells in volatile memory device 840 will not be required.
- the acoustic library is still stored in non-volatile memory device 830 such that the acoustic library can be retained when volatile memory device 840 is powered down.
- volatile memory device 840 when volatile memory device 840 is powered down, the contents stored therein (e.g., the acoustic library) will be lost.
- the other components of speech recognition system 800 can be powered down as well.
- FIG. 9 is an illustration of an embodiment of a method 900 for data pattern analysis.
- Speech recognition system 800 of FIG. 8 can be used, for example, to perform the steps of method 900 .
- method 900 can be used to perform acoustic modeling process 420 of FIG. 4 .
- a person skilled in the relevant art will recognize that method 900 can be used in other data pattern recognition applications such as, for example, image processing, audio processing, and handwriting recognition.
- a plurality of data patterns is copied from a non-volatile memory device (e.g., non-volatile memory device 830 of FIG. 8 ) to a volatile memory device (e.g., volatile memory device 840 of FIG. 8 ).
- the plurality of data patterns can be one or more senones associated with an acoustic library.
- a data pattern from the volatile memory device is requested by a computational unit (e.g., APU 810 of FIG. 8 ) and transferred to the computational unit via a memory controller and bus (e.g., memory controller 820 and bus 825 , respectively, of FIG. 8 ).
- the requested data pattern is a senone from an acoustic library stored in the volatile memory device.
- the computational unit (e.g., APU 810 of FIG. 8 ) performs a data pattern analysis on a data stream received by the computational unit.
- the data pattern analysis is a senone score calculation based on a selected senone and the data stream received by the computational unit (e.g., one or more feature vectors 315 of FIG. 3 ).
- the computational unit transfers the data pattern analysis result to the requesting system (e.g., the CPU).
- the volatile memory device powers down.
- the volatile memory device powers down after a predetermined time of inactivity (e.g., inactivity in the data pattern analysis by the computational unit).
- a predetermined time of inactivity e.g., inactivity in the data pattern analysis by the computational unit.
- power efficiency can be improved, as a periodic refresh of memory cells in the volatile memory device will not be required.
- the other components of the system e.g., other components of speech recognition system 800
- FIG. 10 is an illustration of another embodiment of a system-level architecture for a speech recognition system 1000 .
- Speech recognition system 1000 includes an APU 1010 , a SOC 1040 , a DRAM device 1060 , a Flash memory device 1070 , and an I/O interface 1080 .
- APU 1010 is an integrated chip that includes a memory device 1020 configured to store an acoustic library and an Application-Specific Integrated Circuit (ASIC) 1030 configured to perform an acoustic modeling process (e.g., acoustic modeling process 420 of FIG. 4 ).
- ASIC 1030 and memory device 1020 can be integrated on two separate chips.
- SOC 1040 includes a CPU 1050 configured to perform a signal processing process, a phoneme evaluation process, and a word modeling process (e.g., signal processing process 410 , phoneme evaluation process 430 , and word modeling process 440 , respectively, of FIG. 4 ), according to an embodiment of the present invention.
- APU 1010 and SOC 1040 are integrated on two separate chips.
- FIG. 11 is an illustration of another embodiment of a system-level architecture for a speech recognition system 1100 .
- Speech recognition system 1100 includes an APU 1110 , SOC 1040 , DRAM device 1060 , Flash memory device 1070 , and I/O interface 1080 .
- APU 1110 is an integrated chip that includes an ASIC 1120 , a volatile memory device 1130 , and a non-volatile memory device 1140 .
- ASIC 1120 , volatile memory device 1130 , and non-volatile memory device 1140 can be integrated on two chips—e.g., ASIC 1120 and memory device 1130 on one chip with non-volatile memory device 1140 on another chip; ASIC 1120 on one chip with volatile memory device 1130 and non-volatile memory device 1140 on another chip; or, ASIC 1120 and non-volatile memory device 1140 on one chip with volatile memory device 1130 on another chip.
- ASIC 1120 , volatile memory device 1130 , and non-volatile memory device 1140 can each be integrated on a separate chip—i.e., three separate chips.
- Non-volatile memory device 1140 can be configured to store an acoustic model that is copied to volatile memory device 1130 upon power-up of APU 1110 , according to an embodiment of the present invention.
- non-volatile memory device can be a Flash memory device and volatile memory device 1130 can be a DRAM device.
- ASIC 1120 can be configured to perform an acoustic modeling process (e.g., acoustic modeling process 420 of FIG. 4 ), according to an embodiment of the present invention.
- FIG. 12 is an illustration of another embodiment of a system-level architecture for a speech recognition system 1200 .
- Speech recognition system 1200 includes DRAM device 1060 , Flash memory device 1070 , I/O interface 1080 , a memory device 1210 , and an SOC 1220 .
- SOC 1220 is an integrated chip that includes an ASIC 1230 and a CPU 1240 .
- ASIC 1230 can be configured to perform an acoustic modeling process (e.g., acoustic modeling process 420 of FIG.
- CPU 1240 can be configured to perform a signal processing process, a phoneme evaluation process, and a word modeling process (e.g., signal processing process 410 , phoneme evaluation process 430 , and word modeling process 440 , respectively, of FIG. 4 ), according to an embodiment of the present invention.
- a signal processing process e.g., signal processing process 410 , phoneme evaluation process 430 , and word modeling process 440 , respectively, of FIG. 4
- a word modeling process e.g., signal processing process 410 , phoneme evaluation process 430 , and word modeling process 440 , respectively, of FIG. 4
- Memory device 1210 can be configured to store an acoustic library and to transfer one or more senones to ASIC 1230 via an I/O bus 1215 , according to an embodiment of the present invention.
- memory device 1210 can be a DRAM device or a Flash memory device.
- the acoustic library can be stored in a memory device located within ASIC 1230 (not shown in FIG. 12 ) rather than memory device 1210 .
- the acoustic library can be stored in system memory for SOC 1220 (e.g., DRAM device 1060 ).
- FIG. 13 is another illustration of an embodiment of a system-level architecture for a speech recognition system 1300 .
- Speech recognition system 1300 includes DRAM device 1060 , Flash memory device 1070 , I/O interface 1080 , a memory device 1210 , and an SOC 1220 .
- DRAM device 1060 can be configured to store an acoustic library and to transfer one or more senones to ASIC 1230 via an I/O bus 1315 , according to an embodiment of the present invention.
- FIG. 14 is an illustration of an embodiment of an APU 1400 .
- APU 1400 is an integrated chip that includes a memory module 1420 and a Senone Scoring Unit (SSU) 1430 .
- memory module 1420 and SSU 1430 can be integrated on two separate chips.
- APU 1400 is in communication with a CPU (not shown in FIG. 14 ) via I/O signals 1410 , in which APU 1400 is configured to perform an acoustic modeling process (e.g., acoustic modeling process 420 of FIG. 4 ), according to an embodiment of the present invention.
- I/O signals 1410 can include an input feature vector data line for feature vector information, an input clock signal, an input APU enable signal, an output senone score data line for senone score information, and other I/O control signals for APU 1400 .
- APU 1400 can be configured to receive one or more feature vectors (calculated by the CPU) via the feature vector data line from the CPU and to transmit a senone score via the senone score data line to the CPU for further processing, according to an embodiment of the present invention.
- I/O signals 1410 can be implemented as, for example, an SPI bus, a PCI bus, an API bus, an AHB, an APB, a memory bus, or any other type of bus to provide a communication path between APU 1400 and the CPU (see, e.g., FIGS. 5-7 and associated description).
- An interface between APU 1400 and the CPU, as well as control signals for the interface, are described in further detail below.
- memory module 1420 and SSU 1430 can operate in two different clock domains.
- Memory module 1420 can operate at the clock frequency associated with the input clock signal to APU 1400 (e.g., from I/O signals 1410 ) and SSU 1430 can operate at a faster clock frequency based on the input clock signal, according to an embodiment of the present invention. For example, if the clock frequency associated with the input clock signal is 12 MHz, then SSU 1430 can operate at a clock-divided frequency of 60 MHz—five times faster than the clock frequency associated with the input clock signal. Techniques and methods for implementing clock dividers are known to a person skilled in the relevant art. As will be described in further detail below, the architecture of SSU 1430 can be based on the clock domain at which it operates.
- memory module 1420 includes a bus controller 1422 , a memory controller 1424 , a memory device 1426 , and a bridge controller 1428 .
- Memory device 1426 is configured to store an acoustic model to be used in a speech recognition process.
- memory device 1426 can be a non-volatile memory device such as, for example, a Flash memory device.
- the acoustic library can be pre-loaded into the non-volatile memory device prior to operation of APU 1400 (e.g., during manufacturing and/or testing of APU 1400 ).
- memory device 1426 can be a volatile memory device such as, for example, a DRAM device.
- memory controller 1424 can copy the acoustic library from a non-volatile memory device (either integrated on the same chip as APU 1400 or located external to APU 1400 ) to the volatile memory device.
- the acoustic library transfer process between the non-volatile and volatile memory devices can be implemented using, for example, a DMA operation.
- Bus controller 1422 is configured to control data transfer between APU 1400 and an external CPU. In an embodiment, bus controller 1422 can control the receipt of feature vectors from the CPU and the transmission of senone scores from APU 1400 to the CPU. In an embodiment, bus controller 1422 is configured to transfer one or more feature vectors from the CPU to bridge controller 1428 , which serves as an interface between memory module 1420 and SSU 1430 . In turn, bridge controller 1428 transfers the one or more feature vectors to SSU 1430 for further processing. Upon calculation of a senone score, the senone score is transferred from SSU 1430 to memory module 1420 via bridge controller 1428 , according to an embodiment of the present invention.
- bus controller 1422 can receive a control signal (via I/O signals 1410 ) that provides an active senone list.
- the active senone list can be transferred to APU 1400 as a result of the phoneme evaluation process performed by the CPU (e.g., phoneme evaluation process 430 of FIG. 4 ). That is, in an embodiment, a feedback process can occur between the acoustic modeling process performed by APU 1400 and the phoneme evaluation process performed by the CPU (e.g., feedback 450 of FIG. 4 ).
- the active senone list can be used in senone score calculations for incoming feature vectors into APU 1400 , according to an embodiment of the present invention.
- the active senone list indicates one or more senones stored in memory device 1426 to be used in a senone score calculation.
- the active senone list can include a base address associated with an address space of memory device 1426 and a list of indices related to the base address at which the one or more senones are located in memory device 1426 .
- Bus controller 1422 can send the active senone list to SSU 1430 via bridge controller 1428 , in which SSU 1430 is in communication with memory device 1426 (via memory controller 1424 ) to access the one or more senones associated with the active senone list.
- bus controller 1422 can receive a control signal (via I/O signals 1410 ) that instructs APU 1400 to perform the senone score calculation using all of the senones contained in the acoustic library (e.g., “score all” function).
- Bus controller 1422 sends the “score all” instruction to SSU 1430 via bridge controller 1428 , in which SSU 1430 is in communications with memory device 1426 (via memory controller 1424 ) to access all of the senones associated with the acoustic library.
- Conventional speech recognition systems typically incorporate a feedback loop between acoustic modeling and phoneme evaluation modules (e.g., acoustic modeling module 320 and phoneme evaluation module 330 of FIG. 3 ) within the CPU to limit the number of senones used in senone score calculations. This is because, as discussed above with respect to speech recognition system 300 of FIG. 3 , significant computing resources are dedicated to the acoustic modeling process where thousands of senones can be compared to a feature vector. This places a significant load on the CPU and the bandwidth of the data bus (e.g., data 240 of FIG. 3 ) transferring the senones from the memory device (e.g., memory device 230 of FIG. 3 ) to the CPU.
- acoustic modeling and phoneme evaluation modules e.g., acoustic modeling module 320 and phoneme evaluation module 330 of FIG. 3
- significant computing resources are dedicated to the acoustic modeling process where thousands of senones can be
- active senone lists are used to limit the impact of the acoustic modeling process on the CPU.
- the use active senone lists by the CPU can place limitations on the need to process incoming voice signals in real-time or substantially close to real time.
- the “score all” function of APU 1400 not only alleviates the load on the CPU and the bandwidth of the data bus, but also provides processing of incoming voice signals in real-time or substantially close to real time.
- features of APU 1400 such as, for example, the bus width of data bus 1427 and the architecture of distance calculator 1436 of FIG. 14 provides a system for real-time or substantially close to real time speech recognition.
- SSU 1430 includes an output buffer 1432 , an SSU control module 1434 , a feature vector matrix module 1435 , a distance calculator 1436 , and an addition module 1438 .
- SSU 1430 is configured to calculate a Mahalanobis distance between one or more feature vectors and one or more senones stored in memory device 1426 , according to an embodiment of the present invention.
- Each of the one or more feature vectors can be composed of N dimensions, where N can equal, for example, 39.
- each of the N dimensions in the one or more feature vectors can be a 16-bit mean value.
- each of the one or more senones stored in memory device 1426 is composed of one or more Gaussian probability distributions, where each of the one or more Gaussian probability distributions has the same number of dimensions as each of the one or more feature vectors (e.g., N dimensions).
- Each of the one or more senones stored in memory device 1426 can have, for example, 32 Gaussian probability distributions.
- SSU control module 1434 is configured to receive a clock signal from memory module 1420 via bridge controller 1428 .
- the frequency of the clock signal received by SSU control module 1434 can be the same or substantially the same as the clock frequency associated with the input clock signal to APU 1400 (e.g., input clock signal from I/O signals 1410 ), according to an embodiment of the present invention.
- SSU control module 1434 can divide the frequency of its incoming clock signal and distribute that divided clock signal to other components of SSU 1430 —e.g., output buffer 1432 , feature vector matrix module 1435 , distance calculator 1436 , and addition module 1438 —such that these other components operate at the clock-divided frequency. For example, if the clock frequency associated with the input clock signal (e.g., from I/O signals 1410 ) is 12 MHz, then SSU control module 1434 can receive the same or substantially the same clock signal from bridge controller 1428 and divide that clock frequency using known clock-dividing techniques and methods to a frequency of, for example, 60 MHz. SSU control module 1434 can distribute this clock-divided signal to the other components of SSU 1430 such that these other components operate at, for example, 60 MHz—five times faster than the clock frequency associated with the input clock signal.
- the clock frequency associated with the input clock signal e.g., from I/O signals 1410
- SSU control module 1434 can receive the same or substantially the same
- the clock signals distributed from SSU control module 1434 to the other components of SSU 1430 are not illustrated in FIG. 14 .
- the frequency associated with this clock signal is also referred to herein as the “SSU clock frequency.”
- the frequency associated with the input clock signal to SSU control module 1434 is also referred to herein as the “memory module clock frequency.”
- FIG. 15 is an illustration of an embodiment of SSU control module 1434 .
- SSU control module 1434 includes an input buffer 1510 and a control unit 1520 .
- SSU control module 1434 is configured to receive one or more control signals from memory module 1420 via bridge controller 1428 .
- the one or more control signals can be associated with I/O signals 1410 and with control information associated with a Gaussian probability distribution outputted by memory device 1426 .
- the control signals associated with I/O signals 1410 can include, for example, an active senone list and a “score all” function.
- the control information associated with the Gaussian probability distribution can include, for example, address information for a subsequent Gaussian probability distribution to be outputted by memory device 1426 .
- bus controller 1422 when bus controller 1422 receives an active senone list via I/O signals 1410 , the base address associated with the address space of memory device 1426 and list of indices related to the base address at which the one or more senones are located in memory device 1426 can be stored in input buffer 1510 of FIG. 15 .
- Control unit 1520 is in communication with input buffer 1510 to monitor the list of the senones to be applied by distance calculator 1436 of FIG. 14 in the senone score calculation.
- the active senone list can contain a base address associated with an address space of memory device 1426 and 100 indices pointing to 100 senones stored in memory device 1426 .
- the indices can refer to pointers or memory address offsets in reference to the base address associated with the address space of memory device 1426 .
- a senone can be composed of one or more Gaussian probability distributions, where each of the one or more Gaussian probability distributions has the same number of dimensions as each of one or more feature vectors (e.g., N dimensions) received by APU 1400 .
- each senone stored in memory device 1426 is composed of 32 Gaussian probability distributions. Based on the description herein, a person skilled in the relevant art will understand that each of the senones can be composed of more or less than 32 Gaussian probability distributions.
- control unit 1520 communicates with memory controller 1424 of FIG. 14 to access the first senone in memory device 1426 based on the base address and the first index information contained in the active senone list.
- the senone associated with the first index can include memory address information of the first 2 Gaussian probability distributions associated with that senone, according to an embodiment of the present invention.
- memory device 1426 accesses two Gaussian probability distributions associated with the first senone in, for example, a sequential manner. For example, memory device 1426 accesses the first Gaussian probability distribution and outputs this Gaussian probability distribution to distance calculator 1436 via data bus 1427 . As memory device 1426 outputs the first Gaussian probability distribution, memory device 1426 can also access the second Gaussian probability distribution.
- the second Gaussian probability distribution can include memory address information for a third Gaussian probability distribution to be accessed by memory device 1426 .
- Memory device 1426 can communicate this memory address information to control unit 1520 of FIG. 15 via bridge controller 1428 of FIG. 14 , Control unit 1520 , in turn, communicates with memory controller 1424 of FIG. 14 to access the third Gaussian probability distribution.
- the second Gaussian probability distribution can be outputted to distance calculator 1436 via data bus 1427 .
- This iterative, overlapping process of accessing a subsequent Gaussian probability distribution while outputting a current Gaussian probability distribution is performed for all of the Gaussian probability distributions associated with the senone (e.g., for all of the 32 Gaussian probability distributions associated with the senone).
- a benefit, among others, of the iterative, overlapping (or parallel) processing is faster performance in senone score calculations.
- Control unit 1520 of FIG. 15 monitors the transfer process of Gaussian probability distributions from memory device 1426 to distance calculator 1436 such that the memory access and transfer process occurs in a pipeline manner, according to an embodiment of the present invention. After the 32 Gaussian probability distributions associated with the first senone is outputted to distance calculator 1436 of FIG. 14 , control unit 1520 repeats the above process for the one or more remaining senones in the active senone list.
- memory module 1420 can receive a control signal via I/O signals 1410 that indicates that the active senone list from the current feature vector is to be used in senone score calculations for a subsequent feature vector, according to an embodiment of the present invention.
- SSU control module 1434 uses the same active senone list from the current feature vector in the senone score calculations for the subsequent feature vector.
- control unit 1520 of FIG. 15 applies the same base address and list of indices related to the base address stored in input buffer 1510 to the subsequent feature vector.
- Control unit 1520 of FIG. 15 monitors the transfer process of Gaussian probability distributions from memory device 1426 to distance calculator 1436 for the subsequent feature vector in a similar manner as described above with respect to the active senone list example.
- memory module 1420 can receive a control signal via I/O signals 1410 that indicates a “score all” operation.
- the “score all” function refers to an operation where a feature vector is compared to all of the senones contained in an acoustic library stored in memory device 1426 .
- control unit 1520 of FIG. 15 communicates with memory controller 1424 of FIG. 14 to access a first senone in memory device 1426 .
- the first senone can be, for example, located at a beginning memory address associated with an address space of memory device 1426 .
- the first senone in memory device 1426 can include memory address information of the first 2 Gaussian probability distributions associated with that senone, according to an embodiment of the present invention.
- memory device 1426 accesses two Gaussian probability distributions associated with the first senone in, for example, a sequential manner.
- the second Gaussian probability distribution can include memory address information on a third Gaussian probability distribution to be accessed by memory device 1426 .
- Memory device 1426 can communicate this memory address information to control unit 1520 of FIG. 15 via bridge controller 1428 of FIG. 14 .
- Control unit 1520 communicates with memory controller 1424 of FIG. 14 to access the third Gaussian probability distribution.
- the second Gaussian probability distribution can be outputted to distance calculator 1436 via data bus 1427 .
- This iterative, overlapping process of accessing a subsequent Gaussian probability distribution while outputting a current Gaussian probability distribution is performed for all of the Gaussian probability distributions associated with the senone (e.g., for all of the 32 Gaussian probability distributions associated with the senone).
- Control unit 1520 of FIG. 15 monitors the transfer process of Gaussian probability distributions from memory device 1426 to distance calculator 1436 such that the memory access and transfer process occurs in a pipeline manner, according to an embodiment of the present invention. After the Gaussian probability distributions associated with the first senone are outputted to distance calculator 1436 of FIG. 14 , control unit 1520 repeats the above process for the one or more remaining senones in the acoustic library.
- feature vector matrix module 1435 is used for speaker adaptation in APU 1400 .
- feature vector matrix module 1435 receives a feature vector transform matrix (FVTM) from the CPU via I/O signals 1410 .
- the FVTM can be loaded into feature vector matrix module 1435 periodically such as, for example, once per utterance.
- the FVTM can be stored in a Static Random Access Memory (SRAM) device located within feature vector matrix module 1435 .
- SRAM Static Random Access Memory
- an index can also be stored for each senone, in which the index points to a row in the FVTM, according to an embodiment of the present invention.
- the number of rows in the FVTM can vary (e.g., 10, 50, or 100 rows) and can be specific to a voice recognition system implementing APU 1400 .
- Each row in the FVTM can have an equal number of entries as the N number of dimensions for a feature vector (e.g., 39), where each of the entries is a scaling factor that is multiplied to its corresponding feature vector dimension to produce a new feature vector, according to an embodiment of the present invention.
- the selected row from the FVTM (e.g., row of 39 scaling factors) is transferred to distance calculator 1436 via data bus 1439 , in which distance calculator 1436 performs the multiplication operation to generate the new feature vector, as will be described in further detail below.
- SSU control module 1434 provides a feature vector received from the CPU and an index associated with a senone to feature vector matrix module 1435 .
- the index indicates a particular row in the FVTM for scaling the feature vector.
- the FVTM can have 100 rows and the index can be equal to 10.
- the 10th row of the FVTM contains 39 scaling factors, in which the row of scaling factors is transferred to distance calculator 1436 to generate the new feature vector.
- distance calculator 1436 is configured to calculate a distance between one or more dimensions of a senone stored in memory device 1426 and a corresponding one or more dimensions of a feature vector.
- FIG. 16 is an illustration of an embodiment of distance calculator 1436 .
- Distance calculator 1436 includes a datapath multiplexer (MUX) 1610 , a feature vector buffer 1620 , arithmetic logic units (ALUs) 1630 1 - 1630 8 , and an accumulator 1640 .
- MUX datapath multiplexer
- ALUs arithmetic logic units
- Datapath MUX 1610 is configured to receive a Gaussian probability distribution from memory device 1426 of FIG. 14 via data bus 1427 .
- the width of data bus 1427 is equal to the number of bits associated with one Gaussian probability distribution. For example, if one Gaussian probability distribution is 768 bits, then the width of data bus 1427 is also 768 bits. Over a plurality of Gaussian probability distribution dimensions, the 768 bits associated with the Gaussian probability distribution can be allocated to a 16-bit mean value, a 16-bit variance value, and other attributes per Gaussian probability distribution dimension. As discussed above, the Gaussian probability distribution can have the same number of dimensions as a feature vector—e.g., 39 dimensions. In another embodiment, the width of data bus 1427 can be greater than 256 bits.
- memory device 1426 and distance calculator 1436 can be integrated on the same chip, where data bus 1427 is a wide bus (of the width discussed above) integrated on the chip to provide data transfer of the Gaussian probability distribution from memory device 1426 to distance calculator 1436 .
- memory device 1426 and distance calculator 1436 can be integrated on two separate chips, where data bus 1427 is a wide bus (of the width discussed above) that is tightly coupled between the two chips such that degradation of data due to noise and interconnect parasitic effects are minimized.
- a benefit of a wide data bus 1427 (of the width discussed above), among others, is to increase performance of APU 1400 in the calculation of senone scores.
- Datapath MUX 1610 is also configured to receive one or more control signals and a feature vector from SSU control module 1434 via data bus 1437 , as well as feature vector scaling factors from feature vector buffer 1620 .
- feature vector buffer 1620 can be configured to store scaling factors (associated with a selected row of the FVTM) transferred from feature vector matrix module 1435 via data bus 1439 .
- feature vector buffer 1620 can be configured to store the FVTM.
- one or more control signals from SSU control module 1434 via data bus 1437 can be used to select the FVTM row.
- Datapath MUX 1610 outputs the feature vector, selected feature vector scaling factors from the FVTM, and Gaussian probability distribution information to ALUs 1630 1 - 1630 8 via data bus 1612 for further processing.
- datapath MUX 1610 is also configured to receive a Gaussian weighting factor from the one or more controls signals from SSU control module 1434 via data bus 1437 .
- Datapath MUX 1610 is configured to output the Gaussian weighting factor to accumulator 1640 for further processing.
- each of ALUs 1630 1 - 1630 8 is configured, per SSU clock cycle, to calculate a distance score between a dimension of a Gaussian probability distribution received from datapath MUX 1610 and a corresponding dimension of a feature vector, according to an embodiment of the present invention.
- ALUs 1630 1 - 1630 8 can operate at the SSU clock frequency (e.g., 5 times faster than the memory module clock frequency) such that for every read operation from memory device 1426 of FIG.
- a distance score associated a Gaussian probability distribution (also referred to herein as “Gaussian distance score”) is outputted from distance calculator 1436 to addition module 1438 .
- datapath MUX 1610 is configured to distribute feature vector information associated with one dimension, a mean value associated with a corresponding dimension of a Gaussian probability distribution, a variance value associated with the corresponding dimension of the Gaussian probability, and feature vector scaling factors to each of ALU 1630 1 - 1630 8 .
- each of ALUs 1630 1 - 1630 8 is configured to generate a new feature vector by multiplying dimensions of the feature vector by respective scaling factors.
- the multiplication of the feature vector dimensions by the corresponding scaling factors is performed “on-the-fly,” meaning that the multiplication operation is performed during the calculation of the distance score.
- This is, in contrast, to the multiplication operation being performed for each of the rows in a FVTM and the results of the multiplication operation being stored in memory to be later accessed by each of ALUs 1630 1 - 1630 8 .
- a benefit of the “on-the-fly” multiplication operation, among others, is that memory storage is not required for the results of the multiplication operation associated with non-indexed (or non-selected) rows of the FVTM. This, in turn, results in a faster generation of the new feature vector since additional clock cycles are not required to store the feature vector scaling results associated with the non-indexed rows in memory and also results in a smaller die size area for ALUs 1630 1 - 1630 8 .
- each of ALUs 1630 1 - 1630 8 is configured to calculate a distance score based on a feature vector dimension and a corresponding Gaussian probability distribution dimension per SSU clock cycle, according to an embodiment of the present invention. Cumulatively, in one clock cycle, ALUs 1630 1 - 1630 8 generate distance scores for 8 dimensions (i.e., 1 dimension calculation per ALU).
- the architecture and operation of the ALU is described in further detail below.
- the number of ALUs in distance calculator 1436 can be dependent on the SSU clock frequency and the memory module clock frequency discussed above such that distance calculator 1436 outputs a distance score for one Gaussian probability distribution for every read access to memory device 1426 , according to an embodiment of the present invention.
- the memory module clock frequency can have an operating frequency of 12 MHz, where memory device 1426 also operates at 12 MHz (e.g., for a read access of approximately 83 ns).
- SSU 1430 can have an SSU clock frequency of, for example, 60 MHz to operate five times faster than the memory module cock frequency.
- a Gaussian distance score for one Gaussian probability distribution can be calculated in 5 SSU clock cycles or 1 memory module clock cycle.
- the 5 SSU clock cycles is a predetermined number of clock cycles that corresponds to 1 memory module clock cycle, where as one Gaussian probability distribution is read from memory device at 1 memory module clock cycle, a Gaussian distance score for another Gaussian probability distribution is calculated by accumulator 1640 .
- a portion of ALUs 1630 1 - 1630 8 can be activated on a rising edge of an SSU clock cycle, while the remaining portion of ALUs 1630 1 - 1630 8 can be activated on a falling edge of the SSU clock cycle.
- ALUs 1630 1 - 1630 4 can be activated on the rising edge of the SSU clock cycle and ALUs 1630 5 - 1630 8 can be activated on the falling edge of the SSU clock cycle.
- distance calculator 1436 is not limited to the above example. Rather, as would be understood by a person skilled in the relevant art, distance calculator 1436 can operate at a faster or slower clock frequency of 60 MHz and that distance calculator 1436 can include more or less than 8 ALUs.
- accumulator 1640 is configured to receive the outputs from each of ALUs 1630 1 - 1630 8 and the Gaussian weighting factor from datapath MUX 1610 (via data bus 1614 ). As discussed above, in an embodiment, for every SSU clock cycle, a distance score for a Gaussian probability distribution dimension is outputted by each of ALUs 1630 1 - 1630 8 . These distance scores from each of ALUs 1630 1 - 1630 8 are stored and accumulated by accumulator 1640 to generate a distance score for the Gaussian probability distribution dimension, or Gaussian distance score—e.g., accumulator 1640 adds respective distance scores calculated by ALUs 1630 1 - 1630 8 per SSU clock cycle.
- accumulator 1640 multiplies the total sum by the Gaussian weighting factor to generate a weighted Gaussian distance score.
- the Gaussian weighting factor is optional, where accumulator 1640 outputs the Gaussian distance score.
- the Gaussian weighting factor is specific to each Gaussian and is stored in memory device 1426 .
- Addition module 1438 is configured to add one or more Gaussian distance scores (or weighted Gaussian distance scores) to generate a senone score.
- each senone can be composed of one or more Gaussian probability distributions, in which each Gaussian probability distribution can be associated with a Gaussian distance score.
- addition module 1438 sums the Gaussian distance scores associated with all of the Gaussian probability distributions to generate the senone score.
- addition module 1438 is configured to perform the summation operation in the log domain to generate the senone score.
- Output buffer 1432 is configured to receive a senone score from addition module 1438 and transfer the senone score to bridge controller 1428 .
- Bridge controller 1428 transfers the senone score to the external CPU via bus controller 1422 .
- output buffer 1432 can include a plurality of memory buffers such that, as a first senone score in a first memory buffer is being transferred to bridge controller 1428 , a second senone score generated by addition module 1438 can be transferred to a second memory buffer for a subsequent transfer to bridge controller 1428 .
- FIG. 17 is an illustration of an embodiment of a method 1700 for acoustic modeling. The steps of method 1700 can be performed using, for example, APU 1400 of FIG. 14 .
- a plurality of Gaussian probability distributions is received via a data bus having a width of at least one Gaussian probability distribution and a feature vector from an external computing device.
- the Gaussian probability distribution can be composed of, for example, 768 bits, where the width of the data bus is at least 768 bits.
- APU 1400 of FIG. 14 can receive the feature vector from the external computing device (e.g., a CPU in communication with APU 1400 via I/O signals 1410 of FIG. 14 ).
- information associated with a plurality of dimensions of the feature vector, a plurality of mean values associated with the corresponding plurality of dimensions of the at least one Gaussian probability distribution, and a plurality of variance values associated with the corresponding plurality of dimensions of the at least one Gaussian probability distribution are distributed to, for example, arithmetic logic units (e.g., ALUs 1630 1 - 1630 8 of FIG. 16 ).
- arithmetic logic units e.g., ALUs 1630 1 - 1630 8 of FIG. 16 .
- a plurality of dimension distance scores is calculated based on a plurality of dimensions of the feature vector and a corresponding plurality of dimensions of the at least one Gaussian probability distribution.
- the distance score calculations are based on at least one senone from an active senone list.
- the active senone list can include a base address associated with an address space of a memory device and one or more indices related to the base address at which the at least one senone is located in the memory device.
- a plurality of scaling factors for the plurality of dimensions of the feature vector are stored, where the plurality of scaling factors are applied to the plurality of dimensions of the feature vector during the calculation of the plurality of dimension distance scores.
- Step 1720 can be performed by, for example, distance calculator 1436 of FIG. 14 .
- the plurality of dimension distance scores are summed to generate a Gaussian distance score for the at least one Gaussian probability distribution.
- the Gaussian distance score is generated over a predetermined number of senone scoring unit (SSU) clock cycles.
- the predetermined number of SSU clock cycles can equate to a read access time of the at least one Gaussian probability distribution from a memory device.
- Step 1730 can be performed by, for example, distance calculator 1436 of FIG. 14 .
- step 1740 a plurality of Gaussian distance scores corresponding to the plurality of Gaussian probability distributions is summed to generate a senone score.
- Step 1740 can be performed by, for example, distance calculator 1436 of FIG. 14 .
- Embodiments of the present invention address and solve the issues discussed above with respect to conventional speech recognition system 200 of FIG. 3 .
- the acoustic modeling process is performed by, for example, APU 1400 of FIG. 14 .
- the APU operates in conjunction with a CPU, in which the APU can receive one or more feature vectors (e.g., feature vectors 315 of FIG. 3 ) from the CPU, calculate a senone score (e.g., senone score 325 of FIG. 3 ) based on one or more Gaussian probability distributions, and output the senone score to the CPU.
- the one or more Gaussian probability distributions can be stored in the APU.
- the one or more Gaussian probability distributions can be stored externally to the APU, in which the APU receives the one or more Gaussian probability distributions from an external memory device. Based on embodiments of the APU architecture described above, an accelerated calculation for the senone score is achieved.
- FIG. 18 is a block diagram of an ALU 1800 , according to an embodiment of the present invention.
- ALU 1800 is configured to compute a one-dimensional distance score between a feature vector and a Gaussian probability distribution vector.
- ALU 1800 can be configured to compute the one-dimensional distance score as,
- var ij is the variance value of the i th dimension of the j th Gaussian probability distribution vector
- M 1 and M 2 are scaling factors
- C is a constant
- x i is the value of the feature vector in the ith dimension
- ⁇ ij is the mean value of the ith dimension of the jth Gaussian probability distribution vector.
- the one-dimensional distance score output by ALU 1800 is dependent on three variables: x i , ⁇ ij , and var ij .
- One technique for implementing this equation in software is to generate a look up table (LUT) that is indexed with these three variables.
- this LUT can be further simplified into a dimensional LUT indexed by the ⁇ ij and var ij .
- a two-dimensional LUT could be used to implement ALUs 1630 1 - 1630 8 .
- a two-dimensional LUT could have substantial drawbacks if used to implement ALUs 1630 1 - 1630 8 in the hardware implementation of FIG. 16 .
- ALUs 1630 1 - 1630 8 that each compute a respective one-dimensional distance score
- such a two-dimensional LUT is approximately 32 Kbytes, although other embodiments and applications may require larger LUTs.
- eight copies of a 32 Kbyte LUT would be needed. If implemented in such a mariner, a large amount of the total board space for the SSU would be allocated to only the eight two-dimensional LUTs. This problem would be exacerbated if larger LUTs were required or desired.
- ALU 1800 overcomes this drawback of two-dimensional LUTs by implementing a scoring function using a combination of computational logic and a one-dimensional LUT.
- Equation (1) can be split into two parts: an alu ij part and a LUT ij part, with each specified below.
- alu ij [ ⁇ ij var ij ] 2 ⁇ M 2 ( 2 )
- LUT ij M 1 ⁇ ( ln ⁇ ( var ij ) - C ) ( 3 )
- ALU 1800 computes alu ij and, in parallel with the computing, retrieves LUT ij .
- the alu ij and LUT ij are then combined to form the distance score.
- ALU 1800 includes a computational logic unit 1802 and a LUT module 1804 .
- computational logic unit 1802 can compute value alu ij and LUT module 1804 can be used to retrieve value LUT ij .
- ALU 1800 additionally includes a combination module 1806 .
- Combination module 1806 combines the outputs of computational unit 1802 and LUT module 1804 and outputs the distance score.
- Computational logic unit 1802 and LUT module 1804 only receive the inputs that are needed to determine their respective value. Specifically, as described above, alu ij depends on three variables: x i , ⁇ ij , and var ij . Thus, as shown in FIG. 18 , computational logic unit 1802 receives these three values as inputs. Moreover, the values retrieved from LUT module 1804 are indexed using value var ij alone. Thus, as shown in FIG. 18 , LUT module 1804 only receives value var ij .
- FIG. 19 shows a detailed Hock diagram of ALU 1800 , according to an embodiment of the present invention.
- computational logic unit 1802 includes a subtraction module 1910 , a squaring module 1912 , a LUT 1914 , a multiplier 1916 , and a formatting module 1918 .
- Subtraction module 1910 computes the difference between x i and ⁇ ij , i.e., subtraction module 1918 computes ⁇ ij .
- Squaring module 1912 squares the difference output by subtraction module 1910 generating an integer representing ⁇ ij 2 .
- LUT 1914 outputs a value that corresponds to
- Multiplier 1916 computes a product of two terms: (1) the value retrieved from LUT 1914 and (2) the square output by squaring module 1912 . Thus, the output of multiplier 1916 is
- This product value is received by formatting module 1918 , which formats the result so that it can be effectively combined with the output of LUT module 1804 .
- LUT module 1804 includes a LUT 1920 and a formatting module 1922 .
- LUT 1920 stores values corresponding to LUT ij , as expressed in Equation (3), and is indexed using var ij .
- the value retrieved from LUT 1920 is received by formatting module 1922 .
- Formatting module 1922 formats the output of LUT 1920 so that it can be effectively combined with the output of computational logic unit 1802 .
- Combination module 1806 includes an adder 1930 , a shift module 1932 , a rounding module 1934 , and a saturation module 1936 .
- Adder 1930 computes the sum of the two received values and outputs the sum.
- Shift module 1932 is configured to remove the fractional portion of the sum output by adder 1930 .
- Rounding module 1934 is configured to round down the output of shift module 1934 .
- Saturation module 1936 is configured to receive the rounded sum and saturate the value to a specific number of bits.
- the output of saturation module 1936 is a value having a specific number of bits that represents the one-dimensional distance score.
- FIG. 20 is a block diagram of computational unit 1802 , according to another embodiment of the present invention.
- the embodiment shown in FIG. 20 is similar to the embodiment of FIG. 19 , except that the embodiment of FIG. 20 additionally includes a transform module 2002 , an exception handling module 2012 , a formatting module 2014 , and a multiplexer 2018 .
- Transform module 2002 includes a multiplier 2020 , a scale bit module 2022 , and a saturation module 2024 .
- values of feature vector can be transformed by respective entries in a feature vector transform matrix to, for example, account for learned characteristics of a speaker.
- transform module 2002 can be configured to scale individual feature vector values x i by corresponding transform values ⁇ i .
- multiplier 2020 computes a product of the feature vector value x i and the corresponding transform value ⁇ i and outputs a value to scale bit module 2022 .
- Scale bit module 2022 shifts to the right and outputs the resulting integer to saturation module 2024 .
- Saturation module 2024 is similar to saturation module 1936 , described with reference to FIG. 19 , saturates the received value to a specific number of bits.
- the output of saturation module 2024 is a value that represents the scaled feature vector value.
- Exception handling module 2012 and multiplexer 2018 are configured to address specific errors present in LUT 1914 .
- the size of LUT 1914 can be reduced. This reduction in size can cause specific values of LUT 1914 to have an error.
- exception handling module 2012 can recognize if the output of LUT 1914 will be one of those values, and output the correct value. Put another way, exception handling module 2012 can act as a LUT that includes an entry for each entry of LUT 1914 that may have an error due to size restrictions. Because LUT 1914 is indexed based on var ij , exception handling module 2012 can recognize whether the output of LUT 1914 needs to be corrected based on the value of var ij .
- exception handling module 2012 can act as a two-dimensional LUT that also receives ⁇ ij .
- exception handling module 2012 can output specific values of alu ij (e.g., as opposed to the corresponding entry from LUT 1914 ). Because the number of these possible errors in LUT 1914 is relatively small, exception handling module 2012 does not occupy a significant amount of space, as would other, larger two-dimensional LUTs.
- exception handling module 2012 can ensure that the stored value for alu ij rather than the value of alu ij calculated using the incorrect output of LUT 1914 is finally output to combination module 1806 .
- Formatting module 2014 receives the product computed by multiplier 1916 .
- formatting module 2014 is configured to reduce the number of bits in the result. While not necessary, this operation can save space and power by reducing the number of bits on the output.
- FIG. 20 shows subtraction module 1810 as including multiplexers 2004 and 2006 , comparison module 2008 , and a subtractor 2010 .
- squaring module 1912 may be configured to square specifically positive values.
- the output of subtraction module 1910 in such an embodiment must be positive.
- the two operands i.e., the feature vector value (optionally scaled with transform value ⁇ ij ) and the mean value ⁇ ij can be compared by comparison module 2008 .
- Comparison module 2008 then outputs a control signal to multiplexers 2004 and 2006 to ensure that the first operand into subtractor 2010 is at least as large as the than the second operand.
- FIG. 21 is an illustration of an embodiment of a method 2100 for computing a one-dimensional distance score.
- the steps of method 2100 can be performed using, for example, ALU 1800 shown in FIG. 18 .
- a feature vector dimension is scaled by a transform value.
- a first value is computed based on the feature vector value and a mean and a variance associated with a Gaussian probability distribution vector.
- a second value is retrieved based on the variance value.
- LUT module 1804 can be used to retrieve variance value.
- the first and second values are combined to generate the one-dimensional score.
- FIG. 22 is a block diagram of an acoustic processing system 2200 , according to an embodiment of the present invention.
- Acoustic processing system includes a central processing unit (CPU) 2210 and an acoustic processing unit (APU) 2220 .
- CPU 2210 runs on CPU 2210 are an application 2212 , a voice recognition engine 2214 , and an API 2216 .
- Voice recognition engine 2214 is a process that includes at least two threads: a search thread 2250 and a distance thread 2260 .
- APU 2220 includes an acoustic model memory 2222 , a first bus 2224 , a memory buffer 2226 , a second bus 2228 , and a senone scoring unit 2230 .
- Acoustic model memory 2222 can be configured to store a plurality of senones that together form one or more acoustic models.
- First bus 2224 is a wide bus that is configured to allow acoustic model memory to output an entire Gaussian probability distribution vector to memory buffer 2226 .
- Senone scoring unit 2230 scores a senone score against a feature vector received from CPU 2210 .
- Senone scoring unit 2230 can be implemented as described above. For example, senone scoring unit can be implemented as shown in FIG. 15 . For more information on senone scoring unit 2230 , see Section 4, above.
- Memory buffer 2226 can hold a Gaussian probability distribution vector until senone scoring unit 2230 is ready to compute a Gaussian distance score for it. That is, if senone scoring unit 2230 is scoring a feature vector received from CPU 2210 against a Gaussian probability distribution vector q, memory buffer 2226 can hold the next Gaussian probability distribution vector to be scored, i.e., vector q+1.
- the inputs to APU 2220 include a reference to a specific senone (senone #) and the feature vector.
- the senone # input addresses the stored vector information corresponding to that particular senone in the acoustic model memory.
- the output of APU 2220 is the senone score, which represents the probability that the referenced senone emits the feature vector in a given time frame.
- acoustic model memory 2222 utilizes a parallel read architecture and a very large internal bandwidth bus 2224 .
- the number of bits read in parallel is greater than 256 (e.g., 768 bits wide—sufficient to load an entire Gaussian probability distribution vector at once).
- the values read from the acoustic model memory 2222 are then latched into memory buffer 2226 , using very large bandwidth bus 2224 .
- Both of the output from memory buffer 2226 and the observation vector information are input into senone scoring unit 2230 which performs the multiplications and additions required to compute the senone score.
- Bus 2228 over which memory buffer 2226 communicates with senone scoring unit 2230 , is substantially similar to bus 2224 .
- the senone score is computed by calculating the scores of the J Gaussian probability distribution vectors of dimension N, and by then summing them together to get the total score.
- Some scoring algorithms use only the most significant Gaussians in the calculation to increase the speed of the computation.
- only those bits associated with the required Gaussians need to be transferred from the acoustic model memory to senone scoring unit 2230 .
- the largest number of contiguous bits in memory which will always be required by senone scoring unit 2230 is equal to the number of bits used to store a single Gaussian probability distribution vector.
- the bandwidth requirements of the memory bus as well as the number of bits that need to be read in parallel with be minimized by transferring only those bits comprising a single Gaussian probability distribution vector in each transfer.
- the power requirements of APU 2220 can be reduced and the transfer rate of the necessary data to senone scoring unit 2230 will be increased, resulting in an improvement of the overall system performance.
- the power requirements of APU 2220 can be reduced and the transfer rate of the necessary data to senone scoring unit 2230 can also be increased, resulting in an improvement of the overall system performance.
- acoustic modeling is one of the major bottlenecks in many types of speech recognition system (i.e., keyword recognition, or large vocabulary continuous speech recognition). Because of the large number of comparisons and calculations, high performance and/or parallel microprocessors are commonly used, and a high bandwidth bus between the memory storing the acoustic models and the processors is required.
- the acoustic model memory 2222 can be incorporated into APU 2220 , which is integrated into a single die with senone scoring unit 2230 , with both of them connected using a wide, high bandwidth internal buses 2224 and 2228 to improve the data transfer rate.
- APU 2220 which is integrated into a single die with senone scoring unit 2230 , with both of them connected using a wide, high bandwidth internal buses 2224 and 2228 to improve the data transfer rate.
- the number of bits per transfer can also a function of the algorithms used for acoustic modeling. When scoring algorithms based on a partial set of Gaussians are used (i.e. Gaussian Selection) then the number of bits per transfer can be equal to the size of the Gaussian used by the algorithm. Fewer number of bits per transfer requires multiple cycles to transfer the data comprising the Gaussian, while greater numbers of bits per transfer is inefficient due to data non-locality.
- an architecture is used for acoustic modeling hardware accelerators when scoring algorithms are used is at least partially based on a partial set of Gaussians (i.e., Gaussian Selection). This optimized architecture can result in a significant improvement in the overall system performance compared to other architectures.
- FIG. 23 is a block diagram of an acoustic processing system 2300 , according to an embodiment of the present invention.
- Acoustic processing system 2300 includes a processor 2310 , a dedicated DRAM module 2302 , a DRAM module 2304 , and a non-volatile memory module 2306 .
- Non-volatile memory module 2306 can be implemented as, e.g., an embedded FLASH memory block.
- Processor 2310 includes a CPU 2312 , a hardware accelerator 2314 , and a memory interface 2316 .
- Hardware accelerator 2314 includes a senone scoring unit 2320 .
- Senone scoring unit 2320 can be implemented as described above. For example, senone scoring unit can be implemented as shown in FIG. 15 .
- dedicated DRAM module 2302 is dedicated to senone scoring unit 2320 to, for example, store senones.
- memory interface 2316 can couple senone scoring unit 2320 to dedicated DRAM 2302 .
- FIG. 24 is a block diagram of a hardware accelerator 2400 , according to an embodiment of the present invention.
- Hardware accelerator 2400 includes a processor 2402 and a dedicated DRAM module 2404 .
- Processor 2402 includes a serial peripheral interface (SPI) bus interface module 2412 , a senone scoring unit 2414 , and a memory interface 2416 .
- Senone scoring unit 2414 can be implemented as described above (e.g., as shown in FIG. 15 ).
- dedicated DRAM module 2404 stores one or more acoustic models.
- DRAM module 2404 can instead be a non-volatile memory module, e.g., a FLASH memory module.
- DRAM module 2404 can instead be a memory module that includes a volatile memory module (e.g., DRAM) and a non-volatile memory module (e.g., FLASH).
- the acoustic model can initially be stored in the non-volatile memory module and can be copied to the volatile memory module for senone scoring.
- SPI interface module 2412 can provide an interface to an SPI bus, which, in turn, can couple hardware accelerator 2400 to a CPU.
- Memory interface 2416 couples senone scoring unit 2414 to dedicated DRAM module 2404 .
- a voice-recognition system can be implemented in a cloud-based solution in which the senone scoring and processing necessary for voice-recognition is performed in a cloud-based voice-recognition application.
- FIG. 25 is a block diagram illustrating an APU software stack 2500 , according to an embodiment of the present invention.
- Software stack 2500 can be used to conceptually illustrate the communications between components in an acoustic processing system, e.g., acoustic processing system 2200 described with reference to FIG. 22 .
- Stack 2500 includes an application 2502 , a voice recognition engine 2504 , an application programming interface (API) 2550 , an SPI bus controller 2512 , an SPI bus 2514 , and an APU 2516 .
- API 2550 includes a Generic DCA 2506 , a low level driver (LLD) 2508 , and a hardware abstraction layer (HAL) 2510 .
- application 2502 , voice recognition engine 2504 , API 2550 , and APU 2516 can correspond to application 2212 , voice recognition engine 2214 , API 2216 , and APU 2220 of FIG. 22 , respectively.
- application 2502 communicates with voice recognition engine 2504 , which in turn, communicates with Generic DCA 2506 .
- voice recognition engine 2504 is coupled to the Generic DCA 2506 via a DCA API.
- Generic DCA 2506 can be coupled to LLD 2508 via a LLD API.
- LLD 2508 can be coupled to HAL 2510 via an HAL API.
- HAL 2510 is communicatively coupled to SPI Bus Controller 2512 which is communicatively coupled to SPI bus 2514 .
- APU 2516 is communicatively coupled to SPI bus 2514 and is communicatively coupled to the HAL 2510 via bus controller 2512 and SPI bus 2514 .
- software stack 2500 provides a software interface between APU 2516 and application 2502 (e.g., an application that employs voice recognition).
- application 2502 and voice recognition engine 2504 can be “hardware agnostic.” That is, the application 2502 and voice recognition engine 2504 can complete their respective operations without detailed knowledge about how the distance, or senone, scoring is taking place.
- Generic DCA 2506 , LLD layer 2508 , and HAL layer 2510 include hardware-specific API calls.
- the API calls of HAL 2510 depend on the type of controller to which it is connected.
- the bus interface for APU 2516 can be a different bus and controller combination, requiring a different HAL (with different API calls).
- Generic DCA 2506 is a distance computational API.
- the DCA can be defined by a software developer.
- the DCA API is specifically defined to support a voice recognition engine, such as voice recognition engine 2504 .
- Generic DCA 2506 can be implemented specifically for APU 2516 .
- LLD 2508 can be a functional abstraction of the senone scoring unit commands and can be a one-to-one mapping to the senone scoring unit commands. As shown in FIG. 25 , low-level driver 2508 is coupled to HAL 2510 .
- the DCA API can include the following five functions: Create, Close, Set Feature, Compute Distance Score, and Fill Scores.
- the Create function specifies which acoustic model is to be used.
- There can be one or more acoustic models stored in memory e.g., one or more acoustic models for each language).
- dedicated acoustic model memory 2222 of APU can store the acoustic model (e.g., senone library(s)).
- the Set Feature function is used to set the senone scoring requests into their respective frames by passing a specific framelD, a passID, and the feature vector.
- the input audio signal can be broken up into frames (e.g., by voice recognition engine 2504 ).
- An exemplary frame comprises spectral characteristics of a portion of the audio input signal.
- a frame can be 12 milliseconds (ms) long.
- the Set Feature function can convert each frame into 39 dimensions (e.g., 39 8-bit values).
- the Set Feature function can specify a particular frame's ID and the associated feature vector.
- the Distance Compute Score function calculates the senone score (e.g., Gaussian probability), which, as noted above, can be implemented as a distance calculation. This function can be used to begin and prepare the senone scoring. For example, the feature vector can be input into APU 2516 and APU 2516 will score against all the senones stored in the acoustic model, or at least a selected portion of the senones. This score will then be given back to the upper layer. In an embodiment, the Distance Compute Score function can specify that a portion or the complete acoustic model will be used for the senone scoring.
- Gaussian probability e.g., Gaussian probability
- the Fill Scores function takes the senone scoring result and returns it to the upper software layers, including application 2502 and voice recognition engine 2504 .
- voice recognition engine 2504 can be used for any form of pattern recognition, e.g., pattern recognition forms that use a Hidden Markov model for pattern recognition.
- pattern recognition also uses Gaussian calculations. Examples of pattern recognition can include, but are not limited to the above described senone scoring for speech recognition, image processing and handwritten recognition.
- application 2502 and voice recognition engine 2504 are agnostic to any hardware used to determine the senone score.
- a particular APU can be swapped out for different hardware without application 2502 and voice recognition engine 2504 knowing or being effected.
- application 2502 and voice recognition engine 2504 are agnostic to any type of hardware used for the senone scoring, a first hardware accelerator can be replaced with a second hardware accelerator of a different design without requiring any redesign of application 2502 and voice recognition engine 2504 .
- the APU Library of calls are specific to the type and design of hardware accelerator used, the Generic DCA Library calls are not hardware specific.
- a software architecture as illustrated in FIG. 25 , can be described by describing a data and control flow through the software stack illustrated in FIG. 25 .
- Application 2502 can be any application that uses the voice recognition engine.
- voice recognition engine 2504 is the Vocon Engine provided by Nuance, Inc.
- other speech recognition engines or pattern recognition engines that make use of a Gaussian Mixture Model (GMM) for probability estimation may be used.
- GMM Gaussian Mixture Model
- APU 2516 computes senone scores using the Gaussian Mixture Model.
- APU 2516 can compute these scores much faster (e.g., by an order of magnitude) than an embedded processor (e.g., a cortex A8 embedded processor) making speech recognition more practical in on-board speech recognition systems with APU 2516 .
- an embedded processor e.g., a cortex A8 embedded processor
- Offloading the senone scoring (or distance computation) to APU 2516 not only improves the user experience (by reducing the computational latency) but also allows CPU 2210 to attend to other tasks in the system.
- the software architecture plays an important role in reducing the CPU load and the latency.
- voice recognition engine 2504 is not directly aware of APU 2516 .
- voice recognition engine 2504 can use Generic DCA API 2506 to compute the distances (also referred to as senone scores).
- Generic DCA API 2506 to compute the distances (also referred to as senone scores).
- the specific implementation of the Generic DCA library discussed here has been designed specifically to use APU 2516 , with a plurality of function calls to the APU discussed below. This differs from a fully software implementation of the Generic DCA library.
- This specific implementation translates the Generic DCA library calls to a sequence of APU library calls. The details of the implementation are described below.
- the definition and implementation of the APU library is specific to the current implementation of the APU and is also described below.
- Generic DCA 2506 operates as an interface layer between the voice recognition engine 2504 and APU 2516 .
- voice recognition engine 2504 can utilize generic API calls to the Generic DCA to request senone scoring.
- Generic DCA 2506 then utilizes an APU-specific library of API calls, described further below, to direct the APU hardware accelerator to perform the requested senone scoring.
- voice recognition engine 2504 is not aware of APU 2516 , voice recognition engine 2504 can take advantage of the following benefits. For example, voice recognition engine 2504 may only need to know the message passing formats of APU 2516 . Voice recognition engine 2504 also does not need to know the tasks to be performed by APU 2516 . Moreover, there is a swap-out benefit.
- APU 2516 can be replaced or redesigned without requiring any redesign of voice recognition engine 2504 .
- Generic DCA 2506 needs to have the hardware specific API calls to ensure the required interoperability between voice recognition engine 2504 and APU 2516 .
- a Generic DCA Library comprises the following list of functions:
- the distance_computation_setfeaturematrix function is called between utterances to adapt the recognition to the specific speaker.
- the APU uses this matrix when computing the senone scores for the next utterance.
- “distance_computation_computescores” and “distance_computation_fillscores” can be implemented such that the computational latency and the CPU load are minimized.
- these functions can be implemented so as to achieve the concurrent operation embodied in FIG. 26 .
- an APU Library supports the following functions:
- the APU can be used for scoring the senones for each frame of a given utterance.
- the acoustic model of choice is communicated to the APU at the beginning as part of the function distance_computation_create.
- the feature vector for a given frame is passed to the APU via the function distance_computation_setfeature.
- the senones to be scored for a given frame are passed to the APU via the function distance_computation_computescores.
- the actual scores computed by the APU can be passed back to the Voice Recognition Engine via the function distance_computation_fillscores.
- the control flows from top to bottom of stack 2500 illustrated in FIG. 25 . All the functions are synchronous and they complete before returning except for the function distance_computation_computescores. As noted below, the scoring can be implemented as a separate thread to maximize the concurrency of distance computation and the search as described above. This thread yields the CPU to the rest of voice recognition engine 2214 whenever it is waiting for APU 2220 to complete the distance computation. This asynchronous computation is important to minimize the latency as well as the CPU load.
- a thread e.g. an executable process
- APU 2516 can be created for APU 2516 .
- there must be no dependency that a further action of a first actor is dependent upon the actions of a second actor. Breaking any dependency between application 2502 and voice recognition engine 2504 and APU 2516 allows application 2502 and voice recognition engine 2504 to operate in parallel with APU 2516 .
- a dependency between application 2502 and voice recognition engine 2504 on one hand and APU 2516 on the other can be avoided through the use of frames, e.g., lasting approximately 10-12 ms (although the invention is not limited to this embodiment). For example, while the application 2502 is using the senone score for frame n, APU 2516 can be performing a senone score for frame n+1.
- a voice recognition operation requires two discrete operations: scoring and searching.
- the scoring operation involves a comparison between Gaussian probability distribution vectors of a senone with the feature vector corresponding to a specific frame.
- software stack 2500 can be configured such that these two operations occur in parallel.
- voice recognition engine 2214 can include search thread 2250 and distance thread 2260 .
- Distance thread 2260 can manage distance calculations completed on APU 2220 and search thread 2250 can use the results of the distance calculations to determine which sound was received (e.g., by searching a library of senone scores to determine the best match).
- distance thread 2260 can perform the operations needed to start the scoring operation on APU 2220 .
- the distance thread 2260 can then be put to sleep. While asleep, search thread 2250 can be activated and can search using the results of the last distance operation. Because the length of time needed to complete a distance computation is relatively predictable, distance thread can be put to sleep for a predetermined amount of time.
- distance thread 2260 can be put to sleep indefinitely and an interrupt from APU 2220 can instead be used to wake up distance thread 2260 . In doing so, APU 2220 can be used to compute a distance score for a frame n+1, while CPU 2210 performs a searching operation using the previously calculated score for frame n.
- the search can follow the distance computation as illustrated in FIG. 26 .
- the distance computation for frame (i+1) can be performed while the search for frame i is being conducted.
- the distance computation performed by the APU can be performed concurrently with the search function performed by the CPU.
- a call sequence to the DCA library is arranged to effect this operation.
- the Generic DCA is implemented so that the concurrency of the search computation and the distance computation is maximized.
- an implementation of the Generic DCA library uses the API proved by the APU library.
- FIG. 27 is an illustration of an embodiment of a method 2700 for acoustic processing.
- the steps of method 2700 can be performed using, for example, acoustic processing system 2200 , shown in FIG. 22 , along with software stack 2500 , shown in FIG. 25 .
- the received audio signal is divided into frames.
- voice recognition engine 2214 can divide a received audio signal into frames that are, for example, 10-12 ms in length.
- a search thread and a distance computation thread are created.
- voice recognition engine 2214 can create search thread 2250 and distance thread 2260 .
- a distance score is computed using an APU.
- senone scoring unit 2230 of APU 2220 can compute a distance score between a feature vector corresponding to a frame and a Gaussian probability distribution vector.
- a search operation is performed using the computed score for the frame.
- search thread 2250 can use the distance score computed in step 2706 to search different senones to determine which sound was included in the frame.
- step 2710 it is determined whether the frame was the last frame of the audio signal. If so, method 2700 ends. If not, method 2700 proceeds to step 2712 .
- step 2712 concurrently with the search operation of step 2708 a distance score for the next frame is computing using the APU.
- APU 2220 can be used to compute a distance score for a frame i+1 concurrently with search thread 2250 performing a search operation using the distance score for frame i.
- FIG. 28 is an illustration of an example computer system 2800 in which embodiments of the present invention, or portions thereof, can be implemented as computer-readable code.
- the method illustrated by flowchart 900 of FIG. 9 the method illustrated by flowchart 1700 of FIG. 17 , the method illustrated by flowchart 2100 of FIG. 21 , software stack 2500 illustrated in FIG. 25 , and/or the method illustrated by flowchart 2700 of FIG. 27 can be implemented in system 2800 .
- Various embodiments of the present invention are described in terms of this example computer system 2800 . After reading this description, it will become apparent to a person skilled in the relevant art how to implement embodiments of the present invention using other computer systems and/or computer architectures.
- simulation, synthesis and/or manufacture of various embodiments of this invention may be accomplished, in part, through the use of computer readable code, including general programming languages (such as C or C++), hardware description languages (HDL) such as, for example, Verilog HDL, VHDL, Altera HDL (AHDL), or other available programming and/or schematic capture tools (such as circuit capture tools).
- This computer readable code can be disposed in any known computer-usable medium including a semiconductor, magnetic disk, optical disk (such as CD-ROM, DVD-ROM). As such, the code can be transmitted over communication networks including the Internet.
- the functions accomplished and/or structure provided by the systems and techniques described above can be represented in a core (e.g., an APU core) that is embodied in program code and can be transformed to hardware as part of the production of integrated circuits.
- Computer system 2800 includes one or more processors, such as processor 2804 , Processor 2804 may be a special purpose or a general-purpose processor such as, for example, the APU and CPU of FIG. 4 , respectively.
- Processor 2804 is connected to a communication infrastructure 2806 (e.g., a bus or network).
- Computer system 2800 also includes a main memory 2808 , preferably random access memory (RAM), and may also include a secondary memory 2810 .
- Secondary memory 2810 can include, for example, a hard disk drive 2812 , a removable storage drive 2814 , and/or a memory stick.
- Removable storage drive 2814 can include a floppy disk drive, a magnetic tape drive, an optical disk drive, a flash memory, or the like.
- the removable storage drive 2814 reads from and/or writes to a removable storage unit 2818 in a well-known manner.
- Removable storage unit 2818 can comprise a floppy disk, magnetic tape, optical disk, etc. which is read by and written to by removable storage drive 2814 .
- removable storage unit 2818 includes a computer-usable storage medium having stored therein computer software and/or data.
- Computer system 2800 (optionally) includes a display interface 2802 (which can include input and output devices such as keyboards, mice, etc.) that forwards graphics, text, and other data from communication infrastructure 2806 (or from a frame buffer not shown) for display on display unit 2830 .
- display interface 2802 which can include input and output devices such as keyboards, mice, etc.
- secondary memory 2810 can include other similar devices for allowing computer programs or other instructions to be loaded into computer system 2800 .
- Such devices can include, for example, a removable storage unit 2822 and an interface 2820 .
- Examples of such devices can include a program cartridge and cartridge interface (such as those found in video game devices), a removable memory chip (e.g., EPROM or PROM) and associated socket, and other removable storage units 2822 and interfaces 2820 which allow software and data to be transferred from the removable storage unit 2822 to computer system 2800 .
- Computer system 2800 can also include a communications interface 2824 .
- Communications interface 2824 allows software and data to be transferred between computer system 2800 and external devices.
- Communications interface 2824 can include a modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot and card, or the like.
- Software and data transferred via communications interface 2824 are in the form of signals which may be electronic, electromagnetic, optical, or other signals capable of being received by communications interface 2824 . These signals are provided to communications interface 2824 via a communications path 2826 .
- Communications path 2826 carries signals and can be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, a RF link or other communications channels.
- computer program medium and “computer-usable medium” are used to generally refer to media such as removable storage unit 2818 , removable storage unit 2822 , and a hard disk installed in hard disk drive 2812 .
- Computer program medium and computer-usable medium can also refer to memories, such as main memory 2808 and secondary memory 2810 , which can be memory semiconductors (e.g., DRAMs, etc.). These computer program products provide software to computer system 2800 .
- Computer programs are stored in main memory 2808 and/or secondary memory 2810 . Computer programs may also be received via communications interface 2824 . Such computer programs, when executed, enable computer system 2800 to implement embodiments of the present invention as discussed herein. In particular, the computer programs, when executed, enable processor 2804 to implement processes of embodiments of the present invention, such as the steps in the method illustrated by flowchart 900 of FIG. 9 and flowchart 1700 of FIG. 17 , the method illustrated by flowchart 2100 of FIG. 21 , the method illustrated by flowchart 2700 of FIG. 27 , and or the functions in software stack 2500 illustrated in FIG. 25 can be implemented in system 2800 , discussed above. Accordingly, such computer programs represent controllers of the computer system 2800 . Where embodiments of the present invention are implemented using software, the software can be stored in a computer program product and loaded into computer system 2800 using removable storage drive 2814 , interface 2820 , hard drive 2812 , or communications interface 2824 .
- Embodiments of the present invention are also directed to computer program products including software stored on any computer-usable medium. Such software, when executed in one or more data processing device, causes a data processing device(s) to operate as described herein.
- Embodiments of the present invention employ any computer-usable or -readable medium, known now or in the future.
- Examples of computer-usable mediums include, but are not limited to, primary storage devices (e.g., any type of random access memory), secondary storage devices (e.g., hard drives, floppy disks, CD ROMS, ZIP disks, tapes, magnetic storage devices, optical storage devices, MEMS, nanotechnological storage devices, etc.), and communication mediums (e.g., wired and wireless communications networks, local area networks, wide area networks, intranets, etc.).
- primary storage devices e.g., any type of random access memory
- secondary storage devices e.g., hard drives, floppy disks, CD ROMS, ZIP disks, tapes, magnetic storage devices, optical storage devices, MEMS, nanotechnological storage devices, etc.
- communication mediums e.g., wired and wireless communications networks, local area networks, wide area networks, intranets, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Analysis (AREA)
- Algebra (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Optimization (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Complex Calculations (AREA)
- User Interface Of Digital Computer (AREA)
- Machine Translation (AREA)
- Artificial Intelligence (AREA)
- Telephone Function (AREA)
Abstract
Description
- This application claims the benefit of U.S. Provisional Patent Application No. 61/577,595, filed Dec. 19, 2011, titled “Senone Scoring Unit” and U.S. Provisional Patent Application No. 61/589,113, filed Jan. 20, 2012, titled “HW/SW Architecture for Speech Recognition,” both of which are incorporated herein by reference in their entireties.
- 1. Field
- Embodiments of the present invention generally relate to speech recognition. More particular, embodiments of the present invention relate to the implementation of an acoustic modeling process on a dedicated processing unit.
- 2. Background
- Real-time data pattern recognition is increasingly used to analyze data streams in electronic systems. On a vocabulary with over tens of thousands of words, speech recognition systems have achieved improved accuracy, making it an attractive feature for electronic systems. For example, speech recognition systems are increasingly common in consumer markets targeted to data pattern recognition applications such as, for example, the mobile device, server, automobile, and PC markets.
- Despite the improved accuracy in speech recognition systems, significant computing resources are dedicated to the speech recognition process, in turn placing a significant load on computing systems such as, for example, multiuser/multiprogramming environments. Multiprogramming computing systems concurrently process data from various applications and, as a result, the load placed on these computing systems by the speech recognition process affects the speed at which the computing systems can process incoming voice signals as well as data from other applications. Further, for handheld devices that typically include limited memory resources (as compared to desktop computing systems), speech recognition applications not only place significant load on the handheld device's computing resources but also consume a significant portion of the handheld device's memory resources. The above speech recognition system issues of processing capability, speed, and memory resources are farther exacerbated by the need to process incoming voice signals in real-time or substantially close to real-time.
- Therefore, there is a need to improve the load that speech recognition systems place on the processing capability, speed, and memory resources of computing systems.
- An embodiment of the present invention includes a senone scoring unit (SSU). The SSU can include a SSU control module, a distance calculator, and an addition module. The SSU control module can be configured to receive a feature vector. The distance calculator can be configured to receive a plurality of Gaussian probability distributions via a data bus having a width of at least one Gaussian probability distribution (e.g., 768 bits) and the feature vector from the SSU control module. The distance calculator can include a plurality of arithmetic logic units (ALUs) and an accumulator. Each of the ALUs can be configured to receive a portion of the at least one Gaussian probability distribution and to calculate a dimension distance score between a dimension of the feature vector and a corresponding dimension of the at least one Gaussian probability distribution. The accumulator can be configured to sum the dimension distance scores from the plurality of ALUs to generate a Gaussian distance score. Further, the addition module can be configured to sum a plurality of Gaussian distance scores corresponding to the plurality of Gaussian probability distributions to generate a senone score. The SSU can also include a feature vector matrix module configured to store a scaling factor for the dimension of the feature vector.
- Another embodiment of the present invention includes a method for acoustic modeling. The method can include the following: receiving a plurality of Gaussian probability distributions via a data bus having a width of at least one Gaussian probability distribution and a feature vector from an external computing device; calculating a plurality of dimension distance scores based on a plurality of dimensions of the feature vector and a corresponding plurality of dimensions of the at least one Gaussian probability distribution; summing the plurality of dimension distance scores to generate a Gaussian distance score for the at least one Gaussian probability distribution; and, summing a plurality of Gaussian distance scores corresponding to the plurality of Gaussian probability distributions to generate a senone score.
- A further embodiment of the present invention includes a system for acoustic modeling. The system can include a memory module and a senone scoring unit (SSU). The memory module can be configured to interface with an external computing device to receive a feature vector. The SSU can include a distance calculator and an addition module, where the distance calculator includes a plurality of arithmetic logic units (ALUs) and an accumulator. Each of the ALUs can be configured to receive a portion of the at least one Gaussian probability distribution and to calculate a dimension distance score between a dimension of the feature vector and a corresponding dimension of the at least one Gaussian probability distribution. The accumulator can be configured to sum the dimension distance scores from the plurality of ALUs to generate a Gaussian distance score. Further, the addition module can be configured to sum a plurality of Gaussian distance scores corresponding to the plurality of Gaussian probability distributions to generate a senone score. The memory module and SSU can be integrated on the same chip.
- Further features and advantages of embodiments of the invention, as well as the structure and operation of various embodiments of the present invention, are described in detail below with reference to the accompanying drawings. It is noted that the invention is not limited to the specific embodiments described herein. Such embodiments are presented herein for illustrative purposes only. Additional embodiments will be apparent to persons skilled in the relevant art based on the teachings contained herein.
- The accompanying drawings, which are incorporated herein and form a part of the specification, illustrate embodiments of the present invention and, together with the description, further serve to explain the principles of the invention and to enable a person skilled in the relevant art to make and use the invention.
-
FIG. 1 is an illustration of an exemplary flowchart of a speech recognition process according to an embodiment of the present invention. -
FIG. 2 is an illustration of a conventional speech recognition system. -
FIG. 3 is an illustration of a conventional speech recognition system with speech recognition processes performed by an individual processing unit. -
FIG. 4 is an illustration of an embodiment of speech recognition processes performed by an Acoustic Processing Unit (APU) and a Central Processing Unit (CPU). -
FIG. 5 is an illustration of an embodiment of a Peripheral Controller Interface (PCI) bus architecture for a speech recognition system. -
FIG. 6 is an illustration of an embodiment of an Advanced Peripheral Bus (APB) architecture for a speech recognition system. -
FIG. 7 is an illustration of an embodiment of a Low Power Double Data Rate (LPDDR) bus architecture for a speech recognition system. -
FIG. 8 is an illustration of an embodiment of a system-level architecture for a speech recognition system. -
FIG. 9 is an illustration of an embodiment of a method for data pattern analysis. -
FIG. 10 is an illustration of an embodiment of a system-level architecture for a speech recognition system with an integrated Application-Specific Integrated Circuit (ASIC) and memory device. -
FIG. 11 is an illustration of an embodiment of a system-level architecture for a speech recognition system with an integrated Application-Specific Integrated Circuit (ASIC), volatile memory device, and non-volatile memory device. -
FIG. 12 is an illustration of an embodiment of a system-level architecture for a speech recognition system with a System-On-Chip that includes an Application-Specific Integrated Circuit (ASIC) and a Central Processing Unit (CPU). -
FIG. 13 is an illustration of another embodiment of a system-level architecture for a speech recognition system with a System-On-Chip that includes an Application-Specific Integrated Circuit (ASIC) and a Central Processing Unit (CPU). -
FIG. 14 is an illustration of an embodiment of an Acoustic Processing Unit (APU). -
FIG. 15 is an illustration of an embodiment of a Senone Scoring Unit (SSU) controller for an Acoustic Processing Unit (APU). -
FIG. 16 is an illustration of an embodiment of a distance calculator for an Acoustic Processing Unit (APU). -
FIG. 17 is an illustration of an embodiment of a method of an acoustic modeling process for an Acoustic Processing Unit (APU). -
FIG. 18 is an illustration of an embodiment of an arithmetic logic unit, according to an embodiment of the present invention. -
FIG. 19 is an illustration of an embodiment of the arithmetic logic unit shown inFIG. 18 , according to an embodiment of the present invention. -
FIG. 20 is an illustration of an embodiment of a computational unit, according to an embodiment of the present invention. -
FIG. 21 is an illustration of an embodiment of a method for computing a one-dimensional distance score. -
FIGS. 22 and 23 are illustrations of embodiments of an acoustic processing system. -
FIG. 24 is an illustration of an embodiment of a hardware accelerator. -
FIG. 25 is a block diagram illustrating an APU software stack. -
FIG. 26 is an illustration of an embodiment of concurrent processing. -
FIG. 27 is an illustration of an embodiment of a method of acoustic processing. -
FIG. 28 is an illustration of an embodiment of an example computer system in which embodiments of the present invention, or portions thereof, can be implemented as computer readable code. - The following detailed description refers to the accompanying drawings that illustrate exemplary embodiments consistent with this invention. Other embodiments are possible, and modifications can be made to the embodiments within the spirit and scope of the invention. Therefore, the detailed description is not meant to limit the scope of the invention. Rather, the scope of the invention is defined by the appended claims.
- It would be apparent to a person skilled in the relevant art that the present invention, as described below, can be implemented in many different embodiments of software, hardware, firmware, and/or the entities illustrated in the figures. Thus, the operational behavior of embodiments of the present invention will be described with the understanding that modifications and variations of the embodiments are possible, given the level of detail presented herein.
- This specification discloses one or more embodiments that incorporate the features of this invention. The disclosed embodiments merely exemplify the invention. The scope of the invention is not limited to the disclosed embodiments. The invention is defined by the claims appended hereto.
- The embodiments described, and references in the specification to “one embodiment”, “an embodiment”, “an example embodiment”, etc., indicate that the embodiments described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is understood that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
-
FIG. 1 is an illustration of an exemplary flowchart of aspeech recognition process 100 according to an embodiment of the present invention.Speech recognition process 100 includes asignal processing stage 110, anacoustic modeling stage 120, aphoneme evaluation stage 130, and aword modeling stage 140. - In
signal processing stage 110, an analog signal representation of anincoming voice signal 105 can be filtered to eliminate high frequency components of the signal that lie outside the range of frequencies that the human ear can hear. The filtered signal is then digitized using sampling and quantization techniques well known to a person skilled in the relevant art. One or more parametric digital representations (also referred to herein as “feature vectors 115”) can be extracted from the digitized waveform using techniques such as, for example, linear predictive coding and fast fourier transforms. This extraction can occur at regular time intervals, or frames, of approximately 10 ms, for example. - In
acoustic modeling stage 120,feature vectors 115 fromsignal processing stage 110 are compared to one or more multivariate Gaussian probability distributions (also referred to herein as “Gaussian probability distributions”) stored in memory. The one or more Gaussian probability distributions stored in memory can be part of an acoustic library, in which the Gaussian probability distributions represent senones. A senone refers to a sub-phonetic unit for a language of interest, as would be understood by a person skilled in the relevant art. An individual senone can be made up of, for example, 8 components, in which each of the components can represent a 39-dimension Gaussian probability distribution. -
Acoustic modeling stage 120 can process over 1000 senones, for example. As a result, the comparison offeature vectors 115 to the one or more Gaussian probability distributions can be a computationally-intensive task, as thousands of Gaussian probability distributions, for example, can be compared to featurevectors 115 every time interval or frame (e.g., 10 ms). A set of scores for each of the senones represented in the acoustic library (also referred to herein as “senone scores”) results from the comparison of each offeature vectors 115 to each of the one or more Gaussian probability distributions.Acoustic modeling stage 120 providessenone scores 125 tophoneme evaluation stage 130. - In
phoneme evaluation stage 130, Hidden Markov Models (HMMs) can be used to characterize a phoneme as a set of states and an a priori set of transition probabilities between each of the states, where a state is associated with a senone. For a given observed sequence of senones, there is a most-likely sequence of states in a corresponding HMM. This corresponding HMM can be associated with an observed phoneme. A Viterbi algorithm can be used to find the likelihood of each HMM corresponding to a phoneme. - The Viterbi algorithm performs a computation that starts with a first frame and then proceeds to subsequent frames one-at-a-time in a time-synchronous manner A probability score is computed for each senone in the HMMs being considered. Therefore, a cumulative probability score can be successively computed for each of the possible senone sequences as the Viterbi algorithm analyzes sequential frames.
Phoneme evaluation stage 130 provides the phoneme likelihoods or probabilities 135 (also referred to herein as a “phoneme score”) toword modeling stage 140. - In
word modeling stage 140, searching techniques are used to determine a most-likely string of phonemes and subsequent words, over time. Searching techniques such as, for example, tree-based algorithms can be used to determine the most-likely string of phonemes. -
FIG. 2 is an illustration of a conventionalspeech recognition system 200.Speech recognition system 200 includes aninput device 210, aprocessing unit 220, amemory device 230, and adata bus 240, all of which are separate physical components.Memory device 230 can be, for example, a Dynamic Random Access Memory (DRAM) device that is external toprocessing unit 220 and in communication withprocessing unit 220 viadata bus 240.Input device 210 is also in communication withprocessing unit 220 viadata bus 240.Data bus 240 has a typical bus width of, for example, 8 to 32 bits. -
Input device 210 is configured to receive an incoming voice signal (e.g.,incoming voice signal 105 ofFIG. 1 ) and convert acoustical vibrations associated with the incoming voice signal to an analog signal. The analog signal is digitized using an analog to digital converter (not shown inFIG. 2 ), and the resulting digital signal is transferred toprocessing unit 220 overdata bus 240.Input device 210 can be, for example, a microphone. - Processing unit is configured to process the digital input signal in accordance with the
signal processing stage 110,acoustic modeling stage 120,phoneme evaluation stage 130, andword modeler stage 140 described above with respect toFIG. 1 .FIG. 3 is an illustration ofspeech recognition system 200 with speech recognition modules performed by processingunit 220. Processing unit includessignal processing module 310,acoustic modeling module 320,phoneme evaluation module 330, andword modeling module 340, which operate in a similar manner assignal processing stage 110,acoustic modeling stage 120,phoneme evaluation stage 130, andword modeler stage 140 ofFIG. 1 , respectively. - In reference to
FIG. 3 ,signal processing module 310 can convert a digital input signal representation of incoming voice signal 305 (e.g., from input device 210) into one ormore feature vectors 315.Acoustic modeling module 320 compares one ormore feature vectors 315 to one or more Gaussian probability distributions stored in an acoustic library inmemory device 230. That is, for each of the comparisons of one ormore feature vectors 315 to the one or more Gaussian probability distributions, processingunit 220 accessesmemory device 230 viadata bus 240. For an acoustic library with thousands of senones (in which each of the senones is composed of a plurality of Gaussian probability distributions), not only are the comparisons performed byacoustic modeling module 320 computationally-intensive but the thousands of accesses tomemory device 230 viadata bus 240 byacoustic modeling module 320 are also computationally-intensive and time consuming. The thousands of accesses tomemory device 230 is further exacerbated by the bus width of data bus 240 (e.g., typically 8 to 32 bits), in which multiple accesses tomemory device 230 may be required byacoustic modeling module 320 to access each Gaussian probability distribution. Further, interconnect parasitics associated withdata bus 240 may corrupt data transfer betweenmemory device 230 andacoustic modeling module 320. -
Phoneme evaluation module 330 receivessenone scores 325 fromacoustic modeling module 320. As discussed above with respect tospeech recognition process 100 ofFIG. 1 , HMMs can be used to characterize a phoneme as a set of states and an a priori set of transition probabilities between each of the states, where a state is composed of a sequence of senones. The sets of states and a priori sets of transition probabilities used byphoneme evaluation module 330 can be stored inmemory device 230.Phoneme evaluation module 330 providesphoneme scores 335 toword modeling module 340. -
Word modeling module 340 uses searching techniques such as, for example, tree-based algorithms to determine a most-likely string of phonemes (e.g., most-likely phoneme 335), and subsequent words, over time. - An issue with conventional
speech recognition system 300 ofFIG. 3 , among others, is the significant load onprocessing unit 220 due to the acoustic modeling process. For example, for each comparison of one ormore feature vectors 315 to the one or more Gaussian probability distributions stored inmemory device 220,memory device 220 is accessed by processingunit 220. As a result, significant computing resources are dedicated to the acoustic modeling process, in turn placing a significant load onprocessing unit 220. The load placed onprocessing unit 220 by the acoustic modeling process affects the speed at whichprocessing unit 220 can process digital signals frominput device 210 as well as data from other applications (e.g., where processingunit 220 can operate in a multiuser/multiprogramming environment that concurrently processes data from a plurality of applications). Further, for computing systems with limited memory resources (e.g., handheld devices), the acoustic modeling process not only places a significant load onprocessing unit 220, but also consumes a significant portion ofmemory device 230 and bandwidth ofdata bus 240. These issues, among others, with processing capabilities, speed, and memory resources are further exacerbated by the need to process incoming voice signals in real-time or substantially close to real-time in many applications. - Embodiments of the present invention address the issues discussed above with respect to conventional
speech recognition systems FIGS. 2 and 3 , respectively. In an embodiment, the acoustic modeling process is performed by a dedicated processing unit (also referred to herein as an “Acoustic Processing Unit” or “APU”). The APU operates in conjunction withprocessing unit 220 ofFIG. 3 (also referred to herein as a “Central Processing Unit” or “CPU”). For example, the APU receives one or more feature vectors (e.g.,feature vectors 315 ofFIG. 3 ) from the CPU, calculates a senone score (e.g., senone score 325 ofFIG. 3 ) based on one or more Gaussian probability distributions, and outputs the senone score to the CPU. In an embodiment, the one or more Gaussian probability distributions can be stored in the APU. Alternatively, in another embodiment, the one or more Gaussian probability distributions can be stored externally to the APU, in which the APU receives the one or more Gaussian probability distributions from an external memory device. Based on the architecture of the APU, which is described in further detail below, an accelerated calculation for the senone score is achieved. - Although portions of the present disclosure is described in the context of a speech recognition system, a person skilled in the relevant art will recognize that the embodiments described herein are applicable to any data pattern recognition applications based on the description herein. These other data pattern recognition applications include, but are not limited to, image processing, audio processing, and handwriting recognition. These other data pattern recognition applications are within the spirit and scope of the embodiments disclosed herein.
-
FIG. 4 is an illustration of an embodiment of aspeech recognition process 400 performed by the APU and CPU. In an embodiment, the CPU performs asignal processing process 410, aphoneme evaluation process 430, and aword modeling process 440. The APU performs anacoustic modeling process 420.Signal processing process 410,acoustic modeling process 420,phoneme evaluation process 430, andword modeling process 440 operate in a similar manner assignal processing stage 110,acoustic modeling stage 120,phoneme evaluation stage 130, andword modeler stage 140 ofFIG. 1 , respectively, except as otherwise described herein. - In reference to the embodiment of
FIG. 4 ,feedback 450 is an optional feature ofspeech recognition process 400, in whichphoneme evaluation process 430 can provide an active senone list toacoustic modeling process 420, according to an embodiment of the present invention. The APU can compare one or more feature vectors to one or more senones indicated in the active senone list.Such feedback 450 is further discussed below. - In another embodiment,
acoustic modeling process 420 can compare the one or more feature vectors to all of the senones associated with an acoustic library. In this case,feedback 450 is not required, asphoneme evaluation process 430 receives an entire set of senone scores (e.g., “score all” function) from the APU for further processing. - A. System Bus Architectures for Speech Recognition Systems with an Acoustic Processing Unit
- In an embodiment, the APU and CPU can be in communication with one another over a Serial Peripheral Interface (SPI) bus, a Peripheral Controller Interface (PCI) bus, an Application Programming Interface (API) bus, an Advanced Microcontroller Bus Architecture High-Performance Bus (AHB), an Advanced Peripheral Bus (APB), a memory bus, or any other type of bus. Example, non-limiting embodiments of system bus architectures for
speech recognition process 400 ofFIG. 4 are described in further detail below. -
FIG. 5 is an illustration of an embodiment of a bus architecture for aspeech recognition system 500.Speech recognition system 500 includes anAPU 510, aCPU 520, a processor/memory bus 530, acache 540, asystem controller 550, amain memory 560, a plurality of PCI devices 570 1-570 M, an Input/Output (I/O)bus 580, and aPCI bridge 590.Cache 540 can be, for example, a second-level cache implemented on a Static Random Access Memory (SRAM) device. Further,main memory 560 can be, for example, a Dynamic Random Access Memory (DRAM) device.Speech recognition system 500 can be implemented as a system-on-chip (SOC), according to an embodiment of the present invention. - As illustrated in
FIG. 5 ,APU 510 is communicatively coupled to I/O bus 580 throughPCI bridge 590. I/O bus 580 can be, for example, a PCI bus. ThroughPCI bridge 590 and I/O bus 580,APU 510 is communicatively coupled tosystem controller 550 andCPU 520. In another embodiment (not illustrated inFIG. 5 ),APU 510 can be directly coupled to processor/memory bus 530 and, in turn, communicatively coupled toCPU 520. -
FIG. 6 is an illustration of another embodiment of a bus architecture for aspeech recognition system 600.Speech recognition system 600 includesAPU 510,CPU 520,cache 540, anAHB 610, asystem controller 620, anon-volatile memory device 630, amain memory 640, anAPB bridge 650, anAPB 660, and a plurality of devices 670 1-670 M.Non-volatile memory device 630 can be, for example, a Flash memory device.Main memory 640 can be, for example, a DRAM device.CPU 520 can be, for example, an ARM processor (developed by ARM Holdings plc).Speech recognition system 600 can be implemented as an SOC, according to an embodiment of the present invention. - As illustrated in
FIG. 6 ,APU 510 is communicatively coupled tosystem controller 620 throughAPB bridge 650 andAPB 660.System controller 620 is also communicatively coupled toCPU 520 throughAHB 610. In turn,system controller 620 is communicatively coupled toCPU 520 throughAHB 610. -
FIG. 7 is an illustration of another embodiment of a bus architecture for aspeech recognition system 700.Speech recognition system 700 includesAPU 510,CPU 520,cache 540,AHB 610,system controller 620,non-volatile memory device 630, a Low Power Double Data Rate (LPDDR)interface 710,LPDDR memory bus 720, and amain memory 730.Main memory 730 can be, for example, a DRAM device.CPU 520 can be, for example, an ARM processor (developed by ARM Holdings plc).Speech recognition system 700 can be implemented as an SOC, according to an embodiment of the present invention. - As illustrated in
FIG. 7 ,APU 510 andmain memory 730 are communicatively coupled toLPDDR interface 710 viaLPDDR memory bus 720.APU 510 is also communicatively coupled tosystem controller 620 throughLPDDR memory bus 720 andLPDDR interface 710. In turn,system controller 620 is communicatively coupled toCPU 520 viaAHB 610. - B. System-Level Architectures for Speech Recognition Systems with an Acoustic Processing Unit
-
FIG. 8 is an illustration of an embodiment of a system-level architecture for aspeech recognition system 800.Speech recognition system 800 includes anAPU 810, amemory controller 820, anon-volatile memory device 830, and avolatile memory device 840.Memory controller 820 is communicatively coupled toAPU 810 via abus 815 and coupled tonon-volatile memory device 830 and volatile memory device 850 via a bus 825 (which may represent two or more buses in certain embodiments). In an embodiment,APU 810 andmemory controller 820 are integrated on a single chip. Alternatively, in an embodiment,APU 810 andmemory controller 820 are integrated on separate chips.Non-volatile memory device 830 can be a NAND memory module, a NOR memory module, or another type of non-volatile memory device. In an embodiment,volatile memory device 840 can be a DRAM device. Further,APU 810 can communicate with a CPU (not shown inFIG. 8 ) using, for example, one of the bus architectures described above with respect toFIGS. 5-7 , according to an embodiment of the present invention. -
Non-volatile memory device 830 can store an acoustic library to be used in a speech recognition process, in which the acoustic library can include over 1000 senones, according to an embodiment of the present invention. In an embodiment, when a senone request is received byspeech recognition system 800,memory controller 820 copies the acoustic library fromnon-volatile memory device 830 tovolatile memory device 840 viabus 825. The acoustic library transfer process between the non-volatile and volatile memory devices can be implemented using, for example, a direct memory access (DMA) operation. - In an embodiment,
speech recognition system 800 can be powered on in anticipation of a senone scoring request. After power up, the acoustic library fromnon-volatile memory device 830 is immediately copied tovolatile memory device 840. Oncevolatile memory device 840 has received the acoustic library,APU 810 is ready to begin processing senone scoring requests (e.g.,acoustic modeling process 420 ofFIG. 4 ) using the acoustic library stored involatile memory device 840. - When the senone scoring request is received by
APU 810, a selected senone from the acoustic library is copied fromvolatile memory device 840 toAPU 810 viamemory controller 820.APU 810 calculates a senone score based on the selected senone and a data stream received by APU 810 (e.g., one ormore feature vectors 315 ofFIG. 3 ). After completing the calculation,APU 810 transfers the senone score to the requesting system (e.g., the CPU). - In an embodiment, after a predetermined time of inactivity (e.g., senone scoring inactivity by APU 810),
volatile memory device 840 can be powered down. As a result, power efficiency inspeech recognition system 800 can be improved, as a periodic refresh of memory cells involatile memory device 840 will not be required. Here, the acoustic library is still stored innon-volatile memory device 830 such that the acoustic library can be retained whenvolatile memory device 840 is powered down. As would be understood by a person skilled in the art, whenvolatile memory device 840 is powered down, the contents stored therein (e.g., the acoustic library) will be lost. In an embodiment, whenvolatile memory device 840 is powered down, the other components ofspeech recognition system 800 can be powered down as well. -
FIG. 9 is an illustration of an embodiment of amethod 900 for data pattern analysis.Speech recognition system 800 ofFIG. 8 can be used, for example, to perform the steps ofmethod 900. In an embodiment,method 900 can be used to performacoustic modeling process 420 ofFIG. 4 . Based on the description herein, a person skilled in the relevant art will recognize thatmethod 900 can be used in other data pattern recognition applications such as, for example, image processing, audio processing, and handwriting recognition. - In
step 910, a plurality of data patterns is copied from a non-volatile memory device (e.g.,non-volatile memory device 830 ofFIG. 8 ) to a volatile memory device (e.g.,volatile memory device 840 ofFIG. 8 ). In an embodiment, the plurality of data patterns can be one or more senones associated with an acoustic library. - In
step 920, a data pattern from the volatile memory device is requested by a computational unit (e.g.,APU 810 ofFIG. 8 ) and transferred to the computational unit via a memory controller and bus (e.g.,memory controller 820 andbus 825, respectively, ofFIG. 8 ). In an embodiment, the requested data pattern is a senone from an acoustic library stored in the volatile memory device. - In
step 930, after receiving the requested data pattern, the computational unit (e.g.,APU 810 ofFIG. 8 ) performs a data pattern analysis on a data stream received by the computational unit. In an embodiment, the data pattern analysis is a senone score calculation based on a selected senone and the data stream received by the computational unit (e.g., one ormore feature vectors 315 ofFIG. 3 ). After completing the data pattern analysis, the computational unit transfers the data pattern analysis result to the requesting system (e.g., the CPU). - In
step 940, the volatile memory device powers down. In an embodiment, the volatile memory device powers down after a predetermined time of inactivity (e.g., inactivity in the data pattern analysis by the computational unit). As a result, power efficiency can be improved, as a periodic refresh of memory cells in the volatile memory device will not be required. In an embodiment, when the volatile memory device is powered down, the other components of the system (e.g., other components of speech recognition system 800) can be powered down as well. -
FIG. 10 is an illustration of another embodiment of a system-level architecture for aspeech recognition system 1000.Speech recognition system 1000 includes anAPU 1010, aSOC 1040, aDRAM device 1060, aFlash memory device 1070, and an I/O interface 1080. In an embodiment,APU 1010 is an integrated chip that includes amemory device 1020 configured to store an acoustic library and an Application-Specific Integrated Circuit (ASIC) 1030 configured to perform an acoustic modeling process (e.g.,acoustic modeling process 420 ofFIG. 4 ). In another embodiment,ASIC 1030 andmemory device 1020 can be integrated on two separate chips.SOC 1040 includes aCPU 1050 configured to perform a signal processing process, a phoneme evaluation process, and a word modeling process (e.g.,signal processing process 410,phoneme evaluation process 430, andword modeling process 440, respectively, ofFIG. 4 ), according to an embodiment of the present invention. In an embodiment,APU 1010 andSOC 1040 are integrated on two separate chips. -
FIG. 11 is an illustration of another embodiment of a system-level architecture for aspeech recognition system 1100.Speech recognition system 1100 includes anAPU 1110,SOC 1040,DRAM device 1060,Flash memory device 1070, and I/O interface 1080. In an embodiment,APU 1110 is an integrated chip that includes anASIC 1120, avolatile memory device 1130, and anon-volatile memory device 1140. In another embodiment,ASIC 1120,volatile memory device 1130, andnon-volatile memory device 1140 can be integrated on two chips—e.g.,ASIC 1120 andmemory device 1130 on one chip withnon-volatile memory device 1140 on another chip;ASIC 1120 on one chip withvolatile memory device 1130 andnon-volatile memory device 1140 on another chip; or,ASIC 1120 andnon-volatile memory device 1140 on one chip withvolatile memory device 1130 on another chip. In yet another embodiment,ASIC 1120,volatile memory device 1130, andnon-volatile memory device 1140 can each be integrated on a separate chip—i.e., three separate chips. -
Non-volatile memory device 1140 can be configured to store an acoustic model that is copied tovolatile memory device 1130 upon power-up ofAPU 1110, according to an embodiment of the present invention. In an embodiment, non-volatile memory device can be a Flash memory device andvolatile memory device 1130 can be a DRAM device. Further,ASIC 1120 can be configured to perform an acoustic modeling process (e.g.,acoustic modeling process 420 ofFIG. 4 ), according to an embodiment of the present invention. -
FIG. 12 is an illustration of another embodiment of a system-level architecture for aspeech recognition system 1200.Speech recognition system 1200 includesDRAM device 1060,Flash memory device 1070, I/O interface 1080, amemory device 1210, and anSOC 1220. In an embodiment,SOC 1220 is an integrated chip that includes anASIC 1230 and aCPU 1240.ASIC 1230 can be configured to perform an acoustic modeling process (e.g.,acoustic modeling process 420 ofFIG. 4 ) andCPU 1240 can be configured to perform a signal processing process, a phoneme evaluation process, and a word modeling process (e.g.,signal processing process 410,phoneme evaluation process 430, andword modeling process 440, respectively, ofFIG. 4 ), according to an embodiment of the present invention. -
Memory device 1210 can be configured to store an acoustic library and to transfer one or more senones toASIC 1230 via an I/O bus 1215, according to an embodiment of the present invention. In an embodiment,memory device 1210 can be a DRAM device or a Flash memory device. In another embodiment, the acoustic library can be stored in a memory device located within ASIC 1230 (not shown inFIG. 12 ) rather thanmemory device 1210. In yet another embodiment, the acoustic library can be stored in system memory for SOC 1220 (e.g., DRAM device 1060). -
FIG. 13 is another illustration of an embodiment of a system-level architecture for aspeech recognition system 1300.Speech recognition system 1300 includesDRAM device 1060,Flash memory device 1070, I/O interface 1080, amemory device 1210, and anSOC 1220.DRAM device 1060 can be configured to store an acoustic library and to transfer one or more senones toASIC 1230 via an I/O bus 1315, according to an embodiment of the present invention. -
FIG. 14 is an illustration of an embodiment of anAPU 1400. In an embodiment,APU 1400 is an integrated chip that includes amemory module 1420 and a Senone Scoring Unit (SSU) 1430. In another embodiment,memory module 1420 and SSU 1430 can be integrated on two separate chips. -
APU 1400 is in communication with a CPU (not shown inFIG. 14 ) via I/O signals 1410, in whichAPU 1400 is configured to perform an acoustic modeling process (e.g.,acoustic modeling process 420 ofFIG. 4 ), according to an embodiment of the present invention. In an embodiment, I/O signals 1410 can include an input feature vector data line for feature vector information, an input clock signal, an input APU enable signal, an output senone score data line for senone score information, and other I/O control signals forAPU 1400.APU 1400 can be configured to receive one or more feature vectors (calculated by the CPU) via the feature vector data line from the CPU and to transmit a senone score via the senone score data line to the CPU for further processing, according to an embodiment of the present invention. In an embodiment, I/O signals 1410 can be implemented as, for example, an SPI bus, a PCI bus, an API bus, an AHB, an APB, a memory bus, or any other type of bus to provide a communication path betweenAPU 1400 and the CPU (see, e.g.,FIGS. 5-7 and associated description). An interface betweenAPU 1400 and the CPU, as well as control signals for the interface, are described in further detail below. - In an embodiment,
memory module 1420 and SSU 1430 can operate in two different clock domains.Memory module 1420 can operate at the clock frequency associated with the input clock signal to APU 1400 (e.g., from I/O signals 1410) andSSU 1430 can operate at a faster clock frequency based on the input clock signal, according to an embodiment of the present invention. For example, if the clock frequency associated with the input clock signal is 12 MHz, thenSSU 1430 can operate at a clock-divided frequency of 60 MHz—five times faster than the clock frequency associated with the input clock signal. Techniques and methods for implementing clock dividers are known to a person skilled in the relevant art. As will be described in further detail below, the architecture ofSSU 1430 can be based on the clock domain at which it operates. - In reference to
FIG. 14 ,memory module 1420 includes a bus controller 1422, amemory controller 1424, amemory device 1426, and abridge controller 1428.Memory device 1426 is configured to store an acoustic model to be used in a speech recognition process. In an embodiment,memory device 1426 can be a non-volatile memory device such as, for example, a Flash memory device. The acoustic library can be pre-loaded into the non-volatile memory device prior to operation of APU 1400 (e.g., during manufacturing and/or testing of APU 1400). - In another embodiment,
memory device 1426 can be a volatile memory device such as, for example, a DRAM device. In an embodiment, when a senone request is received byAPU 1400,memory controller 1424 can copy the acoustic library from a non-volatile memory device (either integrated on the same chip asAPU 1400 or located external to APU 1400) to the volatile memory device. The acoustic library transfer process between the non-volatile and volatile memory devices can be implemented using, for example, a DMA operation. - Bus controller 1422 is configured to control data transfer between
APU 1400 and an external CPU. In an embodiment, bus controller 1422 can control the receipt of feature vectors from the CPU and the transmission of senone scores fromAPU 1400 to the CPU. In an embodiment, bus controller 1422 is configured to transfer one or more feature vectors from the CPU to bridgecontroller 1428, which serves as an interface betweenmemory module 1420 andSSU 1430. In turn,bridge controller 1428 transfers the one or more feature vectors toSSU 1430 for further processing. Upon calculation of a senone score, the senone score is transferred fromSSU 1430 tomemory module 1420 viabridge controller 1428, according to an embodiment of the present invention. - In an embodiment, bus controller 1422 can receive a control signal (via I/O signals 1410) that provides an active senone list. In an embodiment, the active senone list can be transferred to
APU 1400 as a result of the phoneme evaluation process performed by the CPU (e.g.,phoneme evaluation process 430 ofFIG. 4 ). That is, in an embodiment, a feedback process can occur between the acoustic modeling process performed byAPU 1400 and the phoneme evaluation process performed by the CPU (e.g.,feedback 450 ofFIG. 4 ). The active senone list can be used in senone score calculations for incoming feature vectors intoAPU 1400, according to an embodiment of the present invention. - The active senone list indicates one or more senones stored in
memory device 1426 to be used in a senone score calculation. In an embodiment, the active senone list can include a base address associated with an address space ofmemory device 1426 and a list of indices related to the base address at which the one or more senones are located inmemory device 1426. Bus controller 1422 can send the active senone list toSSU 1430 viabridge controller 1428, in whichSSU 1430 is in communication with memory device 1426 (via memory controller 1424) to access the one or more senones associated with the active senone list. - In another embodiment, bus controller 1422 can receive a control signal (via I/O signals 1410) that instructs
APU 1400 to perform the senone score calculation using all of the senones contained in the acoustic library (e.g., “score all” function). Bus controller 1422 sends the “score all” instruction toSSU 1430 viabridge controller 1428, in whichSSU 1430 is in communications with memory device 1426 (via memory controller 1424) to access all of the senones associated with the acoustic library. - Conventional speech recognition systems typically incorporate a feedback loop between acoustic modeling and phoneme evaluation modules (e.g.,
acoustic modeling module 320 andphoneme evaluation module 330 ofFIG. 3 ) within the CPU to limit the number of senones used in senone score calculations. This is because, as discussed above with respect tospeech recognition system 300 ofFIG. 3 , significant computing resources are dedicated to the acoustic modeling process where thousands of senones can be compared to a feature vector. This places a significant load on the CPU and the bandwidth of the data bus (e.g.,data 240 ofFIG. 3 ) transferring the senones from the memory device (e.g.,memory device 230 ofFIG. 3 ) to the CPU. Thus, for conventional speech recognition systems, active senone lists are used to limit the impact of the acoustic modeling process on the CPU. However, the use active senone lists by the CPU can place limitations on the need to process incoming voice signals in real-time or substantially close to real time. - The “score all” function of
APU 1400 not only alleviates the load on the CPU and the bandwidth of the data bus, but also provides processing of incoming voice signals in real-time or substantially close to real time. As discussed in further detail below, features ofAPU 1400 such as, for example, the bus width ofdata bus 1427 and the architecture ofdistance calculator 1436 ofFIG. 14 provides a system for real-time or substantially close to real time speech recognition. - In reference to
FIG. 14 ,SSU 1430 includes anoutput buffer 1432, anSSU control module 1434, a featurevector matrix module 1435, adistance calculator 1436, and anaddition module 1438.SSU 1430 is configured to calculate a Mahalanobis distance between one or more feature vectors and one or more senones stored inmemory device 1426, according to an embodiment of the present invention. Each of the one or more feature vectors can be composed of N dimensions, where N can equal, for example, 39. In an embodiment, each of the N dimensions in the one or more feature vectors can be a 16-bit mean value. - Further, each of the one or more senones stored in
memory device 1426 is composed of one or more Gaussian probability distributions, where each of the one or more Gaussian probability distributions has the same number of dimensions as each of the one or more feature vectors (e.g., N dimensions). Each of the one or more senones stored inmemory device 1426 can have, for example, 32 Gaussian probability distributions. - As discussed above,
memory module 1420 and SSU 1430 can operate in two different clock domains. In an embodiment,SSU control module 1434 is configured to receive a clock signal frommemory module 1420 viabridge controller 1428. The frequency of the clock signal received bySSU control module 1434 can be the same or substantially the same as the clock frequency associated with the input clock signal to APU 1400 (e.g., input clock signal from I/O signals 1410), according to an embodiment of the present invention. - In an embodiment,
SSU control module 1434 can divide the frequency of its incoming clock signal and distribute that divided clock signal to other components ofSSU 1430—e.g.,output buffer 1432, featurevector matrix module 1435,distance calculator 1436, andaddition module 1438—such that these other components operate at the clock-divided frequency. For example, if the clock frequency associated with the input clock signal (e.g., from I/O signals 1410) is 12 MHz, thenSSU control module 1434 can receive the same or substantially the same clock signal frombridge controller 1428 and divide that clock frequency using known clock-dividing techniques and methods to a frequency of, for example, 60 MHz.SSU control module 1434 can distribute this clock-divided signal to the other components ofSSU 1430 such that these other components operate at, for example, 60 MHz—five times faster than the clock frequency associated with the input clock signal. - For simplicity purposes, the clock signals distributed from
SSU control module 1434 to the other components ofSSU 1430 are not illustrated inFIG. 14 . For ease of reference, the frequency associated with this clock signal is also referred to herein as the “SSU clock frequency.” Further, for ease of reference, the frequency associated with the input clock signal toSSU control module 1434 is also referred to herein as the “memory module clock frequency.” -
FIG. 15 is an illustration of an embodiment ofSSU control module 1434.SSU control module 1434 includes aninput buffer 1510 and acontrol unit 1520.SSU control module 1434 is configured to receive one or more control signals frommemory module 1420 viabridge controller 1428. In an embodiment, the one or more control signals can be associated with I/O signals 1410 and with control information associated with a Gaussian probability distribution outputted bymemory device 1426. The control signals associated with I/O signals 1410 can include, for example, an active senone list and a “score all” function. The control information associated with the Gaussian probability distribution can include, for example, address information for a subsequent Gaussian probability distribution to be outputted bymemory device 1426. - In reference to
FIG. 14 , in an embodiment, when bus controller 1422 receives an active senone list via I/O signals 1410, the base address associated with the address space ofmemory device 1426 and list of indices related to the base address at which the one or more senones are located inmemory device 1426 can be stored ininput buffer 1510 ofFIG. 15 .Control unit 1520 is in communication withinput buffer 1510 to monitor the list of the senones to be applied bydistance calculator 1436 ofFIG. 14 in the senone score calculation. - For example, the active senone list can contain a base address associated with an address space of
memory device memory device 1426. As would be understood by a person skilled in the relevant art, the indices can refer to pointers or memory address offsets in reference to the base address associated with the address space ofmemory device 1426. Further, as discussed above, a senone can be composed of one or more Gaussian probability distributions, where each of the one or more Gaussian probability distributions has the same number of dimensions as each of one or more feature vectors (e.g., N dimensions) received byAPU 1400. For explanation purposes, this example will assume that each senone stored inmemory device 1426 is composed of 32 Gaussian probability distributions. Based on the description herein, a person skilled in the relevant art will understand that each of the senones can be composed of more or less than 32 Gaussian probability distributions. - In an embodiment, for the first senone in the active senone list,
control unit 1520 communicates withmemory controller 1424 ofFIG. 14 to access the first senone inmemory device 1426 based on the base address and the first index information contained in the active senone list. The senone associated with the first index can include memory address information of the first 2 Gaussian probability distributions associated with that senone, according to an embodiment of the present invention. In turn,memory device 1426 accesses two Gaussian probability distributions associated with the first senone in, for example, a sequential manner. For example,memory device 1426 accesses the first Gaussian probability distribution and outputs this Gaussian probability distribution to distancecalculator 1436 viadata bus 1427. Asmemory device 1426 outputs the first Gaussian probability distribution,memory device 1426 can also access the second Gaussian probability distribution. - In an embodiment, the second Gaussian probability distribution can include memory address information for a third Gaussian probability distribution to be accessed by
memory device 1426.Memory device 1426 can communicate this memory address information to controlunit 1520 ofFIG. 15 viabridge controller 1428 ofFIG. 14 ,Control unit 1520, in turn, communicates withmemory controller 1424 ofFIG. 14 to access the third Gaussian probability distribution. In an embodiment, as the third Gaussian probability distribution is being accessed bymemory device 1426, the second Gaussian probability distribution can be outputted todistance calculator 1436 viadata bus 1427. This iterative, overlapping process of accessing a subsequent Gaussian probability distribution while outputting a current Gaussian probability distribution is performed for all of the Gaussian probability distributions associated with the senone (e.g., for all of the 32 Gaussian probability distributions associated with the senone). A benefit, among others, of the iterative, overlapping (or parallel) processing is faster performance in senone score calculations. -
Control unit 1520 ofFIG. 15 monitors the transfer process of Gaussian probability distributions frommemory device 1426 todistance calculator 1436 such that the memory access and transfer process occurs in a pipeline manner, according to an embodiment of the present invention. After the 32 Gaussian probability distributions associated with the first senone is outputted todistance calculator 1436 ofFIG. 14 ,control unit 1520 repeats the above process for the one or more remaining senones in the active senone list. - After the senones in the active senone list are used in the senone score calculations for a current feature vector,
memory module 1420 can receive a control signal via I/O signals 1410 that indicates that the active senone list from the current feature vector is to be used in senone score calculations for a subsequent feature vector, according to an embodiment of the present invention. Upon receipt of the control signal frommemory module 1420 viabridge controller 1428,SSU control module 1434 uses the same active senone list from the current feature vector in the senone score calculations for the subsequent feature vector. In particular,control unit 1520 ofFIG. 15 applies the same base address and list of indices related to the base address stored ininput buffer 1510 to the subsequent feature vector.Control unit 1520 ofFIG. 15 monitors the transfer process of Gaussian probability distributions frommemory device 1426 todistance calculator 1436 for the subsequent feature vector in a similar manner as described above with respect to the active senone list example. - In another embodiment,
memory module 1420 can receive a control signal via I/O signals 1410 that indicates a “score all” operation. As discussed above, the “score all” function refers to an operation where a feature vector is compared to all of the senones contained in an acoustic library stored inmemory device 1426. In an embodiment,control unit 1520 ofFIG. 15 communicates withmemory controller 1424 ofFIG. 14 to access a first senone inmemory device 1426. The first senone can be, for example, located at a beginning memory address associated with an address space ofmemory device 1426. Similar to the active senone list example above, the first senone inmemory device 1426 can include memory address information of the first 2 Gaussian probability distributions associated with that senone, according to an embodiment of the present invention. In turn,memory device 1426 accesses two Gaussian probability distributions associated with the first senone in, for example, a sequential manner. - In an embodiment, similar to the active senone list example above, the second Gaussian probability distribution can include memory address information on a third Gaussian probability distribution to be accessed by
memory device 1426.Memory device 1426 can communicate this memory address information to controlunit 1520 ofFIG. 15 viabridge controller 1428 ofFIG. 14 .Control unit 1520, in turn, communicates withmemory controller 1424 ofFIG. 14 to access the third Gaussian probability distribution. In an embodiment, as the third Gaussian probability distribution is being accessed bymemory device 1426, the second Gaussian probability distribution can be outputted todistance calculator 1436 viadata bus 1427. This iterative, overlapping process of accessing a subsequent Gaussian probability distribution while outputting a current Gaussian probability distribution is performed for all of the Gaussian probability distributions associated with the senone (e.g., for all of the 32 Gaussian probability distributions associated with the senone). -
Control unit 1520 ofFIG. 15 monitors the transfer process of Gaussian probability distributions frommemory device 1426 todistance calculator 1436 such that the memory access and transfer process occurs in a pipeline manner, according to an embodiment of the present invention. After the Gaussian probability distributions associated with the first senone are outputted to distancecalculator 1436 ofFIG. 14 ,control unit 1520 repeats the above process for the one or more remaining senones in the acoustic library. - In reference to
FIG. 14 , featurevector matrix module 1435 is used for speaker adaptation inAPU 1400. In an embodiment, featurevector matrix module 1435 receives a feature vector transform matrix (FVTM) from the CPU via I/O signals 1410. The FVTM can be loaded into featurevector matrix module 1435 periodically such as, for example, once per utterance. In an embodiment, the FVTM can be stored in a Static Random Access Memory (SRAM) device located within featurevector matrix module 1435. - Along with mean and variance values stored for each senone in
memory device 1426, an index can also be stored for each senone, in which the index points to a row in the FVTM, according to an embodiment of the present invention. The number of rows in the FVTM can vary (e.g., 10, 50, or 100 rows) and can be specific to a voice recognitionsystem implementing APU 1400. Each row in the FVTM can have an equal number of entries as the N number of dimensions for a feature vector (e.g., 39), where each of the entries is a scaling factor that is multiplied to its corresponding feature vector dimension to produce a new feature vector, according to an embodiment of the present invention. The selected row from the FVTM (e.g., row of 39 scaling factors) is transferred todistance calculator 1436 viadata bus 1439, in whichdistance calculator 1436 performs the multiplication operation to generate the new feature vector, as will be described in further detail below. - In an embodiment,
SSU control module 1434 provides a feature vector received from the CPU and an index associated with a senone to featurevector matrix module 1435. The index indicates a particular row in the FVTM for scaling the feature vector. For example, the FVTM can have 100 rows and the index can be equal to 10. Here, for a feature vector with 39 dimensions, the 10th row of the FVTM contains 39 scaling factors, in which the row of scaling factors is transferred todistance calculator 1436 to generate the new feature vector. - In reference to
FIG. 14 ,distance calculator 1436 is configured to calculate a distance between one or more dimensions of a senone stored inmemory device 1426 and a corresponding one or more dimensions of a feature vector.FIG. 16 is an illustration of an embodiment ofdistance calculator 1436.Distance calculator 1436 includes a datapath multiplexer (MUX) 1610, afeature vector buffer 1620, arithmetic logic units (ALUs) 1630 1-1630 8, and anaccumulator 1640. -
Datapath MUX 1610 is configured to receive a Gaussian probability distribution frommemory device 1426 ofFIG. 14 viadata bus 1427. In an embodiment, the width ofdata bus 1427 is equal to the number of bits associated with one Gaussian probability distribution. For example, if one Gaussian probability distribution is 768 bits, then the width ofdata bus 1427 is also 768 bits. Over a plurality of Gaussian probability distribution dimensions, the 768 bits associated with the Gaussian probability distribution can be allocated to a 16-bit mean value, a 16-bit variance value, and other attributes per Gaussian probability distribution dimension. As discussed above, the Gaussian probability distribution can have the same number of dimensions as a feature vector—e.g., 39 dimensions. In another embodiment, the width ofdata bus 1427 can be greater than 256 bits. - Further, in an embodiment,
memory device 1426 anddistance calculator 1436 can be integrated on the same chip, wheredata bus 1427 is a wide bus (of the width discussed above) integrated on the chip to provide data transfer of the Gaussian probability distribution frommemory device 1426 todistance calculator 1436. In another embodiment,memory device 1426 anddistance calculator 1436 can be integrated on two separate chips, wheredata bus 1427 is a wide bus (of the width discussed above) that is tightly coupled between the two chips such that degradation of data due to noise and interconnect parasitic effects are minimized. As will be discussed below, a benefit of a wide data bus 1427 (of the width discussed above), among others, is to increase performance ofAPU 1400 in the calculation of senone scores. -
Datapath MUX 1610 is also configured to receive one or more control signals and a feature vector fromSSU control module 1434 viadata bus 1437, as well as feature vector scaling factors fromfeature vector buffer 1620. In an embodiment,feature vector buffer 1620 can be configured to store scaling factors (associated with a selected row of the FVTM) transferred from featurevector matrix module 1435 viadata bus 1439. In another embodiment,feature vector buffer 1620 can be configured to store the FVTM. Here, one or more control signals fromSSU control module 1434 viadata bus 1437 can be used to select the FVTM row.Datapath MUX 1610 outputs the feature vector, selected feature vector scaling factors from the FVTM, and Gaussian probability distribution information to ALUs 1630 1-1630 8 viadata bus 1612 for further processing. - In an embodiment,
datapath MUX 1610 is also configured to receive a Gaussian weighting factor from the one or more controls signals fromSSU control module 1434 viadata bus 1437.Datapath MUX 1610 is configured to output the Gaussian weighting factor toaccumulator 1640 for further processing. - In reference to
FIG. 16 , each of ALUs 1630 1-1630 8 is configured, per SSU clock cycle, to calculate a distance score between a dimension of a Gaussian probability distribution received fromdatapath MUX 1610 and a corresponding dimension of a feature vector, according to an embodiment of the present invention. In an embodiment, ALUs 1630 1-1630 8 can operate at the SSU clock frequency (e.g., 5 times faster than the memory module clock frequency) such that for every read operation frommemory device 1426 ofFIG. 14 (e.g., to transfer a Gaussian probability distribution to distance calculator 1436), a distance score associated a Gaussian probability distribution (also referred to herein as “Gaussian distance score”) is outputted fromdistance calculator 1436 toaddition module 1438. - In an embodiment,
datapath MUX 1610 is configured to distribute feature vector information associated with one dimension, a mean value associated with a corresponding dimension of a Gaussian probability distribution, a variance value associated with the corresponding dimension of the Gaussian probability, and feature vector scaling factors to each of ALU 1630 1-1630 8. Based on the feature vector information and the feature vector scaling factors allocated to a respective ALU, each of ALUs 1630 1-1630 8 is configured to generate a new feature vector by multiplying dimensions of the feature vector by respective scaling factors. - In an embodiment, the multiplication of the feature vector dimensions by the corresponding scaling factors is performed “on-the-fly,” meaning that the multiplication operation is performed during the calculation of the distance score. This is, in contrast, to the multiplication operation being performed for each of the rows in a FVTM and the results of the multiplication operation being stored in memory to be later accessed by each of ALUs 1630 1-1630 8. A benefit of the “on-the-fly” multiplication operation, among others, is that memory storage is not required for the results of the multiplication operation associated with non-indexed (or non-selected) rows of the FVTM. This, in turn, results in a faster generation of the new feature vector since additional clock cycles are not required to store the feature vector scaling results associated with the non-indexed rows in memory and also results in a smaller die size area for ALUs 1630 1-1630 8.
- Based on the new feature vector, the mean value, and the variance value for a respective ALU, each of ALUs 1630 1-1630 8 is configured to calculate a distance score based on a feature vector dimension and a corresponding Gaussian probability distribution dimension per SSU clock cycle, according to an embodiment of the present invention. Cumulatively, in one clock cycle, ALUs 1630 1-1630 8 generate distance scores for 8 dimensions (i.e., 1 dimension calculation per ALU). The architecture and operation of the ALU is described in further detail below.
- The number of ALUs in
distance calculator 1436 can be dependent on the SSU clock frequency and the memory module clock frequency discussed above such thatdistance calculator 1436 outputs a distance score for one Gaussian probability distribution for every read access tomemory device 1426, according to an embodiment of the present invention. For example, the memory module clock frequency can have an operating frequency of 12 MHz, wherememory device 1426 also operates at 12 MHz (e.g., for a read access of approximately 83 ns).SSU 1430 can have an SSU clock frequency of, for example, 60 MHz to operate five times faster than the memory module cock frequency. With a feature vector of 39 dimensions and 8 ALUs, a Gaussian distance score for one Gaussian probability distribution can be calculated in 5 SSU clock cycles or 1 memory module clock cycle. Therefore, by design, the 5 SSU clock cycles is a predetermined number of clock cycles that corresponds to 1 memory module clock cycle, where as one Gaussian probability distribution is read from memory device at 1 memory module clock cycle, a Gaussian distance score for another Gaussian probability distribution is calculated byaccumulator 1640. - In an embodiment, a portion of ALUs 1630 1-1630 8 can be activated on a rising edge of an SSU clock cycle, while the remaining portion of ALUs 1630 1-1630 8 can be activated on a falling edge of the SSU clock cycle. For example, ALUs 1630 1-1630 4 can be activated on the rising edge of the SSU clock cycle and ALUs 1630 5-1630 8 can be activated on the falling edge of the SSU clock cycle. As a result of staggering the activation of ALUs 1630 1-1630 8, the peak current (and peak power) generated by
distance calculator 1436 can be minimized, thus decreasing the susceptibility of reliability issues indistance calculator 1436. - Based on the description herein, a person skilled in the relevant art will recognize that the architecture of
distance calculator 1436 is not limited to the above example. Rather, as would be understood by a person skilled in the relevant art,distance calculator 1436 can operate at a faster or slower clock frequency of 60 MHz and thatdistance calculator 1436 can include more or less than 8 ALUs. - In reference to
FIG. 16 ,accumulator 1640 is configured to receive the outputs from each of ALUs 1630 1-1630 8 and the Gaussian weighting factor from datapath MUX 1610 (via data bus 1614). As discussed above, in an embodiment, for every SSU clock cycle, a distance score for a Gaussian probability distribution dimension is outputted by each of ALUs 1630 1-1630 8. These distance scores from each of ALUs 1630 1-1630 8 are stored and accumulated byaccumulator 1640 to generate a distance score for the Gaussian probability distribution dimension, or Gaussian distance score—e.g.,accumulator 1640 adds respective distance scores calculated by ALUs 1630 1-1630 8 per SSU clock cycle. - After the Gaussian distance scores associated with all of the Gaussian probability distribution dimensions are accumulated in accumulator 1640 (e.g., 39 dimensions),
accumulator 1640 multiplies the total sum by the Gaussian weighting factor to generate a weighted Gaussian distance score. In an embodiment, the Gaussian weighting factor is optional, whereaccumulator 1640 outputs the Gaussian distance score. In another embodiment, the Gaussian weighting factor is specific to each Gaussian and is stored inmemory device 1426. -
Addition module 1438 is configured to add one or more Gaussian distance scores (or weighted Gaussian distance scores) to generate a senone score. As discussed above, each senone can be composed of one or more Gaussian probability distributions, in which each Gaussian probability distribution can be associated with a Gaussian distance score. For a senone with a plurality of Gaussian probability distributions (e.g., 32 Gaussian probability distributions),addition module 1438 sums the Gaussian distance scores associated with all of the Gaussian probability distributions to generate the senone score. In an embodiment,addition module 1438 is configured to perform the summation operation in the log domain to generate the senone score. -
Output buffer 1432 is configured to receive a senone score fromaddition module 1438 and transfer the senone score to bridgecontroller 1428.Bridge controller 1428, in turn, transfers the senone score to the external CPU via bus controller 1422. In an embodiment,output buffer 1432 can include a plurality of memory buffers such that, as a first senone score in a first memory buffer is being transferred tobridge controller 1428, a second senone score generated byaddition module 1438 can be transferred to a second memory buffer for a subsequent transfer to bridgecontroller 1428. -
FIG. 17 is an illustration of an embodiment of amethod 1700 for acoustic modeling. The steps ofmethod 1700 can be performed using, for example,APU 1400 ofFIG. 14 . - In
step 1710, a plurality of Gaussian probability distributions is received via a data bus having a width of at least one Gaussian probability distribution and a feature vector from an external computing device. The Gaussian probability distribution can be composed of, for example, 768 bits, where the width of the data bus is at least 768 bits. Further,APU 1400 ofFIG. 14 can receive the feature vector from the external computing device (e.g., a CPU in communication withAPU 1400 via I/O signals 1410 ofFIG. 14 ). - In an embodiment, information associated with a plurality of dimensions of the feature vector, a plurality of mean values associated with the corresponding plurality of dimensions of the at least one Gaussian probability distribution, and a plurality of variance values associated with the corresponding plurality of dimensions of the at least one Gaussian probability distribution are distributed to, for example, arithmetic logic units (e.g., ALUs 1630 1-1630 8 of
FIG. 16 ). - In
step 1720, a plurality of dimension distance scores is calculated based on a plurality of dimensions of the feature vector and a corresponding plurality of dimensions of the at least one Gaussian probability distribution. In an embodiment, the distance score calculations are based on at least one senone from an active senone list. The active senone list can include a base address associated with an address space of a memory device and one or more indices related to the base address at which the at least one senone is located in the memory device. Further, a plurality of scaling factors for the plurality of dimensions of the feature vector are stored, where the plurality of scaling factors are applied to the plurality of dimensions of the feature vector during the calculation of the plurality of dimension distance scores.Step 1720 can be performed by, for example,distance calculator 1436 ofFIG. 14 . - In
step 1730, the plurality of dimension distance scores are summed to generate a Gaussian distance score for the at least one Gaussian probability distribution. In an embodiment, the Gaussian distance score is generated over a predetermined number of senone scoring unit (SSU) clock cycles. The predetermined number of SSU clock cycles can equate to a read access time of the at least one Gaussian probability distribution from a memory device.Step 1730 can be performed by, for example,distance calculator 1436 ofFIG. 14 . - In
step 1740, a plurality of Gaussian distance scores corresponding to the plurality of Gaussian probability distributions is summed to generate a senone score.Step 1740 can be performed by, for example,distance calculator 1436 ofFIG. 14 . - Embodiments of the present invention address and solve the issues discussed above with respect to conventional
speech recognition system 200 ofFIG. 3 . In summary, the acoustic modeling process is performed by, for example,APU 1400 ofFIG. 14 . The APU operates in conjunction with a CPU, in which the APU can receive one or more feature vectors (e.g.,feature vectors 315 ofFIG. 3 ) from the CPU, calculate a senone score (e.g., senone score 325 ofFIG. 3 ) based on one or more Gaussian probability distributions, and output the senone score to the CPU. In an embodiment, the one or more Gaussian probability distributions can be stored in the APU. Alternatively, in another embodiment, the one or more Gaussian probability distributions can be stored externally to the APU, in which the APU receives the one or more Gaussian probability distributions from an external memory device. Based on embodiments of the APU architecture described above, an accelerated calculation for the senone score is achieved. -
FIG. 18 is a block diagram of anALU 1800, according to an embodiment of the present invention. In an embodiment, one or more of ALUs 1630 1-1630 8 can be implemented according to the architecture shown inFIG. 18 .ALU 1800 is configured to compute a one-dimensional distance score between a feature vector and a Gaussian probability distribution vector. For example,ALU 1800 can be configured to compute the one-dimensional distance score as, -
- where:
-
Δij =x i−μij, - varij, is the variance value of the ith dimension of the jth Gaussian probability distribution vector;
M1 and M2 are scaling factors;
C is a constant;
xi is the value of the feature vector in the ith dimension; and
μij is the mean value of the ith dimension of the jth Gaussian probability distribution vector. - Thus, in an embodiment, for a given dimension and a given Gaussian probability distribution, the one-dimensional distance score output by
ALU 1800 is dependent on three variables: xi, μij, and varij. One technique for implementing this equation in software is to generate a look up table (LUT) that is indexed with these three variables. Moreover, because the score does not specifically depend on the values of xi and μij, but rather the difference between them, Δij, this LUT can be further simplified into a dimensional LUT indexed by the Δij and varij. Thus, a two-dimensional LUT could be used to implement ALUs 1630 1-1630 8. - A two-dimensional LUT, however, could have substantial drawbacks if used to implement ALUs 1630 1-1630 8 in the hardware implementation of
FIG. 16 . In particular, for example, because there are eight ALUs 1630 1-1630 8 that each compute a respective one-dimensional distance score, there would have to be eight copies of this two-dimensional LUT. In one embodiment, such a two-dimensional LUT is approximately 32 Kbytes, although other embodiments and applications may require larger LUTs. Thus, in such an embodiment, eight copies of a 32 Kbyte LUT would be needed. If implemented in such a mariner, a large amount of the total board space for the SSU would be allocated to only the eight two-dimensional LUTs. This problem would be exacerbated if larger LUTs were required or desired. - In an embodiment,
ALU 1800 overcomes this drawback of two-dimensional LUTs by implementing a scoring function using a combination of computational logic and a one-dimensional LUT. Importantly, Equation (1) can be split into two parts: an aluij part and a LUTij part, with each specified below. -
- Thus,
ALU 1800 computes aluij and, in parallel with the computing, retrieves LUTij. The aluij and LUTij are then combined to form the distance score. In particular, as shown inFIG. 18 ,ALU 1800 includes acomputational logic unit 1802 and aLUT module 1804. As described in further detail below,computational logic unit 1802 can compute value aluij andLUT module 1804 can be used to retrieve value LUTij. Moreover,ALU 1800 additionally includes acombination module 1806.Combination module 1806 combines the outputs ofcomputational unit 1802 andLUT module 1804 and outputs the distance score. -
Computational logic unit 1802 andLUT module 1804 only receive the inputs that are needed to determine their respective value. Specifically, as described above, aluij depends on three variables: xi, μij, and varij. Thus, as shown inFIG. 18 ,computational logic unit 1802 receives these three values as inputs. Moreover, the values retrieved fromLUT module 1804 are indexed using value varij alone. Thus, as shown inFIG. 18 ,LUT module 1804 only receives value varij. -
FIG. 19 shows a detailed Hock diagram ofALU 1800, according to an embodiment of the present invention. In the embodiment ofFIG. 19 ,computational logic unit 1802 includes asubtraction module 1910, asquaring module 1912, aLUT 1914, amultiplier 1916, and aformatting module 1918.Subtraction module 1910 computes the difference between xi and μij, i.e.,subtraction module 1918 computes Δij.Squaring module 1912 squares the difference output bysubtraction module 1910 generating an integer representing Δij 2. - In an embodiment,
LUT 1914 outputs a value that corresponds to -
-
Multiplier 1916 computes a product of two terms: (1) the value retrieved fromLUT 1914 and (2) the square output by squaringmodule 1912. Thus, the output ofmultiplier 1916 is -
- This product value is received by
formatting module 1918, which formats the result so that it can be effectively combined with the output ofLUT module 1804. - As shown in
FIG. 19 ,LUT module 1804 includes aLUT 1920 and a formatting module 1922.LUT 1920 stores values corresponding to LUTij, as expressed in Equation (3), and is indexed using varij. The value retrieved fromLUT 1920 is received by formatting module 1922. Formatting module 1922 formats the output ofLUT 1920 so that it can be effectively combined with the output ofcomputational logic unit 1802. - The outputs from
computational unit 1802 andLUT module 1804 are received atcombination module 1806.Combination module 1806 includes anadder 1930, ashift module 1932, a roundingmodule 1934, and asaturation module 1936.Adder 1930 computes the sum of the two received values and outputs the sum.Shift module 1932 is configured to remove the fractional portion of the sum output byadder 1930. Roundingmodule 1934 is configured to round down the output ofshift module 1934.Saturation module 1936 is configured to receive the rounded sum and saturate the value to a specific number of bits. Thus, the output ofsaturation module 1936 is a value having a specific number of bits that represents the one-dimensional distance score. -
FIG. 20 is a block diagram ofcomputational unit 1802, according to another embodiment of the present invention. The embodiment shown inFIG. 20 is similar to the embodiment ofFIG. 19 , except that the embodiment ofFIG. 20 additionally includes atransform module 2002, anexception handling module 2012, aformatting module 2014, and amultiplexer 2018. -
Transform module 2002 includes amultiplier 2020, ascale bit module 2022, and asaturation module 2024. As described above, values of feature vector can be transformed by respective entries in a feature vector transform matrix to, for example, account for learned characteristics of a speaker. In an embodiment, transformmodule 2002 can be configured to scale individual feature vector values xi by corresponding transform values αi. Specifically,multiplier 2020 computes a product of the feature vector value xi and the corresponding transform value αi and outputs a value toscale bit module 2022.Scale bit module 2022 shifts to the right and outputs the resulting integer tosaturation module 2024.Saturation module 2024 is similar tosaturation module 1936, described with reference toFIG. 19 , saturates the received value to a specific number of bits. Thus, the output ofsaturation module 2024 is a value that represents the scaled feature vector value. -
Exception handling module 2012 andmultiplexer 2018 are configured to address specific errors present inLUT 1914. For example, in an effort to save space, the size ofLUT 1914 can be reduced. This reduction in size can cause specific values ofLUT 1914 to have an error. In such an embodiment,exception handling module 2012 can recognize if the output ofLUT 1914 will be one of those values, and output the correct value. Put another way,exception handling module 2012 can act as a LUT that includes an entry for each entry ofLUT 1914 that may have an error due to size restrictions. BecauseLUT 1914 is indexed based on varij,exception handling module 2012 can recognize whether the output ofLUT 1914 needs to be corrected based on the value of varij. - In a further embodiment,
exception handling module 2012 can act as a two-dimensional LUT that also receives Δij. In such an embodiment,exception handling module 2012 can output specific values of aluij (e.g., as opposed to the corresponding entry from LUT 1914). Because the number of these possible errors inLUT 1914 is relatively small,exception handling module 2012 does not occupy a significant amount of space, as would other, larger two-dimensional LUTs. Furthermore, by controllingmultiplexer 2018 to output the output ofexception handling module 2012 instead of the output ofsign bit module 1918,exception handling module 2012 can ensure that the stored value for aluij rather than the value of aluij calculated using the incorrect output ofLUT 1914 is finally output tocombination module 1806. -
Formatting module 2014 receives the product computed bymultiplier 1916. In an embodiment,formatting module 2014 is configured to reduce the number of bits in the result. While not necessary, this operation can save space and power by reducing the number of bits on the output. - Moreover, the embodiment of
FIG. 20 showssubtraction module 1810 as includingmultiplexers comparison module 2008, and asubtractor 2010. In an embodiment, squaringmodule 1912 may be configured to square specifically positive values. Thus, the output ofsubtraction module 1910 in such an embodiment must be positive. To achieve this result, the two operands, i.e., the feature vector value (optionally scaled with transform value ∝ij) and the mean value μij can be compared bycomparison module 2008.Comparison module 2008 then outputs a control signal tomultiplexers subtractor 2010 is at least as large as the than the second operand. -
FIG. 21 is an illustration of an embodiment of amethod 2100 for computing a one-dimensional distance score. The steps ofmethod 2100 can be performed using, for example,ALU 1800 shown inFIG. 18 . Instep 2102, a feature vector dimension is scaled by a transform value. Instep 2104, a first value is computed based on the feature vector value and a mean and a variance associated with a Gaussian probability distribution vector. Instep 2106, a second value is retrieved based on the variance value. For example, inFIG. 19 ,LUT module 1804 can be used to retrieve variance value. Instep 2108, the first and second values are combined to generate the one-dimensional score. - A. System Overview
-
FIG. 22 is a block diagram of an acoustic processing system 2200, according to an embodiment of the present invention. Acoustic processing system includes a central processing unit (CPU) 2210 and an acoustic processing unit (APU) 2220. Running onCPU 2210 are anapplication 2212, avoice recognition engine 2214, and anAPI 2216.Voice recognition engine 2214 is a process that includes at least two threads: asearch thread 2250 and adistance thread 2260. -
APU 2220 includes anacoustic model memory 2222, afirst bus 2224, amemory buffer 2226, asecond bus 2228, and asenone scoring unit 2230.Acoustic model memory 2222 can be configured to store a plurality of senones that together form one or more acoustic models.First bus 2224 is a wide bus that is configured to allow acoustic model memory to output an entire Gaussian probability distribution vector tomemory buffer 2226.Senone scoring unit 2230 scores a senone score against a feature vector received fromCPU 2210.Senone scoring unit 2230 can be implemented as described above. For example, senone scoring unit can be implemented as shown inFIG. 15 . For more information onsenone scoring unit 2230, see Section 4, above. -
Memory buffer 2226 can hold a Gaussian probability distribution vector untilsenone scoring unit 2230 is ready to compute a Gaussian distance score for it. That is, ifsenone scoring unit 2230 is scoring a feature vector received fromCPU 2210 against a Gaussian probability distribution vector q,memory buffer 2226 can hold the next Gaussian probability distribution vector to be scored, i.e.,vector q+ 1. - As shown in
FIG. 22 , the inputs toAPU 2220 include a reference to a specific senone (senone #) and the feature vector. The senone # input addresses the stored vector information corresponding to that particular senone in the acoustic model memory. The output ofAPU 2220 is the senone score, which represents the probability that the referenced senone emits the feature vector in a given time frame. In an embodiment,acoustic model memory 2222 utilizes a parallel read architecture and a very largeinternal bandwidth bus 2224. The number of bits read in parallel is greater than 256 (e.g., 768 bits wide—sufficient to load an entire Gaussian probability distribution vector at once). The values read from theacoustic model memory 2222 are then latched intomemory buffer 2226, using verylarge bandwidth bus 2224. Both of the output frommemory buffer 2226 and the observation vector information are input intosenone scoring unit 2230 which performs the multiplications and additions required to compute the senone score.Bus 2228, over whichmemory buffer 2226 communicates withsenone scoring unit 2230, is substantially similar tobus 2224. - As noted above, the senone score is computed by calculating the scores of the J Gaussian probability distribution vectors of dimension N, and by then summing them together to get the total score. Some scoring algorithms, however, use only the most significant Gaussians in the calculation to increase the speed of the computation. When utilizing algorithms based on a partial set of Gaussians, only those bits associated with the required Gaussians need to be transferred from the acoustic model memory to
senone scoring unit 2230. In other words, the largest number of contiguous bits in memory which will always be required bysenone scoring unit 2230 is equal to the number of bits used to store a single Gaussian probability distribution vector. The bandwidth requirements of the memory bus as well as the number of bits that need to be read in parallel with be minimized by transferring only those bits comprising a single Gaussian probability distribution vector in each transfer. Using this number of bits per transfer, the power requirements ofAPU 2220 can be reduced and the transfer rate of the necessary data tosenone scoring unit 2230 will be increased, resulting in an improvement of the overall system performance. Put another way, by reducing the number of bits per transfer, the power requirements ofAPU 2220 can be reduced and the transfer rate of the necessary data tosenone scoring unit 2230 can also be increased, resulting in an improvement of the overall system performance. - As discussed above, acoustic modeling is one of the major bottlenecks in many types of speech recognition system (i.e., keyword recognition, or large vocabulary continuous speech recognition). Because of the large number of comparisons and calculations, high performance and/or parallel microprocessors are commonly used, and a high bandwidth bus between the memory storing the acoustic models and the processors is required. In the embodiment of
FIG. 22 , theacoustic model memory 2222 can be incorporated intoAPU 2220, which is integrated into a single die withsenone scoring unit 2230, with both of them connected using a wide, high bandwidthinternal buses - The number of bits per transfer can also a function of the algorithms used for acoustic modeling. When scoring algorithms based on a partial set of Gaussians are used (i.e. Gaussian Selection) then the number of bits per transfer can be equal to the size of the Gaussian used by the algorithm. Fewer number of bits per transfer requires multiple cycles to transfer the data comprising the Gaussian, while greater numbers of bits per transfer is inefficient due to data non-locality.
- In an embodiment, an architecture is used for acoustic modeling hardware accelerators when scoring algorithms are used is at least partially based on a partial set of Gaussians (i.e., Gaussian Selection). This optimized architecture can result in a significant improvement in the overall system performance compared to other architectures.
-
FIG. 23 is a block diagram of anacoustic processing system 2300, according to an embodiment of the present invention.Acoustic processing system 2300 includes aprocessor 2310, adedicated DRAM module 2302, aDRAM module 2304, and anon-volatile memory module 2306.Non-volatile memory module 2306 can be implemented as, e.g., an embedded FLASH memory block.Processor 2310 includes aCPU 2312, ahardware accelerator 2314, and amemory interface 2316.Hardware accelerator 2314 includes a senone scoring unit 2320. Senone scoring unit 2320 can be implemented as described above. For example, senone scoring unit can be implemented as shown inFIG. 15 . - In an embodiment,
dedicated DRAM module 2302 is dedicated to senone scoring unit 2320 to, for example, store senones. Thus,memory interface 2316 can couple senone scoring unit 2320 to dedicatedDRAM 2302. -
FIG. 24 is a block diagram of ahardware accelerator 2400, according to an embodiment of the present invention.Hardware accelerator 2400 includes aprocessor 2402 and a dedicated DRAM module 2404.Processor 2402 includes a serial peripheral interface (SPI)bus interface module 2412, asenone scoring unit 2414, and a memory interface 2416.Senone scoring unit 2414 can be implemented as described above (e.g., as shown inFIG. 15 ). As shown inFIG. 24 , dedicated DRAM module 2404 stores one or more acoustic models. In an alternate embodiment, DRAM module 2404 can instead be a non-volatile memory module, e.g., a FLASH memory module. In still another embodiment, DRAM module 2404 can instead be a memory module that includes a volatile memory module (e.g., DRAM) and a non-volatile memory module (e.g., FLASH). In such an embodiment, the acoustic model can initially be stored in the non-volatile memory module and can be copied to the volatile memory module for senone scoring. -
SPI interface module 2412 can provide an interface to an SPI bus, which, in turn, can couplehardware accelerator 2400 to a CPU. Memory interface 2416 couples senonescoring unit 2414 to dedicated DRAM module 2404. In an embodiment, a voice-recognition system can be implemented in a cloud-based solution in which the senone scoring and processing necessary for voice-recognition is performed in a cloud-based voice-recognition application. - B. Software Stack
-
FIG. 25 is a block diagram illustrating anAPU software stack 2500, according to an embodiment of the present invention.Software stack 2500 can be used to conceptually illustrate the communications between components in an acoustic processing system, e.g., acoustic processing system 2200 described with reference toFIG. 22 .Stack 2500 includes anapplication 2502, avoice recognition engine 2504, an application programming interface (API) 2550, anSPI bus controller 2512, anSPI bus 2514, and anAPU 2516.API 2550 includes aGeneric DCA 2506, a low level driver (LLD) 2508, and a hardware abstraction layer (HAL) 2510. In an embodiment,application 2502,voice recognition engine 2504,API 2550, andAPU 2516 can correspond toapplication 2212,voice recognition engine 2214,API 2216, andAPU 2220 ofFIG. 22 , respectively. - In
software stack 2500,application 2502 communicates withvoice recognition engine 2504, which in turn, communicates withGeneric DCA 2506. In an embodiment,voice recognition engine 2504 is coupled to theGeneric DCA 2506 via a DCA API.Generic DCA 2506 can be coupled toLLD 2508 via a LLD API.LLD 2508 can be coupled toHAL 2510 via an HAL API.HAL 2510 is communicatively coupled toSPI Bus Controller 2512 which is communicatively coupled toSPI bus 2514.APU 2516 is communicatively coupled toSPI bus 2514 and is communicatively coupled to theHAL 2510 viabus controller 2512 andSPI bus 2514. - In an embodiment,
software stack 2500 provides a software interface betweenAPU 2516 and application 2502 (e.g., an application that employs voice recognition). In particular,application 2502 andvoice recognition engine 2504 can be “hardware agnostic.” That is, theapplication 2502 andvoice recognition engine 2504 can complete their respective operations without detailed knowledge about how the distance, or senone, scoring is taking place. -
Generic DCA 2506,LLD layer 2508, andHAL layer 2510 include hardware-specific API calls. In an embodiment, the API calls ofHAL 2510 depend on the type of controller to which it is connected. In an embodiment, the bus interface forAPU 2516 can be a different bus and controller combination, requiring a different HAL (with different API calls). -
Generic DCA 2506 is a distance computational API. The DCA can be defined by a software developer. In an embodiment, the DCA API is specifically defined to support a voice recognition engine, such asvoice recognition engine 2504. Also,Generic DCA 2506 can be implemented specifically forAPU 2516. Moreover,LLD 2508 can be a functional abstraction of the senone scoring unit commands and can be a one-to-one mapping to the senone scoring unit commands. As shown inFIG. 25 , low-level driver 2508 is coupled toHAL 2510. - The DCA API can include the following five functions: Create, Close, Set Feature, Compute Distance Score, and Fill Scores. In an embodiment, the Create function specifies which acoustic model is to be used. There can be one or more acoustic models stored in memory (e.g., one or more acoustic models for each language). For example, as discussed above with reference to
FIG. 22 , dedicatedacoustic model memory 2222 of APU can store the acoustic model (e.g., senone library(s)). Moreover, given an acoustic model (e.g., a library of senones that stores the Gaussian distribution of the sound corresponding to the various senones) and a feature vector, the Create function can specify the number of dimensions in the feature vector. In an embodiment, for English the feature vector can have 39 dimensions. In another embodiment, for other languages, the feature vector can have another number of dimensions. More generally, the number of dimensions can vary depending on the specific spoken language selected for voice recognition processing. Thus, the Create function specifies the acoustic model selected, number of dimensions, and number of senones. The Close function is a function that ends delivery of feature vectors, audio sample portions, and senone scoring requests to the hardware accelerator (e.g., APU 2516). - In an embodiment, the Set Feature function is used to set the senone scoring requests into their respective frames by passing a specific framelD, a passID, and the feature vector. As noted above, the input audio signal can be broken up into frames (e.g., by voice recognition engine 2504). An exemplary frame comprises spectral characteristics of a portion of the audio input signal. In an embodiment, a frame can be 12 milliseconds (ms) long. The Set Feature function can convert each frame into 39 dimensions (e.g., 39 8-bit values). The Set Feature function can specify a particular frame's ID and the associated feature vector.
- In an embodiment, the Distance Compute Score function calculates the senone score (e.g., Gaussian probability), which, as noted above, can be implemented as a distance calculation. This function can be used to begin and prepare the senone scoring. For example, the feature vector can be input into
APU 2516 andAPU 2516 will score against all the senones stored in the acoustic model, or at least a selected portion of the senones. This score will then be given back to the upper layer. In an embodiment, the Distance Compute Score function can specify that a portion or the complete acoustic model will be used for the senone scoring. - In an embodiment, the Fill Scores function takes the senone scoring result and returns it to the upper software layers, including
application 2502 andvoice recognition engine 2504. - In an embodiment,
voice recognition engine 2504 can be used for any form of pattern recognition, e.g., pattern recognition forms that use a Hidden Markov model for pattern recognition. In another embodiment, another form of pattern recognition also uses Gaussian calculations. Examples of pattern recognition can include, but are not limited to the above described senone scoring for speech recognition, image processing and handwritten recognition. - As noted above,
application 2502 andvoice recognition engine 2504 are agnostic to any hardware used to determine the senone score. In an embodiment, a particular APU can be swapped out for different hardware withoutapplication 2502 andvoice recognition engine 2504 knowing or being effected. Whenapplication 2502 andvoice recognition engine 2504 are agnostic to any type of hardware used for the senone scoring, a first hardware accelerator can be replaced with a second hardware accelerator of a different design without requiring any redesign ofapplication 2502 andvoice recognition engine 2504. In other words, as discussed herein, while the APU Library of calls are specific to the type and design of hardware accelerator used, the Generic DCA Library calls are not hardware specific. - In an embodiment, a software architecture, as illustrated in
FIG. 25 , can be described by describing a data and control flow through the software stack illustrated inFIG. 25 .Application 2502 can be any application that uses the voice recognition engine. In an embodiment,voice recognition engine 2504 is the Vocon Engine provided by Nuance, Inc. In alternate embodiments, other speech recognition engines or pattern recognition engines that make use of a Gaussian Mixture Model (GMM) for probability estimation may be used. - In an embodiment,
APU 2516 computes senone scores using the Gaussian Mixture Model.APU 2516 can compute these scores much faster (e.g., by an order of magnitude) than an embedded processor (e.g., a cortex A8 embedded processor) making speech recognition more practical in on-board speech recognition systems withAPU 2516. Offloading the senone scoring (or distance computation) toAPU 2516 not only improves the user experience (by reducing the computational latency) but also allowsCPU 2210 to attend to other tasks in the system. The software architecture plays an important role in reducing the CPU load and the latency. - In an embodiment,
voice recognition engine 2504 is not directly aware ofAPU 2516. For example,voice recognition engine 2504 can useGeneric DCA API 2506 to compute the distances (also referred to as senone scores). The specific implementation of the Generic DCA library discussed here has been designed specifically to useAPU 2516, with a plurality of function calls to the APU discussed below. This differs from a fully software implementation of the Generic DCA library. This specific implementation translates the Generic DCA library calls to a sequence of APU library calls. The details of the implementation are described below. The definition and implementation of the APU library is specific to the current implementation of the APU and is also described Below. - In an embodiment,
Generic DCA 2506 operates as an interface layer between thevoice recognition engine 2504 andAPU 2516. For example,voice recognition engine 2504 can utilize generic API calls to the Generic DCA to request senone scoring.Generic DCA 2506 then utilizes an APU-specific library of API calls, described further below, to direct the APU hardware accelerator to perform the requested senone scoring. Becausevoice recognition engine 2504 is not aware ofAPU 2516,voice recognition engine 2504 can take advantage of the following benefits. For example,voice recognition engine 2504 may only need to know the message passing formats ofAPU 2516.Voice recognition engine 2504 also does not need to know the tasks to be performed byAPU 2516. Moreover, there is a swap-out benefit. That is,APU 2516 can be replaced or redesigned without requiring any redesign ofvoice recognition engine 2504. Only the interface, in thisembodiment Generic DCA 2506, needs to have the hardware specific API calls to ensure the required interoperability betweenvoice recognition engine 2504 andAPU 2516. - In one exemplary embodiment, a Generic DCA Library comprises the following list of functions:
- Function name: distance_computation_create
-
- input parameters:
- acoustic model.
- number of dimensions in the feature vector.
- total number of senones in the acoustic model.
- description: stores these parameters as part of the state of distance computation.
- Function name: distance_computation_setfeature
-
- Input parameters:
- Frame Id
- feature vector
- Description: store the feature vector corresponding to the frame Id.
- Function name: distance_computation_computescores
- Input parameters:
-
- Frame Id
- List of Senones to score
- Description: specifies the senones to be scored for a given frame.
- Function name: distance_computation_fillscores
-
- Input parameters:
- Buffer containing the scores
- Description: store the senone scores in the buffer.
- Function name: distance_computation_setfeaturematrix
-
- Input parameters:
- pMatrix
- Description: stores the feature vector transformation matrix given by “pMatrix” in APU.
- The distance_computation_setfeaturematrix function is called between utterances to adapt the recognition to the specific speaker. The APU uses this matrix when computing the senone scores for the next utterance.
- In an embodiment, “distance_computation_computescores” and “distance_computation_fillscores” can be implemented such that the computational latency and the CPU load are minimized. For example, these functions can be implemented so as to achieve the concurrent operation embodied in
FIG. 26 . - In one exemplary embodiment, an APU Library supports the following functions:
- Function name: apu_set_acoustic_model
-
- Input parameters:
- Acoustic model
- Description: sets the acoustic model to be used for senone scoring.
- Input parameters:
- Function name: apu_load_feature_vector
-
- Input parameters:
- Feature vector
- Description: Loads the feature vector in to the APU.
- Input parameters:
- Function name: apu_score_senone_chunk
-
- Input parameters:
- Senone list
- Description: Loads the senone list in to the APU for scoring.
- Input parameters:
- Function name: apu_score_range
-
- Input parameters:
- Range of senones specified by the first and last index
- Description: Instructs APU to score all the senones in the range.
- Input parameters:
- Function name: apu_read_senone_scores
-
- Input parameters:
- Number of scores to read
- Destination buffer
- Description: Reads the scores and stores in the destination buffer.
- Input parameters:
- Function name: apu_check_score_ready_status
-
- Input parameters:
- none
- Description: Checks if the scores are ready to be read form the APU.
- Input parameters:
- Function name: apu_read_score_length
-
- Input parameters:
- none
- Description: Reads the status register to find the number of score entries available.
- Input parameters:
- Function name: apu_read_status
-
- Input parameters:
- Register index
- Description: Reads the status register specified by register index.
- Input parameters:
- Function name: apu_read_configuration
-
- Input parameters:
- none
- Description: Reads the configuration register.
- Input parameters:
- Function name: apu_write_configuration
-
- Input parameters:
- Configuration data
- Description: Writes to the configuration register.
- Input parameters:
- In an embodiment, the APU can be used for scoring the senones for each frame of a given utterance. The acoustic model of choice is communicated to the APU at the beginning as part of the function distance_computation_create. The feature vector for a given frame is passed to the APU via the function distance_computation_setfeature. The senones to be scored for a given frame are passed to the APU via the function distance_computation_computescores. The actual scores computed by the APU can be passed back to the Voice Recognition Engine via the function distance_computation_fillscores.
- The control flows from top to bottom of
stack 2500 illustrated inFIG. 25 . All the functions are synchronous and they complete before returning except for the function distance_computation_computescores. As noted below, the scoring can be implemented as a separate thread to maximize the concurrency of distance computation and the search as described above. This thread yields the CPU to the rest ofvoice recognition engine 2214 whenever it is waiting forAPU 2220 to complete the distance computation. This asynchronous computation is important to minimize the latency as well as the CPU load. - C. Concurrent Search and Distance Score Computation
- In one embodiment, a thread (e.g. an executable process) separate from a thread that is being executed by
application 2502 orvoice recognition engine 2504 can be created forAPU 2516. For there to be separate threads, there must be no dependency (that a further action of a first actor is dependent upon the actions of a second actor). Breaking any dependency betweenapplication 2502 andvoice recognition engine 2504 andAPU 2516 allowsapplication 2502 andvoice recognition engine 2504 to operate in parallel withAPU 2516. In one exemplary embodiment, a dependency betweenapplication 2502 andvoice recognition engine 2504 on one hand andAPU 2516 on the other can be avoided through the use of frames, e.g., lasting approximately 10-12 ms (although the invention is not limited to this embodiment). For example, while theapplication 2502 is using the senone score for frame n,APU 2516 can be performing a senone score forframe n+ 1. - More specifically, a voice recognition operation requires two discrete operations: scoring and searching. As described above, the scoring operation involves a comparison between Gaussian probability distribution vectors of a senone with the feature vector corresponding to a specific frame. In an embodiment,
software stack 2500 can be configured such that these two operations occur in parallel. In particular, as shown inFIG. 22 ,voice recognition engine 2214 can includesearch thread 2250 anddistance thread 2260.Distance thread 2260 can manage distance calculations completed onAPU 2220 andsearch thread 2250 can use the results of the distance calculations to determine which sound was received (e.g., by searching a library of senone scores to determine the best match). By settingdistance thread 2260 to a higher priority thansearch thread 2250,distance thread 2260 can perform the operations needed to start the scoring operation onAPU 2220. Thedistance thread 2260 can then be put to sleep. While asleep,search thread 2250 can be activated and can search using the results of the last distance operation. Because the length of time needed to complete a distance computation is relatively predictable, distance thread can be put to sleep for a predetermined amount of time. In alternative embodiments,distance thread 2260 can be put to sleep indefinitely and an interrupt fromAPU 2220 can instead be used to wake updistance thread 2260. In doing so,APU 2220 can be used to compute a distance score for a frame n+1, whileCPU 2210 performs a searching operation using the previously calculated score for frame n. - For any given frame, the search can follow the distance computation as illustrated in
FIG. 26 . In particular, the distance computation for frame (i+1) can be performed while the search for frame i is being conducted. Thus, as shown inFIG. 26 , the distance computation performed by the APU can be performed concurrently with the search function performed by the CPU. In an embodiment, a call sequence to the DCA library is arranged to effect this operation. In a further embodiment, the Generic DCA is implemented so that the concurrency of the search computation and the distance computation is maximized. In an embodiment, an implementation of the Generic DCA library uses the API proved by the APU library. -
FIG. 27 is an illustration of an embodiment of amethod 2700 for acoustic processing. The steps ofmethod 2700 can be performed using, for example, acoustic processing system 2200, shown inFIG. 22 , along withsoftware stack 2500, shown inFIG. 25 . - In
step 2702, the received audio signal is divided into frames. For example, inFIG. 22 ,voice recognition engine 2214 can divide a received audio signal into frames that are, for example, 10-12 ms in length. - In
step 2704, a search thread and a distance computation thread are created. For example, inFIG. 22 ,voice recognition engine 2214 can createsearch thread 2250 anddistance thread 2260. - In
step 2706, a distance score is computed using an APU. For example, inFIG. 22 , at the direction ofdistance thread 2260,senone scoring unit 2230 ofAPU 2220 can compute a distance score between a feature vector corresponding to a frame and a Gaussian probability distribution vector. - In
step 2708, a search operation is performed using the computed score for the frame. For example, inFIG. 22 ,search thread 2250 can use the distance score computed instep 2706 to search different senones to determine which sound was included in the frame. - In
step 2710, it is determined whether the frame was the last frame of the audio signal. If so,method 2700 ends. If not,method 2700 proceeds to step 2712. - In
step 2712, concurrently with the search operation of step 2708 a distance score for the next frame is computing using the APU. For example, inFIG. 22 ,APU 2220 can be used to compute a distance score for a frame i+1 concurrently withsearch thread 2250 performing a search operation using the distance score for frame i. - Various aspects of the present invention may be implemented in software, firmware, hardware, or a combination thereof,
FIG. 28 is an illustration of anexample computer system 2800 in which embodiments of the present invention, or portions thereof, can be implemented as computer-readable code. For example, the method illustrated byflowchart 900 ofFIG. 9 , the method illustrated byflowchart 1700 ofFIG. 17 , the method illustrated byflowchart 2100 ofFIG. 21 ,software stack 2500 illustrated inFIG. 25 , and/or the method illustrated byflowchart 2700 ofFIG. 27 can be implemented insystem 2800. Various embodiments of the present invention are described in terms of thisexample computer system 2800. After reading this description, it will become apparent to a person skilled in the relevant art how to implement embodiments of the present invention using other computer systems and/or computer architectures. - It should be noted that the simulation, synthesis and/or manufacture of various embodiments of this invention may be accomplished, in part, through the use of computer readable code, including general programming languages (such as C or C++), hardware description languages (HDL) such as, for example, Verilog HDL, VHDL, Altera HDL (AHDL), or other available programming and/or schematic capture tools (such as circuit capture tools). This computer readable code can be disposed in any known computer-usable medium including a semiconductor, magnetic disk, optical disk (such as CD-ROM, DVD-ROM). As such, the code can be transmitted over communication networks including the Internet. It is understood that the functions accomplished and/or structure provided by the systems and techniques described above can be represented in a core (e.g., an APU core) that is embodied in program code and can be transformed to hardware as part of the production of integrated circuits.
-
Computer system 2800 includes one or more processors, such asprocessor 2804,Processor 2804 may be a special purpose or a general-purpose processor such as, for example, the APU and CPU ofFIG. 4 , respectively.Processor 2804 is connected to a communication infrastructure 2806 (e.g., a bus or network). -
Computer system 2800 also includes amain memory 2808, preferably random access memory (RAM), and may also include asecondary memory 2810.Secondary memory 2810 can include, for example, ahard disk drive 2812, aremovable storage drive 2814, and/or a memory stick.Removable storage drive 2814 can include a floppy disk drive, a magnetic tape drive, an optical disk drive, a flash memory, or the like. Theremovable storage drive 2814 reads from and/or writes to aremovable storage unit 2818 in a well-known manner.Removable storage unit 2818 can comprise a floppy disk, magnetic tape, optical disk, etc. which is read by and written to byremovable storage drive 2814. As will be appreciated by persons skilled in the relevant art,removable storage unit 2818 includes a computer-usable storage medium having stored therein computer software and/or data. - Computer system 2800 (optionally) includes a display interface 2802 (which can include input and output devices such as keyboards, mice, etc.) that forwards graphics, text, and other data from communication infrastructure 2806 (or from a frame buffer not shown) for display on
display unit 2830. - In alternative implementations,
secondary memory 2810 can include other similar devices for allowing computer programs or other instructions to be loaded intocomputer system 2800. Such devices can include, for example, aremovable storage unit 2822 and aninterface 2820. Examples of such devices can include a program cartridge and cartridge interface (such as those found in video game devices), a removable memory chip (e.g., EPROM or PROM) and associated socket, and otherremovable storage units 2822 andinterfaces 2820 which allow software and data to be transferred from theremovable storage unit 2822 tocomputer system 2800. -
Computer system 2800 can also include acommunications interface 2824.Communications interface 2824 allows software and data to be transferred betweencomputer system 2800 and external devices. Communications interface 2824 can include a modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot and card, or the like. Software and data transferred viacommunications interface 2824 are in the form of signals which may be electronic, electromagnetic, optical, or other signals capable of being received bycommunications interface 2824. These signals are provided tocommunications interface 2824 via acommunications path 2826.Communications path 2826 carries signals and can be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, a RF link or other communications channels. - In this document, the terms “computer program medium” and “computer-usable medium” are used to generally refer to media such as
removable storage unit 2818,removable storage unit 2822, and a hard disk installed inhard disk drive 2812. Computer program medium and computer-usable medium can also refer to memories, such asmain memory 2808 andsecondary memory 2810, which can be memory semiconductors (e.g., DRAMs, etc.). These computer program products provide software tocomputer system 2800. - Computer programs (also called computer control logic) are stored in
main memory 2808 and/orsecondary memory 2810. Computer programs may also be received viacommunications interface 2824. Such computer programs, when executed, enablecomputer system 2800 to implement embodiments of the present invention as discussed herein. In particular, the computer programs, when executed, enableprocessor 2804 to implement processes of embodiments of the present invention, such as the steps in the method illustrated byflowchart 900 ofFIG. 9 andflowchart 1700 ofFIG. 17 , the method illustrated byflowchart 2100 ofFIG. 21 , the method illustrated byflowchart 2700 ofFIG. 27 , and or the functions insoftware stack 2500 illustrated inFIG. 25 can be implemented insystem 2800, discussed above. Accordingly, such computer programs represent controllers of thecomputer system 2800. Where embodiments of the present invention are implemented using software, the software can be stored in a computer program product and loaded intocomputer system 2800 usingremovable storage drive 2814,interface 2820,hard drive 2812, orcommunications interface 2824. - Embodiments of the present invention are also directed to computer program products including software stored on any computer-usable medium. Such software, when executed in one or more data processing device, causes a data processing device(s) to operate as described herein. Embodiments of the present invention employ any computer-usable or -readable medium, known now or in the future. Examples of computer-usable mediums include, but are not limited to, primary storage devices (e.g., any type of random access memory), secondary storage devices (e.g., hard drives, floppy disks, CD ROMS, ZIP disks, tapes, magnetic storage devices, optical storage devices, MEMS, nanotechnological storage devices, etc.), and communication mediums (e.g., wired and wireless communications networks, local area networks, wide area networks, intranets, etc.).
- It is to be appreciated that the Detailed Description section, and not the Summary and Abstract sections, is intended to be used to interpret the claims. The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present invention as contemplated by the inventors, and thus, are not intended to limit the present invention and the appended claims in any way.
- Embodiments of the present invention have been described above with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed.
- The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying knowledge within the skill of the relevant art, readily modify and/or adapt for various applications such specific embodiments, without undue experimentation, without departing from the general concept of the present invention. Therefore, such adaptations and modifications are intended to be within the meaning and range of equivalents of the disclosed embodiments, based on the teaching and guidance presented herein. It is to be understood that the phraseology or terminology herein is for the purpose of description and not of limitation, such that the terminology or phraseology of the present specification is to be interpreted by the skilled artisan in light of the teachings and guidance.
- The breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.
Claims (20)
Priority Applications (11)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/489,799 US20130158996A1 (en) | 2011-12-19 | 2012-06-06 | Acoustic Processing Unit |
US13/669,907 US8996374B2 (en) | 2012-06-06 | 2012-11-06 | Senone scoring for multiple input streams |
US13/669,926 US9009049B2 (en) | 2012-06-06 | 2012-11-06 | Recognition of speech with different accents |
JP2014547494A JP2015505993A (en) | 2011-12-19 | 2012-12-14 | Sound processing unit |
CN201280070070.3A CN104126200A (en) | 2011-12-19 | 2012-12-14 | Acoustic processing unit |
KR1020147020293A KR20140106723A (en) | 2011-12-19 | 2012-12-14 | acoustic processing unit |
EP12859602.0A EP2795614A4 (en) | 2011-12-19 | 2012-12-14 | Acoustic processing unit |
PCT/US2012/069787 WO2013096124A1 (en) | 2011-12-19 | 2012-12-14 | Acoustic processing unit |
US13/725,260 US9514739B2 (en) | 2012-06-06 | 2012-12-21 | Phoneme score accelerator |
US13/725,173 US9230548B2 (en) | 2012-06-06 | 2012-12-21 | Hybrid hashing scheme for active HMMS |
US13/725,224 US9224384B2 (en) | 2012-06-06 | 2012-12-21 | Histogram based pre-pruning scheme for active HMMS |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201161577595P | 2011-12-19 | 2011-12-19 | |
US201261589113P | 2012-01-20 | 2012-01-20 | |
US13/489,799 US20130158996A1 (en) | 2011-12-19 | 2012-06-06 | Acoustic Processing Unit |
Publications (1)
Publication Number | Publication Date |
---|---|
US20130158996A1 true US20130158996A1 (en) | 2013-06-20 |
Family
ID=48611061
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/489,799 Abandoned US20130158996A1 (en) | 2011-12-19 | 2012-06-06 | Acoustic Processing Unit |
US13/490,129 Active 2033-03-12 US8924453B2 (en) | 2011-12-19 | 2012-06-06 | Arithmetic logic unit architecture |
US13/490,124 Active 2035-07-24 US9785613B2 (en) | 2011-12-19 | 2012-06-06 | Acoustic processing unit interface for determining senone scores using a greater clock frequency than that corresponding to received audio |
Family Applications After (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/490,129 Active 2033-03-12 US8924453B2 (en) | 2011-12-19 | 2012-06-06 | Arithmetic logic unit architecture |
US13/490,124 Active 2035-07-24 US9785613B2 (en) | 2011-12-19 | 2012-06-06 | Acoustic processing unit interface for determining senone scores using a greater clock frequency than that corresponding to received audio |
Country Status (6)
Country | Link |
---|---|
US (3) | US20130158996A1 (en) |
EP (3) | EP2795614A4 (en) |
JP (3) | JP2015505993A (en) |
KR (3) | KR20140106723A (en) |
CN (3) | CN104126200A (en) |
WO (3) | WO2013096124A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140180694A1 (en) * | 2012-06-06 | 2014-06-26 | Spansion Llc | Phoneme Score Accelerator |
US20140309754A1 (en) * | 2013-04-10 | 2014-10-16 | Robert Bosch Gmbh | Method and device for creating a data-based function model |
US20140309973A1 (en) * | 2013-04-10 | 2014-10-16 | Robert Bosch Gmbh | Method and control for calculating a data-based function model |
WO2016191031A1 (en) * | 2015-05-27 | 2016-12-01 | Intel Corporation | Gaussian mixture model accelerator with direct memory access engines corresponding to individual data streams |
EP3783606A1 (en) * | 2019-08-20 | 2021-02-24 | Samsung Electronics Co., Ltd. | Electronic device and method for controlling the electronic device |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10007724B2 (en) * | 2012-06-29 | 2018-06-26 | International Business Machines Corporation | Creating, rendering and interacting with a multi-faceted audio cloud |
JP6052814B2 (en) * | 2014-09-24 | 2016-12-27 | インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation | Speech recognition model construction method, speech recognition method, computer system, speech recognition apparatus, program, and recording medium |
KR102299330B1 (en) * | 2014-11-26 | 2021-09-08 | 삼성전자주식회사 | Method for voice recognition and an electronic device thereof |
CN105869641A (en) * | 2015-01-22 | 2016-08-17 | 佳能株式会社 | Speech recognition device and speech recognition method |
PH12018050262A1 (en) * | 2017-07-21 | 2019-06-17 | Accenture Global Solutions Ltd | Automatic provisioning of a software development environment |
US11043218B1 (en) * | 2019-06-26 | 2021-06-22 | Amazon Technologies, Inc. | Wakeword and acoustic event detection |
CN112307986B (en) * | 2020-11-03 | 2022-02-08 | 华北电力大学 | Load switch event detection method and system by utilizing Gaussian gradient |
CN113703579B (en) * | 2021-08-31 | 2023-05-30 | 北京字跳网络技术有限公司 | Data processing method, device, electronic device and storage medium |
KR20240173495A (en) * | 2023-06-05 | 2024-12-12 | 한양대학교 산학협력단 | A method and apparatus for performing speech recognition using artificial intelligence |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5864810A (en) * | 1995-01-20 | 1999-01-26 | Sri International | Method and apparatus for speech recognition adapted to an individual speaker |
US5937384A (en) * | 1996-05-01 | 1999-08-10 | Microsoft Corporation | Method and system for speech recognition using continuous density hidden Markov models |
US6542866B1 (en) * | 1999-09-22 | 2003-04-01 | Microsoft Corporation | Speech recognition method and apparatus utilizing multiple feature streams |
US20030093269A1 (en) * | 2001-11-15 | 2003-05-15 | Hagai Attias | Method and apparatus for denoising and deverberation using variational inference and strong speech models |
US20030182121A1 (en) * | 2002-03-20 | 2003-09-25 | Hwang Mei Yuh | Generating a task-adapted acoustic model from one or more different corpora |
US20030182120A1 (en) * | 2002-03-20 | 2003-09-25 | Mei Yuh Hwang | Generating a task-adapted acoustic model from one or more supervised and/or unsupervised corpora |
US20040260548A1 (en) * | 2003-06-20 | 2004-12-23 | Hagai Attias | Variational inference and learning for segmental switching state space models of hidden speech dynamics |
US20050180547A1 (en) * | 2004-02-12 | 2005-08-18 | Microsoft Corporation | Automatic identification of telephone callers based on voice characteristics |
US20060287856A1 (en) * | 2005-06-17 | 2006-12-21 | Microsoft Corporation | Speech models generated using competitive training, asymmetric training, and data boosting |
US20080221896A1 (en) * | 2007-03-09 | 2008-09-11 | Microsoft Corporation | Grammar confusability metric for speech recognition |
US20120072215A1 (en) * | 2010-09-21 | 2012-03-22 | Microsoft Corporation | Full-sequence training of deep structures for speech recognition |
Family Cites Families (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH01298400A (en) * | 1988-05-26 | 1989-12-01 | Ricoh Co Ltd | Continuous speech recognition device |
JPH04232998A (en) * | 1990-12-27 | 1992-08-21 | Nec Corp | Speech recognition device |
CN1112269A (en) * | 1994-05-20 | 1995-11-22 | 北京超凡电子科技有限公司 | HMM speech recognition technique based on Chinese pronunciation characteristics |
US5604839A (en) | 1994-07-29 | 1997-02-18 | Microsoft Corporation | Method and system for improving speech recognition through front-end normalization of feature vectors |
US5710866A (en) * | 1995-05-26 | 1998-01-20 | Microsoft Corporation | System and method for speech recognition using dynamically adjusted confidence measure |
CN1061451C (en) * | 1996-09-26 | 2001-01-31 | 财团法人工业技术研究院 | Chinese Word Sound Recognition Method Based on Hidden Markov Model |
US7295978B1 (en) * | 2000-09-05 | 2007-11-13 | Verizon Corporate Services Group Inc. | Systems and methods for using one-dimensional gaussian distributions to model speech |
JP3932789B2 (en) * | 2000-09-20 | 2007-06-20 | セイコーエプソン株式会社 | HMM output probability calculation method and speech recognition apparatus |
WO2002029617A1 (en) * | 2000-09-30 | 2002-04-11 | Intel Corporation (A Corporation Of Delaware) | Method, apparatus, and system for building a compact model for large vocabulary continuous speech recognition (lvcsr) system |
CA2359544A1 (en) | 2001-10-22 | 2003-04-22 | Dspfactory Ltd. | Low-resource real-time speech recognition system using an oversampled filterbank |
US20030097263A1 (en) * | 2001-11-16 | 2003-05-22 | Lee Hang Shun | Decision tree based speech recognition |
US7571097B2 (en) * | 2003-03-13 | 2009-08-04 | Microsoft Corporation | Method for training of subspace coded gaussian models |
US7480615B2 (en) * | 2004-01-20 | 2009-01-20 | Microsoft Corporation | Method of speech recognition using multimodal variational inference with switching state space models |
US20060058999A1 (en) * | 2004-09-10 | 2006-03-16 | Simon Barker | Voice model adaptation |
GB0420464D0 (en) | 2004-09-14 | 2004-10-20 | Zentian Ltd | A speech recognition circuit and method |
US7930180B2 (en) * | 2005-01-17 | 2011-04-19 | Nec Corporation | Speech recognition system, method and program that generates a recognition result in parallel with a distance value |
KR100664960B1 (en) * | 2005-10-06 | 2007-01-04 | 삼성전자주식회사 | Speech recognition device and method |
KR100764247B1 (en) | 2005-12-28 | 2007-10-08 | 고려대학교 산학협력단 | Apparatus and Method for speech recognition with two-step search |
EP1840822A1 (en) * | 2006-03-29 | 2007-10-03 | Sony Deutschland Gmbh | Method for deriving noise statistical properties of a signal |
US7774202B2 (en) | 2006-06-12 | 2010-08-10 | Lockheed Martin Corporation | Speech activated control system and related methods |
KR100974871B1 (en) * | 2008-06-24 | 2010-08-11 | 연세대학교 산학협력단 | Feature vector selection method and device, and music genre classification method and device using same |
JP5714495B2 (en) | 2008-10-10 | 2015-05-07 | スパンション エルエルシー | Analysis system and data pattern analysis method |
US8818802B2 (en) | 2008-10-10 | 2014-08-26 | Spansion Llc | Real-time data pattern analysis system and method of operation thereof |
JP5609182B2 (en) * | 2010-03-16 | 2014-10-22 | 日本電気株式会社 | Speech recognition apparatus, speech recognition method, and speech recognition program |
-
2012
- 2012-06-06 US US13/489,799 patent/US20130158996A1/en not_active Abandoned
- 2012-06-06 US US13/490,129 patent/US8924453B2/en active Active
- 2012-06-06 US US13/490,124 patent/US9785613B2/en active Active
- 2012-12-14 KR KR1020147020293A patent/KR20140106723A/en not_active Application Discontinuation
- 2012-12-14 JP JP2014547494A patent/JP2015505993A/en active Pending
- 2012-12-14 EP EP12859602.0A patent/EP2795614A4/en not_active Ceased
- 2012-12-14 CN CN201280070070.3A patent/CN104126200A/en active Pending
- 2012-12-14 WO PCT/US2012/069787 patent/WO2013096124A1/en active Application Filing
- 2012-12-18 KR KR1020147020295A patent/KR102048893B1/en active IP Right Grant
- 2012-12-18 JP JP2014547557A patent/JP6138148B2/en active Active
- 2012-12-18 EP EP12859642.6A patent/EP2795461A4/en not_active Withdrawn
- 2012-12-18 WO PCT/US2012/070332 patent/WO2013096303A1/en active Application Filing
- 2012-12-18 JP JP2014547556A patent/JP2015501011A/en active Pending
- 2012-12-18 EP EP12860893.2A patent/EP2795615A4/en not_active Withdrawn
- 2012-12-18 CN CN201280070112.3A patent/CN104126165A/en active Pending
- 2012-12-18 WO PCT/US2012/070329 patent/WO2013096301A1/en active Application Filing
- 2012-12-18 KR KR1020147020294A patent/KR20140106724A/en not_active Application Discontinuation
- 2012-12-18 CN CN201280070114.2A patent/CN104137178B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5864810A (en) * | 1995-01-20 | 1999-01-26 | Sri International | Method and apparatus for speech recognition adapted to an individual speaker |
US5937384A (en) * | 1996-05-01 | 1999-08-10 | Microsoft Corporation | Method and system for speech recognition using continuous density hidden Markov models |
US6542866B1 (en) * | 1999-09-22 | 2003-04-01 | Microsoft Corporation | Speech recognition method and apparatus utilizing multiple feature streams |
US20030093269A1 (en) * | 2001-11-15 | 2003-05-15 | Hagai Attias | Method and apparatus for denoising and deverberation using variational inference and strong speech models |
US20060036444A1 (en) * | 2002-03-20 | 2006-02-16 | Microsoft Corporation | Generating a task-adapted acoustic model from one or more different corpora |
US20030182121A1 (en) * | 2002-03-20 | 2003-09-25 | Hwang Mei Yuh | Generating a task-adapted acoustic model from one or more different corpora |
US20030182120A1 (en) * | 2002-03-20 | 2003-09-25 | Mei Yuh Hwang | Generating a task-adapted acoustic model from one or more supervised and/or unsupervised corpora |
US20040260548A1 (en) * | 2003-06-20 | 2004-12-23 | Hagai Attias | Variational inference and learning for segmental switching state space models of hidden speech dynamics |
US20050180547A1 (en) * | 2004-02-12 | 2005-08-18 | Microsoft Corporation | Automatic identification of telephone callers based on voice characteristics |
US20060287856A1 (en) * | 2005-06-17 | 2006-12-21 | Microsoft Corporation | Speech models generated using competitive training, asymmetric training, and data boosting |
US20100161330A1 (en) * | 2005-06-17 | 2010-06-24 | Microsoft Corporation | Speech models generated using competitive training, asymmetric training, and data boosting |
US20080221896A1 (en) * | 2007-03-09 | 2008-09-11 | Microsoft Corporation | Grammar confusability metric for speech recognition |
US20120072215A1 (en) * | 2010-09-21 | 2012-03-22 | Microsoft Corporation | Full-sequence training of deep structures for speech recognition |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140180694A1 (en) * | 2012-06-06 | 2014-06-26 | Spansion Llc | Phoneme Score Accelerator |
US9514739B2 (en) * | 2012-06-06 | 2016-12-06 | Cypress Semiconductor Corporation | Phoneme score accelerator |
US20140309754A1 (en) * | 2013-04-10 | 2014-10-16 | Robert Bosch Gmbh | Method and device for creating a data-based function model |
US20140309973A1 (en) * | 2013-04-10 | 2014-10-16 | Robert Bosch Gmbh | Method and control for calculating a data-based function model |
US9709967B2 (en) * | 2013-04-10 | 2017-07-18 | Robert Bosch Gmbh | Method and device for creating a data-based function model |
US9977842B2 (en) * | 2013-04-10 | 2018-05-22 | Robert Bosch Gmbh | Method and control for calculating a data-based function model |
WO2016191031A1 (en) * | 2015-05-27 | 2016-12-01 | Intel Corporation | Gaussian mixture model accelerator with direct memory access engines corresponding to individual data streams |
EP3783606A1 (en) * | 2019-08-20 | 2021-02-24 | Samsung Electronics Co., Ltd. | Electronic device and method for controlling the electronic device |
US11545149B2 (en) | 2019-08-20 | 2023-01-03 | Samsung Electronics Co., Ltd. | Electronic device and method for controlling the electronic device |
EP4220633A3 (en) * | 2019-08-20 | 2023-08-23 | Samsung Electronics Co., Ltd. | Electronic device and method for controlling the electronic device |
US11967325B2 (en) | 2019-08-20 | 2024-04-23 | Samsung Electronics Co., Ltd. | Electronic device and method for controlling the electronic device |
Also Published As
Publication number | Publication date |
---|---|
JP2015505993A (en) | 2015-02-26 |
US8924453B2 (en) | 2014-12-30 |
WO2013096301A1 (en) | 2013-06-27 |
EP2795615A4 (en) | 2016-01-13 |
KR20140106723A (en) | 2014-09-03 |
WO2013096124A1 (en) | 2013-06-27 |
JP6138148B2 (en) | 2017-05-31 |
US20130159371A1 (en) | 2013-06-20 |
US20130158997A1 (en) | 2013-06-20 |
CN104137178B (en) | 2018-01-19 |
EP2795615A1 (en) | 2014-10-29 |
EP2795614A1 (en) | 2014-10-29 |
KR102048893B1 (en) | 2019-11-26 |
JP2015501011A (en) | 2015-01-08 |
EP2795461A4 (en) | 2015-08-12 |
CN104126165A (en) | 2014-10-29 |
EP2795614A4 (en) | 2015-07-22 |
WO2013096303A1 (en) | 2013-06-27 |
KR20140107537A (en) | 2014-09-04 |
US9785613B2 (en) | 2017-10-10 |
CN104137178A (en) | 2014-11-05 |
JP2015501012A (en) | 2015-01-08 |
KR20140106724A (en) | 2014-09-03 |
CN104126200A (en) | 2014-10-29 |
EP2795461A1 (en) | 2014-10-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8924453B2 (en) | Arithmetic logic unit architecture | |
US20160293161A1 (en) | System and Method for Combining Geographic Metadata in Automatic Speech Recognition Language and Acoustic Models | |
Lin et al. | A 1000-word vocabulary, speaker-independent, continuous live-mode speech recognizer implemented in a single FPGA | |
US8818802B2 (en) | Real-time data pattern analysis system and method of operation thereof | |
US20140244248A1 (en) | Conversion of non-back-off language models for efficient speech decoding | |
Nedevschi et al. | Hardware speech recognition for user interfaces in low cost, low power devices | |
JP2023529801A (en) | Attention Neural Network with Sparse Attention Mechanism | |
You et al. | Parallel scalability in speech recognition | |
US9230548B2 (en) | Hybrid hashing scheme for active HMMS | |
US9514739B2 (en) | Phoneme score accelerator | |
Lin et al. | A multi-FPGA 10x-real-time high-speed search engine for a 5000-word vocabulary speech recognizer | |
Choi et al. | An FPGA implementation of speech recognition with weighted finite state transducers | |
Price | Energy-scalable speech recognition circuits | |
Buthpitiya et al. | A parallel implementation of viterbi training for acoustic models using graphics processing units | |
Yoshizawa et al. | Scalable architecture for word HMM-based speech recognition | |
Cheng et al. | Speech recognition system for embedded real-time applications | |
Tambe | Architecting High Performance Silicon Systems for Accurate and Efficient On-Chip Deep Learning | |
Noguchi et al. | VLSI architecture of GMM processing and Viterbi decoder for 60,000-word real-time continuous speech recognition | |
US8996374B2 (en) | Senone scoring for multiple input streams | |
Kim et al. | Multi-user real-time speech recognition with a GPU | |
Choi et al. | FPGA-based implementation of a real-time 5000-word continuous speech recognizer | |
Pazhayaveetil | Hardware implementation of a low power speech recognition system | |
Pinto et al. | Exploiting beam search confidence for energy-efficient speech recognition | |
Kent et al. | Contextual partitioning for speech recognition | |
Bhagavatheeswaran | Hardware Accelerator for HMM Based Speech Recognition using Approximate Computing Techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: SPANSION LLC, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FASTOW, RICHARD;OLSON, JENS;REEL/FRAME:028328/0719 Effective date: 20120605 |
|
AS | Assignment |
Owner name: SPANSION LLC, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FASTOW, RICHARD;OLSON, JENS;LOHANI, SUMIT;SIGNING DATES FROM 20130611 TO 20130620;REEL/FRAME:030725/0303 |
|
AS | Assignment |
Owner name: MORGAN STANLEY SENIOR FUNDING, INC., NEW YORK Free format text: SECURITY INTEREST;ASSIGNORS:CYPRESS SEMICONDUCTOR CORPORATION;SPANSION LLC;REEL/FRAME:035240/0429 Effective date: 20150312 |
|
AS | Assignment |
Owner name: CYPRESS SEMICONDUCTOR CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SPANSION LLC;REEL/FRAME:035860/0001 Effective date: 20150601 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: MORGAN STANLEY SENIOR FUNDING, INC., NEW YORK Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE 8647899 PREVIOUSLY RECORDED ON REEL 035240 FRAME 0429. ASSIGNOR(S) HEREBY CONFIRMS THE SECURITY INTERST;ASSIGNORS:CYPRESS SEMICONDUCTOR CORPORATION;SPANSION LLC;REEL/FRAME:058002/0470 Effective date: 20150312 |