[go: up one dir, main page]

LU507134B1 - Intelligent voice recognition method and system for ar helmets - Google Patents

Intelligent voice recognition method and system for ar helmets Download PDF

Info

Publication number
LU507134B1
LU507134B1 LU507134A LU507134A LU507134B1 LU 507134 B1 LU507134 B1 LU 507134B1 LU 507134 A LU507134 A LU 507134A LU 507134 A LU507134 A LU 507134A LU 507134 B1 LU507134 B1 LU 507134B1
Authority
LU
Luxembourg
Prior art keywords
voice
user
feedback
model
voice input
Prior art date
Application number
LU507134A
Other languages
French (fr)
Inventor
Keyan Liu
Jiuxing Liao
Gang Yu
Ming Chen
Yuan Wang
Ziguo Jiang
Original Assignee
Shuozhou Pinglu Distr Tianrui Wind Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shuozhou Pinglu Distr Tianrui Wind Power Co Ltd filed Critical Shuozhou Pinglu Distr Tianrui Wind Power Co Ltd
Application granted granted Critical
Publication of LU507134B1 publication Critical patent/LU507134B1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • User Interface Of Digital Computer (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

This invention provides an intelligent voice recognition method and system for AR helmets. It relates to the field of voice recognition technology. The method includes: monitoring the user’s voice input through pre-deployed audio sensors on the AR helmet; capturing the voice wake word based on the voice wake model, thereby obtaining the user’s voice input command; preprocessing the voice input command to extract voice features, and matching these features against a voice recognition model to analyze and ascertain the user's voice intent and determine the feedback command; controlling the AR helmet to perform the feedback operation and display related outputs to the user. This not only enhances the accuracy and efficiency of voice recognition but also facilitates intelligent interpretation of user intent by the AR helmet.

Description

INTELLIGENT VOICE RECOGNITION METHOD AND SYSTEM FOR AR
HELMETS
Technical Field
This invention relates to the field of voice recognition technology, specifically relating to an intelligent voice recognition method and system for AR helmets.
Background Technology
With the rapid development of science and technology, voice recognition technology has also advanced significantly. However, voice recognition is susceptible to environmental noise and interference, leading to low recognition accuracy.
Additionally, the user's speech rate and volume can affect the recognition process, resulting in inefficiencies and errors in voice recognition. Moreover, situations may arise where a user is conversing with others or talking to themselves, and in such cases, the AR helmet needs to identify and disregard these non-interactive voice inputs.
Therefore, this invention provides an intelligent voice recognition method and system for AR helmets.
Content of the Invention
This invention offers an intelligent voice recognition method for AR helmets, utilizing pre-deployed audio sensors on the AR helmet to monitor user voice inputs; capturing voice wake words using a voice wake model to obtain user voice input commands; preprocessing these voice input commands to extract voice features and performing matching analysis to determine the user's voice intent and the corresponding feedback commands; controlling the AR helmet to execute feedback operations and display related outputs to the user. This method enhances both the accuracy and efficiency of voice recognition and enables intelligent interpretation of user intent by the AR helmet.
The present invention provides an intelligent voice recognition method for AR helmets, including: 10907136
Step 1: Monitor the user's voice input using the pre-deployed audio sensors of the AR helmet;
Step 2: Capture the voice wake word from the voice input based on the voice wake model, and obtain the user’s voice input command;
Step 3: Preprocess the user’s voice input command to extract voice features, and perform matching analysis of these features using a voice recognition model to capture the user's voice intent and determine the feedback command;
Step 4: Control the AR helmet to perform the feedback operation according to the feedback command and display related outputs to the user.
Preferably, monitoring of the user's voice input using the pre-deployed audio sensors of the AR helmet includes:
Pre-setting the voice collection area and pre-deploying multiple directional audio sensors within the AR helmet device;
When it is detected that the user is using the AR helmet, automatically adjust the reception volume of the directional audio sensors to capture continuous sound signals from each sensor in the voice collection area, and combine the captured signals to be recorded as the user’s voice input.
Preferably, prior to capturing the voice wake word based on the voice wake model, it further includes:performing initial recognition processing of the user’s voice input, including:
Extracting the pitch lines of the input, obtaining the start and end points of each pitch line, dividing the user's voice input based on these points, and inspecting the divided segments to discard segments without human voice input;
Using a pre-deployed adaptive filter to apply nonlinear processing to the voice input after discard processing to eliminate the speaker audio of the AR helmet used for feedback information.
Preferably, capturing the voice wake word from the voice input based on the voice wake model, and obtaining the user's voice input command includes:
Capturing the initial segments of the user's voice input using a pre-built voice wake model, and upon detecting the presence of a voice wake word, acquiring the 10907136 complete voice input data including the voice wake word, recorded as the user’s voice input command.
Preferably, preprocessing the user's voice input command to extract voice features includes:
Framing and windowing the user's voice input command for preprocessing, and initially removing noise from the voice input command;
Performing endpoint detection on each frame of the voice input command after initial noise removal;
Applying noise reduction to frames detected to contain human voice, and extracting voice features from each noise-reduced frame.
Extracting voice features from each frame after noise reduction then normalizing these features to capture the voice characteristics of the user's voice input command.
Preferably, matching analysis of extracted voice features using a voice recognition model to ascertain user voice intent and determine feedback commands, including:
Randomly obtaining N first standard signals from a preset voice library and acquiring each of these first standard signals’ weighted signal-to-noise ratio with a randomly added noise signal;
Based on the ratio of preset SNR to weighted SNR, amplifying the randomly added noise signal, and synthesizing the corresponding first standard signal with the amplified noise signal to construct training voice data for an acoustic model;
Inputting N1 training voice data sequentially into the acoustic model, statistically analyzing each training voice data under the model's transition state structure to determine the number of transition states and parameters;
Comparing the first transition count, determined by the phoneme count of each training voice data, with the corresponding number of transition states, and obtaining a parameter set that matches the random added noise signal associated with the respective training voice data;
Assigning a minimum value to the corresponding transition state parameters when there is a zero value and the comparison result is outside the preset range; 10907136
Assigning the highest frequency value to the corresponding transition state parameters when there is a zero value and the comparison result is within the preset range;
Maintaining the values of the counted transition state parameters unchanged if there are no zero values and the comparison result is within the preset range;
Otherwise, adding a new noise signal to the corresponding first standard signal;
Updating the acoustic model based on the final values of the corresponding transition state parameters, obtaining the second intrinsic state parameters of the updated model and the first intrinsic state parameters of the model before the update;
Deeming the updated model as the optimal acoustic model if the differences between the first and second intrinsic state parameters meet the preset convergence conditions;
Randomly selecting a training voice data input into the optimal acoustic model, obtaining the corresponding first text sequence, and performing redundancy preprocessing to generate a training text sequence;
Converting the user's voice input command into a corresponding second text sequence and performing redundancy processing to generate the current text sequence;
Inputting both the training and current text sequences into a language model, calculating the analysis capability coefficient of the language model: el Tla p= Ny (la (1 Long NT 211 + 212 La z01” c2 T2a #2, 5 (1+ S220) J (ar 211 + 212 Ls 202"? z01 + 702
Wherein, P represents the analysis capability coefficient of the language model; c1 represents the number of times the language model analyzes the training text sequence; c2 represents the number of times the language model analyzes the current text sequence; f0;, represents the number of sequence combinations under the ilth analysis of the training text sequence; 31;; represents the computational 10907136 power for the ilth analysis of the training text sequence; z01 represents the total number of sequences in the training text sequence; z02 represents the total number of sequences in the current text sequence; T1a represents the analysis time by the 5 language model for the training text sequence; G1 represents the estimated analysis time for the training text sequence; f1;, represents the number of sequence combinations under the i2th analysis of the current text sequence; 32;, represents the computational power for the i2th analysis of the current text sequence; T2a represents the analysis time by the language model for the current text sequence; G2 represents the estimated analysis time for the current text sequence; z1 represents the number of times the language model is trained simultaneously on the current and training text sequences; z11 represents the number of times the language model analyzes key sequences in the training text sequence; z12 represents the number of times the language model analyzes key sequences in the current text sequence.
When the analysis capability coefficient exceeds a preset capability threshold, the predictive ability of the language model is considered satisfactory;
Combining the optimal acoustic model with the satisfactory language model for decoding, yielding the best output sequence for the voice input command, thereby determining user intent, where the post-decoding model is the voice recognition model;
Determining the current scenario type based on historical voice commands, and if the user's intent does not belong to the intent set of the current scenario, the input is deemed invalid, and the AR helmet does not generate a corresponding feedback command;
If the user's intent belongs to the intent set of the current scenario, gathering each historical voice command's speech rate, volume, and speaking style to construct the current voice characteristics, and if the voice input features of the user's intent match the current voice characteristics, the input is considered valid, and the AR helmet's feedback command is determined according to a user intent-feedback command mapping;
If the voice input features of the user's intent do not match the current voice 10907136 characteristics, the input is deemed invalid, and no corresponding feedback command is generated;
When multiple user intents exist, determining the execution order of feedback commands based on the valid input order of user intents.
Preferably, controlling the AR helmet to perform feedback operations based on feedback commands and displaying related outputs to the user, including:
Determining the AR helmet's feedback outputs to the user according to a feedback command-output mapping;
Locking the speaker audio used for feedback operations, and when it is detected that the user has ceased voice input and the speaker audio volume is not below a preset feedback volume, controlling the locked speaker to display the relevant output to the user;
Otherwise, visualizing the feedback output according to the user's scenario type.
Preferably, an intelligent voice recognition system for AR helmets, including:
A voice monitoring module: Monitoring the user's voice input using the AR helmet's pre-deployed audio sensors;
À voice collection module: Capturing the voice wake word from the voice input based on the voice wake model, obtaining the user's voice input command;
A voice recognition module: Preprocessing the user's voice input command to extract voice features, and performing matching analysis with a voice recognition model to ascertain the user's voice intent and determine the feedback command;
A voice feedback module: Controlling the AR helmet to perform feedback operations based on the feedback command and displaying related outputs to the user.
Other features and advantages of the present invention will be set forth in the subsequent specification and, in part, will become apparent from the specification or will be understood by carrying out the invention. The objects and other advantages of the present invention may be realized and obtained by means of the structure particularly indicated in the specification as written and in the accompanying drawings.
The technical embodiments of the present invention are described in further detail below by means of the accompanying drawings and embodiments. 10907136
Description of the Drawings
The accompanying drawings are used to provide a further understanding of the present invention and form part of the specification and are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation of the invention. In the accompanying drawings:
FIG.1: a flowchart of the intelligent voice recognition method for AR helmets according to an embodiment of the invention;
FIG.2: a structural diagram of the intelligent voice recognition system for AR helmets according to an embodiment of the invention.
Specific Embodiments
Preferred embodiments of the present invention are described below in conjunction with the accompanying drawings, and it should be understood that the preferred embodiments described herein are intended only to illustrate and explain the present invention and are not intended to limit the invention.
The embodiment of the present invention provides an intelligent voice recognition method for AR helmets, as shown in FIG. 1, including:
Step 1: Monitor the user's voice input using the pre-deployed audio sensors of the AR helmet;
Step 2: Capture the voice wake word from the voice input based on the voice wake model, and obtain the user’s voice input command;
Step 3: Preprocess the user’s voice input command to extract voice features, and perform matching analysis of these features using a voice recognition model to capture the user's voice intent and determine the feedback command;
Step 4: Control the AR helmet to perform the feedback operation according to the feedback command and display related outputs to the user.
In this embodiment, the pre-deployed audio sensors are directional audio sensors, positioned to enhance the monitoring efficiency and accuracy of user voice inputs.
The user's voice inputs may include the user's own voice signals, environmental 10907136 noise, other users' voice signals, and audio from the AR helmet’s speakers used for feedback.
In this embodiment, the voice wake words activate the AR helmet to perform voice recognition tasks when detected by the helmet's voice monitoring system. These wake words can be preset or added by the user, and there is no limit to the number of wake words.
The voice wake model refers to a voice recognition model that identifies specific voice wake words.
The voice recognition model refers to a model that transforms voice signals into text signals or other recognizable signals through recognition and parsing processes.
The user's voice intent refers to the user's desire for the AR helmet to perform certain actions, such as playing music.
Feedback operations are actions that the AR helmet performs after successfully recognizing the user's intent, such as playing music when the intent to play music is recognized.
Displaying related outputs to the user means showing what the AR helmet has done in response to the feedback operations, for example, playing music through the speaker audio or displaying visual outputs.
The technical solution's beneficial effects include: using the AR helmet’s pre- deployed audio sensors to monitor user voice inputs; capturing voice wake words using a voice wake model to obtain voice input commands; preprocessing these commands to extract voice features and performing matching analysis to determine user voice intent and feedback commands; and controlling the AR helmet to execute feedback operations and display related outputs, thereby enhancing voice recognition accuracy and efficiency, and enabling intelligent determination of user intent.
The embodiment of the present invention provides an intelligent voice recognition method for AR helmets, the monitoring of the user's voice input using the pre-deployed audio sensors of the AR helmet includes:
Pre-setting the voice collection area and pre-deploying multiple directional audio sensors within the AR helmet device; LUS07134
When it is detected that the user is using the AR helmet, automatically adjust the reception volume of the directional audio sensors to capture continuous sound signals from each sensor in the voice collection area, and combine the captured signals to be recorded as the user’s voice input.
In this embodiment, the voice collection area is set directly in front of the AR helmet to maximize the monitoring of user voice inputs while minimizing interference from other users’ voices.
Directional audio sensors are designed to monitor voice signals from a fixed direction, effectively reducing environmental noise.
The beneficial effects of this technical solution include monitoring continuous sound signals in the voice collection area using multiple pre-deployed directional audio sensors, effectively monitoring user voice inputs, and significantly reducing interference from other users’ voice signals, thereby improving the accuracy of intelligent voice recognition.
The embodiment of the present invention provides an intelligent voice recognition method for AR helmets, prior to capturing the voice wake word based on the voice wake model, it further includes:performing initial recognition processing of the user’s voice input, including:
Extracting the pitch lines of the input, obtaining the start and end points of each pitch line, dividing the user's voice input based on these points, and inspecting the divided segments to discard segments without human voice input;
Using a pre-deployed adaptive filter to apply nonlinear processing to the voice input after discard processing to eliminate the speaker audio of the AR helmet used for feedback information.
In this embodiment, an adaptive filter in this embodiment removes speaker audio detected by the directional sensors to perform noise reduction on the voice signals.
If the user's voice input contains a lot of meaningless noise, recognizing all voice inputs can lead to low efficiency. Therefore, segments without human voice input are discarded.
For example, if a voice input includes a phone vibration, this vibration would start 10907136 at moment 1 and end at moment 2, with moment 1 being the start point and moment 2 the endpoint.
The beneficial effect of this technical solution is that it performs initial recognition processing on the user's voice input, discarding invalid and interfering voice inputs, which enhances the efficiency of subsequent data processing and ensures the efficiency of voice recognition.
The embodiment of the present invention provides an intelligent voice recognition method for AR helmets, capturing the voice wake word from the voice input based on the voice wake model, and obtaining the user's voice input command includes:
Capturing the initial segments of the user's voice input using a pre-built voice wake model, and upon detecting the presence of a voice wake word, acquiring the complete voice input data including the voice wake word, recorded as the user’s voice input command.
In this embodiment, the initial capture of user voice input is based on a pre-built voice wake model. The voice wake model monitors the user's speech signals, filtering and comparing them with preset wake words. When a match is found, the AR helmet begins to monitor and analyze the user's voice input.
When the user wears the AR helmet, it starts monitoring for voice inputs. The voice recognition functionality of the AR helmet activates upon detecting a wake word and commences voice recognition operations. If no wake word is detected, it is determined that the user has no interaction intent, and the intelligent voice recognition functionality of the helmet does not activate.
The beneficial effect of this technical solution is that it captures user voice inputs based on a pre-built voice wake model, avoiding the capture of irrelevant voices, reducing ineffective voice inputs, and facilitating subsequent parsing of voice inputs.
The embodiment of the present invention provides an intelligent voice recognition method for AR helmets, preprocessing the user's voice input command to extract voice features includes:
Framing and windowing the user's voice input command for preprocessing, and 10907136 initially removing noise from the voice input command;
Performing endpoint detection on each frame of the voice input command after initial noise removal;
Applying noise reduction to frames detected to contain human voice, and extracting voice features from each noise-reduced frame.
Extracting voice features from each frame after noise reduction then normalizing these features to capture the voice characteristics of the user's voice input command.
In this embodiment, framing and windowing preprocessing use the Hamming window, which has the lowest sidelobe height compared to rectangular and Hann windows, thus offering smoother low-pass characteristics for processing voice inputs.
Endpoint detection in this embodiment involves identifying the start and end points of the user's speech from the voice input command, excluding non-speaking segments to reduce voice recognition time and enhance recognition efficiency.
Voice features represent the information carried by a voice, varying significantly across different voice signals, with high distinctiveness. The extraction of voice features and the selection of feature parameters influence the subsequent accuracy of voice recognition.
The beneficial effect of this technical approach is that it applies noise reduction to the user's voice input commands, enhancing the accuracy of voice recognition; obtaining voice features from the user's voice input commands aids in voice recognition and the subsequent intelligent determination of user intent.
The embodiment of the present invention provides an intelligent voice recognition method for AR helmets, the matching analysis of extracted voice features using a voice recognition model to ascertain user voice intent and determine feedback commands, including:
Randomly obtaining N first standard signals from a preset voice library and acquiring each of these first standard signals’ weighted signal-to-noise ratio with a randomly added noise signal;
Based on the ratio of preset SNR to weighted SNR, amplifying the randomly added 10907136 noise signal, and synthesizing the corresponding first standard signal with the amplified noise signal to construct training voice data for an acoustic model;
Inputting N1 training voice data sequentially into the acoustic model, statistically analyzing each training voice data under the model's transition state structure to determine the number of transition states and parameters;
Comparing the first transition count, determined by the phoneme count of each training voice data, with the corresponding number of transition states, and obtaining a parameter set that matches the random added noise signal associated with the respective training voice data;
Assigning a minimum value to the corresponding transition state parameters when there is a zero value and the comparison result is outside the preset range;
Assigning the highest frequency value to the corresponding transition state parameters when there is a zero value and the comparison result is within the preset range;
Maintaining the values of the counted transition state parameters unchanged if there are no zero values and the comparison result is within the preset range;
Otherwise, adding a new noise signal to the corresponding first standard signal;
Updating the acoustic model based on the final values of the corresponding transition state parameters, obtaining the second intrinsic state parameters of the updated model and the first intrinsic state parameters of the model before the update;
Deeming the updated model as the optimal acoustic model if the differences between the first and second intrinsic state parameters meet the preset convergence conditions;
Randomly selecting a training voice data input into the optimal acoustic model, obtaining the corresponding first text sequence, and performing redundancy preprocessing to generate a training text sequence;
Converting the user's voice input command into a corresponding second text sequence and performing redundancy processing to generate the current text sequence;
Inputting both the training and current text sequences into a language model, calculating the analysis capability coefficient of the language model: el Tla p= Ny (la (1 Long NT 211 + 212 La z01” c2 T2a ; sri oo ee) (1) 211 + 212 Ls 202"? z01 + 702
Wherein, P represents the analysis capability coefficient of the language model; cl represents the number of times the language model analyzes the training text sequence; c2 represents the number of times the language model analyzes the current text sequence; f0;, represents the number of sequence combinations under the ilth analysis of the training text sequence; 31;, represents the computational power for the ilth analysis of the training text sequence; z01 represents the total number of sequences in the training text sequence; z02 represents the total number of sequences in the current text sequence; T1a represents the analysis time by the language model for the training text sequence; G1 represents the estimated analysis time for the training text sequence; f1;, represents the number of sequence combinations under the i2th analysis of the current text sequence; 32;, represents the computational power for the i2th analysis of the current text sequence; T2a represents the analysis time by the language model for the current text sequence; G2 represents the estimated analysis time for the current text sequence; z1 represents the number of times the language model is trained simultaneously on the current and training text sequences; z11 represents the number of times the language model analyzes key sequences in the training text sequence; z12 represents the number of times the language model analyzes key sequences in the current text sequence.
When the analysis capability coefficient exceeds a preset capability threshold, the predictive ability of the language model is considered satisfactory;
Combining the optimal acoustic model with the satisfactory language model for decoding, yielding the best output sequence for the voice input command, thereby determining user intent, where the post-decoding model is the voice recognition 10907136 model;
Determining the current scenario type based on historical voice commands, and if the user's intent does not belong to the intent set of the current scenario, the input is deemed invalid, and the AR helmet does not generate a corresponding feedback command;
If the user's intent belongs to the intent set of the current scenario, gathering each historical voice command's speech rate, volume, and speaking style to construct the current voice characteristics, and if the voice input features of the user's intent match the current voice characteristics, the input is considered valid, and the AR helmet's feedback command is determined according to a user intent-feedback command mapping;
If the voice input features of the user's intent do not match the current voice characteristics, the input is deemed invalid, and no corresponding feedback command is generated;
When multiple user intents exist, determining the execution order of feedback commands based on the valid input order of user intents.
In this embodiment, the first standard signal refers to a pure voice signal from the preset voice library, recorded in an environment free of interference.
Signal-to-noise ratio (SNR) is the relative strength or power ratio between the signal and noise, used to assess signal quality. A higher SNR of a voice signal indicates better quality of the corresponding voice signal.
Due to the inevitable presence of various noise signals in the user's environment, it is necessary to test the voice recognition model under different SNR conditions to meet the real-world environmental needs of the user.
Training voice data refers to voice data used to test the voice recognition model, created by amplifying randomly added noise signals based on a preset SNR and synthesizing them with the first standard signal, where the randomly added noise signals are selected from a clean noise database.
The formula for calculating the SNR is as follows:
Yio à LU507134
SNR = 10log,, (=) j1=1"41
Where SNR is the signal-to-noise ratio; s;; is the amplitude of the /1th frame of the voice signal; n,, is the amplitude of the /1th frame of the noise signal;
Xi=1 SA is the power of the voice signal; D },-, Nf, is the power of the noise signal;
T isthe length of the training voice signal.
In this embodiment, the acoustic model and the language model are key components of the voice recognition system. The acoustic model transforms input voice signals into a sequence of phonemes or syllables, while the language model predicts the next word after obtaining a sequence of phonemes or syllables.
Phonemes are the smallest vocal units divided according to the natural properties of speech, representing the most basic units of language used to differentiate meanings in speech; syllables are vocal units formed by combining phonemes.
The transition state structure of the acoustic model includes both a left-to-right structure and a state-through structure. The state-through structure is mainly used for language identification, while the left-to-right structure is primarily used for voice recognition. Thus, a left-to-right structure acoustic model is selected here.
In this embodiment, the number of transition states in the acoustic model is determined based on the number of phonemes in the training voice data, typically ranging from 4 to 8.
The transition state parameters of the acoustic model refer to the model parameters obtained after the initial acoustic model matches and analyzes the training voice data.
The first transition data refers to the number of transition states determined by the number of phonemes in each training voice data.
The final value of the corresponding transition state parameters is determined based on the parameter set assigned either the minimum value, the most frequently used value, or an unchanged value; the minimum value can be manually selected, such as 0.001.
The first intrinsic state parameter refers to the state parameters of the unupdated 10907136 model, and the second intrinsic state parameter refers to the state parameters of the model updated based on the final values of the corresponding transition state parameters.
When the difference between the first and second intrinsic state parameters meets the preset convergence conditions, such as the difference not exceeding 0.05, it is determined that the convergence condition is satisfied, and the updated model is then considered the optimal acoustic model.
Redundant preprocessing involves removing invalid words and stop words from the text sequence, retaining only the text sequence that contains effective information, which facilitates subsequent text recognition processing.
The first text sequence refers to the text sequence obtained when the training voice data is input into the optimal acoustic model; the second text sequence refers to the text sequence obtained when the user's voice input command is input into the optimal acoustic model, and this text sequence includes some numerical information, for example, the first text sequence is "120011," and the second text sequence is "120021."
The analysis capability coefficient of the language model is a metric that measures the language model's ability to predict the next word. A higher analysis capability coefficient indicates stronger predictive ability of the language model.
Combined decoding processing refers to the process of generating voice recognition results from both the acoustic and language models.
The best output sequence refers to the optimal voice recognition result for user voice inputs by the voice recognition model.
The intent set of the current scenario type is a finite set; the AR helmet only responds to user intents within this set, which is a preset set, determined and optimized based on the historical scenario database.
Current voice characteristics refer to the set of features constructed based on the user's voice input rate, volume, and speaking style, used to distinguish the current AR helmet user from other users, preventing the AR helmet from recognizing other users’
intents and producing erroneous responses and feedback, thus affecting user 10907136 experience.
When multiple user intents exist, the execution order of feedback commands is determined based on the valid input order of the user intents. For example, if the user \intents include both opening music and lowering the volume, the AR helmet prioritizes executing the music opening intent entered first, followed by the volume lowering intent entered later.
The preset capability coefficient is generally set at 0.3.
The beneficial effect of this technical solution is that it matches and analyzes extracted voice features using the voice recognition model to obtain the user's voice intent and determine the feedback command, achieving accurate recognition of user voice as well as precise determination of user intent.
The embodiment of the present invention provides an intelligent voice recognition method for AR helmets, controlling the AR helmet to perform feedback operations based on feedback commands and displaying related outputs to the user, including:
Determining the AR helmet's feedback outputs to the user according to a feedback command-output mapping;
Locking the speaker audio used for feedback operations, and when it is detected that the user has ceased voice input and the speaker audio volume is not below a preset feedback volume, controlling the locked speaker to display the relevant output to the user;
Otherwise, visualizing the feedback output according to the user's scenario type.
In this embodiment, feedback output refers to selecting appropriate outputs based on feedback commands to provide relevant feedback to the user, including audio output and visual display output.
Feedback operation refers to executing relevant actions based on user intent; for example, if a user wishes to play specific music, the AR helmet, upon successfully recognizing the user's intent to play music, finds and plays the music, and simultaneously displays relevant outputs to the user indicating successful feedback
. LU507134 operation.
Speaker audio refers to the feedback voice generated based on feedback commands.
Preset feedback volume is the minimum allowable volume at which the user can hear feedback voices; if the user adjusts the speaker audio volume below the preset feedback volume, then feedback is not provided via speaker audio.
Visual display output refers to converting feedback outputs into text or other visual content displayed in a user-visible location that does not interfere with the user experience.
The beneficial effect of this technical solution is that it controls the AR helmet to perform feedback operations based on feedback commands and displays related outputs to the user, achieving intelligent voice recognition. The feedback operation can successfully provide feedback to the user without affecting the user experience.
The embodiment of the present invention provides an intelligent voice recognition system for AR helmets, as shown in FIG.2, including:
A voice monitoring module: Monitoring the user's voice input using the AR helmet's pre-deployed audio sensors;
À voice collection module: Capturing the voice wake word from the voice input based on the voice wake model, obtaining the user's voice input command;
A voice recognition module: Preprocessing the user's voice input command to extract voice features, and performing matching analysis with a voice recognition model to ascertain the user's voice intent and determine the feedback command;
A voice feedback module: Controlling the AR helmet to perform feedback operations based on the feedback command and displaying related outputs to the user.
The technical solution's beneficial effects include: using the AR helmet’s pre- deployed audio sensors to monitor user voice inputs; capturing voice wake words using a voice wake model to obtain voice input commands; preprocessing these commands to extract voice features and performing matching analysis to determine user voice intent and feedback commands; and controlling the AR helmet to execute feedback operations and display related outputs, thereby enhancing voice recognition accuracy and efficiency, and enabling intelligent determination of user intent. 10907136
It is evident that those skilled in the art can make various changes and modifications to this invention without departing from the spirit and scope of the invention. Thus, if these modifications and variations of this invention fall within the scope of the claims and their equivalents, then the invention also intends to include these changes and variations.

Claims (8)

1. An intelligent voice recognition method for AR helmets, wherein comprising the following steps: Step 1: Monitor the user's voice input using the pre-deployed audio sensors of the AR helmet; Step 2: Capture the voice wake word from the voice input based on the voice wake model, and obtain the user’s voice input command; Step 3: Preprocess the user’s voice input command to extract voice features, and perform matching analysis of these features using a voice recognition model to capture the user's voice intent and determine the feedback command; Step 4: Control the AR helmet to perform the feedback operation according to the feedback command and display related outputs to the user.
2. The intelligent voice recognition method for AR helmets according to claim 1, wherein the monitoring of the user's voice input using the pre-deployed audio sensors of the AR helmet comprises: Pre-setting the voice collection area and pre-deploying multiple directional audio sensors within the AR helmet device; When it is detected that the user is using the AR helmet, automatically adjust the reception volume of the directional audio sensors to capture continuous sound signals from each sensor in the voice collection area, and combine the captured signals to be recorded as the user’s voice input.
3. The intelligent voice recognition method for AR helmets according to claim 2, wherein, prior to capturing the voice wake word based on the voice wake model, it further comprises:performing initial recognition processing of the user’s voice input, comprising: Extracting the pitch lines of the input, obtaining the start and end points of each pitch line, dividing the user's voice input based on these points, and inspecting the divided segments to discard segments without human voice input; 10907136 Using a pre-deployed adaptive filter to apply nonlinear processing to the voice input after discard processing to eliminate the speaker audio of the AR helmet used for feedback information.
4. The intelligent voice recognition method for AR helmets according to claim 1, wherein capturing the voice wake word from the voice input based on the voice wake model, and obtaining the user's voice input command comprises: Capturing the initial segments of the user's voice input using a pre-built voice wake model, and upon detecting the presence of a voice wake word, acquiring the complete voice input data comprising the voice wake word, recorded as the user’s voice input command.
5. The intelligent voice recognition method for AR helmets according to claim 1, wherein preprocessing the user's voice input command to extract voice features comprises: Framing and windowing the user's voice input command for preprocessing, and initially removing noise from the voice input command; Performing endpoint detection on each frame of the voice input command after initial noise removal; Applying noise reduction to frames detected to contain human voice, and extracting voice features from each noise-reduced frame. Extracting voice features from each frame after noise reduction then normalizing these features to capture the voice characteristics of the user's voice input command.
6. An intelligent voice recognition method for AR helmets according to claim 1, wherein the matching analysis of extracted voice features using a voice recognition model to ascertain user voice intent and determine feedback commands, comprising:
Randomly obtaining N first standard signals from a preset voice library and acquiring each of these first standard signals’ weighted signal-to-noise ratio with a randomly added noise signal; Based on the ratio of preset SNR to weighted SNR, amplifying the randomly added noise signal, and synthesizing the corresponding first standard signal with the amplified noise signal to construct training voice data for an acoustic model; Inputting n1 training voice data sequentially into the acoustic model, statistically analyzing each training voice data under the model's transition state structure to determine the number of transition states and parameters; Comparing the first transition count, determined by the phoneme count of each training voice data, with the corresponding number of transition states, and obtaining a parameter set that matches the random added noise signal associated with the respective training voice data; Assigning a minimum value to the corresponding transition state parameters when there is a zero value and the comparison result is outside the preset range; Assigning the highest frequency value to the corresponding transition state parameters when there is a zero value and the comparison result is within the preset range; Maintaining the values of the counted transition state parameters unchanged if there are no zero values and the comparison result is within the preset range; Otherwise, adding a new noise signal to the corresponding first standard signal; Updating the acoustic model based on the final values of the corresponding transition state parameters, obtaining the second intrinsic state parameters of the updated model and the first intrinsic state parameters of the model before the update; Deeming the updated model as the optimal acoustic model if the differences between the first and second intrinsic state parameters meet the preset convergence conditions; Randomly selecting a training voice data input into the optimal acoustic model, obtaining the corresponding first text sequence, and performing redundancy
. LU507134 preprocessing to generate a training text sequence;
Converting the user's voice input command into a corresponding second text sequence and performing redundancy processing to generate the current text sequence;
Inputting both the training and current text sequences into a language model, calculating the analysis capability coefficient of the language model: c1 T1a po| > (la (1 + Da 51, 1) 211 + 212 Ls 201" c2 T2a fot 5 (1+ Ege) J (ar 211 + 712 La 202“? z01 + z02
Wherein, P represents the analysis capability coefficient of the language model; c1 represents the number of times the language model analyzes the training text sequence; c2 represents the number of times the language model analyzes the current text sequence; f0;, represents the number of sequence combinations under the ilth analysis of the training text sequence; 31;; represents the computational power for the ilth analysis of the training text sequence; z01 represents the total number of sequences in the training text sequence; z02 represents the total number of sequences in the current text sequence; T1a represents the analysis time by the language model for the training text sequence; G1 represents the estimated analysis time for the training text sequence; f1;, represents the number of sequence combinations under the i2th analysis of the current text sequence; 32;, represents the computational power for the iZth analysis of the current text sequence; T2a represents the analysis time by the language model for the current text sequence; G2 represents the estimated analysis time for the current text sequence; z1 represents the number of times the language model is trained simultaneously on the current and training text sequences; z11 represents the number of times the language model analyzes key sequences in the training text sequence; z12 represents the number of times the language model analyzes key sequences in the current text sequence.
When the analysis capability coefficient exceeds a preset capability threshold, the 10907136 predictive ability of the language model is considered satisfactory; Combining the optimal acoustic model with the satisfactory language model for decoding, yielding the best output sequence for the voice input command, thereby determining user intent, where the post-decoding model is the voice recognition model; Determining the current scenario type based on historical voice commands, and if the user's intent does not belong to the intent set of the current scenario, the input is deemed invalid, and the AR helmet does not generate a corresponding feedback command; If the user's intent belongs to the intent set of the current scenario, gathering each historical voice command's speech rate, volume, and speaking style to construct the current voice characteristics, and if the voice input features of the user's intent match the current voice characteristics, the input is considered valid, and the AR helmet's feedback command is determined according to a user intent-feedback command mapping; If the voice input features of the user's intent do not match the current voice characteristics, the input is deemed invalid, and no corresponding feedback command is generated; When multiple user intents exist, determining the execution order of feedback commands based on the valid input order of user intents.
7. An intelligent voice recognition method for AR helmets according to claim 1, wherein controlling the AR helmet to perform feedback operations based on feedback commands and displaying related outputs to the user, comprising: Determining the AR helmet's feedback outputs to the user according to a feedback command-output mapping; Locking the speaker audio used for feedback operations, and when it is detected that the user has ceased voice input and the speaker audio volume is not below a preset feedback volume, controlling the locked speaker to display the relevant output to the user; LUS07134 Otherwise, visualizing the feedback output according to the user's scenario type.
8. An intelligent voice recognition system for AR helmets, wherein comprising: A voice monitoring module: Monitoring the user's voice input using the AR helmet's pre-deployed audio sensors; À voice collection module: Capturing the voice wake word from the voice input based on the voice wake model, obtaining the user's voice input command; A voice recognition module: Preprocessing the user's voice input command to extract voice features, and performing matching analysis with a voice recognition model to ascertain the user's voice intent and determine the feedback command; A voice feedback module: Controlling the AR helmet to perform feedback operations based on the feedback command and displaying related outputs to the user.
LU507134A 2024-03-05 2024-05-06 Intelligent voice recognition method and system for ar helmets LU507134B1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410249879.7A CN118379994A (en) 2024-03-05 2024-03-05 Intelligent voice recognition method and system for AR helmet

Publications (1)

Publication Number Publication Date
LU507134B1 true LU507134B1 (en) 2024-11-06

Family

ID=91907327

Family Applications (1)

Application Number Title Priority Date Filing Date
LU507134A LU507134B1 (en) 2024-03-05 2024-05-06 Intelligent voice recognition method and system for ar helmets

Country Status (2)

Country Link
CN (1) CN118379994A (en)
LU (1) LU507134B1 (en)

Also Published As

Publication number Publication date
CN118379994A (en) 2024-07-23

Similar Documents

Publication Publication Date Title
US8762144B2 (en) Method and apparatus for voice activity detection
CN108305615B (en) Object identification method and device, storage medium and terminal thereof
CN107799126B (en) Voice endpoint detection method and device based on supervised machine learning
CN106486131B (en) A kind of method and device of speech de-noising
RU2373584C2 (en) Method and device for increasing speech intelligibility using several sensors
CN103578468B (en) The method of adjustment and electronic equipment of a kind of confidence coefficient threshold of voice recognition
CN112102850B (en) Emotion recognition processing method and device, medium and electronic equipment
US8489404B2 (en) Method for detecting audio signal transient and time-scale modification based on same
JP6371516B2 (en) Acoustic signal processing apparatus and method
KR100745976B1 (en) Method and device for distinguishing speech and non-voice using acoustic model
US20050143997A1 (en) Method and apparatus using spectral addition for speaker recognition
CN108091340B (en) Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium
JP5282523B2 (en) Basic frequency extraction method, basic frequency extraction device, and program
JP5803125B2 (en) Suppression state detection device and program by voice
CN109065026B (en) Recording control method and device
CN110728993A (en) Voice change identification method and electronic equipment
Krishna et al. Emotion recognition using dynamic time warping technique for isolated words
LU507134B1 (en) Intelligent voice recognition method and system for ar helmets
CN111755029B (en) Voice processing method, device, storage medium and electronic equipment
Singh et al. Novel feature extraction algorithm using DWT and temporal statistical techniques for word dependent speaker’s recognition
CN113380265A (en) Household appliance noise reduction method and device, storage medium, household appliance and range hood
CN115376494B (en) Voice detection method, device, equipment and medium
CN110875043B (en) Voiceprint recognition method and device, mobile terminal and computer readable storage medium
KR101737083B1 (en) Method and apparatus for voice activity detection
Hajipour et al. Listening to sounds of silence for audio replay attack detection