CN110310644A

CN110310644A - Wisdom class board exchange method based on speech recognition

Info

Publication number: CN110310644A
Application number: CN201910577869.5A
Authority: CN
Inventors: 陈天; 蔡瑞琦; 丁国柱
Original assignee: Guangzhou Yundi Technology Co Ltd
Current assignee: Guangzhou Yundi Technology Co Ltd
Priority date: 2019-06-28
Filing date: 2019-06-28
Publication date: 2019-10-08

Abstract

The present invention provides a kind of wisdom class board exchange method based on speech recognition, comprising: receive the voice signal of the first user；Voice signal is pre-processed, the first digital signal is obtained；Feature extraction is carried out to the first digital signal, obtains characteristic parameter；Characteristic parameter is decoded, optimal word Model sequence is obtained；Optimal word Model sequence constitutes the text information of voice signal；Semantic analysis is carried out to the text information of continuous speech signal, text instruction is obtained and user is intended to；According to text instruction, corresponding output result is instructed in display interface display text；It is intended to according to user, determines response message；Response message is subjected to conversion process, generate continuous speech and is exported.The courses of action for shortening user as a result, alleviate user to the operational load of wisdom class board, can carry out showing interface and voice answer-back simultaneously, greatly enhance user experience.

Description

Wisdom class board exchange method based on speech recognition

Technical field

The present invention relates to a kind of processing method more particularly to a kind of wisdom class board exchange methods based on speech recognition.

Background technique

Network is fast-developing, and network-based application can help school's major occasion in administrative area to carry out Rich Media The publication of information improves information in the propagation efficiency of school.Wisdom class board is an application vector in information technology environment, more It is the important component of wisdom classroom and Intelligent campus construction.It is deployed in classroom doorway, has and shows curriculum schedule information, class The purposes such as grade and the notice of school, cultural construction, notice information publication, attendance of swiping the card, information inquiry, family-school interaction message. However, user can only carry out information interchange by traditional interactive mode and wisdom class board, it is inconvenient the problems such as enable user's body It tests and has a greatly reduced quality.

Existing wisdom class board is mainly the mechanism for transmitting character code as information between user and equipment, Yong Hutong Traditional input hardware equipment, such as the input equipment with touch sensible, remote controler, fuselage key are crossed, is all to rely on substantially In hardware device, to simulate and realize mouse, to software sending, accurately operational order, computer carry out letter by the instruction of user Breath handles and exports, and feeds back information to user in the form of character or picture.And user is using these traditional interaction sides Formula carries out being in passive position when human-computer interaction, can not accomplish and machine natural interaction.

Traditional interactive mode needs to follow certain operating procedure, and input efficiency is low；Traditional interactive mode needs to use Family carries out certain learning and Memory to application method, and cognitive load is higher.

Summary of the invention

The purpose of the present invention is in view of the drawbacks of the prior art, provide a kind of board interaction side, wisdom class based on speech recognition Method needs user to carry out certain learning and Memory to application method to solve wisdom class board interactive mode in the prior art, The caused higher problem of cognitive load.

To solve the above problems, the present invention provides a kind of wisdom class board exchange method based on speech recognition, the side Method includes:

Receive the voice signal of the first user；

The voice signal is pre-processed, the first digital signal is obtained；

Feature extraction is carried out to first digital signal, obtains characteristic parameter；

The characteristic parameter is decoded, optimal word Model sequence is obtained；Described in the optimal word Model sequence is constituted The text information of voice signal；

Semantic analysis is carried out to the text information of the continuous voice signal, text instruction is obtained and user is intended to；

According to the text instruction, corresponding output result is instructed in display interface display text；

It is intended to according to the user, determines response message；

The response message is subjected to conversion process, generate continuous speech and is exported.

In one possible implementation, before the method further include:

Receive the first voice authentication information of the first user；

Feature extraction is carried out to the first voice authentication information, obtains fisrt feature information；

The fisrt feature information is matched with the first template in reference model library；

After successful match, the fisrt feature information is handled, determines the first identity information of the first user；Institute State the identity ID and the identity grade that the first identity information includes the first user；

The ID of the identity information and wisdom class board is bound.

In one possible implementation, described that the voice messaging is pre-processed, the first digital signal is obtained, It specifically includes:

Sampling and quantification treatment are carried out to the voice messaging, obtain raw digital signal；

Preemphasis processing is carried out to the raw digital signal, obtains preemphasis voice signal；

Sub-frame processing is carried out to the preemphasis voice signal, obtains framing voice signal；

Windowing process is carried out to the framing voice signal, obtains the first digital signal.

In one possible implementation, the characteristic parameter includes: linear predictor coefficient LPCC, perception linear prediction One of FACTOR P LP, mel-frequency cepstrum coefficient MFCC.

In one possible implementation, described that first digital signal is carried out when characteristic parameter is MFCC Feature extraction obtains characteristic parameter, specifically includes:

The first digital signal is changed into frequency-region signal from time-domain signal using fast Fourier transform FFT；

Convolution is carried out according to the triangular filter group that Mel scale is distributed to the frequency-region signal；

According to convolution results, the vector constituted to the output of each triangular filter in the triangular filter group is carried out from remaining String changes DCT；

Top n coefficient in the DCT is taken, the characteristic parameter of the first digital signal is obtained.

In one possible implementation, described that the characteristic parameter is decoded, optimal word Model sequence is obtained, It specifically includes:

Respectively by acoustic model, language model and pronunciation dictionary, to the reference model library of the characteristic parameter and built in advance In the similitude of reference template give a mark, obtain corresponding first score of acoustic model, speech model corresponding second Divide third score corresponding with pronunciation dictionary；

Data fusion is weighted to first score, second score and the third score, obtains optimal word Model sequence, to be made up of the text information of voice the optimal word Model sequence.

In one possible implementation, the text information to the continuous voice signal carries out semantic point Analysis obtains text instruction and user is intended to specifically include:

Morphological analysis is carried out to the text information, the text information is divided into multiple words；

Syntactic analysis is carried out with the relationship between the multiple words of determination to the multiple word, generates the syntactic structure of sentence；

According to the syntactic structure, text instruction is obtained；

Using machine learning, intention analysis is carried out to the syntactic structure, determines the corresponding user's meaning of the syntactic structure Figure.

In one possible implementation, described that the response message is subjected to conversion process, generate continuous speech simultaneously Output, specifically includes:

By literary periodicals device, the response message is converted into continuous speech and is exported.

In one possible implementation, after the method further include:

Receive the second voice authentication information of second user；

Feature extraction is carried out to the second voice authentication information, obtains second feature information；

The second feature information is matched with the second template in reference model library；

After successful match, the second feature information is handled, determines the second identity information of second user；Institute State the identity ID and identity grade that the second identity information includes second user；

The identity grade of the second user is compared with the identity grade of first user；

When the identity grade of the second user be greater than first user identity grade when, release wisdom class board ID with The binding of first identity information, and the ID of wisdom class board and second identity information are bound.

In one possible implementation, the method also includes:

When no operating time being equal to preset time, sleep signal or screen locking signal are generated；

According to the sleep signal perhaps screen locking signal into dormant state or screen lock state.

By applying the wisdom class board exchange method provided by the invention based on speech recognition, the operation road of user is shortened Diameter alleviates the load of operation of the user to wisdom class board, can carry out showing interface and voice answer-back simultaneously, greatly enhance User experience.

Detailed description of the invention

Fig. 1 is the wisdom class board exchange method flow chart provided in an embodiment of the present invention based on speech recognition.

Specific embodiment

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that for just Part relevant to related invention is illustrated only in description, attached drawing.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.Below by drawings and examples, technical scheme of the present invention will be described in further detail.

Wisdom class board is generally deployed in classroom doorway, has notice, the text for showing curriculum schedule information, class and school Change the purposes such as construction, notice information publication, attendance of swiping the card, information inquiry, family's interaction message.The interactive system of wisdom class board by Multiple block combiners carry out speech recognition together, to the voice signal of user's input, to realize through voice and user Interaction.

Fig. 1 is the wisdom class board exchange method flow chart provided in an embodiment of the present invention based on speech recognition.This method Executing subject is wisdom class board, as shown in Figure 1, method includes the following steps:

Step 101, the voice signal of the first user is received.

Wherein, the first user interacted with wisdom class board can pass through and brush personal IC card, recognition of face or voice The modes such as identification carry out the identification of identity.

Below by taking speech recognition as an example, wisdom class board is illustrated with how the first user carries out identification.

Wisdom class board has speech reception module, and for receiving the voice messaging of user, which be can be Microphone is also possible to microphone array.To realize the acquisition to voice messaging.

Specifically, firstly, receiving the first voice authentication information of the first user；Then, to the first voice authentication information into Row feature extraction obtains fisrt feature information；Then, by the first template progress in fisrt feature information and reference model library Match；Then, after successful match, fisrt feature information is handled, determines the first identity information of the first user；First body Part information includes the identity ID and identity grade of the first user；Finally, the ID of identity information and wisdom class board is bound.

For example, speech reception module receives the information of the first user: " hello, I is Xiao Ming ", then to the information into Row pretreatment, for example after sampling, quantization, preemphasis, framing, windowing process, carry out feature extraction, extract fisrt feature information Afterwards, which is matched with the first template in reference model library.Wherein, there are multiple moulds in reference model library Plate, each template correspond to a user, preset wake-up word in the first template, the content of the wake-up word of user does not limit, can To be " hello ", " " etc., after successful match, into standby mode, meanwhile, which is carried out further Processing, identifies the identity information of interactive people, which includes the identity ID and identity grade of the first user.For example, should Also there is the preset voice messaging of the multiple users registered in advance, by presetting this feature parameter with this in reference model library The characteristic parameter of voice messaging matched, so that it is determined that the identity information of user.

Wherein, grade can be divided according to the identity of student, teacher.Grade between student is identical, teacher etc. Grade is higher than student.

Further, the grade that can also set the location of the wisdom class board student of corresponding class is higher than it The grade of his student.For example, the wisdom class board that wisdom class board is six grades one class, then the grade of six grades one class of student is higher than The grade of other classes student.

Further, the grade that can also set class leader is higher than the grade of non-class leader, for example, the grade of squad leader is high Other students of Yu Benban.Those skilled in the art can according to need, and preset each grade, and the application does not limit this.

After identity validation and binding, interface then enters waiting command status, voice behaviour from suspend mode or screen lock state Make the voice messaging for only handling current interaction people, i.e., one-to-one interactive mode, contact action is not limited by identity.

Step 102, voice signal is pre-processed, obtains the first digital signal.

Specifically, step 102 specifically includes:

Sampling and quantification treatment are carried out to voice signal, obtain raw digital signal；

Preemphasis processing is carried out to raw digital signal, obtains preemphasis voice signal；

Sub-frame processing is carried out to preemphasis voice signal, obtains framing voice signal；

Windowing process is carried out to framing voice signal, obtains the first digital signal.

Wherein, before carrying out preemphasis to voice signal, it is also necessary to be sampled and be quantified, the purpose of sampling is to mould Quasi- audio signal waveform is split, and the purpose of quantization is the amplitude measured with shaping value storage sampling.To voice signal into The purpose of row preemphasis is in order to which the high frequency section to voice aggravates, and the influence of removal lip radiation increases the height of voice Frequency division resolution.Preemphasis generally is realized by the way that transmission function is single order FIR high-pass digital filter, and wherein a is preemphasis system Number, 0.9 < a < 1.0.It is y (n) by preemphasis treated result if the speech sample value at n moment is x (n))=x (n)-ax (n-1), a=0.98 is taken here.

After carrying out the processing of preemphasis digital filtering, here is exactly to carry out sub-frame processing, and voice signal has short-term stationarity (10--30ms in it is considered that voice signal approximation constant) can be divided into voice signal some short sections thus to carry out Processing, here it is framing, the framing of voice signal is the method that is weighted using the window of moveable finite length come reality It is existing.General frame number per second is about 33-100 frame, is depended on the circumstances.General framing method is the method for overlapping segmentation, previous Frame and the overlapping part of a later frame are known as frame shifting, and frame, which is moved, is generally 0-0.5 with the ratio of frame length.

Adding window is usually to add Hamming window or rectangular window, to increase the decaying to high fdrequency component.

Step 103, feature extraction is carried out to the first digital signal, obtains characteristic parameter.

Specifically, different characteristic parameters can be extracted according to the different purposes of the first digital signal.Wherein, feature is joined The main linear predictive coefficient (Linear Predictive Cepstral Coding, LPCC) of number, perception linear predictor coefficient (Perceptual Linear Predictive, PLP), mel-frequency cepstrum coefficient (Mel Frequency Cepstrum Coefficient, MFCC).

When characteristic parameter is MFCC, step 103 is specifically included:

Firstly, using fast Fourier transform (Fast Fourier Transformation, FFT) by the first digital signal It is changed into frequency-region signal from time-domain signal；Then, the triangular filter group that frequency-region signal is distributed according to Mel scale is rolled up Product；Then, according to convolution results, the vector that the output of each triangular filter in triangular filter group is constituted become from cosine Change (Discrete Cosine Transform, DCT)；Finally, taking top n coefficient in DCT, the spy of the first digital signal is obtained Levy parameter.

Step 104, characteristic parameter is decoded, obtains optimal word Model sequence；Optimal word Model sequence constitutes voice The text information of signal.

Specifically, step 104 includes:

Respectively by acoustic model, language model and pronunciation dictionary, in the reference model library of characteristic parameter and built in advance The similitude of reference template is given a mark, obtain corresponding first score of acoustic model, corresponding second score of speech model and The corresponding third score of pronunciation dictionary；

Data fusion is weighted to the first score, the second score and third score, obtains optimal word Model sequence, with logical Cross the text information that optimal word Model sequence constitutes voice.

Specifically, having speech recognition decoder and processor in wisdom class board, above-mentioned pretreatment, characteristic extraction procedure can Being carried out by processor.Subsequent, the characteristic parameter of voice can be sent to speech recognition decoder by processor, by voice Identify that decoder carries out similarity measurement comparison to the reference template in these characteristic parameters and reference model library.

Wherein, which may include signal processing module and characteristic extracting module, pass through the two modules respectively, into Row pretreatment and feature extraction.

Wherein, acoustic model is the knowledge to the difference of acoustics, phonetics, the variable of environment, speaker's gender, accent etc. It indicates, it is therefore an objective to convert orderly phoneme for the feature vector of all frames extracted through MFCC and export.

Language model is the representation of knowledge to one group of word Sequence composition, indicates the probability that a certain word sequence occurs, generally adopts With chain rule, the probability of a sentence is disassembled the product of the probability of each word in growing up to be a useful person.

Pronunciation dictionary, comprising from the mapping between word all factor, effect is for connection to acoustic model and language model 's.Pronunciation dictionary includes the set for the word that wisdom class board can be handled, and designates its pronunciation.

Weighted fusion algorithm is weighted and averaged to multi-source redundancy, and fusion value is as a result used as, and is a kind of directly right The method that data source is operated.

Acoustic model, language model and pronunciation dictionary etc. can be respectively to the reference moulds in characteristic parameter and reference model library Plate marking, weighted fusion algorithm is finally carried out on score domain, provides last court verdict, that is, obtain it is a series of to The optimal word Model sequence for describing input speech signal, to obtain the text information of voice.

Step 105, semantic analysis is carried out to the text information of continuous voice signal, obtains text instruction and user's meaning Figure.

Specifically, step 105 includes: to be divided into text information multiple firstly, to text information progress morphological analysis Word；Then, syntactic analysis is carried out with the relationship between the multiple words of determination to multiple words, generates the syntactic structure of sentence；Then, root According to syntactic structure, text instruction is obtained；Finally, carrying out intention analysis using machine learning to syntactic structure, determining syntactic structure Corresponding intention type；Finally, determining that user is intended to according to type is intended to.

Wherein, when carrying out morphological analysis, sentence is divided into each word using segmenter, relates generally to participle, part of speech mark The operation such as note.In order to improve the quality of morphological analysis, it is also necessary to be added Entity recognition, such as some intrinsic place names, name with And other proper names etc..In order to determine the relationship in sentence between vocabulary, syntactic analysis need to be used, the input of syntactic analysis is one Word string, output are the syntactic structures of sentence.It is finally realized using the method for machine learning and is intended to analysis, it is intended that the work master of analysis If sentence, which is assigned to corresponding intention type, by the method classified determines the intention of user, to mention according to type is intended to The response efficiency of high wisdom class board.

Step 106, according to text instruction, corresponding output result is instructed in display interface display text.

Specifically, the text information by semantic analysis is converted into the text instruction of one or more (a series of), and adjust It is operated with text instruction's interface.Such as first user say: " what class on today? " by speech recognition decoder and semanteme After the processing of analysis module, the corresponding instruction of output is recalled into the school timetable of today, and present on the display interface of wisdom class board.

Step 107, it is intended to according to user, determines response message.

Step 108, response message is subjected to conversion process, generates continuous speech and exported.

Specifically, determining response by the dialog manager in wisdom class board after obtaining user's intention according to semantic analysis Then information by the conversion of literary periodicals device, the continuous speech that response message changes into high quality, high naturalness is exported. Literary periodicals device, which is mainly realized, is converted to audio-frequency information for response message.

For example, the first user says: " what class on today? ", the user that gets is intended to are as follows: the first user wonders that one is whole It course arrangement.Dialog manager determines response message, which is text information, and literary periodicals device can be by text Information changes into the continuous speech output of high quality, high naturalness.For example, " you, which assign to today 8: 30 at 9 points, will go up mathematics in 15 minutes Class ... ".

Further, before step 101 or later, further includes:

Receive the second voice authentication information of second user；

Second feature information is matched with the second template in reference model library；

After successful match, second feature information is handled, determines the second identity information of second user；Second body Part information includes the identity ID and identity grade of second user；

The identity grade of second user is compared with the identity grade of the first user；

When the identity grade of second user is greater than the identity grade of the first user, wisdom class board ID and the first identity are released The binding of information, and the ID of wisdom class board is bound with the second identity information.

Specifically, just obtaining and differentiating new identity information in real time after waking up from interface, it is assumed that current interaction people still in It in equipment interactive process, is accessed if any other interaction persons, grade comparison can be carried out to the identity of midway connector, if grade Lower or grade is identical, and wisdom class board will continue to handle the voice messaging of current interaction people, if higher ranked, the board meeting of wisdom class The identity information of higher ranked person is bound, and switchs to the voice messaging for handling higher ranked person, this facilitates class leader or teacher Control and management to wisdom class board.

Further, method further include:

According to sleep signal perhaps screen locking signal into dormant state or screen lock state.

Specifically, presetting no operation locks screen automatically the time, when user stops interaction with equipment and reaches the preset time Afterwards, interface reenters suspend mode or screen lock state, waits wake-up next time.

Professional should further appreciate that, described in conjunction with the examples disclosed in the embodiments of the present disclosure Unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, hard in order to clearly demonstrate The interchangeability of part and software generally describes each exemplary composition and step according to function in the above description. These functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution. Professional technician can use different methods to achieve the described function each specific application, but this realization It should not be considered as beyond the scope of the present invention.

The step of method described in conjunction with the examples disclosed in this document or algorithm, can be executed with hardware, processor The combination of software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field In any other form of storage medium well known to interior.

Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all include Within protection scope of the present invention.

Claims

1. a kind of wisdom class board exchange method based on speech recognition, which is characterized in that the described method includes:

Receive the voice signal of the first user；

The voice signal is pre-processed, the first digital signal is obtained；

The characteristic parameter is decoded, optimal word Model sequence is obtained；The optimal word Model sequence constitutes the voice The text information of signal；

It is intended to according to the user, determines response message；

2. the method according to claim 1, wherein before the method further include:

Receive the first voice authentication information of the first user；

After successful match, the fisrt feature information is handled, determines the first identity information of the first user；Described One identity information include the first user identity ID and the identity grade；

The ID of the identity information and wisdom class board is bound.

3. obtaining the method according to claim 1, wherein described pre-process the voice messaging One digital signal, specifically includes:

4. the method according to claim 1, wherein the characteristic parameter includes: linear predictor coefficient LPCC, sense Know one of linear predictor coefficient PLP, mel-frequency cepstrum coefficient MFCC.

5. according to the method described in claim 4, it is characterized in that, when characteristic parameter be MFCC when, it is described to it is described first number Word signal carries out feature extraction, obtains characteristic parameter, specifically includes:

According to convolution results, the vector that the output of each triangular filter in the triangular filter group is constituted become from cosine Change DCT；

6. being obtained optimal the method according to claim 1, wherein described be decoded the characteristic parameter Word Model sequence, specifically includes:

Respectively by acoustic model, language model and pronunciation dictionary, in the reference model library of the characteristic parameter and built in advance The similitude of reference template is given a mark, obtain corresponding first score of acoustic model, corresponding second score of speech model and The corresponding third score of pronunciation dictionary；

7. the method according to claim 1, wherein the text information to the continuous voice signal into Row semantic analysis obtains text instruction and user is intended to specifically include:

According to the syntactic structure, text instruction is obtained；

Using machine learning, intention analysis is carried out to the syntactic structure, determines that the corresponding user of the syntactic structure is intended to.

8. being generated the method according to claim 1, wherein described carry out conversion process for the response message Continuous speech simultaneously exports, and specifically includes:

9. according to the method described in claim 2, it is characterized in that, after the method further include:

Receive the second voice authentication information of second user；

After successful match, the second feature information is handled, determines the second identity information of second user；Described Two identity informations include the identity ID and identity grade of second user；

When the identity grade of the second user be greater than first user identity grade when, release wisdom class board ID with it is described The binding of first identity information, and the ID of wisdom class board and second identity information are bound.

10. the method according to claim 1, wherein the method also includes: