CN101859565A

CN101859565A - System and method for realizing voice recognition on television

Info

Publication number: CN101859565A
Application number: CN201010198592A
Authority: CN
Inventors: 刘翰林; 赵新科
Original assignee: Shenzhen Skyworth RGB Electronics Co Ltd
Current assignee: Shenzhen Skyworth RGB Electronics Co Ltd
Priority date: 2010-06-11
Filing date: 2010-06-11
Publication date: 2010-10-13

Abstract

The invention relates to the technical field of consumer electronics, and provides a system for realizing voice recognition on a television. The system comprises a voice input system, a data analysis system and a coordinating system for connecting the voice input system and the data analysis system, wherein the voice input system is used for sampling human voice, converting an analog signal into a digital signal and performing data caching to finish the preparation work of initial data; the data analysis system analyzes spectral characteristics according to human voice characteristics and is used for extracting effective voice, removing noise and further analyzing the voice content; and the coordinating system is mainly used for coordinating the work of the voice input system and the data analysis system. Voice recognition technology is introduced into the television; the operation of a remote controller of the television can be simplified by using the voice recognition technology; song order, movie on demand, channel selection and the like by voice can be easily realized; and the functions of the television are more abundant.

Description

A kind of system and method thereof that on televisor, realizes speech recognition

Technical field

The invention belongs to the consumption electronic products technical field, relate in particular to a kind of system and method thereof that on televisor, realizes speech recognition.

Background technology

Speech recognition technology is that voice signal with the people is as input, be converted into digital signal, pass through Computer Processing again, sound and implication thereof that identification is human, make corresponding reaction, reach the purpose that exchanges or communicate by letter, promptly allow machine pass through identification and understanding process, voice signal is changed into corresponding text or order.But voice technology can not get using widely so far, is because itself exist three big gordian techniquies ripe not enough.The first, to the identification of band accent sound, in different areas different pronunciations is arranged with a kind of language, interpersonal sound also all varies, and is difficult to realize compatible identification; The second, environmental noise problem, environmental noise has increased the identification difficulty, is difficult to realize accurate data collection and analysis; Three, spoken problem, spoken language does not also meet normal syntactic structure, and the abnormal characteristics of word order lack of standardization of grammer bring difficulty can for semantic analysis and understanding.This three big problem always is that speech recognition technology moves towards the stumbling-block of commercialization, if can address the above problem, speech recognition technology is incorporated in the televisor, utilize speech recognition technology can simplify the operation of TV remote controller, and can easily be achieved as follows function: ordering song by voice, playing speech on demand film, voice channel selection etc.The user only needs to say the name of song, TV under certain scene, the platform of TV number can replace the telepilot operation, thereby makes television function abundanter.

Summary of the invention

The purpose of the embodiment of the invention is to provide a kind of system and method thereof that realizes speech recognition on televisor.

The embodiment of the invention is achieved in that a kind of system that realizes speech recognition on televisor, comprises sound input system, data analysis system and the coherent system that connects sound input system and data analysis system; Wherein, the responsible sampling to human sound of sound input system, simulating signal are finished the preliminary work of primary data to the conversion of digital signal and the operation of metadata cache; Data analysis system is according to people's characteristic voice analysis spectrum characteristic, be responsible for extraction to effective sound, the removal of noise, and further analyze the content of sound, be translated into the people's of mating most intention, thereby reach the effect that machine can be discerned people's sound with sound; And coherent system mainly is the work of coordinating sound input system and data analysis system.

A kind of method that realizes speech recognition on televisor comprises the steps:

The sound input system is sampled to sound, converts digital signal then to and is sent to data analysis system, is handled by the MCU of data subsystem;

Data analysis system is according to people's characteristic voice analysis spectrum characteristic, be responsible for extraction to effective sound, the removal of noise, and further analyze the content of sound, be translated into the people's of mating most intention, thereby reach the effect that machine can be discerned people's sound with sound;

Coherent system is coordinated the work of sound input system and data analysis system, and synchronous processing is done in the data acquisition of sound input system and analyzing and processing two parts of data analysis system, and it is divided into recording thread and two thread modules of data analysis thread.

Compared to prior art, the present invention is incorporated into speech recognition technology in the televisor, utilizes speech recognition technology can simplify the operation of TV remote controller, and can easily be achieved as follows function: ordering song by voice, playing speech on demand film, voice channel selection etc.The user only needs to say the name of song, TV under certain scene, the platform of TV number can replace the telepilot operation, thereby makes television function abundanter.

Description of drawings

Fig. 1 is a theory structure block diagram of the present invention.

Fig. 2 is the data analysis flow process diagram of data analysis system of the present invention.

Fig. 3 is the workflow diagram of coherent system of the present invention.

Fig. 4 is the flow chart of data processing diagram of coherent system of the present invention.

Embodiment

In order to make purpose of the present invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with drawings and Examples.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.

Fig. 1 shows the present invention realizes speech recognition on televisor system, comprises sound input system, data analysis system and the coherent system that connects sound input system and data analysis system.Wherein, the responsible sampling to human sound of sound input system, simulating signal are finished the preliminary work of primary data to the conversion of digital signal and the operation of metadata cache; Data analysis system is according to people's characteristic voice analysis spectrum characteristic, be responsible for extraction to effective sound, the removal of noise, and further analyze the content of sound, be translated into the people's of mating most intention, thereby reach the effect that machine can be discerned people's sound with sound; And coherent system mainly is the work of coordinating sound input system and data analysis system, as when beginning to carry out the input of sound, when send raw data to data analysis system, how to discern, and when discerns success, and how to operate after the success etc.

The sound input system includes the input system of sound mainly by a miaow head, and the circuit that A/D converter is formed is sampled to sound, converts digital signal then to and is sent to data analysis system, is handled by the MCU of data subsystem.The data that MCU obtains AD are carried out the eigenwert extraction, and cepstral mean subtracts, acoustic layer identification, and acoustic layer is known aftertreatment, the conversion of sound speech, speech figure retrieves beta pruning etc., and then the output control information is to corresponding control module.

Data analysis system is to be finished by speech recognition engine, with reference to shown in Figure 2, after voice signal is converted to speech data, at first carries out feature extraction, and according to the effective frequency speech data of feature extraction of speech data, these frequencies relatively meet people's sounding characteristics; Use the method for " cepstral mean subtracts " to carry out the noise abatement processing then, this method speed is fast, and real-time high-efficiency satisfies service condition in the televisor; Then do the identification and the identification aftertreatment of acoustic layer, according to existing acoustic model, comprise a series of speech parameter such as word speed, intonation, tone colors etc. are analyzed the content of recognizing voice, can obtain the content that the people speaks after analysis is finished, further do the conversion of sound speech again, can obtain the word content of voice.For the coupling of further accurately word content and order, but the words and phrases after continuing comparison utility command keyword and changing find the order of mating most as recognition result, send to televisor then and do next step processing.Adopt " cepstral mean subtracts " that the speech data that extracts is carried out noise abatement in the data analysis system and handle, tentatively discern according to the feature of sound then, then do the conversion of sound speech aspect, obtain exporting the result.

Coherent system is done synchronous processing to the data acquisition of sound input system and analyzing and processing two parts of data analysis system, with reference to shown in Figure 3, is divided into recording thread and two thread modules of data analysis thread.In the recording thread, at first start the recording thread, wherein can initiating hardware equipment, distribute the required memory source of thread and some initial parameter settings, after finishing initialization, just enter the thread circulation, at first judged whether the speech data input, if have then the data of recording are deposited in the buffer area of this thread, and judge then whether recording finishes; If not then judge directly whether recording finishes.If recording does not finish, then continue the execution thread circulation, if finish recording, then close recording module.And in the data analysis thread, at first log-on data is analyzed thread, after foundation and distribute data are analyzed required system resource, enters the thread circulation.Whether elder generation's judgment data buffer area has is upgraded the data of coming, if having, then takes out one section content analysis process from the data analysis district, and whether the judgment data analysis bears results then; If whether no, then directly enter the judgment data analysis bears results.If bear results then stop the thread of recording, stop data analysis, the result is preserved, if do not bear results, then proceed the thread circulation.

The present invention creates multithreading and the recycling technology of buffer area of having adopted, recording and these two steps of data analysis start simultaneously, recording thread independent operating audio frequency acquiring data, with the deposit data of gathering at recording buffer memory block, with the interruption form data supplementing is arrived data field to be analyzed again then, notice that this recording buffer memory block is the circular buffer block, distribute suitable number and size, otherwise just be not capped when data also are not appended to data field to be analyzed, caused losing of data.The back judges whether recording needs to stop, if then stop recording.Data analysis thread independent operating is handled the content of buffer area to be analyzed, when existing content, buffer area just takes out one section contents processing analysis from buffer zone, take out one section content from buffer zone at every turn, the data field to be analyzed corresponding size that just moves up, append with regard to leaving the end of being close to data field to be analyzed in the time of data supplementing, so data have been recorded in reception that just can be rationally correct and processing, stop data analysis after producing recognition result.

In the coherent system of the present invention's creation software is realized optimizing, to recording data buffer area piecemeal, set a block size, when the data of gathering are filled full data block, produces an interruption, in this interruption, the data block that obtains moved and be appended to the language data process data buffer area, and meanwhile recording module can continue recording, in next data block, data block circulation storage is appended, and just can make the length of record length unrestricted with its deposit data that obtains.Because at the beginning in whole work, recording thread and data analysis thread just begin to have moved together, when obtaining a data block, the recording thread just passed to the data analysis thread immediately, when recording is not finished, the data analysis thread just can carry out the analysis of data, do not need to wait for that recording stops just to begin then analyzing and processing, improved efficient.Its part key code is as follows:

Two worker threads of // establishment

RockCreateThread(ProcGetProcGuid(GUID_EXE_VOICERECOG)，AitalkWorkThread，

TPRI_LOW)；

RockCreateThread(ProcGetProcGuid(GUID_EXE_VOICERECOG)，

VoiceRecogHighTaskMsgCallBack，TPRI_HIGH)；

// recording datacycle divides block cache

pRxBuf＝&gAudioData.PCMdata[gAudioData.RxBuflndex*AUDIO_BUF_LEN]；

DmaTransmit(AUDIO_DMACHANNEL，

(UINT32)RegI2s_RXR，

(UINT32)pRxBuf，

(UINT32)gAudioData.nPCMlength，

(UINT32)DmaI2sRecordCopy，

(DMACallBack)Voice_RecISR)；

if(++gAudioData.RxBufIndex＞＝AUDIO_BUF_NUM)

gAudioData.RxBufIndex＝0；

// supplemental data is to data field to be analyzed

EsrAppendData(g_hESRObj，lpBuf，nSample)；

// taking-up data analysis

EsrRunStep(g_hESRObj)；

// judge whether to produce recognition result

IStatus=EsrGetResultParameterA (g_hESRObj ， ﹠amp; PCmdID ， ﹠amp; NSame, (ivCStrA) " song title

″)；

if(pCmdID[0]＞0?&&?pCmdID[0]＜nFileNum)

return?TRUE；

In addition, coherent system also be responsible for initialization sound pick-up outfit and speech parameter and scene setting, speech processes result obtain and obtain the operation that will carry out the back etc., with reference to shown in Figure 4, the work sequence of coherent system is divided two lines, article one, be the sound pick-up outfit operating path, second is the voice recognition processing operating path.From left to right increase progressively line for the time, flow chart of data processing can be divided 3 time periods according to treatment step and time corresponding point, recording and speech processes are provided with the time period, the processing procedure time period, the identification back time period, and require these 3 time periods synchronous in two lines.Be provided with in the time period, sound pick-up outfit needs initialization, and voice recognition processing need be created instance objects, and scene and dictionary (being the established command words and phrases) are discerned in initialization then, and send the startup recognition command.In processing procedure in the time period, need in the sound pick-up outfit operating path, start recording earlier, gather the recording data then, when gathering, the voice recognition processing route is accepted the recording data of gathering and is attempted bearing results, if bear results then enter the identification back time period, if not then continue collection analysis.After identification in the time period, in the sound pick-up outfit operating path, stop recording earlier, in the speech recognition operating path, then obtain recognition result earlier, operate according to result's (promptly ordering words and phrases) then, for example " play xxx.mp3 ", then begin to play the xxx.mp3 music file.

Article one, the startup of the sound pick-up outfit of route and stop to be subjected to the influence of second route identifying, when identifying initialization object, after finishing the input of dictionary and scene, can start the speech recognition of a scene, just start recording this time, when the recording data were constantly imported, the identification route was also handled the recording data always, up to manual termination or there is recognition result to produce, recording afterwards is stopped.At this moment just can obtain recognition result, and carry out corresponding operation, as according to the song title playing back music etc. according to the result.Wherein, the part key code is as follows:

The recording route:

// initialization sound pick-up outfit

Codec_SetSampleRate (8000); // sampling rate is set

CodecGainSet (5); // gain is set

SetAudioDataBuf(&gAudioData.pLeft，&gAudioData.pRight，&gAudioData.nPCMleng

Th); // buffer is set

PMU_EnterModule(PMU_RECORDADPCM)；

Codec_SetMode (Codec_MICAdc); // MIC is set import

// start and record

I2sStart (I2S_Start_Rx); //I2S interface configuration

AudioInputBuffSwitch (); // being provided with and opening the DMA transmission, buffer is related with buffer memory

// collection recording data

FlushRecData()；

// stop to record

I2sStop()；

HDMA_Stop(0)；

The identification route:

// establishment object

TUserSys.pWorkBuffer=WorkBuff; // distribute data is analyzed buffer memory

tUserSys.nWorkBufferBytes＝USER_WORKBUFFER_BYTES；

// data analysis cache size

iStatus＝EsrCreate(&g_hESRObj，&tUserSys，&tResPackDesc，1)；

// establishment recognition engine

// input dictionary and scene

iStatus＝EsrSetACP(g_hESRObj，ivESR_CP_GBK)；

IStatus=EsrBeginLexiconA (g_hESRObj, (ivCStrA) " song title ");

for(i＝0；i＜nFileNum；i++)

{

iStatus＝EsrAddLexiconItemW(g_hESRObj，\

(ivCStrW)pFileUnit[i].LongFileName，i+1)；

}

iStatus＝EsrEndLexicon(g_hESRObj)；

IStatus=EsrBeginSceneA (g_hESRObj (ivCStrA) " plays scene ");

IStatus=EsrAddSyntaxA (g_hESRObj (ivCStrA) " opens { song title } ", 1);

iStatus＝EsrEndScene(g_hESRObj)；

The identification of a scene of // startup

EsrStartA (g_hESRObj (ivCStrA) " plays scene ");

// processing recording data

EsrRunStep(g_hESRObj)；

// obtain recognition result

IStatus=EsrGetResultParameterA (g_hESRObj ， ﹠amp; PCmdID ， ﹠amp; NSame, (ivCStrA) " song

Name ");

// playing back music

PlayMusic(pCmdID[0]-1)；

After coherent system is finished, indicating that whole process finishes substantially, the back only need define how opening voice identification gets final product.Solve problems such as noise, accent identification, spoken identification, improved accuracy of identification.

The above only is preferred embodiment of the present invention, not in order to restriction the present invention, all any modifications of being done within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims

1. a system that realizes speech recognition on televisor is characterized in that: comprise sound input system, data analysis system and the coherent system that connects sound input system and data analysis system; Wherein, the responsible sampling to human sound of sound input system, simulating signal are finished the preliminary work of primary data to the conversion of digital signal and the operation of metadata cache; Data analysis system is according to people's characteristic voice analysis spectrum characteristic, be responsible for extraction to effective sound, the removal of noise, and further analyze the content of sound, be translated into the people's of mating most intention, thereby reach the effect that machine can be discerned people's sound with sound; And coherent system mainly is the work of coordinating sound input system and data analysis system.

2. the system that on televisor, realizes speech recognition as claimed in claim 1, it is characterized in that, described sound input system includes miaow head and the A/D converter that sound is sampled, miaow head and A/D converter are sampled to sound, convert digital signal then to and be sent to data analysis system, handle by the MCU of data subsystem.

3. the method that on televisor, realizes speech recognition as claimed in claim 1 or 2, it is characterized in that, described data analysis system is to be finished by speech recognition engine, the data that the MCU of data subsystem obtains A/D converter are carried out the eigenwert extraction, and cepstral mean subtracts, acoustic layer identification, acoustic layer is known aftertreatment, the conversion of sound speech, speech figure retrieves beta pruning, exports control information then to control module.

4. the method that on televisor, realizes speech recognition as claimed in claim 3, it is characterized in that, described coherent system is done synchronous processing to the data acquisition of sound input system and analyzing and processing two parts of data analysis system, and it is divided into recording thread and two thread modules of data analysis thread.

5. a method that realizes speech recognition on televisor is characterized in that, comprises the steps:

6. the method that on televisor, realizes speech recognition as claimed in claim 5, it is characterized in that, data analysis system is to be finished by speech recognition engine, after voice signal is converted to speech data, at first carry out feature extraction, according to the effective frequency speech data of feature extraction of speech data, use the method for " cepstral mean subtracts " to carry out the noise abatement processing then; Then do the identification and the identification aftertreatment of acoustic layer,, analyze the content of identification voice, can obtain the content that the people speaks after analysis is finished, further do the conversion of sound speech again, can obtain the word content of voice according to existing acoustic model.

7. the method that on televisor, realizes speech recognition as claimed in claim 6, it is characterized in that, described coherent system is in the recording thread, at first start the recording thread, initiating hardware equipment is distributed the required memory source of thread and some initial parameter settings, after finishing initialization, just enter the thread circulation, judged whether the speech data input, the data of recording are deposited in the buffer area of this thread.

8. the method that on televisor, realizes speech recognition as claimed in claim 7, it is characterized in that, described coherent system is in the data analysis thread, at first log-on data is analyzed thread, after foundation and distribute data are analyzed required system resource, enter the thread circulation, whether the judgment data buffer area has is upgraded the data of coming, take out one section content analysis process from the data analysis district, whether the judgment data analysis bears results then; If bear results then stop the thread of recording, stop data analysis, the result is preserved.

9. the method that on televisor, realizes speech recognition as claimed in claim 8, it is characterized in that, described recording thread independent operating audio frequency acquiring data, with the deposit data of gathering at recording buffer memory block, with the interruption form data supplementing is arrived data field to be analyzed again then, and data analysis thread independent operating is handled the content of buffer area to be analyzed, when existing content, buffer area just takes out one section contents processing analysis from buffer zone, take out one section content from buffer zone at every turn, the data field to be analyzed corresponding size that just moves up is appended with regard to leaving the end of being close to data field to be analyzed in the time of data supplementing.

10. the method that on televisor, realizes speech recognition as claimed in claim 9, it is characterized in that in the coherent system to recording data buffer area piecemeal, set a block size, when a data block is expired in the data filling of gathering, produce an interruption, in this interrupts, the data block that obtains moved and be appended to the language data process data buffer area, and meanwhile recording module can continue recording, in next data block, data block circulation storage is appended with its deposit data that obtains.