CN109994121A

CN109994121A - Eliminate system, method and the computer storage medium of audio crosstalk

Info

Publication number: CN109994121A
Application number: CN201711484010.7A
Authority: CN
Inventors: 薛彬; 曹晶皓; 刘礼; 余涛; 田彪
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2019-07-09

Abstract

The invention discloses a kind of system, method and computer storage mediums for eliminating audio crosstalk.The system comprises: acoustic sensor, for acquiring multiple voice signals；The voice signal is converted to digital signal, and handle except crosstalk to the digital signal after conversion, exports non-crosstalk digital signal by speech processing device for coupling with the acoustic sensor.After the embodiment of the present invention, crosstalk signal can be eliminated, voice recognition rate can be improved during subsequent processing sound when crosstalk occurs for sound.

Description

Eliminate system, method and the computer storage medium of audio crosstalk

Technical field

The present invention relates to field of acoustics more particularly to a kind of system for eliminating audio crosstalk, method and computer storage to be situated between Matter.

Background technique

Under the plurality of application scenes such as the scene of videoconference, large-scale activity, sound transducer acquisition can use Sound is to be transmitted.Enable the sound that spokesman is heard apart from farther away hearer by transmitting corresponding acoustic information.

Acoustic enviroment is very complicated, and while picking up sound, another sound transducer can also pick up a sound transducer Take the same sound.In this way, crosstalk occurs for sound.

Summary of the invention

The embodiment of the invention provides a kind of system, method and computer storage medium for eliminating audio crosstalk, Neng Gou When crosstalk occurs for sound, crosstalk signal is eliminated, voice recognition rate can be improved during subsequent processing sound.

A kind of system for eliminating audio crosstalk, the system comprises:

Acoustic sensor, for acquiring multiple voice signals；

Speech processing device is used to couple with the acoustic sensor, the voice signal is converted to digital signal, and Digital signal after conversion handle except crosstalk, non-crosstalk digital signal is exported.

The system also includes: the server-side coupled with the speech processing device；

The server-side carries out data according to the non-crosstalk digital signal for receiving the non-crosstalk digital signal Processing, output treated non-crosstalk digital signal.

The server-side is located locally end or is located at cloud.

The speech processing device, specifically for being carried out the digital signal after conversion except at crosstalk according to speech parameter Reason, the speech parameter are the parameters determined based on the voice signal.

The speech parameter is the parameter that the acoustic sensor acquires that the sound wave of maximum loudness of a sound determines.

The speech parameter is the parameter that the acoustics sensor acquires that the sound wave of non-maximum loudness of a sound determines.

A kind of system for eliminating audio crosstalk, the system comprises:

Acoustic sensor, for acquiring multiple voice signals；

The voice signal is converted to digital signal for coupling with the acoustic sensor by speech processing device；

Server-side receives the digital signal, carries out to the digital signal for coupling with the speech processing device Except crosstalk handles to obtain non-crosstalk digital signal, and data processing is carried out according to the non-crosstalk digital signal, after output processing Non- crosstalk digital signal.

The server-side is located locally end or is located at cloud.

The server, is specifically used for according to speech parameter, carries out handling to obtain non-string except crosstalk to the digital signal Digital signal is disturbed, the speech parameter is the parameter determined based on the voice signal.

A kind of method of the elimination audio crosstalk of above system, which comprises

The acoustic sensor acquires multiple voice signals；

The voice signal is converted to digital signal by the speech processing device, and the digital signal after conversion is carried out Except crosstalk is handled, non-crosstalk digital signal is exported.

The method also includes:

The server-side receives the non-crosstalk digital signal, carries out data processing according to the non-crosstalk digital signal, Output treated non-crosstalk digital signal.

The server-side is located locally end or is located at cloud.

The voice signal is converted to digital signal by the speech processing device, and the digital signal after conversion is carried out Except crosstalk is handled, non-crosstalk digital signal is exported, comprising:

The voice signal is converted to digital signal by the speech processing device, and will be after conversion according to speech parameter Digital signal handle except crosstalk, exports non-crosstalk digital signal, and the speech parameter is determined based on the voice signal Parameter.

The speech parameter is the parameter that the acoustic sensor acquires that the sound wave of non-maximum loudness of a sound determines.

A method of such as the elimination audio crosstalk of above system, which comprises

The acoustic sensor acquires multiple voice signals；

The voice signal is converted to digital signal by the speech processing device；

The server-side receives the digital signal, carries out handling to obtain non-crosstalk number except crosstalk to the digital signal Signal, and data processing is carried out according to the non-crosstalk digital signal, output treated non-crosstalk digital signal.

The server-side is located locally end or is located at cloud.

The server-side receives the digital signal, carries out handling to obtain non-crosstalk number except crosstalk to the digital signal Signal, and data processing is carried out according to the non-crosstalk digital signal, output treated non-crosstalk digital signal, comprising:

The server-side receives the digital signal, handle except crosstalk to the digital signal according to speech parameter Data processing is carried out to non-crosstalk digital signal, and according to the non-crosstalk digital signal, output treated non-crosstalk number Signal, the speech parameter are the parameters determined based on the voice signal.

A kind of computer storage medium is stored with computer program instructions in the computer storage medium；The calculating The method such as above-mentioned elimination audio crosstalk is realized when machine program instruction is executed by processor.

As can be seen that acoustic sensor will acquire multiple voice signals input speech processes sets from above-mentioned technical proposal It is standby.Speech processing device is coupled with acoustic sensor, converts voice signals into digital signal, and by the digital signal after conversion Handled except crosstalk, exports non-crosstalk digital signal.Therefore, it can be gone here and there in sound using the system for eliminating audio crosstalk When disturbing, crosstalk signal is eliminated, and then voice recognition rate can be improved during subsequent processing sound.

Detailed description of the invention

The present invention may be better understood from the description with reference to the accompanying drawing to a specific embodiment of the invention wherein, The same or similar appended drawing reference indicates the same or similar feature.

Fig. 1 is the schematic diagram of a scenario at official meeting scene in the embodiment of the present invention；

Fig. 2 is the system global structure schematic diagram that audio crosstalk is eliminated in first embodiment of the invention；

Fig. 3 is the system structure diagram that audio crosstalk is eliminated in second embodiment of the invention；

Fig. 4 is the system structure diagram that audio crosstalk is eliminated in third embodiment of the invention；

Fig. 5 is the system structure diagram that audio crosstalk is eliminated in fourth embodiment of the invention；

Fig. 6 is the system structure diagram that audio crosstalk is eliminated in fifth embodiment of the invention；

Fig. 7 is the system structure diagram that audio crosstalk is eliminated in sixth embodiment of the invention；

Fig. 8 is the method flow schematic diagram that audio crosstalk is eliminated in one embodiment of the invention；

Fig. 9 is the method flow schematic diagram that audio crosstalk is eliminated in another embodiment of the present invention；

Figure 10 is the exemplary hard of the method for the elimination audio crosstalk of the embodiment of the present invention and the calculating equipment of control assembly The structure chart of part framework.

Specific embodiment

To make the object, technical solutions and advantages of the present invention express to be more clearly understood, with reference to the accompanying drawing and specifically The present invention is further described in more detail for embodiment.

At official meeting occasion or the scene of large-scale activity, since space is bigger, spokesman needs by acoustics sensor Device transfers out speech, and the hearer far from spokesman can clearly hear the sound of spokesman in this way.

However, acoustic enviroment is more complicated at above-mentioned scene, made a speech in spokesman by acoustic sensor same When, other spokesmans may also make a speech at the same time.Under normal conditions, acoustics sensor is provided in face of each spokesman Device.In this way, each acoustic sensor can acquire the voice signal of multiple spokesmans.

It is the schematic diagram of a scenario at official meeting scene in the embodiment of the present invention referring to Fig. 1, Fig. 1.At official meeting scene, ginseng See that the participant of meeting includes two parts, is sitting in the two sides of conference table respectively.The participant of side includes spokesman A1, hair Speaker A2 and spokesman A3；The participant of the other side includes spokesman B1, spokesman B2 and spokesman B3.

In Fig. 1, as an example, acoustic sensor can be microphone.It is respectively provided in face of each spokesman There is microphone.In this way, each participant at meeting scene passes through the loudspeaker that meeting scene is arranged, it can understand and hear hair The speech of speaker.

When only a spokesman makes a speech, multiple microphones acquire the voice signal of the spokesman simultaneously, in this way Each participant at meeting scene can understand the speech for hearing the spokesman.Due to needing to record the speech of spokesman, make For an example, the voice signal of spokesman can recorde, and data processing is carried out to the voice signal of record, with clear and complete The speech content of whole record spokesman.

When there is multiple spokesmans to make a speech, i.e., the number of spokesman is greater than one, and multiple microphones acquire more simultaneously The voice signal of a spokesman, it is clear that for each microphone, crosstalk occurs for voice signal collected, then will appear chicken Tail cocktail party problem.

Cocktail party problem is i.e.: current speech identification technology can be identified in the speech of a people with degree of precision Hold, but when talker's number is two people or more people, phonetic recognization rate will be can be greatly reduced.

That is, thering are more human hairs to say when recording the speech content of spokesman in synchronization, then needing to more The speech content of a spokesman carries out speech recognition.Due to simultaneously have more than one spokesman speech, then audio crosstalk can occur, Phonetic recognization rate is caused to lower significantly.

In order to eliminate audio crosstalk, it may be considered that eliminate audio crosstalk using the system for eliminating audio crosstalk.Referring to fig. 2, Fig. 2 is the system structure diagram that audio crosstalk is eliminated in first embodiment of the invention.

The system for eliminating audio crosstalk may include multiple acoustic sensors and speech processing device.Wherein, each acoustics Sensor is coupled with speech processing device.

Acoustic sensor is the sensor that can be experienced acoustics amount and the acoustics amount experienced is converted to exportable signal. Acoustic sensor may include sound pressure sensor, noise transducer, ultrasonic sensor and microphone.In embodiments of the present invention By taking acoustic sensor is microphone as an example, it is illustrated.

The voice signal of microphone acquisition needs speech processing device to eliminate the string in voice signal when crosstalk occurs It disturbs.That is, speech processing device can be used for eliminating the crosstalk in voice signal.In view of easy to carry, speech processes Equipment can be laptop etc. and be conducive to the equipment carried.

At official meeting scene as shown in Figure 1, the microphone in face of spokesman can acquire multiple spokesmans' Voice signal.The voice signal for collecting multiple spokesmans can be inputted the speech processing device coupled with microphone by microphone In.

Speech processing device receives the voice signal of multiple spokesmans of microphone transmission, can be by spokesman's Voice signal is converted to single channel digital signal；It, can will be more when speech processing device receives the voice signal of multiple spokesmans The voice signal of a spokesman is converted to multi-path digital signal.

As an example, speech processing device receives the voice signal of three spokesmans, respectively spokesman A1's The voice signal of voice signal, the voice signal of spokesman A2 and spokesman B3.Speech processing device can will will make a speech respectively The voice signal of person A1 is converted to first via digital signal；The voice signal of spokesman A2 can be converted to the second railway digital letter Number；The voice signal of spokesman B3 can be converted to third railway digital signal.

It, can also be to the number after conversion after the voice signal received can be converted to digital signal by speech processing device Word signal handle except crosstalk, exports non-crosstalk digital signal.

As an example, speech processing device receives the voice signal of three spokesmans, respectively by three spokesmans Voice signal conversion digital signal, obtain first via digital signal, the second digital signal and third railway digital signal.Due to same When more than just one spokesman make a speech, then audio crosstalk can occur.Speech processing device is respectively to first via digital signal, second Railway digital signal and third railway digital signal handle except crosstalk, the non-crosstalk digital signal of final output.

In one embodiment of the invention, speech processing device can be according to speech parameter by the digital signal after conversion Handle except crosstalk, speech parameter is the parameter determined based on voice signal.

Specifically, it is located in face of spokesman A1 referring to Fig. 3, microphone A1, microphone B1 is located at the face of spokesman B1 Before, spokesman A1 and spokesman B1 make a speech simultaneously.

Microphone acquires the voice signal of spokesman, i.e. microphone A1 acquires speech and the spokesman B1 of spokesman A1 simultaneously Speech；Microphone B1 acquires the speech of spokesman A1 and the speech of spokesman B1 simultaneously.

Due to microphone A1 being closer apart from spokesman A1, distance of the microphone B1 apart from spokesman B1 farther out, then The corresponding spokesman of maximum loudness of a sound of microphone A1 is spokesman A1.Correspondingly, the corresponding speech of maximum loudness of a sound of microphone B2 Person is spokesman B1.

Microphone can be acquired the audio of the sound wave of maximum loudness of a sound as speech parameter, that is, be directed to Mike by a kind of situation Wind A1, the voice signal according to spokesman A1 determine that the corresponding speech parameter of microphone A1, speech parameter can be spokesman A1 Voice signal sound frequency；For microphone B1, the voice signal according to spokesman B1 determines the corresponding language of microphone B1 Sound parameter, speech parameter can be the sound frequency of the voice signal of spokesman B1.

In this way, microphone acquires the corresponding voice signal of sound frequency of the sound wave of maximum loudness of a sound, therefore it can be eliminated The audio crosstalk of the sound wave of his non-maximum loudness of a sound.

As speech parameter, i.e., microphone can be acquired the sound frequency of the sound wave of non-maximum loudness of a sound by another situation For microphone A1, the voice signal according to other spokesmans in addition to spokesman A1 determines the corresponding voice ginseng of microphone A1 Number, speech parameter can be the sound frequency of the voice signal of spokesman B1；For microphone B1, according in addition to spokesman B1 Other voice signals determine that the corresponding speech parameter of microphone B1, speech parameter can be the sound of the voice signal of spokesman A1 Voice frequency.

In this way, microphone acquires the corresponding voice signal of audio of the sound wave of non-maximum loudness of a sound, therefore other can be eliminated The audio crosstalk of the sound wave of non-maximum loudness of a sound.

It is the system structure diagram for eliminating audio crosstalk in third embodiment of the invention referring to fig. 4, includes meeting in Fig. 4 Phone, spokesman 1 and spokesman 2.Wherein, conference telephone includes acoustic sensor and speech processing device.That is, meeting Acoustic sensor and speech processing device are integrated in phone.

The voice signal of acoustic sensor acquisition spokesman 1 and spokesman 2 in conference telephone.Voice in conference telephone Collected voice signal is converted to digital signal, and handle except crosstalk to the digital signal after conversion by processing equipment, To export non-crosstalk digital signal.Wherein, speech processing device can according to speech parameter by the digital signal after conversion into Row is handled except crosstalk, referring specifically to the corresponding embodiment of Fig. 3.

It is the system structure diagram for eliminating audio crosstalk in fourth embodiment of the invention referring to Fig. 5, eliminates audio crosstalk System include microphone, speech processing device and server-side.

Non- crosstalk digital signal is sent to server-side by speech processing device.Server-side can be to the non-crosstalk number received Word signal carries out data processing, output treated non-crosstalk digital signal.As an example, server-side is non-to what is received Crosstalk digital signal carries out speech recognition, non-crosstalk digital signal can be converted to text information.As another example, Server-side carries out speech analysis, knows the actual purpose of the voice messaging, and according to upper to the non-crosstalk digital signal received Purpose is stated, other device or systems is controlled and carries out corresponding operation.

In view of that may include more pending data in non-crosstalk digital signal, with regard to a computer difficulty of server-side To handle more pending data.Server-side may be located on cloud, it can be handled simultaneously wait locate using multiple stage computers Data are managed, can much improve the working efficiency of server-side in this way.Wherein, the organizational form of multiple stage computers can be centralization Processing system is also possible to distributed processing system(DPS).

It is the system structure diagram for eliminating audio crosstalk in fifth embodiment of the invention referring to Fig. 6, Fig. 6, eliminates audio The system of crosstalk includes multiple acoustic sensors, speech processing device and server-side.Wherein, acoustic sensor is set with speech processes Standby coupling, speech processing device are coupled with server-side.

Acoustic sensor acquires multiple voice signals, is illustrated so that acoustic sensor is microphone as an example below.

The voice signal received can be converted to digital signal by speech processing device.The voice signal of microphone acquisition When crosstalk occurs, server-side is needed to eliminate the crosstalk in voice signal.That is, being eliminated in the present embodiment by server-side Crosstalk in voice signal.

In general, the processing capacity of server-side is far longer than speech processing device, therefore server-side can receive voice The digital signal of processing equipment output carries out handling to obtain non-crosstalk digital signal except crosstalk to the digital signal received, and Data processing, output treated non-crosstalk digital signal are carried out according to non-crosstalk digital signal.As an example, at data Reason can be speech recognition, can also be speech analysis.Wherein, server can believe the number after conversion according to speech parameter It number carries out handling except crosstalk, as an example, what the sound wave that speech parameter can acquire maximum loudness of a sound with acoustic sensor determined. As another example, what the sound wave that speech parameter can acquire non-maximum loudness of a sound with acoustic sensor determined.

In view of the number of pending data amount, when the server-side being located locally is capable of handling pending data amount, then may be used To carry out data processing using the server-side being located locally.

And when pending data amount it is very big, then can use positioned at cloud server-side carry out data processing.Referring to figure 7, Fig. 7 be the system structure diagram that audio crosstalk is eliminated in sixth embodiment of the invention, wherein server is located at cloud.I.e. It can use multiple stage computers while handling pending data, can much improve the working efficiency of server-side in this way.Wherein, more The organizational form of platform computer can be centralized processing system, be also possible to distributed processing system(DPS).

It is the method flow schematic diagram for eliminating audio crosstalk in one embodiment of the invention referring to Fig. 8, Fig. 8, specifically includes Following steps:

S801, acoustic sensor acquire multiple voice signals.

Acoustics sensor implement body can be microphone, i.e. the microphone voice signal that acquires multiple spokesmans.

S802, speech processing device convert voice signals into digital signal, and the digital signal after conversion is removed Crosstalk processing, exports non-crosstalk digital signal.

It, can be by speech processing device by language in view of huge advantage of the digital signal in storage, transmission and processing Sound signal is converted to digital signal.In order to guarantee the availability of digital signal, digital signal can also be filtered.

Then, the digital signal after conversion handle except crosstalk, export non-crosstalk digital signal.

In one embodiment of the invention, server-side receives the non-crosstalk digital signal of speech processing device output, root Data processing, output treated non-crosstalk digital signal are carried out according to non-crosstalk digital signal.

Server-side can carry out data processing to the non-crosstalk digital signal received.For example, data processing may include Speech recognition, and then voice messaging can be converted into text information.

Server-side can be located locally or cloud.Server-side beyond the clouds in the case where, handled simultaneously using multiple stage computers Pending data, to improve the working efficiency of server-side.

In one embodiment of the invention, speech processing device converts voice signals into digital signal, and according to language Digital signal after conversion handle except crosstalk by sound parameter by the digital signal after conversion, exports non-crosstalk digital signal, Speech parameter is the parameter determined based on voice signal.

In one embodiment of the invention, speech parameter is the ginseng that acoustic sensor acquires that the sound wave of maximum loudness of a sound determines Number.

In one embodiment of the invention, speech parameter is the sound wave determination that acoustic sensor acquires non-maximum loudness of a sound Parameter.

In addition, in one embodiment of the invention, speech processing device can also detect voice quality.

The purpose of detection voice quality is exactly to be really to identify that (ASR) provides qualified voice data for subsequent acoustics. Whether quality of speech signal detected meets the requirements, and can be informed by prompt information.For example, if the voice signal matter of detection It measures undesirable, then spokesman can be notified by Alarm mode.The quality of speech signal of detection is undesirable, then It can send out a warning.After spokesman sees red light, then speech just now can be repeated.

Speech processing device can also receive the detection parameters for voice signal, be examined by above-mentioned detection parameters It surveys.Above-mentioned detection parameters are that research, improvement obtain in actual application.As an example, detection parameters may include At least one of following parameter, in short-term smoothing factor, it is long when smoothing factor, time window and preset threshold and signal-to-noise ratio (Signal to Noise Ratio, SNR).

Voice signal combination detection parameters are illustrated below.Voice is generally divided into unvoiced segments, voiceless sound section and voiced segments. Voiced sound is generally considered one using pitch period as the oblique triangular pulse string in period, voiceless sound is modeled to random white noise.By It is a non stationary state process in voice signal, the signal processing technology for being unable to use reason stationary signal analyzes it place Reason.But the characteristics of due to voice signal itself, in short time (such as 10~30ms or even shorter time) range, characteristic It can be regarded as a quasi-steady state process, i.e. voice signal has short-term stationarity.Therefore, special using the short-term stationarity of voice Property, such as the voice signal of input can be divided by multiple speech frames using the method for adding window framing.

As an example, framing is the voice signal formation speech frame that input is intercepted with the window function of finite length, Window function obtains current speech frame for the sampled point zero setting except processing region is needed.Although framing can be used and will be inputted Voice signal contiguous segmentation method, but the method generally frequently with overlapping segmentation, i.e. former frame and a later frame have it is common Overlapping part, the overlapping part be known as frame shifting, can make to seamlessly transit between frame and frame in this way, keep its continuity.

In Speech processing, voiceless sound/voiced sound is one of those particularly significant link, and the order of accuarcy of judgement is to rear Continuous speech processes influence very big.The variation of the energy of voice signal in time is that comparison is significant, the energy of unvoiced part Smaller than the energy of voiced portions is more.The short-time energy of voice signal is the important parameter for characterizing temporal signatures.Based on voice The short-time energy of signal can distinguish voiceless sound and voiced sound.Secondly as small more of energy of the energy than sound section of unvoiced segments, It can use sound section/unvoiced segments of this feature detection voice signal.In addition, the short-time energy of voice signal can also be used to Carry out initial consonant and the boundary and the boundary of loigature of simple or compound vowel of a Chinese syllable etc..Loigature refers to gapless between word and word.

Energy is the long-term trend value for representing energy when long.As an example, there are MCVF multichannel voice frequencies to acquire equipment, When voice inputs, may because of sound reflection, put loudspeaker location outside and the factors such as put and sound is passed into other audio collections It in equipment, is misidentified, influences the result of final speech recognition.By based on it is long when energy comparison, can distinguish true Positive voice importer, to eliminate misrecognition.

In embodiments of the present invention, it is contemplated that in needing for all values of voice signal to be considered in, assigned in different moments Different weights is given, so that the predicted value of energy is closer to actual observation value.

P (t)=α p (t-1)+(1-a) px (t) (1)

P (t) is the energy of the speech frame of moment t, and px (t) is the average energy of the speech frame of moment t, and α is smoothing factor. The value range of α is greater than 0 and less than 1.Wherein, p (0)=0, p (t) and px (t) can be arranged according to actual conditions.

In order to make formula (1) sensitively reflect the variation of energy, i.e. p (t) is the short-time energy of the speech frame of moment t.α is answered Take the larger value, i.e. α is close to 1, and α is known as smoothing factor in short-term at this time.

If required energy is for representing long-term trend value, i.e. α is the long Shi Nengliang of the speech frame of moment t.α should take compared with Small value, i.e. α are close to 0, smoothing factor when α is known as long at this time.

During speech processes, to guarantee certain voice quality, have to the mass parameter of voice signal certain Area requirement.When the mass parameter of voice signal is better than or worse when the target zone, the influence to speech processes will No longer highly significant.Therefore, by cutting top, reduction is influenced into the mass parameter of lesser voice signal effective to speech processes Specific gravity in mass parameter calculating, meanwhile, so that the mass parameter of finally obtained efficient voice signal reflects actual voice Signal quality.

In order to keep the accuracy of the detection voice quality of the embodiment of the present invention higher, as an example: when presetting Between window and preset threshold.Wherein, preset threshold includes maximum preset threshold value and minimum preset threshold.As an example: most The absolute value of big preset threshold can be equal with the absolute value of minimum preset threshold.

It can preset in time window, it is default to speech frame based on default max-thresholds and default minimum threshold Range value carries out cutting top, so that the mass parameter for cutting the speech frame behind top is in the range of effectively work.

More specifically, it is preset based on default max-thresholds and default minimum threshold and the range value of speech frame is cut Top is greater than default maximum threshold so that the range value of the speech frame before default max-thresholds and default minimum threshold is constant The range value of the speech frame of value is changed to default max-thresholds, and is less than the range value change of the speech frame of default minimum threshold To preset minimum threshold.

After detection obtains the mass parameter of speech frame, it is contemplated that need to export the prompt information based on voice quality.Language The length of sound frame is typically all millisecond (MS) grade, and the voice signal length of spokesman is the second (S) to be even up to dozens of minutes, with language Sound frame is the prompt information that unit exports voice quality, will cause the puzzlement of spokesman.For example, spokesman has just said that several words are just shown Show prompt information, prompt information is excessively frequent, leads to the speech for frequently interrupting spokesman.

Therefore, the voice quality of the time cycle can be determined according to the mass parameter of speech frame in the time cycle.Time Period can be arranged according to actual conditions, in this way according to the duration of the voice signal of practical spokesman, timely feedback voice The prompt information of quality.

In one embodiment of the invention, speech frame can be counted according to the mass parameter of speech frame in the time cycle Mass parameter, determine the time cycle voice quality.As an example, the setting time cycle is equal to 60 seconds, quality Threshold value and speech frame qualification ratio.Amounting within the setting time cycle has 6000 speech frames, the mass parameter of each speech frame Compared with quality threshold, the speech frame that the mass parameter of speech frame is greater than quality threshold is qualified speech frame.In 6000 voices In frame, the accounting of qualified speech frame is more than or equal to speech frame qualification ratio, it is determined that the voice quality of the time cycle is qualification； If the accounting of qualified speech frame is less than speech frame qualification ratio, it is determined that the voice quality of the time cycle is unqualified.Voice Acquisition device speech processing device can export the prompt information of the voice quality based on the time cycle.

SNR is the ratio of the voltage and the noise voltage exported simultaneously of output signal, is usually indicated with decibels.It is based on SNR may determine whether to do voice signal corresponding processing.Such as: can determine whether to contain in voice signal according to SNR and make an uproar Sound component, noise component(s) then needs to carry out noise reduction process to the voice signal if it exists.

In one embodiment of the invention, the SNR of voice signal can be calculated within the time cycle.According to voice signal SNR and in the time cycle speech frame mass parameter, determine the voice quality of time cycle.As an example, voice is believed Number SNR be greater than noise threshold, and in the time cycle speech frame mass parameter it is undesirable, it is determined that the time cycle Voice quality is unqualified.When spokesman repeats to make a speech, voice quality can be improved in terms of SNR and mass parameter two.

In addition, if the quality of speech signal of detection meets the requirements, i.e., voice is believed after the prompt information of output voice quality Number be it is qualified, the satisfactory speech frame of voice quality can be uploaded.As an example, voice can be uploaded to server The speech frame of satisfactory quality.In this way, can use the satisfactory speech frame of voice quality in turn, efficient voice is carried out The processing such as communication, speech synthesis and speech recognition.

It is the method flow schematic diagram that audio crosstalk is eliminated in another embodiment of the present invention referring to Fig. 9, Fig. 9, it is specific to wrap Include following steps:

S901, acoustic sensor acquire multiple voice signals.

S902, speech processing device convert voice signals into digital signal.

S903, server-side receive digital signal, carry out handling to obtain non-crosstalk digital signal except crosstalk to digital signal, and Data processing, output treated non-crosstalk digital signal are carried out according to non-crosstalk digital signal.

Server-side receives the digital signal of speech processing device output, carries out handling to obtain non-string except crosstalk to digital signal Disturb digital signal.Then, data processing, output treated non-crosstalk digital signal are carried out further according to non-crosstalk digital signal.

Server-side can carry out data processing to non-crosstalk digital signal.For example, data processing may include speech recognition, And then voice messaging can be converted into text information.

In one embodiment of the invention, server-side receives the digital signal of speech processing device output, according to voice Parameter carries out the digital signal to handle to obtain non-crosstalk digital signal except crosstalk, and carries out data according to non-crosstalk digital signal Processing, output treated non-crosstalk digital signal, speech parameter are the parameters determined based on voice signal.

Figure 10 is to show to can be realized the calculating of system and method according to an embodiment of the present invention for eliminating audio crosstalk and set The structure chart of standby exemplary hardware architecture.

As shown in Figure 10, calculate equipment 1000 include input equipment 1001, input interface 1002, central processing unit 1003, Memory 1004, output interface 1005 and output equipment 1006.Wherein, input interface 1002, central processing unit 1003, deposit Reservoir 1004 and output interface 1005 are connected with each other by bus 1010, and input equipment 1001 and output equipment 1006 are distinguished It is connect by input interface 1002 and output interface 1005 with bus 1010, and then connected with the other assemblies for calculating equipment 1000 It connects.

Specifically, input equipment 1001 is received from external input information, and is believed input by input interface 1002 Breath is transmitted to central processing unit 1003；Central processing unit 1003 is based on the computer executable instructions pair stored in memory 1004 Input information is handled to generate output information, output information is temporarily or permanently stored in memory 1004, so Output information is transmitted to by output equipment 1006 by output interface 1005 afterwards；Output information is output to meter by output equipment 1006 Calculate the outside of equipment 1000 for users to use.

That is, calculating equipment shown in Fig. 10 also may be implemented as including: to be stored with computer executable instructions Memory；And processor, the processor may be implemented that Fig. 1 to Fig. 9 is combined to describe when executing computer executable instructions Elimination audio crosstalk system and method.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations；To the greatest extent Pipe present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement；And these are modified or replaceed, and the essence of corresponding technical solution is not made to be detached from various embodiments of the present invention technology The range of scheme.

Claims

1. a kind of system for eliminating audio crosstalk, which is characterized in that the system comprises:

Acoustic sensor, for acquiring multiple voice signals；

The voice signal is converted to digital signal for couple with the acoustic sensor by speech processing device, and to turn Digital signal after changing handle except crosstalk, exports non-crosstalk digital signal.

2. eliminating the system of audio crosstalk according to claim 1, which is characterized in that the system also includes: with institute's predicate The server-side of sound processing equipment coupling；

The server-side carries out data processing according to the non-crosstalk digital signal for receiving the non-crosstalk digital signal, Output treated non-crosstalk digital signal.

3. eliminating the system of audio crosstalk according to claim 2, which is characterized in that the server-side is located locally end or position In cloud.

4. eliminating the system of audio crosstalk according to claim 1 or described in 2 claims, which is characterized in that the speech processes Equipment, specifically for handle except crosstalk by the digital signal after conversion according to speech parameter, the speech parameter is to be based on The parameter that the voice signal determines.

5. eliminating the system of audio crosstalk according to claim 4, which is characterized in that the speech parameter is that the acoustics passes Sensor acquires the parameter that the sound wave of maximum loudness of a sound determines.

6. eliminating the system of audio crosstalk according to claim 4, which is characterized in that the speech parameter is that the acoustics passes The parameter that the sound wave of the non-maximum loudness of a sound of sense acquisition determines.

7. a kind of system for eliminating audio crosstalk, which is characterized in that the system comprises:

Acoustic sensor, for acquiring multiple voice signals；

Server-side receives the digital signal, carries out the digital signal except string for coupling with the speech processing device It disturbs processing and obtains non-crosstalk digital signal, and data processing is carried out according to the non-crosstalk digital signal, that treated is non-for output Crosstalk digital signal.

8. eliminating the system of audio crosstalk according to claim 7, which is characterized in that the server-side is located locally end or position In cloud.

9. eliminating the system of audio crosstalk according to claim 7 or 8 claims, which is characterized in that the server, tool Body is used to carry out handling to obtain non-crosstalk digital signal, the voice ginseng except crosstalk to the digital signal according to speech parameter Number is the parameter determined based on the voice signal.

10. eliminating the system of audio crosstalk according to claim 9, which is characterized in that the speech parameter is the acoustics Sensor acquires the parameter that the sound wave of maximum loudness of a sound determines.

11. eliminating the system of audio crosstalk according to claim 9, which is characterized in that the speech parameter is the acoustics The parameter that the sound wave of the non-maximum loudness of a sound of sensing acquisition determines.

12. a kind of method of the elimination audio crosstalk of the system as described in claim 1-6 any claim, which is characterized in that The described method includes:

The acoustic sensor acquires multiple voice signals；

The voice signal is converted to digital signal by the speech processing device, and the digital signal after conversion is carried out except string Processing is disturbed, non-crosstalk digital signal is exported.

13. eliminating the method for audio crosstalk according to claim 12, which is characterized in that the method also includes:

The server-side receives the non-crosstalk digital signal, carries out data processing, output according to the non-crosstalk digital signal Non- crosstalk digital signal that treated.

14. according to claim 13 it is described eliminate audio crosstalks method, which is characterized in that the server-side be located locally end or Positioned at cloud.

15. 2 or 13 method for eliminating audio crosstalk according to claim 1, which is characterized in that the speech processing device will The voice signal is converted to digital signal, and the digital signal after conversion handle except crosstalk, exports non-crosstalk number Signal, comprising:

The voice signal is converted to digital signal by the speech processing device, and according to speech parameter by the number after conversion Signal handle except crosstalk, exports non-crosstalk digital signal, and the speech parameter is the ginseng determined based on the voice signal Number.

16. eliminating the method for audio crosstalk according to claim 15, which is characterized in that the speech parameter is the acoustics Sensor acquires the parameter that the sound wave of maximum loudness of a sound determines.

17. eliminating the method for audio crosstalk according to claim 15, which is characterized in that the speech parameter is the acoustics Sensor acquires the parameter that the sound wave of non-maximum loudness of a sound determines.

18. a kind of method of the elimination audio crosstalk of the system as described in claim 7-11 any claim, which is characterized in that The described method includes:

The acoustic sensor acquires multiple voice signals；

The server-side receives the digital signal, carries out handling to obtain non-crosstalk number letter except crosstalk to the digital signal Number, and data processing is carried out according to the non-crosstalk digital signal, output treated non-crosstalk digital signal.

19. according to claim 18 it is described eliminate audio crosstalks method, which is characterized in that the server-side be located locally end or Positioned at cloud.

20. 8 or 19 method for eliminating audio crosstalk according to claim 1, which is characterized in that described in the server-side receives Digital signal carries out the digital signal to handle to obtain non-crosstalk digital signal except crosstalk, and according to the non-crosstalk number Signal carries out data processing, output treated non-crosstalk digital signal, comprising:

The server-side receives the digital signal, carries out handling to obtain except crosstalk to the digital signal according to speech parameter non- Crosstalk digital signal, and data processing is carried out according to the non-crosstalk digital signal, output treated non-crosstalk digital signal, The speech parameter is the parameter determined based on the voice signal.

21. eliminating the method for audio crosstalk according to claim 20, which is characterized in that the speech parameter is the acoustics Sensor acquires the parameter that the sound wave of maximum loudness of a sound determines.

22. eliminating the method for audio crosstalk according to claim 20, which is characterized in that the speech parameter is the acoustics Sensor acquires the parameter that the sound wave of non-maximum loudness of a sound determines.

23. a kind of computer storage medium, which is characterized in that be stored with computer program in the computer storage medium and refer to It enables；It is realized when the computer program instructions are executed by processor and eliminates audio string as described in claim 12-17 any one The method disturbed.

24. a kind of computer storage medium, which is characterized in that be stored with computer program in the computer storage medium and refer to It enables；It is realized when the computer program instructions are executed by processor and eliminates audio string as described in claim 18-22 any one The method disturbed.