CN102056026B

CN102056026B - Audio/video synchronization detection method and system, and voice detection method and system

Info

Publication number: CN102056026B
Application number: CN2009102374145A
Authority: CN
Inventors: 陈欣伟; 方力; 沈亮; 高屹; 常静; 侯优优; 阮征
Original assignee: China Mobile Group Design Institute Co Ltd
Current assignee: China Mobile Group Design Institute Co Ltd
Priority date: 2009-11-06
Filing date: 2009-11-06
Publication date: 2013-04-03
Anticipated expiration: 2029-11-06
Also published as: CN102056026A

Abstract

The invention discloses an audio and video synchronous detection method and its system, and a voice detection method and its system. The audio and video synchronous detection method includes: determining the audio and video files played by the target end, which match the audio reference data The start play time of the audio segment, and the start play time of the video frame matched with the video reference data; according to the start play time of the audio segment matched with the audio reference data, and the video frame matched with the video reference data The initial playback time of the frame determines the audio-video playback time difference when the audio-video file is played at the target end; obtains the audio-video playback time difference when the audio-video file is played at the source end, according to the audio-video file at the source The audio and video playback time difference between the end and the target end is determined to determine the audio and video synchronization of the audio and video file when it is played on the target end. The accuracy of audio and video synchronous detection can be improved by adopting the invention.

Description

Audio-visual synchronization detection method and system thereof, speech detection method and system thereof

Technical field

The present invention relates to the audio frequency and video detection technique in the communications field, relate in particular to a kind of audio-visual synchronization detection method and system thereof, and a kind of speech detection method and system thereof.

Background technology

In the mobile communication video business, because Voice ﹠ Video do not carry temporal information in cataloged procedure, the synchronizing information that therefore the obtains audio frequency and video difficult that becomes.

If add respectively temporal information in the packets of audio data in advance behind audio/video coding and the video packets of data, then the audio-video document after encoding is after Internet Transmission arrives receiving terminal, resolve by the audio-video document that receiving terminal is received, parse the temporal information of carrying in packets of audio data and the video packets of data, then judge the synchronous situation of audio frequency and video according to the temporal information that parses.

But there is following problem in above-mentioned audio-visual synchronization detection method:

(1) although Voice ﹠ Video is carrying temporal information after the packing respectively, but the temporal information after the two grouping packing does not have corresponding corresponding relation, moreover the frame length of Voice ﹠ Video and the size of packet are not identical, therefore can't accurately determine the relative time delay of Voice ﹠ Video;

(2) result who according to the temporal information of carrying in packets of audio data and the video packets of data packet header audio-visual synchronization is detected synchronously, the propagation delay time that only can reflect network, and in the actual play process, the audio-video document player of receiving terminal is provided with buffer memory, audio stream and video flowing through decoding are adjusted by buffer memory synchronously by player, therefore, carry out result that audio-visual synchronization detects according to the temporal information of carrying in packets of audio data and the video packets of data packet header and can not reflect the impact that after the audio-video document player is adjusted synchronously audio-visual synchronization is produced, that is, adopting this kind mode to carry out audio-visual synchronization, to detect resulting result inaccurate.

Summary of the invention

The embodiment of the invention provides a kind of audio-visual synchronization detection method and system thereof, in order to solve the existing low problem of audio-visual synchronization detection accuracy.

The technical scheme that the embodiment of the invention provides comprises:

A kind of audio-visual synchronization detection method comprises the steps:

Determine in the audio-video document that destination end plays, with the initial reproduction time of the audio section of audioref Data Matching, and with the initial reproduction time of the frame of video of video reference Data Matching;

According to the initial reproduction time of the audio section of described and audioref Data Matching, and the initial reproduction time of the frame of video of described and video reference Data Matching, it is poor to determine the audio frequency and video reproduction time of described audio-video document when destination end is play;

It is poor to obtain the audio frequency and video reproduction time of described audio-video document when source is play, poor according to the audio frequency and video reproduction time of described audio-video document when source and destination end are play, determine the audio-visual synchronization situation of described audio-video document when described destination end is play;

Wherein, described audioref data are speech data, determine and the process of the initial reproduction time of the audio section of audioref Data Matching, comprising: detect the voice segments and the start-stop reproduction time thereof that comprise in the audio-video document of playing; By detected voice segments and described audioref data are carried out voice recognition processing, determine the voice segments with described audioref Data Matching; Wherein, the voice segments that comprises in the audio-video document of determining to play and the process of start-stop reproduction time thereof comprise:

In the audio-video document of playing, search for audio signal according to the voice signal short-time average magnitude, when searching short-time average magnitude and surpass the audio signal of the first amplitude threshold, be designated as the first current time; And when behind this first current time, searching short-time average magnitude and dropping to first audio signal below the first amplitude threshold, be designated as the second current time;

When searching backward short-time average magnitude and drop to the audio signal of the second amplitude threshold forward with from the second current time from the first current time, continue along former direction of search search audio signal according to short-time average zero-crossing rate; Described the second amplitude threshold is less than described the first amplitude threshold;

When searching forward short-time average zero-crossing rate and drop to audio signal below the zero-crossing rate threshold value, be designated as the 3rd current time, and with the starting point of the 3rd current time as voice segments, when searching backward short-time average zero-crossing rate and drop to audio signal below the zero-crossing rate threshold value, be designated as the 4th current time, and with the terminal point of the 4th current time as voice segments.

A kind of audio-visual synchronization detection system comprises:

The audio identification module be used for to be determined the audio-video document that destination end is play, with the initial reproduction time of the audio section of audioref Data Matching;

The video identification module be used for to be determined the audio-video document that destination end is play, with the initial reproduction time of the frame of video of video reference Data Matching;

The time difference determination module, be used for initial reproduction time that determine according to described audio identification module and the audio section audioref Data Matching, and the described video identification module initial reproduction time with the frame of video video reference Data Matching that determine, it is poor to determine the audio frequency and video reproduction time of described audio-video document when destination end is play;

Synchronous detection module, it is poor to be used for obtaining the audio frequency and video reproduction time of described audio-video document when source is play, poor according to the audio frequency and video reproduction time that the poor and described time difference determination module of described audio frequency and video reproduction time that gets access to is determined, determine the audio-visual synchronization situation of described audio-video document when described destination end is play;

Wherein, described audioref data are speech data; Described audio identification module is determined and the process of the initial reproduction time of the audio section of audioref Data Matching, being comprised: detect the voice segments and the start-stop reproduction time thereof that comprise in the audio-video document of playing; By detected voice segments and described audioref data are carried out voice recognition processing, determine the voice segments with described audioref Data Matching; Wherein, the voice segments that comprises in the audio-video document that described audio identification module is determined to play and the process of start-stop reproduction time thereof comprise:

In the audio-video document of playing, search for audio signal according to the voice signal short-time average magnitude, when searching short-time average magnitude and surpass the audio signal of the first amplitude threshold, be designated as the first current time; And when after this moment, searching short-time average magnitude and dropping to first audio signal below the first amplitude threshold, be designated as the second current time;

The above embodiment of the present invention, the audio-video document of playing for destination end, determine the initial reproduction time of the audio section of itself and audioref Data Matching, and with the initial reproduction time of the frame of video of video reference Data Matching, thereby the audio frequency and video reproduction time when obtaining the destination end broadcast is poor, then with audio frequency and video reproduction time poor compare of this audio-video document when source is play, thereby determine the audio-visual synchronization situation of this audio-video document when described destination end is play, compared with prior art, the audio-visual synchronization of the embodiment of the invention detects the temporal information that does not rely in the audio, video data bag, but detect synchronously according to the audio-video document of destination end institute actual play, simultaneously the factor of in the audio/video decoding course of destination end audio-visual synchronization being adjusted is taken into account, therefore resulting audio-visual synchronization testing result is more accurate.Be particularly useful for the process to the audio-visual synchronization situation detection of audio frequency and video behind Internet Transmission.

The embodiment of the invention also provides a kind of speech detection method and system thereof, is used for solving the low problem of prior art speech detection accuracy.

The technical scheme that the embodiment of the invention provides comprises:

A kind of speech detection method comprises the steps:

According to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude and surpass the audio signal of the first amplitude threshold, be designated as the first current time; And when after this moment, searching short-time average magnitude and dropping to first audio signal below the first amplitude threshold, be designated as the second current time;

A kind of speech detection system comprises:

The first search module is used for searching for audio signal according to the voice signal short-time average magnitude in audio frequency to be measured, when searching short-time average magnitude and surpass the audio signal of the first amplitude threshold, is designated as the first current time; And when after this moment, searching short-time average magnitude and dropping to first audio signal below the first amplitude threshold, be designated as the second current time;

The second search module, be used for when described the first search module searches backward short-time average magnitude and drops to the audio signal of the second amplitude threshold forward with from the second current time from the first current time, continuing to search for audio signal along the former direction of search according to short-time average zero-crossing rate; Described the second amplitude threshold is less than described the first amplitude threshold;

The voice segments determination module, be used for when described the second search module searches forward short-time average zero-crossing rate and drops to audio signal below the zero-crossing rate threshold value, be designated as the 3rd current time, and with the starting point of the 3rd current time as voice segments, when searching backward short-time average zero-crossing rate and drop to audio signal below the zero-crossing rate threshold value, be designated as the 4th current time, and with the terminal point of the 4th current time as voice segments.

The above embodiment of the present invention, in the speech detection process, come identification effective for voice segments standby average energy when background noise is smaller, come the effective characteristics of identification at the average zero-crossing rate of time standby that background noise is larger, short-time average magnitude and the short-time average zero-crossing rate of voice signal have been considered, on the basis based on the short-time average magnitude detection method, investigate again the short-time average zero-crossing rate of voice signal, utilize amplitude and zero-crossing rate double characteristic to carry out the voice signal terminal and detect, thereby make detected voice segments terminal more accurate.

Description of drawings

Fig. 1 is the schematic flow sheet that audio-visual synchronization detects in the embodiment of the invention;

Fig. 2 is the schematic flow sheet that IP network video telephone audio-visual synchronization detects in the embodiment of the invention;

Fig. 3 is the dynamic route search schematic diagram of speech recognition process in the embodiment of the invention;

Fig. 4 is the audio-visual synchronization Rating Model schematic diagram in the embodiment of the invention;

Fig. 5 is the structural representation of the audio sync detection system in the embodiment of the invention;

Fig. 6 is the structural representation of the speech detection system in the embodiment of the invention.

Embodiment

The problems referred to above for the prior art existence, the embodiment of the invention provides a kind of audio-visual synchronization detection method and system thereof, adopt the mode of pattern recognition to carry out the audio-visual synchronization detection, namely respectively the audio-video document of broadcast and the reference data of these audio frequency and video are carried out pattern recognition at transmitting terminal and receiving terminal, record respectively the audio frame that is complementary with audioref data and video reference data and the initial reproduction time of frame of video, the audio frequency and video reproduction time that obtains transmitting terminal and receiving terminal is poor, again by poor the comparing of the audio frequency and video reproduction time of transmitting terminal and receiving terminal calculated delay inequality, thus the audio-visual synchronization situation the when audio-video document that obtains receiving terminal is play.

In the embodiment of the invention, before carrying out the audio-visual synchronization detection, to prepare first audioref data and video reference data, be used for detecting in synchronization detection process audioref point and the video reference point of audio-video document, thereby determine the audio-visual synchronization parameter according to audioref point and video reference point.The audioref data can be the audio volume control data, and the video reference data can be vedio datas, and audioref data and video reference data can be pre-stored in feature database.

Referring to Fig. 1, be the schematic flow sheet of audio-visual synchronization detection in the embodiment of the invention.This flow process can be applicable to the critic network transmission to the impact of audio-visual synchronization, also can be used for assessing the different ends of playing to the impact of audio-visual synchronization.If for assessment of the impact of Internet Transmission on audio-visual synchronization, then the source in this flow process refers to that transmitting terminal, the destination end of audio-video document refer to the receiving terminal that audio-video document arrives behind Internet Transmission; If play end to the impact of audio-visual synchronization for assessment of difference, then the source in this flow process can be the audio frequency and video broadcast end that preferably audio frequency and video broadcast of audio-visual synchronization quality end, destination end refer to carry out the audio-visual synchronization quality evaluation.This flow process comprises the steps:

Step 101, adopt the audio mode recognition methods to find out in the audio-video document that destination end plays, with the audio section of audioref Data Matching, and record the initial reproduction time of this audio section;

Step 102, adopt video mode recognition method to find out in the audio-video document that destination end plays, with the frame of video of video reference Data Matching, and record the initial reproduction time of this frame of video;

Step 103, according to initial reproduction time record and the audio section of audioref Data Matching, and with the initial reproduction time of the frame of video of video reference Data Matching, the reproduction time of determining audio frequency and video is poor;

Step 104, poor according to the audio frequency and video reproduction time of determining, and this audio-video document is when source is play and the audio section of audioref Data Matching, poor with the audio frequency and video reproduction time of the frame of video of video reference Data Matching, determine this destination end and play the audio-visual synchronization situation of this audio-video document, as, compare with the audio-visual synchronization time delay of source, the variable quantity of the synchronization delayed time of the audio frequency and video of destination end (situation of change of the time span of comparing with source such as the time span of or hysteresis video leading at the destination end audio frequency) or degree, and can further the audio-visual synchronization situation be mapped as corresponding audio-visual synchronization credit rating.

In the step 101 and step 102 of above-mentioned flow process, the time of recording can be the current system time of destination end, also can be the time of playing starting point with respect to this audio-video document.Step 101 in the above-mentioned flow process and step 102 are not strict with in sequential, that is, this two step is upper the exchange sequentially, also can executed in parallel.

Usually, audioref data and video reference data are one to one, and in order to make synchronous detection more accurate, how right audioref data and video reference data are generally.For many situations to audioref data and video reference data, the reproduction time that the step 103 of flow process shown in Figure 1 is determined is poor also to be to one to one with audioref data and video reference data, namely, determine initial reproduction time with the audio section of its coupling for audioref data, for determining initial reproduction time with the frame of video of its coupling with the corresponding video reference data of these audioref data, it is poor to corresponding audio frequency and video reproduction time that both time differences are with these audioref data and video reference data; In like manner, can obtain in the step 104, audio-video document is when source is play and the audio section of audioref Data Matching, poor with the audio frequency and video reproduction time of the frame of video of video reference Data Matching.

Can obtain in the above described manner the audio frequency and video time difference that this detects the audio-video document of usefulness synchronously at transmitting terminal in advance, and when follow-up this audio-video document of each use carries out the audio-visual synchronization detection, directly use this in advance the audio frequency and video time difference of detected transmitting terminal audio frequency and video time difference and receiving terminal compare, thereby determine the audio-visual synchronization situation of this audio-video document after transmitting.

Generally, in order accurately to detect the audio-visual synchronization situation, audio-visual synchronization detects the audioref data of usefulness and video reference data should be had and comparatively significantly be convenient to the feature identifying and be convenient to carry out pattern matching, audio-visual synchronization detect then comprise in the audio-video document of usefulness with the audio section of audioref Data Matching and with the frame of video of video reference Data Matching.Preferably, audio-visual synchronization is detected in the video file of usefulness, with the initial reproduction time of the audio section of audioref Data Matching, and with the initial reproduction time of the frame of video of corresponding video reference Data Matching, identical on the sampled point meaning, namely the audio frequency and video time difference is 0.In this case, in the step 104 of flow process shown in Figure 1, because the audio frequency and video reproduction time of audio-video document when source is play is poor to be 0, the audio frequency and video reproduction time of then can be directly determining according to step 103 is poor, makes the audio-visual synchronization situation that this destination end is play this audio-video document.

Detect as example take IP network video telephone audio-visual synchronization, audio-video document as synchronous detection usefulness, aspect audio frequency, comprise the pronunciation of numeral 1,2,3,4,5, the picture that aspect video, comprises 5 kinds of different human body gestures that show before the solid background, and during the pronunciation of a numeral of every appearance, show corresponding a kind of gesture on the picture in the playing process; The audioref data are the audio volume control data of each numeric utterance in the numeral 1,2,3,4,5, are stored in the audio frequency characteristics storehouse; The video reference data are the vedio data of each gesture in 5 kinds of human body gestures under the solid background, are stored in the video features storehouse; This audio-video document is when transmitting terminal is play, and each numeric utterance is known with the synchronization time difference of corresponding gesture picture.In network transmission process, the Voice ﹠ Video in this audio-video document transmits respectively, forms WAV audio file and AVI video file at receiving terminal.Detect this audio-video document in the process of the audio-visual synchronization situation of receiving terminal, can as shown in Figure 2, comprise the steps:

Obtain the WAV audio file (step 201) in the audio-video document that the audio frequency and video receiving terminal receives, determine that according to audio signal the terminal of wherein each voice segments is to find out voice segments (step 202), adopt the audio mode recognition methods, the speech data of each numeric utterance in each voice segments and the audio frequency characteristics storehouse is compared, determine respectively numeral 1 in each voice segments, 2,3,4, the voice segments (step 203) of 5 pronunciations, and record the start-stop reproduction time of these voice segments, thereby in time (then corresponding the record the more time of repetition being arranged such as the digital pronunciation in the WAV audio file) (step 204) that the audio frequency and video receiving terminal can record at least 5 audio sections;

Obtain the AVI video file (step 205) in the audio-video document that the audio frequency and video receiving terminal receives, extract the every two field picture (step 206) in the AVI video file, adopt video mode recognition method, the view data of various gestures in each video frame images and the video features storehouse is compared, determine respectively the wherein frame of video of various gestures, usually only get the frame of video (step 207) that first identifies, and record the initial reproduction time of these frame of video, thereby in time (then corresponding the record the more time of repetition being arranged such as the gesture picture in the AVI video file) (step 208) of at least 5 frame of video of audio frequency and video receiving terminal record;

The initial reproduction time of frame of video of the gesture that the numeral 1 of the initial reproduction time of numeral 1 pronunciation of record and record is corresponding subtracts each other, the audio frequency and video reproduction time that obtains digital 1 correspondence poor (time of recording all is that system time take receiving terminal is as benchmark), the like, obtain respectively audio frequency and video reproduction time poor (step 209) corresponding to other numerals;

The resulting audio frequency and video reproduction time of step 209 is poor, compare in that the reproduction time of transmitting terminal is poor with known this audio-video document, determine with respect to the audio frequency and video time delay (210) of this audio-video document of transmitting terminal at receiving terminal;

According to the result of step 210, determine corresponding audio-visual synchronization credit rating or MOS score value (step 211).

In the embodiment of the invention aspect the arranging of audioref data, consider that people's subjective feeling is to the starting point (from noiseless to sound) of audio frequency and the asynchronous relatively sensitivity of terminating point (from sound to noiseless) and picture material, preferably, audioref is chosen at voice segments (such as the voice segments of digital 1-5 pronunciation), therefore, when the audio section of definite and audioref Data Matching, at first to detect the terminal position of each voice segments in the audio volume control of this audio-video document, then voice segments and the audioref data determined be carried out audio mode identification.

For detecting the voice segments in the audio file, the embodiment of the invention can adopt traditional voice segments waveforms detection method based on short-time energy or short-time average magnitude.Traditional voice segments waveforms detection method based on short-time energy or short-time average magnitude is a kind of detection method of simple gate limit in essence, a kind of stronger than conventional method adaptability in order to obtain, the audiotime message of extracting is sound end detecting method more accurately, the invention process is also improved traditional speech detection method, and adopts the speech detection method after improving to carry out speech detection.Speech detection method after the improvement, come identification effective for voice segments standby average energy when background noise is smaller, come the effective characteristics of identification at the average zero-crossing rate of time standby that background noise is larger, short-time average magnitude and the short-time average zero-crossing rate of voice signal have been considered, on the basis based on the short-time average magnitude detection method, investigate again the short-time average zero-crossing rate of voice signal, utilize amplitude and zero-crossing rate double characteristic to carry out the voice signal terminal and detect.

The foundation that can realize these judgements is that the various in short-term parameters of voice of different nature have different probability density functions and adjacent some frame voice should have consistent characteristics of speech sounds, and namely they can not undergone mutation at voiced sound, voiceless sound, between noiseless.Usually, the short-time average magnitude of voice signal voiced sound is maximum, and noiseless short-time average magnitude is minimum; The short-time average zero-crossing rate of voiceless sound is maximum, and noiseless placed in the middle, the short-time average zero-crossing rate of voiced sound is minimum.

In the speech detection method that the embodiment of the invention adopts, at first rule of thumb value is determined two amplitude threshold parameter MH and ML(MH〉ML), and a short-time zero-crossing rate threshold value Z0.The value of MH should be set highlyer, so that when the short-time average magnitude M of frame voice signal value surpasses MH, just can to determine this frame voice signal be not noiseless surely and sizable possibility is arranged is voiced sound.When the short-time average magnitude M of voice signal when being reduced to ML greatly, adopt short-time average zero-crossing rate to proceed judgement, when the short-time average zero-crossing rate of voice signal is lower than threshold value Z0, can determine that it is the end points (beginning or end) of voice segments.

The statistical analysis of short-time average magnitude and short-time average zero-crossing rate be can carry out according to a large amount of speech samples, and amplitude threshold value MH and ML determined in conjunction with the short-time average magnitude of actual sample.The process of determining amplitude threshold MH according to speech samples is:

Data in each speech samples are carried out windowing divide frame.Result out according to people's physilogical characteristics and a large amount of data statistics generally is made as 20ms with window length, and step-length is set as long half of window, the then total amount of frame=total sampling number/step-length;

According to the short-time average magnitude in the computing formula unit of account frame of following short-time average magnitude:

M_{m} = Σ_{n = m}^{N + m - 1} | S_{w} (n - m) |

According to the short-time average zero-crossing rate in the computing formula unit of account frame of following short-time average zero-crossing rate;

Z_{m} = \frac{1}{2} {Σ_{n = m}^{N + m - 1} | sgn [s_{w} (n)] - sgn [s_{w} (n - 1)] |}

All speech frames in each speech samples are traveled through statistical analysis, with the short-time average magnitude that draws speech samples and the distribution situation of short-time average zero-crossing rate;

Distribution situation according to short-time average magnitude and the short-time average zero-crossing rate of speech samples, short-time average magnitude according to quiet period, set out the threshold value MH of a thresholding, with fixed larger of this threshold value, to guarantee that short-time average magnitude in each speech samples is voice segments greater than the part of MH, then to get the zero-crossing rate threshold value Z0 of period three short-time average zero-crossing rate doubly as voice segments that mourn in silence.

According to the amplitude threshold MH that determines and ML and short-time average zero-crossing rate thresholding Z0, the speech detection process of the embodiment of the invention is:

Determine former and later two time points A1 and A2 in the audio signal to be detected according to MH, wherein, when the short-time average magnitude M of voice signal surpasses MH, this is designated as A1 constantly, the moment when backward voice signal being dropped to MH first from A1 is designated as A2; Substantially can be defined as voice segments between A1 and the A2;

Continue search before A1 and in the voice signal after the A2; When being searched for forward by A1, if the short-time average magnitude M of voice signal reduces to ML from big to small, then current time can be designated as B1; In like manner, when being searched for backward by A2, if the short-time average magnitude M of voice signal reduces to ML from big to small, then current time is designated as B2.Still can determine it is voice segments between B1 and the B2;

Continuation is searched for forward and by B2 backward by B1.When being searched for forward by B1, if the short-time zero-crossing rate Z of voice signal, thinks then that these voice signals still belong to voice segments all the time greater than Z0, until Z drops to suddenly Z0 when following, current time is designated as C1 and as the starting point of voice segments; In like manner, when being searched for backward by B2, if the short-time zero-crossing rate Z of voice signal, thinks then that these voice signals still belong to voice segments all the time greater than Z0, until Z drops to suddenly Z0 when following, current time is designated as C2 and as the terminal point of this voice segments;

The like, detect all audio sections and starting point and terminal point in the audio file voice signal.

Take the reason of this algorithm to be: before the B1 and B2 may be one section voiceless consonant section afterwards, a little less than their energy equivalence, rely on short-time average magnitude not differentiate they and unvoiced segments fully, but their short-time average zero-crossing rate but will be apparently higher than noiseless, thereby enough this parameters of energy are judged the cut-point of the two, namely real starting point and the terminal point of voice accurately.

This kind algorithm not only is adapted to the voice segments testing process in the embodiment of the invention, also is applicable to the application scenarios that other need to detect the voice segments in the audio signal.

After obtaining the temporal information of voice segments, also need the voice segments that obtains is carried out pattern recognition, to determine the voice segments with the audioref Data Matching.The embodiment of the invention adopts the linear forecasting technology (LPCC) in the audio frequency to carry out audio mode identification.

Obtaining of LPCC characteristic parameter mainly is divided into four steps: preliminary treatment, autocorrelation calculation, moral guest's Algorithm for Solving linear predictor coefficient (LPC) regular equation and LPCC recursion.Wherein, in preliminary treatment, the preemphasis employing promotes high frequency to the mode that voice signal adds single order FIR filter, is used for the decay of compensation glottal excitation and the radiation-induced high frequency spectrum of mouth and nose; The preferred window shape Hamming window of this algorithm picks of window adding technology is as window function.

Voice signal has just changed into one group of LPCC characteristic vector after each frame is extracted the LPCC characteristic parameter.Speech recognition is exactly the speech feature vector of this stack features and reference audio data will be carried out pattern matching, thereby seeks the shortest pattern of distance.

Adopt pattern matching method to carry out speech recognition and usually be divided into two classes: training stage and cognitive phase.Form standard form in the training stage, at cognitive phase, the speech characteristic vector to be known after the transmission attenuation and the standard form vector in the standard form are carried out similarity calculating.In the embodiment of the invention, be the characteristic vector of audioref data by formed standard form of training stage.

But consider the impact of the factors such as decay packet loss of audio file in transmission course, voice sequence length after the raw tone sequence is transmitted with process may be unequal, for addressing this problem, the embodiment of the invention adopts based on the DTW recognizer of dynamic time warping coupling carries out pattern recognition.

In the DTW method that the embodiment of the invention provides, at first calculate input pattern (being the audio signal characteristic vector of each voice segments to be identified) and reference model (being the characteristic vector of audioref data) apart from matrix, then, in distance matrix, find out an optimal path, the accumulation distance in this path is minimum, and this paths is exactly the non-linear relation between the time calculation degree of two patterns.Its algorithm principle is as follows:

Suppose that input pattern to be identified and reference model represent with T and R respectively, in order to compare the similarity between them, can calculate the distortion D[T between them, R], the less similarity of the distortion factor is higher.In order to calculate this distortion, the distortion from T and R between each corresponding frame is counted.If N and M are respectively the totalframes among T and the R, n and m are respectively optional frame numbers among T and the R, D[T (n), R (m)] represent the distortion between these two characteristic vectors, then:

When N=M(is that the T pattern is identical with the frame number of R pattern) time, directly T (1) and R (1) frame, T (2) and R (2) frame,, T (m) and R (m) frame coupling calculate D[T (1), R (1)], D[T (2), R (2)] ... D[T (m), R (m)] the distortion factor, and ask itself and, namely obtain total distortion;

When N ≠ M(is that the frame number of T pattern and R pattern is not identical) time, adopt dynamic programming method to carry out route searching, be specially: (n=1～N) transverse axis in a two-dimensional direct angle coordinate system marks with each frame number among the T, (m=1～M) ordinate at this coordinate system marks with each frame number among the R, as shown in Figure 3, each crosspoint (n in the formed grid of transverse and longitudinal coordinate, m) plotted point of a certain frame among the expression T, the route searching process just can be summed up as seeks one by the path in some crosspoints in these grids, and the crosspoint that the path is passed through namely is the voice frame number that carries out distortion computation among T and the R.

Wherein, the path is not elective, consider that the speed of voice may change, but the precedence of each several part can not change, therefore selected path should be from the lower left corner, finish in the upper right corner.Secondly, in order to prevent planless search, can further leave out those to the n axle or to the undue path that tilts of m axle, this be because the pressure of the voice in the reality, expand always limited, so just can in the path respectively maximum and the minimum value of G-bar in the path by point limited, usually, greatest gradient is decided to be 2, minimum slope location 1/2.

The path cost function that defines in the present embodiment is: d[(ni, mi)], its meaning be from starting point (n0, the m0), computing formula was as follows to each frame distortion aggregate-value of current point (ni, mi):

d[(ni,mi)]=D[T(ni),R(mi)]+d[(ni-1,mi-1)]

d[(ni-1,mi-1)]=min{d[(ni-1,mi)],d[(ni-1,mi-1)],d[(ni-1,mi-2)]}

According to above formula, can be in the hope of needed D[T (ni), R (mi)] value.More than the path cost function of definition only is a kind of example, does not get rid of the algorithm of other path costs.

The video mode recognition method that the embodiment of the invention adopts refers to image-recognizing method, namely, each frame of video that intercepting is play compares each two field picture of intercept and the video frame images in the feature database, thus find out with feature database in the video frame images frame of video of mating.This image recognition processes mainly is divided into two stages: video interception and image recognition.

Video interception can utilize the AVIFile library file of windows operating system to realize, is specially:

At first, initialization AVIFile storehouse, then open and treat the avi file that detects synchronously and obtain its file interface address, if open file successfully (being that video format meets the requirements), then according to the needed avi file information of file interface address acquisition, these information can comprise: the data rate of file maximum (bytes per second), document flow number, file height (pixels), width (pixels), sample rate (samples per second), file size (frames), kind of document etc.; Can obtain the interface IP address of AVI stream according to the file interface address, interface IP address according to AVI stream, obtain the avi file stream information, because audio/video flow is separately to process, so the stream information that obtains here only is video flowing, these information can comprise: the kind class description of document flow kind, frame rate (fps), start frame, end frame, image quality value, document flow etc.;

Then, process the Video stream information obtain, call the address that corresponding decoding functions obtains data behind the decompress(ion), and the memory address of every frame data (being used for preserving into the BMP file), so far, just obtained needed image data information;

At last, again write the header file of this image data information, it is preserved into needed BMP file.The frame number of BMP file AVI video flowing by name, frame time can multiply by the frame time interval by current frame number and obtain, wherein frame period information can find in being specifically designed to the structure of preserving avi file information, for example, the file playback rate is 15fps, it is 66666ns that interframe is divided into 1/15, so it is poor with respect to the reproduction time of start frame to be easy to obtain each frame.

Intercept out the BMP picture from avi file after, the known BMP file of preserving is 24 RGB bitmaps, and further work namely is that the BMP picture is carried out image recognition.Image recognition processes can be: be the binary picture of 8RGB with the colored bitmap-converted of 24RGB, the feature of outstanding target object, adopt pixel statistics and profile track algorithm to ask the area and perimeter of detected image target object, it and image in the feature database are compared, specifically can be divided into following several step:

Step 1, with target image (image that namely is truncated to) gray processing, obtain corresponding grey value profile;

Step 2, grey value profile is carried out interative computation, calculate threshold value;

Step 3, according to threshold value with image binaryzation (be converted into black and white picture, white is background, and black is target object);

Step 4, the image of binaryzation is carried out pixels statistics, calculate the area (pixel number) of target object;

Step 5, carry out next step image and process, depict the profile of target object;

Step 6, carry out pixels statistics, calculate the girth of objects' contour;

The information of the respective image of storing in the area and perimeter that step 7, usefulness obtain and the feature database is compared, and judges whether this image is required target image, is then to record reproduction time.

In the embodiment of the invention, when the audio-visual synchronization situation is estimated, can compare according to audio ﹠ video the degree of lead and lag, mapping obtains corresponding audio-visual synchronization grade and corresponding MOS score value.

The MOS score value of the audio-visual synchronization in the embodiment of the invention is with reference to the scoring algorithm in ITU-R.BT 1359 standards, the method of copying its segmentation to calculate, according to the subjective feeling of people to the audio-visual synchronization situation, set the threshold value of 4 kinds of audio-visual synchronization credit ratings.The audio-visual synchronization Rating Model can be as shown in Figure 4, transverse axis is the time of audio frequency hysteresis video among the figure, vertical pivot represents the score value of marking, and A, B, C, A ', B ', C ' each point represent the Three Estate thresholding formulated, will estimate score value and be divided into 4 grades, the corresponding MOS score value of each audio-visual synchronization credit rating, maximum score value is 4.0, and minimum score value is 1.0, and floating space is 0.3, each audio-visual synchronization grade and thresholding thereof and corresponding MOS score value, can be as shown in table 1:

Table 1

In order more accurately to estimate objectively the audio-visual synchronization quality, a plurality of monitoring points are set to detect the audio-visual synchronization situation and to carry out the audio-visual synchronization quality evaluation in the embodiment of the invention, when carrying out the audio sync quality evaluation, with the synchronous MOS score value addition of these a plurality of monitoring points, then obtain overall synchronous MOS score value.The MOS score value of general synchronization can be used as the MOS score value that draws the video traffic total quality after an important indicator and audio frequency MOS, the video MOS score value weighted calculation.

Based on the technical conceive identical with audio-visual synchronization detection in the embodiment of the invention, the embodiment of the invention also provides a kind of audio-visual synchronization detection system.As shown in Figure 5, this system comprises: audio identification module 501, video identification module 502, time difference determination module 503 and synchronous detection module 504, wherein:

Audio identification module 501 can be determined in the audio-video document that destination end plays by the audio mode RM, with the initial reproduction time of the audio section of audioref Data Matching;

Video identification module 502 can be determined in the audio-video document that destination end plays by the video mode RM, with the initial reproduction time of the frame of video of video reference Data Matching;

Time difference determination module 503, the initial reproduction time that is used for the audio section of and audioref Data Matching that determine according to audio identification module 501, and the initial reproduction time of video identification module 502 frame of video with the video reference Data Matching that determine, it is poor to determine the audio frequency and video reproduction time of audio-video document when destination end is play;

Synchronous detection module 504, it is poor to be used for obtaining the audio frequency and video reproduction time of audio-video document when source is play, poor according to the audio frequency and video reproduction time that the audio frequency and video reproduction time is poor and time difference determination module 503 is determined that gets access to, determine the audio-visual synchronization situation of this audio-video document when described destination end is play.

The specific implementation process of each function in above-mentioned each functional module, similar to the respective process in the aforementioned audio-visual synchronization testing process, do not repeat them here.

Based on the technical conceive identical with speech detection in the embodiment of the invention, the embodiment of the invention also provides a kind of speech detection system, as shown in Figure 6, this system comprises: the first search module 601, the second search module 602, voice segments determination module 603, wherein:

The first search module 601, receive the audio signal to be measured of input, according to the voice signal short-time average magnitude, in audio frequency to be measured, search for audio signal, when searching short-time average magnitude and surpass the audio signal of amplitude threshold MH, from current time, search for forward audio signal, and when after this moment, searching short-time average magnitude and dropping to first audio signal below the amplitude threshold MH, from current time, search for backward audio signal;

The second search module 602 is used for continuing to search for audio signal along the former direction of search according to short-time average zero-crossing rate when the first search module 601 searches forward and backward short-time average magnitude and drops to the audio signal of amplitude threshold ML;

Voice segments determination module 603, be used for when the second search module 602 searches forward short-time average zero-crossing rate and drops to audio signal below the zero-crossing rate threshold value Z0, with the starting point of current time as voice segments, when searching backward short-time average zero-crossing rate and drop to audio signal below the zero-crossing rate threshold value Z0, with the terminal point of current time as voice segments.

This system also can comprise threshold value setting module 604, be used for distributing to determine amplitude threshold MH, amplitude threshold ML and zero-crossing rate threshold value Z0 according to short-time average magnitude distribution and short-time average zero-crossing rate to speech samples data sound intermediate frequency signal, wherein, the audio signal of short-time average zero-crossing rate more than amplitude threshold MH is voice signal, in the voice signal of short-time average magnitude below amplitude threshold ML, the audio signal that short-time average zero-crossing rate is lower than zero-crossing rate threshold value Z0 is not voice signal.

The specific implementation process of each function in above-mentioned each functional module, similar to the respective process in the aforementioned speech detection flow process, do not repeat them here.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. an audio-visual synchronization detection method is characterized in that, comprises the steps:

2. the method for claim 1 is characterized in that, it is poor to obtain the audio frequency and video reproduction time of described audio-video document when source is play, and comprising:

Determine in the audio-video document that source plays, with the initial reproduction time of the audio section of described audioref Data Matching, and with the initial reproduction time of the frame of video of described video reference Data Matching;

According to the initial reproduction time of the audio section of described and audioref Data Matching, and the initial reproduction time of the frame of video of described and video reference Data Matching, it is poor to determine the audio frequency and video reproduction time of described audio-video document when source is play.

3. the method for claim 1, it is characterized in that, described the first amplitude threshold, the second amplitude threshold and zero-crossing rate threshold value distribute according to the short-time average magnitude to speech samples data sound intermediate frequency signal and short-time average zero-crossing rate distributes to determine, wherein, the audio signal of short-time average magnitude more than the first amplitude threshold is voice signal, in the voice signal of short-time average magnitude below the second amplitude threshold, the audio signal that short-time average zero-crossing rate is lower than the zero-crossing rate threshold value is not voice signal.

4. the method for claim 1 is characterized in that, determines and the process of the voice segments of described audioref Data Matching, comprising:

According to the characteristic vector of each voice segments audio signal, and the characteristic vector of described audioref data, by the definite similarity to each other of the space length that calculates each voice segments and described audioref data;

According to the similarity of determining, get wherein the most similar to described audioref data voice segments, as with the voice segments of described audioref Data Matching.

5. method as claimed in claim 4 is characterized in that, when the audio frame number of the audio frame number of voice segments and audioref data was unequal, the process of the distance of computing voice section and described audioref data was specially:

Each audio frame frame number of described voice segments is mapped on the transverse axis in the two-dimensional direct angle coordinate system, each audio frame frame number of audioref data is mapped on the ordinate of this coordinate system, on the direction of the upper right corner, determine a paths along the lower left corner of described coordinate system; According to the coordinate points of described path process, determine the frame number of the audioref data corresponding with each frame number in the described voice segments;

Corresponding relation according to the frame number of determining, utilize the characteristic vector of audio signal, calculating has audio frame signal in the described voice segments of corresponding relation and the distortion factor of the audio frame signal in the audioref data, according to the distortion factor that calculates, determine the space length between described voice segments and the described audioref data.

6. method as claimed in claim 5, it is characterized in that, the described path of determining on along the lower left corner of described coordinate system to upper right corner direction, the slope at the joint place of the frame number that identifies at each ordinate and abscissa, be no more than the first slope threshold value, be not less than the second slope threshold value, described the first slope threshold value is greater than the second slope threshold value.

7. method as claimed in claim 1 or 2 is characterized in that, determines and the process of the initial reproduction time of the frame of video of video reference Data Matching, comprising:

Extract the frame of video that comprises in the audio-video document of playing;

Carry out image recognition processing by frame of video and the described video reference data that will extract, determine frame of video and initial reproduction time thereof with described video reference Data Matching.

8. the method for claim 1 is characterized in that, determines the audio-visual synchronization situation of described audio-video document, comprising:

Determine described audio-video document when destination end is play with respect to the audio frequency and video Delay Variation amount that when source is play, produces;

According to the audio frequency and video Delay Variation amount of determining, determine corresponding audio-visual synchronization credit rating or mark.

9. an audio-visual synchronization detection system is characterized in that, comprising:

10. system as claimed in claim 9, it is characterized in that, described synchronous detection module is obtained the audio frequency and video reproduction time of described audio-video document when source is play when poor, determine in the audio-video document that source plays, with the initial reproduction time of the audio section of described audioref Data Matching, and with the initial reproduction time of the frame of video of described video reference Data Matching; Then, in the audio-video document of playing according to source, the initial reproduction time of the audio section of described and audioref Data Matching, and the initial reproduction time of the frame of video of described and video reference Data Matching, it is poor to determine the audio frequency and video reproduction time of described audio-video document when source is play.

11. system as claimed in claim 9 is characterized in that, described audio identification module is determined and the process of the voice segments of described audioref Data Matching, being comprised:

12. system as claimed in claim 11 is characterized in that, when the audio frame number of the audio frame number of voice segments and audioref data was unequal, the process of the distance of described audio identification module computing voice section and described audioref data was specially:

13. system as claimed in claim 10 is characterized in that, described video identification module is determined and the process of the initial reproduction time of the frame of video of video reference Data Matching, being comprised:

14. system as claimed in claim 9 is characterized in that, described synchronous detection module is determined the audio-visual synchronization situation of described audio-video document, comprising:

15. a speech detection method is characterized in that, comprises the steps:

16. method as claimed in claim 15, it is characterized in that, described the first amplitude threshold, the second amplitude threshold and zero-crossing rate threshold value distribute according to the short-time average magnitude to speech samples data sound intermediate frequency signal and short-time average zero-crossing rate distributes to determine, wherein, the audio signal of short-time average magnitude more than the first amplitude threshold is voice signal, in the voice signal of short-time average magnitude below the second amplitude threshold, the audio signal that short-time average zero-crossing rate is lower than the zero-crossing rate threshold value is not voice signal.

17. a speech detection system is characterized in that, comprising:

18. system as claimed in claim 17 is characterized in that, also comprises:

The threshold value setting module, be used for distributing to determine described the first amplitude threshold, the second amplitude threshold and zero-crossing rate threshold value according to short-time average magnitude distribution and short-time average zero-crossing rate to speech samples data sound intermediate frequency signal, wherein, the audio signal of short-time average magnitude more than the first amplitude threshold is voice signal, in the voice signal of short-time average magnitude below the second amplitude threshold, the audio signal that short-time average zero-crossing rate is lower than the zero-crossing rate threshold value is not voice signal.