CN101980197B

CN101980197B - A multi-layer filter audio retrieval method and device based on long-term structural voiceprint

Info

Publication number: CN101980197B
Application number: CN201010524833XA
Authority: CN
Inventors: 刘刚; 王镪; 郭军
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2010-10-29
Filing date: 2010-10-29
Publication date: 2012-10-31
Anticipated expiration: 2030-10-29
Also published as: CN101980197A

Abstract

The embodiment of the invention discloses a sample-based audio frequency search method, namely, a long time structure vocal print-based multi-layer filtering audio frequency search method, which can search for the complete information of the entire audio frequency through a recorded audio frequency clip. The invention discloses a novel method for generating a vocal print having long time structure information and search effect is enhanced by a two-layer filtering method. The method comprises the following steps of: extracting the vocal print characteristic of an input clip; processing by using a first layer filter; calculating result reliability; determining whether second filtering is performed or not; and realizing secondary filtering by inquiring vocal print expansion. The invention also discloses a long time structure vocal print-based multi-layer filtering audio frequency search device. Experiments indicate that the accuracy of up to 99.7 percent can be reached for an audio frequency library containing 10,000 songs when an inquired clip lasts for 5 seconds and the signal-to-noise ratio is 0 db by the embodiment of the invention.

Description

A kind of when long the multilayer filtering audio search method and the device of structure vocal print

Technical field

The invention belongs to the computer technology application field, concrete relate to a kind of method and apparatus of inquiring about audio database, relate in particular to a kind of content-based sample audio search method, promptly search the complete information of whole audio frequency through the original audio segment of recording.

Background technology

Along with modern information technologies, particularly multimedia technology and rapid development of network technique, a large amount of multimedia messagess can obtain from network.And various audio files more become in each search engine (for example Baidu, Google etc.) and the most often are not used the object that the person searches.Traditional audio-frequency information retrieval technique mainly is based on text, yet the retrieval of traditional text based audio-frequency information can't be satisfied the demand of people to audio retrieval.That is to say,, want to inquire about the information of whole section audio, still have bigger realization difficulty at present technically through the segment of recording several seconds if the user hears one section very familiar audio frequency.

At present the audio search service on the internet is a kind of text search from essence, is through coupling audio frequency related text content, crucial words and return results.Want the audio-frequency fragments of recording is searched for, just relate to content-based sample audio retrieval.And existing audio retrieval technology still can not satisfy people's demand.In recent years, content-based audio retrieval technology becomes a research focus, and the scientist of various fields begins to inquire into this new technological challenge.

Content-based audio retrieval, realizing inquiring about through the segment of recording several seconds is one of the most basic implementation, i.e. sample retrieval.It refers to the user and imports audio-frequency fragments or through microphone records one section audio, possibly contain various noises in these segments, and system can correctly return the relevant information of audio-frequency fragments.

Based on the audio retrieval of sample, can be divided into two subproblems usually: 1) transfer the audio-frequency fragments of inquiry to representative characteristic sequence and form vocal print (vocal print is meant the characteristic sequence that can represent a section audio and ability index building) 2) the search segment candidates the most similar in the storehouse with characteristic sequence.Relatively more classical audio search method mainly contains two types: based on the audio search method of local feature point or global structure information.Based on the method for local feature point, generally be from frequency spectrum, to seek some typical unique points, for example the Shazam company of Britain extracts spectrum peak information, it is right then unique point to be formed unique point, unique point to vocal print as this fragment; Set up hash index in the time of search and realize search fast.The characteristics of the method are the global informations that need not keep frequency spectrum, and characteristic is representative, and anti-to make performance strong, and shortcoming is that quantity of information is few, and collision is more serious during the vocal print index building.Based on the method for global structure information, be the global information that keeps entire spectrum, contain much information; But noiseproof feature is not strong; Information is representative poor, and for example the method for the Philips research institute of Holland proposition is divided into 33 non-overlapped sub-bands to the frequency spectrum between the 300-2000Hz; Final sub-band representes that by 0 or 1 these 0,1 sequences are formed vocal print; Also use vocal print to make up Hash table in the time of search and accelerate search speed.

These audio search methods can obtain reasonable effect in small-scale application, but when audio repository is magnanimity, has a lot of problems and occur, and serious such as the index collision, search time is long.Because the characteristic information amount of extracting is not enough, to collide seriously when causing setting up index, search time is long; If increase the voiceprint amount to forming vocal print with unique point; Reduce the index collision, can reduce vocal print stability again, retrieval precision descends; That is to say between vocal print collision rate and the stability it is a contradiction, low collision rate will inevitably be brought the stability decreases of vocal print.

Summary of the invention

In view of this; The audio search method that the purpose of this invention is to provide a kind of structure vocal print and multilayer filtering when long; Effectively solve conflicting problem between vocal print stability and the collision rate; For the magnanimity audio database, the present invention can effectively improve retrieval accuracy, recall precision and the noise robustness of audio retrieval.

In order to realize the foregoing invention purpose, the present invention adopts following technical proposals:

A kind of when long the multilayer filtering audio search method of structure vocal print, it is characterized in that:

(1) extracts the invariant feature that the user imports audio-frequency fragments, for example spectrum peak characteristic;

(2) generate vocal print with structural information when long (English audio fingerprint by name is meant and can represents a section audio and characteristic sequence that can index building) according to unique point;

(3) through the ground floor wave filter, for searching item, search hash index, obtain segment candidates intermediate result, and use the original signal spectrum unique point to calculate the intermediate result similarity, according to similarity middle result is sorted then with all vocal prints;

(4) candidate result of ground floor wave filter rank first is carried out degree of confidence marking,, then export the result, otherwise changed for the 5th step over to if surpass predetermined threshold;

(5) expanding query vocal print number gets into second layer wave filter, according to concordance list, searches more intermediate results, and calculates the intermediate result similarity, then the one or two layer of wave filter result is sorted according to similarity;

(6) select the highest audio-frequency fragments information of similarity to return the user.

Wherein, the audio database of being inquired about obtains through following steps:

(1) extracts audio database invariant feature, for example spectrum peak characteristic;

(2) generate vocal print with structural information when long;

(3) use all database vocal prints to make up hash index, key is a vocal print, is worth to be the position in vocal print place audio file name and the vocal print place audio file.

The invention also discloses a kind of audio retrieval device of structure vocal print and multilayer filtering when long, comprising: voice data library unit 101 promptly constitutes the audio database of inquiring about the storehouse.

Vocal print construction unit 102, i.e. extract minutiae, a plurality of unique points of information make up vocal prints when long with having;

Index building unit 103 for audio repository sound intermediate frequency file, makes up a hashed table index with all vocal prints, and vocal print is a key, and vocal print place audio file name and audio file position, place are values.

Input block 104 is input as the original audio segment of recording in the complex environment;

Filter cell

105 and 108 comprised for three steps, was respectively: search candidate's intermediate result according to the hash index table, calculate the intermediate result similarity, according to similarity to sort result.The difference of unit 105 and unit 108 is that the inquiry vocal print of importing is different, and the original vocal print of segment is inquired about in being input as of unit 105, the vocal print with fault-tolerant ability that is input as the process query expansion of unit 108.

Confidence computation unit 106 is carried out degree of confidence marking to ground floor wave filter output result, estimates confidence level;

Query expansion unit 107, use a kind of based on fault-tolerant query expansion to the inquiry vocal print expand;

Result for retrieval output unit 109, the output result for retrieval.

Provided by the present invention when long the multilayer filtering audio search method of structure vocal print; The voiceprint amount of using during index building of structural information when long is big, and the index collision rate is low, and what adopt when calculating similarity is the parent mass peak value tag; Stability is strong; And use inquiry vocal print extension realization secondary filtering, improved the stability of vocal print, improved the speed and the precision of inquiry significantly with fault tolerant mechanism.Use method of the present invention,,, can reach first hit rate of 99.7% when the inquiry segment is 5 seconds and signal to noise ratio (S/N ratio) when being 0db for the audio database of 10000 first songs.

Description of drawings

In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, will do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below.

Fig. 1 is the device block diagram of the embodiment of the invention.

Fig. 2 is the vocal print design of graphics of structural information when long of this method.

Fig. 3 is the filtering algorithm synoptic diagram based on index.

Fig. 4 is the multilayer filtering audio search method process flow diagram of structure vocal print when long.

Embodiment

To combine the accompanying drawing in the embodiment of the invention below, the technical scheme in the embodiment of the invention is carried out clear, intactly description, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills are not making the every other embodiment that is obtained under the creative work prerequisite, all belong to the scope of the present invention's protection.

As shown in Figure 1, the device block diagram for the embodiment of the invention comprises:

For the voice data in the database (unit 101), extract characteristic, a plurality of unique points of structural information make up vocal prints (unit 102) when long with having, and use vocal print to make up database index (unit 103) then.

Retrieval phase; For the inquiry segment (unit 104) of input, extract the vocal print (unit 102) that feature construction has structural information when long, through ground floor wave filter (unit 105); Promptly search candidate's intermediate result according to the hash index table; Calculate the intermediate result similarity, to sort result, then initial results is carried out degree of confidence marking (unit 106) according to similarity; Whether judgement is passed through based on fault-tolerant query expansion (unit 107) gets into second layer wave filter (unit 108), finally exports the result and gives user (unit 109).

Below, in conjunction with Fig. 2-Fig. 4, the multilayer filtering audio search method of structure vocal print when long of explaining that the embodiment of the invention provides:

In content-based audio retrieval, all be earlier the voice data process to be handled, extract audio frequency characteristics.This audio frequency characteristics is representative, this section audio of the unique representative of ability, and noiseproof feature is eager to excel in whatever one does, and when neighbourhood noise, characteristic still remains unchanged or less variation.

Present modal voice data all is a wave file, and form generally is wav, and the audio file of extended formatting is easy to be converted into the wav file through software.Therefore, in the present embodiment, audio repository and user record segment and all adopt wave file wav form.

Set up database index and query script and all will use vocal print, method is the same, below the first generative process of explanation vocal print.

Vocal print generates and comprises that feature extraction and vocal print make up two parts.Feature extraction algorithm comprises following process: at first, voice data is divided into has overlapping frame, through windowing process and time-frequency conversion, from these frames, extracts the spectrum peak point at last.

When vocal print makes up, adopts a kind of method that is called the anchor point expansion to make up vocal print, promptly make up vocal print (Fig. 2) with a plurality of unique points, structural information during increase vocal print long, the structure formula is following:

hash(f _i，f _i+1，...，f _i+r-1)＝f _i+f _i+1*n+...+f _i+r-1*n ^r-1 [1]

Above-mentioned is the formula that r unique point makes up vocal print, and wherein, f is an audio frequency characteristics, and n is the unique point span upper limit.

So-called anchor point refers to a main unique point that is used for making up vocal print, shown in formula 1, and f _iBe anchor point.Can adjust the number that distance and each anchor point between the unique point formed vocal print according to no situation in the reality.Supposing that unique point satisfies evenly distributes, and maximum frequency is n, and the unique point number of forming vocal print is r, if each point all is an anchor point, and each anchor point to form vocal print number be m, so maximum voiceprint is exactly m*n ^rIf m=1, n=256, r=4, then maximum voiceprint is 32bit, the voiceprint amount is very big, can accelerate search speed greatly during index building.When m is not equal to 1, can makes up m Hash table and come acceleration search to reduce collision.Because the database that the present invention considers is a magnanimity, pays the utmost attention to the impact severity of vocal print, for each anchor point, this method increases by 3 points and makes up vocal print.In the characteristic extraction procedure; If certain frequency band peak value last very long; The same situation of peak point that continuous a few frame extracts just possibly occur, make to have great correlativity between the adjacent feature point, in order to eliminate this correlativity; Get 2 at interval between the unique point when vocal print makes up, concrete computing formula is following:

hash(f _i，f _i+3，f _i+6，f _i+9，)＝f _i+f _i+3*n+f _i+6*n ²+f _i+9*n ³ [2]

In the following formula, f represents the relative frequency of unique point, and n is the Frequency point span upper limit.The vocal print collision that this method makes up is very little, but vocal print correct match probability is the product of each unique point correct probability, so this anchor point extended method will inevitably cause the instability of vocal print, and the present invention adopts a kind of search strategy of uniqueness to come remedy such and insufficient.

Take all factors into consideration search efficiency and precision problem, the inventor uses a kind of search method of selectable two-layer filtering.As shown in Figure 4, search method is made up of two-layer wave filter, and two-layer wave filter included for three steps, at first searched segment candidates according to vocal print, is that the accurate similarity of segment candidates is calculated then, sorts the output ranking results at last according to similarity.Because vocal print poor stability; For the segment candidates of each corresponding vocal print, all carry out the accurate similarity in second step and calculate, what adopt when similarity is calculated is the primitive character point; The primitive character point is more a lot of than vocal print good stability, can eliminate the influence that the vocal print instability is brought like this.The difference of this two-layer wave filter is that input vocal print number is different, and seek rate and precision are different.According to the output result of ground floor wave filter, can calculate corresponding degree of confidence, if degree of confidence is lower,, increase the vocal print number through the vocal print expansion, through second layer wave filter, export more accurate result again.Experimental result shows that when the inquiry segment received noise effect serious, second layer wave filter can improve the retrieval accuracy of total system greatly.

Do respectively in the face of the several Key Points in the inquiry filtering algorithm down and specify.

Algorithm filter once at first is described.The searching algorithm of this two-layer wave filter is the same.For audio repository sound intermediate frequency file, make up a Hash table with all vocal prints, vocal print is a key, vocal print place audio file name and audio file position, place are values.Retrieval phase (Fig. 3) is extracted the vocal print of inquiry segment, through index search, just can find the stock's audio frequency vocal print and the position of correspondence, just can find and inquires about corresponding fragment according to these vocal prints, and all these fragments all are candidate segment.Because it is big to constitute the voiceprint amount of this index, collide fewer, so seek rate is very fast.If audio repository is made up of 10000 first songs, average every first song 5 minutes, the maximal value of single unique point is 256 (8bit); Vocal print is made up of 4 unique points, and then the voiceprint amount is 32bit, on average corresponding 0.01 candidate segment of each vocal print; Record segment in 10 seconds and extract about 300 vocal prints; Can find about 3 candidate segment, because the distribution of characteristic is more concentrated, cause many tens times of segment candidates in the actual conditions; But still can get rid of most impossible songs through this index, only keep less candidate segment.After finding candidate segment, candidate segment is sorted, use the primitive character that constitutes vocal print to come the similarity of calculated candidate segment, just can obtain song information accurately, computing formula is following:

S_{j} = 1 - \frac{Σ_{i = 0}^{N} \min ({(q_{i} - d_{i})}^{2}, C)}{N \cdot C} - - - [3]

Wherein, S _jBe the similarity of j fragment, q _iBe the unique point of inquiry segment, d _iBe the unique point of fragment in the corresponding storehouse, N is the characteristic total number, and C is a fixing constant, and the influence that can limit noise brings can be arranged to than 3 little integers.Experiment showed, and introduce the retrieval performance that this constant can greatly improve system.Because what this similarity calculating method used is the primitive character point, primitive character point itself is just than stable many of vocal print, and the similarity of therefore obtaining with the method is more accurate, and it is more reliable to export the result after the ordering.

This searching algorithm is based on a hypothesis: having a vocal print at least is accurately to mate, if this hypothesis is set up, those fragments that need calculate similarity only are the corresponding stock's audio fragments of inquiry segment vocal print so.In order to prove the validity of this hypothesis, can be calculated to a rare probability that vocal print is correct with following formula:

P＝1-(1-q ^r) ⁿ [4]

Q is the correct probability of each unique point, and r is a unique point number of forming vocal print, and n is the vocal print total number that extracts.If q=0.4, r=4, the inquiry fragment length is 10 seconds, n ≈ 300 so, then calculate P and are approximately 0.999.If q is very little, P is also very little so, and in this case, accurate similarity is calculated and also is difficult to find correct result, so this algorithm is effective.In fact, the selection of r can be selected according to the stability of frame length, index amount, characteristic and to the requirement of speed.When data were magnanimity, based on paying the utmost attention to of speed, r was set to 4.

Before judging whether to get into second layer wave filter, ground floor wave filter result is had a confidence calculations process, be used for the confidence level of estimated result, the confidence calculations method has multiple, and confidence calculations is following as a result in output in this method:

C = \frac{S_{1}}{S_{2}} - - - [5]

C is output result's a degree of confidence, S ₁Be first candidate's similarity, S ₂It is second candidate's similarity.If ground floor wave filter output result's degree of confidence is lower than a threshold value,, obtain more accurate result just through second layer wave filter.

If the inquiry segment receives noise effect serious, the vocal print that is made up of unique point possibly neither one be on all four, to this situation; The present invention proposes a kind of enhancing searching algorithm; Vocal print is made up of r-1 point, and the vocal print with r-1 point when setting up database index makes up second index, is used for second layer wave filter search matched; Second layer algorithm filter is the same with ground floor, only is that the structure and the index of vocal print is different.If ground floor wave filter output result's degree of confidence is lower than a threshold value,, obtain more accurate result just through second layer wave filter.Find that through statistics the frequency values of error characteristic point generally all fluctuates up and down at original frequency in the inquiry segment, and it is very big to differ 1 probability; The possibility that is higher than other frequency values far away, therefore, the inventor has proposed a kind of based on fault-tolerant query expansion algorithm again; When making up second layer wave filter and the public same index of ground floor wave filter, only expansion inquiry segment vocal print; Increase the number of vocal print through the unique point of expansion inquiry segment; So just reduced demand, only need make up an index, also reached requirement fast and accurately simultaneously internal memory.If each point all expands to original three times, promptly fluctuates 1 up and down, and constitutes vocal print by 4 points, can obtain 80 times original vocal print number so.Here original vocal print is not retrieved again, only sorted ground floor filter similarity result of calculation and second layer wave filter result together, export net result then.In fact, only need realize feature expansion to the low unique point of those degree of confidence, the characteristic confidence calculations is following:

F = Σ_{i = 0}^{N - 1} E_{i} / N * λ - - - [6]

E _iBe the energy of unique point, N is the characteristic total number, and λ is a coefficient, can adjust the number of this Coefficient Control feature expansion.In fact, because the ground floor wave filter output existence of confidence threshold value as a result, only go bad when serious when audio-frequency fragments, just can be through two-layer wave filter, in this case, two-layer wave filter can improve the performance of total system greatly.Through using this query expansion algorithm, can spend the fewer time and reach good performance.

Find through statistics, for original clip, zero lap between frame, and when just in time differing field when getting frame, about 1/4 peak point generation deviation is arranged, this because frame boundaries is chosen the inconsistent feature extraction mistake that causes and is referred to as boundary effect.Because the existence of boundary effect causes the feature extraction mistake, so the Duplication between the frame should be the bigger the better, promptly frame move more little good more so that reduce the influence that boundary effect is brought.In this patent method, for the total amount that reduces index and reduce boundary effect as far as possible, storehouse sound intermediate frequency Duplication is 1/2, and inquiry segment Duplication is 3/4.Because Duplication is different, adopt following formula to calculate similarity:

S_{j} = 1 - \frac{Σ_{i = 0}^{N} \min ({(q_{2 i - 1} - d_{i})}^{2}, {(q_{2 i} - d_{i})}^{2}, C)}{N \cdot C} - - - [7]

This formula implication is the same with formula 4, only does comparison to two points in point in the audio repository and the inquiry.

Accordingly, inquiry segment vocal print formula calculates as follows:

hash(f _i，f _i+6，f _i+12，f _i+18，)＝f _i+f _i+6*n+f _i+12*n ²+f _i+18*n ³ [8]

Utilize search method process flow diagram shown in Figure 4 below, the retrieving of explanation this method of image.As shown in the figure, this method comprises that mainly the off-line of left-half sets up the online query process of database index process and right half part.

Overall flow mainly comprises two parts: i) set up database index and ii) retrieve the inquiry segment.Specifically describe as follows:

1, off-line is set up database index:

For the every first song in the database (module 201), extract spectrum peak unique point (module 202) earlier, make up vocal print (module 203) (formula 1) according to these peak points, use the vocal print that contains much information to make up database index (module 204).

2, online retrieving inquiry segment:

Step 1: for inquiry segment (module 206), extract spectrum peak unique point (module 207), generate vocal print (module 208) (formula 8) with peak point.

Step 2: use the ground floor wave filter, promptly search candidate's (module 209) and calculated candidate similarity (module 210) (formula 7), sort (module 211), obtain initial results according to similarity according to database index (module 205).

Step 3: initial results is carried out degree of confidence marking (module 212) (formula 5),, then export the result if surpass predetermined threshold (module 213), otherwise, expand (module 214) to the inquiry vocal print, change step 4 over to.

Step 4: selectively use second layer wave filter that the expansion vocal print is retrieved again; Search candidate's (module 215) and calculate similarity (module 216) (formula 7); Sort the candidate result of two-layer wave filter together (module 217) according to similarity, output is result's (module 218) more reliably.

In order to verify the validity of the method, the inventor is an example with the music retrieval, set up the audio repository of one 10000 first song, tested 400 head from the storehouse at random the length of intercepting be the plus noise segment of 5 seconds and 10 seconds, test result is as shown in the table:

Signal to noise ratio (S/N ratio) (db)-12-9-6-3 03

Accuracy rate 46% 82.00% 96.30% 98.20% 99.70% 99.70%

The average search time (s) 0.12 0.13 0.15 0.20 0.25 0.26

Table 15 second segment test result

Signal to noise ratio (S/N ratio) (db)-12-9-6-3 03

Accuracy rate 54% 90.00% 97.20% 100.00% 100.00% 100.00%

The average search time (s) 0.19 0.32 0.38 0.45 0.58 0.63

Table 210 second segment test result

From table, can find out, this method has reached gratifying inquiry accuracy rate in the millisecond rank.

Though above described the present invention through embodiment, the present invention has many distortion and variation and does not break away from spirit of the present invention, appended claim will comprise these distortion and variation.Any modification of within spirit of the present invention and principle, being done, be equal to replacement and improvement etc., all should be included within protection scope of the present invention.

Claims

1. A multi-layer filter audio retrieval method based on long-term structural voiceprint, characterized in that:

(1) Build an audio database, including extracting stable features of the audio database, generating voiceprints with long-term structural information, and using all database voiceprints to build a hash index. The key is the voiceprint, and the value is the name of the audio file where the voiceprint is located and the voiceprint position in the audio file;

(2) extracting the stable feature of the user input audio segment;

(3) Build a voiceprint with long-term structural information;

(4) After the first layer of filters, all voiceprints are used as search items to search the database index to obtain candidate intermediate results, and calculate the similarity of the intermediate results according to the original features, and then sort the intermediate results according to the similarity;

(5) Confidence scoring is performed on the first-ranked candidate result of the first-layer filter, and if it exceeds a predetermined threshold, the result is output, otherwise, go to step 6;

(6) Expand the query voiceprint, enter the second layer filter, search for more intermediate results according to the index table, and calculate the similarity of the intermediate results, and then sort the first and second layer filter results according to the similarity;

(7) According to the sorting result, select the audio segment information with the highest similarity and return it to the user.

2. the multi-layer filter audio retrieval method based on long-term structural voiceprint according to claim 1, is characterized in that:

The voiceprint construction method with long-term structural information uses multiple feature points to construct voiceprints. The construction formula is as follows:

hash(f _i ，f _i+1 ，...，f _i+r-1 )＝f _i +f _i+1 *n+...+f _i+r-1 *n ^r-1

In the above formula, r is the number of feature points for constructing the voiceprint, _fi is the i-th audio feature extracted, and n is the upper limit of the value range of feature points.

3. the multi-layer filter audio retrieval method based on long-term structure voiceprint according to claim 1, is characterized in that:

Confidence calculation of the query result, the confidence calculation is performed on the output result of the first layer filter, which is used to evaluate the credibility of the output result of the first layer filter. The formula for calculating the confidence degree of the output result in this method is as follows:

C C = = \frac{{S S}_{11}}{{S S}_{22}}

C is the confidence of the output result, S ₁ is the similarity of the first candidate, and S ₂ is the similarity of the second candidate.

4. the multi-layer filter audio retrieval method based on long-term structural voiceprint according to claim 1, is characterized in that:

The extended query voiceprint is to float several positions up and down for each feature point of the audio clip input by the user, so that the voiceprint of the input clip is expanded into multiple voiceprints, which are used as the query input for the second retrieval.

5. the multi-layer filter audio retrieval method based on long-term structural voiceprint according to claim 1, is characterized in that:

The filter algorithm includes three steps: 1. Find candidate intermediate results according to the database index table; 2. Calculate the similarity of the intermediate results; 3. Rank the intermediate results according to the similarity.

6. the multi-layer filter audio retrieval method based on long-term structural voiceprint according to claim 1, is characterized in that:

A selective two-layer filtering algorithm, that is, to score the confidence of the first-ranked candidate result of the first-layer filter. If it does not exceed the predetermined threshold value, it will enter the second-layer filtering through query expansion.

7. the multi-layer filter audio retrieval method based on long-term structure voiceprint according to claim 1, is characterized in that:

The audio segment input by the user is a recording segment, and the frame shift during feature extraction of the recording segment is half of the frame shift of the audio data in the database.

8. the multi-layer filter audio retrieval method based on long-term structure voiceprint according to claim 1, is characterized in that:

As an alternative algorithm for the second-level filter, the second-level filter can use a more precise index structure.

9. A multi-layer filtering audio retrieval device based on long-term structural voiceprint, comprising:

(1) Offline database index construction module:

Audio database unit, i.e. constitute the audio database of the query library;

Voiceprint construction unit, which extracts audio data feature points and constructs voiceprints with multiple feature points with long-term structural information;

Build an index unit. For the audio files in the audio library, use all the voiceprints to build a hash table index, the voiceprint is the key, the name of the audio file where the voiceprint is located and the location of the audio file are the values;

(2) Online query search module:

Input unit, the input is the original audio clip recorded in the complex environment;

Voiceprint construction unit, which extracts feature points and constructs voiceprints with multiple feature points with long-term structural information;

The filter unit includes three steps, which are: look up candidate intermediate results according to the hash index table, calculate the similarity of the intermediate results, and sort the results according to the similarity;

The confidence degree calculation unit performs confidence degree scoring on the output result of the first layer filter, and evaluates the degree of credibility;

The query expansion unit uses a fault-tolerant query expansion to expand the query voiceprint;

The search result output unit outputs the search result.