CN103035247B

CN103035247B - Based on the method and device that voiceprint is operated to audio/video file

Info

Publication number: CN103035247B
Application number: CN201210518118.4A
Authority: CN
Inventors: 杨帆; 苏腾荣; 李世全; 马永健
Original assignee: Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Current assignee: Beijing Samsung Telecommunications Technology Research Co Ltd; Samsung Electronics Co Ltd
Priority date: 2012-12-05
Filing date: 2012-12-05
Publication date: 2017-07-07
Anticipated expiration: 2032-12-05
Also published as: CN107274916B; CN103035247A; CN107274916A

Abstract

The present invention discloses a kind of method operated to audio/video file based on voiceprint, comprises the following steps：Gather the voiceprint of audible target；And audio/video file is searched for according to the voiceprint.Present invention also offers a kind of terminal device.Technical scheme proposed by the present invention, audio/video file can be classified according to the voiceprint of particular contact, when user wants to find the audio/video file for including particular contact, need not the broadcasting of file one by one check, but directly selected, so as to facilitate user to search the audio-video document containing specific people's sound.Further, the method operated to audio/video file based on voiceprint that the present invention is provided can jump directly to the timing node that certain contact person in audio/video speaks and play out, so as to provide the search efficiency of user.

Description

Based on the method and device that voiceprint is operated to audio/video file

Technical field

The present invention relates to mobile device communication application field, more particularly to according to particular contact vocal print to terminal device sound The method and device of vision operation.

Background technology

Phonographic recorder or image pick-up device on existing terminal device can facilitate user to record and shoot Voice ＆ Video file.With The performance for terminal device is improved, memory capacity increase, and the condition such as the species of multimedia application increases, user is easy to record Make or shoot substantial amounts of audio/video file.However, facing to a large amount of audio/video files, when user requires to look up all records The audio/video file of certain particular contact is formed with, or is searched and is played certain particular contact in certain audio/video text During a certain section of customizing messages in part, due to cannot quickly position, the situation for having no way of searching can be run into.One file of only one of which Broadcasting check, can just obtain required file or fragment.

In view of this, it is desirable to provide one kind is quick to search and class object audio/video file, and positions particular contact The method and terminal device of time of occurrence point in this document, has specific people's sound and video to facilitate user to search to record File.

The content of the invention

In order to solve the above-mentioned technical problem, realize that user quickly searches the file recorded and have specific people's sound or video.

An object of the present invention is to provide a kind of method operated to audio/video file based on voiceprint, Comprise the following steps：Gather the voiceprint of audible target；And audio/video file is searched for according to the voiceprint；Wherein, All sound being recorded in the audio/video file are divided into multiple voice units, and each voice unit only includes it In a voice for audible target, and record time point of the audible target in the audio/video file.

Another object of the present invention is to provide a kind of terminal device, including：Voiceprint extraction module, for gathering sounding mesh Target voiceprint；And performing module, for searching for audio/video file according to the voiceprint；Wherein, the sound/regard All sound being recorded in frequency file are divided into multiple voice units, and each voice unit only includes one of sounding The voice of target, and record time point of the audible target in the audio/video file.

The method and apparatus that the present invention is provided, can quickly search the file recorded and have specific people's sound or video, with Improve the search efficiency of user.

The additional aspect of the present invention and advantage will be set forth in part in the description, and these will become from the following description Obtain substantially, or recognized by practice of the invention.

Brief description of the drawings

The above-mentioned and/or additional aspect of the present invention and advantage will become from description below in conjunction with the accompanying drawings to implementation Obtain substantially and be readily appreciated that, wherein：

Fig. 1 shows schematic flow sheet according to an embodiment of the invention；

Fig. 2 shows that the terminal device of an embodiment of the invention carries out the interface schematic diagram before audio collection；

Fig. 3 shows the flow chart of audio collection according to embodiments of the present invention；

Interface schematic diagram when Fig. 4 shows that the terminal device of an embodiment of the invention carries out audio collection；

Fig. 5 shows after the video and audio file that search out recording that terminal device shows and is labeled with sounding hereof The voiceprint appearance of target and/or the interface schematic diagram at the time point for terminating；

Fig. 6 shows the flow chart that contact person's media library is checked by terminal device of an embodiment of the invention；

Fig. 7 shows the flow chart of recording contact person's sound according to embodiments of the present invention；

Fig. 8 shows overall structure diagram according to an embodiment of the invention；

Fig. 9 shows structural representation according to an embodiment of the invention.

Specific embodiment

Illustrative embodiment of the invention is specifically described referring now to accompanying drawing.However, the present invention can be with many not Implement and should not be construed as limited to the specific implementation method for illustrating here with form；Conversely, there is provided these implementations Be in order that disclosure of the invention is thorough and complete, and intactly passed on to those skilled in the art thought of the invention, idea, Purpose, design, reference scheme and protection domain.Used in the detailed description of the specific illustrative implementation of example in accompanying drawing Term is not meant to limit the present invention.In accompanying drawing, identical label refers to identical element.

Those skilled in the art of the present technique are appreciated that unless expressly stated, singulative " " used herein, " one It is individual ", " described " and " being somebody's turn to do " may also comprise plural form.It is to be further understood that what is used in specification of the invention arranges Diction " including " refer to the presence of the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition One or more other features, integer, step, operation, element, component and/or their group.It should be understood that when we claim unit Part is " connected " or during " coupled " to another element, and it can be directly connected or coupled to other elements, or can also exist Intermediary element.Additionally, " connection " used herein or " coupling " can include wireless connection or coupling.Wording used herein "and/or" includes one or more associated any cells for listing item and all combines.

Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology art Language and scientific terminology) have with art of the present invention in those of ordinary skill general understanding identical meaning.Should also Understand, those terms defined in such as general dictionary should be understood that the meaning having with the context of prior art The consistent meaning of justice, and unless defined as here, will not be with idealizing or excessively formal implication be explained.

As shown in figure 1, the invention provides a kind of method operated to audio/video file based on voiceprint, bag Include following steps：The voiceprint of S1, collection audible target；And S2, according to voiceprint search for audio/video file.

For example, step S1 is realized by the following method：When contact person X1 makes a phone call to user Y, terminal device is opened interior Put phonographic recorder and record the voice (for example, the spoken sounds recorded, time span 7-10 seconds) that one section of contact person X1 individually talks, And therefrom extract voiceprint；Then, after stopping call, terminal device is according to the voiceprint generation speaker model M1 for recording Afterwards, the sample is stored in media library；Then, speaker model is corresponded to terminal device the register of contact person in address list X.

For example, step S1 is also realized by the following method：When user Y bands son X2 goes to park to play, terminal device exists " recording vocal print sample " option is opened in address list in the record of son X2 and the voiceprint of son X2 is recorded；Then, stop After recording, after terminal device is according to the voiceprint generation speaker model M2 for recording, the sample is stored in terminal memory；Connect , speaker model is corresponded to terminal device the file of contact person X2 in media library.It will of course be understood that being, media library is to deposit A kind of statement of storage collection of multimedia documents, it is also possible to be expressed as file, file manager, media manager, video management Device, audio manager etc..As shown in figure 5, the voiceprint for including speaker model M1 and M2, terminal ought be run into again later These videos and audio file are classified and are marked by equipment according to special object (for example, " I " and " son ").In classification After storage, the information such as Subject field, file, the media library of corresponding classification can be generated.

Step S1 can also be achieved by the steps of：A sounding in step S11, elected middle address list application program There is provided during target (for example, Zhang San), on display screen and record vocal print sample option；Step S12, when user click on record vocal print sample After option, terminal device collection voiceprint, and the speaker model storage that will be generated according to voiceprint is in contact person's media In storehouse；And step S13, when enter contact person's media library page after, display screen shows the audio/video file for searching.Cause This, the voiceprint for gathering audible target includes：When certain audible target in choosing, voiceprint is gathered；And storage collection Voiceprint.

Fig. 2 shows that the terminal device of an embodiment of the invention carries out the interface schematic diagram before audio collection. Fig. 3 shows the flow chart of audio collection according to embodiments of the present invention.Audio collection flow comprises the following steps：Step 101： Entry communication is recorded, and opens particular contact on telephone directory.Then, step 102：By " recording vocal print sample " option (such as Fig. 2 institutes Show), record contact person's sound (that is, gathering the voiceprint of contact person).Then, step 103：After the completion of recording, to contact person Sound be modeled, to generate speaker model, and speaker model is saved in associated person information.Therefore, collection and Storage voiceprint includes：Speaker model is generated according to voiceprint；And speaker model storage is being locally stored mould In block.

Fig. 4 shows modeling process according to an embodiment of the present invention.Using voiceprint identification speaker's identity Technology is properly termed as Speaker Identification (Speaker Recognition, SR), and corresponding model is properly termed as speaker model (Speaker Model, SM).Speaker Recognition System is generally modeled using the method for UBM-GMM, i.e., by a large amount of training Audio (a more than speaker) trains a universal background model (Universal Background Model, UBM), then Method by self adaptation on the basis of this UBM is modeled to specific speaker, obtains speaker model (SM).No matter Universal background model or speaker model, generally all using mixed Gauss model (Gaussian Mixture Model, GMM) structure.

Interface schematic diagram when Fig. 4 shows that the terminal device of an embodiment of the invention carries out audio collection.Example Such as, when terminal device records vocal print sample under address book contact interface (as shown in Figure 4), click on addition and record vocal print sample Button can just record contact person's sound.

Further, as shown in figure 3, Application on Voiceprint Recognition flow comprises the following steps：Step 104：Determine audio/video file. Then, step 105：Speaker's segmentation is carried out to the voice in audio/video file, and generates n voice unit, each voice list Unit only includes single speaker's voice.Then, step 106：To each voice unit for being partitioned into (for example, n voice list Unit) carry out contact person's Application on Voiceprint Recognition and judge whether matching.Then, step 107：It it is end if recognition result is matching End equipment sets up the database of corresponding relation between a contact person and this audio/video file.Further, the number of corresponding relation Can be recorded according to storehouse and the audio/video file of contact person's sound occur.Further, the database of corresponding relation can also record connection It is the time point during people's sound appears in audio/video file.That is, map audio/video by time point appearing in accordingly Position in file.

Fig. 6 shows the flow chart that contact person's media library is checked by terminal device of an embodiment of the invention. The flow for checking contact person's media library by terminal device may include steps of：Step 201：Open media library, select into Enter " contact person's media library " menu.Then, step 202：Start to read contact person and audio/video document relationship database.Then, Step 203：Contact person and its correspondence media file and time point 203 are shown after the completion of reading.

Fig. 5 shows after the video and audio file that search out recording that terminal device shows and is labeled with sounding hereof The voiceprint appearance of target and/or the interface schematic diagram at the time point for terminating.For example, opening media library, selection enters " contact People's media library " menu, at this moment checks that the interface of contact person's media library is presented to user.On interface provide through read contact person and Every terms of information after audio/video document relationship database.Therefore, searching for audio/video file according to voiceprint includes：Work as opening When module is locally stored, audio/video file is shown.

Further, as can be seen that having " son " and " I " in the media library of the implementation method from the interface shown in Fig. 5 Two class media files, wherein：Have three time points, i.e., 3 ' 45 in " International Children's Day " project of " son " file ", 18 ' 23 ", 45’34”.These three time points are exactly occur the time point of " son " sound in " International Children's Day " project.For example, user can be with Selection " 3 ' 45 " ", commences play out when at this moment terminal device can enter to go to 45 seconds 3 minutes in " International Children's Day " project automatically. Therefore, the voiceprint of storage collection includes：Classification storage is carried out according to speaker model.Further, according to voiceprint Search audio/video file includes：When opening is locally stored module, audio/video file is shown.Further, the classification bag Include：Classification is carried out according to speaker model to audio/video file to show.Further, the display includes：Display audible target Appear in the time point in audio/video file.Further, the classification includes：Species according to audible target is to audio/video File is searched classifiably.Further, the time point include：When the time point in display of classifying in choosing, broadcasting sound/regard The audio/video of the audible target contained in frequency file.

As shown in figures 1 to 6, another implementation method of the invention, when terminal device to audio/video file according to specific When contact person is classified, it is necessary first to carry out the modeling and storage of vocal print for its emphasis contact person in address list module. The present invention is that each contact person record increases " vocal print sample " field, for storing in terminal device address list module The vocal print sample of contact person.Concrete operation method is：User is newly-built or edits the important relation people (such as " child ") of its concern.With Afterwards, one section of audio of the particular contact (" child ") (for example, recording normal speech, time span 7-10 seconds) is recorded.Terminal Equipment is modeled according to sample sound to particular contact (" the child ") vocal print, and is saved in the address list contact person record In the vocal print sample field of (" child ").Then, user's Record and Save audio/video file on the terminal device.The present invention Important relation people voiceprint analysis can be carried out and be classified according to contact person, Tag Contact's sound time of origin point it is right As.Then, the sound of all speakers being recorded in audio/video file is extracted and is split using speaker's cutting techniques It is multiple voice units, each voice unit only includes the voice of one of speaker.Then, using speaker model to every Individual voice unit carries out Application on Voiceprint Recognition.Then, to the database of storage contact person and audio/video relation after Application on Voiceprint Recognition, it is used for Record contact person and the corresponding relation of audio/video file, and the time point that contact person's sound occurs in this audio/video file. The vocal print that the present invention is mentioned refers to：The sound wave spectrum of user voice is the biological characteristic of the user voice.Compared by vocal print, moved Dynamic terminal can find out the respective objects in the multimedia of storage.Therefore, certain in audible target is contact application During individual contact person, the method for gathering the voiceprint of audible target includes：When being conversed with the contact person, contact person is recorded One section of sound, this section of sound time length 7-10 seconds and the only sound of the contact person above and in this section of sound.Using this Duan Shengyin extracts voiceprint and generates vocal print template.Further, certain in audible target is contact application During contact person, the voiceprint for gathering audible target includes：When being conversed with the contact person, the vocal print letter of contact person is recorded Breath.Further, when audible target is certain contact person in contact application, the voiceprint of audible target is gathered Including：User records contact person's voice manually, records the voiceprint of contact person.Further, when audible target is contact During certain contact person in people's application program, search audio/video file includes：As the contact person in choosing, mapping contact is played The audio/video of people.

Fig. 7 shows the flow chart of recording contact person's sound according to embodiments of the present invention.Record the stream of contact person's sound Journey includes：Step 301：Open certain contact person on address list.Then, step 302：Judge whether it is to record for the first time.

When judged result is when recording for the first time, into step 303：Start to record.Then, step 304：After the completion of recording Preserve this audio.Then, step 305：Vocal print modeling is carried out to the audio.Then, step 306：Preserve vocal print modeling information.Connect , step 307：Existing audio/video file is recognized with this voiceprint.Then, step 308：The file that will identify that and time Point is saved in contact person and audio/video relational database.Finally, step 309：Vocal print records end-of-job.

When judged result is not when recording for the first time, then into step 310：Determine whether whether prompting records again. If necessary to record again, then into step 311：Delete original recording file.After deleting original recording file, then into step 303.Above-mentioned steps 303 to 309 are then performed successively.If need not record again, do not record, process terminates (309).

Another implementation method of the invention, one kind is entered based on sound groove recognition technology in e to video on terminal device and audio One of row classification and the method for mark, comprise the following steps：Contact person's sound is recorded to shift to an earlier date voiceprint.Then, by sound/regard Frequency file carries out speaker's segmentation, is divided into multiple voice units, and each voice unit comprises only the voice of speaker, Application on Voiceprint Recognition is carried out one by one to these voice units.Then, recognition result is saved in contact person and audio/video relational database In.When contact person's media library is entered, or carried out in any media library of terminal device or file manager as user " according to When contact categories " or " according to searching contact person " are operated, or contact person's phase is directly viewable in contact application When closing audio frequency and video, read the relational database of contact person and audio/video and show their relation.The present invention not only may be used With so that the relation of contact person and audio/video is shown in the way of a certain menu item in media library, it is also possible in contact person or text Shown with menu-style in part manager.

Further, another implementation method of the invention, in terminal device media library, contact manager, file In the application programs such as manager, selection " according to contact categories " or " according to searching contact person " come carry out audio, video point Class shows and searches.Further, another implementation method of the invention, can be directly viewable in contact application The related audio/video of the contact person.

Therefore, the method operated to audio/video file based on voiceprint that the present invention is provided can be according to specific The voiceprint of contact person is classified to audio/video file.Therefore, when user want to find the sound for including particular contact/ Video file, it is not necessary to which the broadcasting of file one by one is checked, but directly passes through media library, contact manager, file management Device display information is selected, so as to facilitate user to search the file containing specific people's sound or video.Further, this hair The method operated to audio/video file based on voiceprint of bright offer can jump directly to certain connection in audio/video It is that the timing node that people speaks is played out, so as to provide the search efficiency of user.

As shown in figure 8, overall plan of the invention is properly termed as using the technology of voiceprint identification speaker's identity Words people identification (Speaker Recognition, SR), corresponding model be properly termed as speaker model (Speaker Model, SM).Speaker Recognition System is generally modeled using the method for UBM-GMM, i.e., by a large amount of training audios, (more than one is said Words people) one universal background model (Universal Background Model, UBM) of training, then on the basis of this UBM Specific speaker is modeled by the method for self adaptation, obtains speaker model (SM).Either universal background model Or speaker model, generally all uses mixed Gauss model (Gaussian Mixture Model, GMM) structure.Such as Fig. 8 institutes Show, the method operated to audio/video file based on voiceprint that the present invention is provided can include：Modeling process, identification Process.Modeling process may comprise steps of：Step 1：Training audio；Step 2：Jing Yin detection；Step 3：Voice is split；Step Rapid 4：Feature extraction；Step 5：Intersection self adaptation is carried out according to universal background model；Step 6：Generation speaker model；Step 7： Z-norm treatment is carried out based on personator's audio；Step 8：Normalization speaker model.Identification process may comprise steps of： Step 1：Detect audio to be identified；Step 2：Jing Yin detection；Step 3：Voice is split；Step 4：Feature extraction；Step 5：According to Normalization speaker model carries out score calculating；Step 6：T-norm treatment is carried out based on personator's audio；Step 7：Judgement；Step Rapid 8：Output recognition result.Wherein：Normalization speaker model and personator's model group into speaker model.It is of the invention One implementation method, the modeling process of speaker model can be described generally as following several stages：1st, feature extraction phases：Utilize Jing Yin detection technique (Voice Activity Detection, VAD), effective voice is detected from input audio, And audio segmentation will be input into some voices according to the Jing Yin length between voice, then carried from each voice for splitting Phonetic feature required for taking Speaker Identification；2nd, the UBM modelling phases：Using a large amount of phonetic features from training audio extraction, Calculate universal background model (UBM)；3rd, the SM modelling phases：Voice using universal background model and a small amount of speaker dependent is special Levy, the model (SM) of the speaker is calculated by adaptive approach；4th, the SM normalization stages：In order to strengthen the anti-of speaker model Interference performance, completes after speaker model modeling, is frequently utilized that the phonetic feature of some personation speakers to speaker model (Normalization) operation is normalized, the speaker model (Normalized SM) after normalization is finally given.Root According to one embodiment of the present invention, the identification process of Speaker Identification can be described generally as following several stages：1st, feature is carried Take the stage：This stage is identical with the feature extraction phases of modeling process；2nd, score calculation stages：Using speaker model, calculate It is input into the score of phonetic feature；3rd, the Score Normalization stage：Using normalized speaker model, to score obtained in the previous step It is normalized, and makes conclusive judgement.Furthermore, in modeling as described above and identification process, part steps There can be different implementation methods：1st, the Jing Yin detection technique of feature extraction phases：The application use method be first with The energy information and fundamental frequency information of audio are input into, are distinguished Jing Yin with non-mute, recycle a SVMs (Support Vector Machine, SVM) model distinguishes the voice and non-voice of non-mute part.Voice is determined Part, it is possible to according to the gap length between voice segments, input audio is divided into some voices；2nd, using common background Model calculates the adaptive approach of speaker model：The application uses eigentones (Eigenvoice) method, and constraint is maximum Likelihood linear regression (Constrained Maximum Likelihood Linear Regression, CMLLR) method and The method that structuring maximum a posteriori probability (Structured Maximum A Posterior, SMAP) method is combined；3rd, say Words people's model method for normalizing：The application uses Z-Norm methods；4th, score normalization：The application uses T- Norm methods.The method for normalizing that Z-Norm and T-Norm methods are combined is most popular in speaker Recognition Technology at present Method for normalizing, the former is used for the modelling phase, and the latter is used for cognitive phase.

As shown in figure 9, another object of the present invention is to provide a kind of terminal device, including：Voiceprint extraction module, is used for Gather the voiceprint of audible target；And performing module, for searching for audio/video file according to voiceprint.

Further, voiceprint extraction module includes：Voiceprint collecting unit, for being adopted when certain audible target is chosen Collection voiceprint；Vocal print sample generation unit, for generating speaker model according to voiceprint.

Further, device also includes：Memory module, the voiceprint for storing collection.

Further, memory module is additionally operable to：Storage vocal print template sample.

Further, voiceprint extraction module includes：Target classification unit, classification storage is carried out according to speaker model.

Further, device also includes：Display, when opening is locally stored module, shows audio/video file.

Further, display is used for：Audio/video file is entered according to the species that target classification unit is based on audible target Row classification display.

Further, display is used for：Display audible target appears in the time point in audio/video file.

Further, target classification unit is additionally operable to：Species according to audible target carries out classification and searches to audio/video file Rope.

Further, performing module is additionally operable to：When the time point in display of classifying in choosing, in broadcasting audio/video file The audio/video of the audible target for containing.

Further, when audible target is certain contact person in contact application, voiceprint extraction module is used for： When being conversed with the contact person, the voiceprint of contact person is recorded.

Further, when audible target is certain contact person in contact application, voiceprint extraction module is used for： User records contact person's voice manually, records the voiceprint of contact person.

Further, when audible target is certain contact person in contact application, performing module is additionally operable to：When When choosing the contact person, the audio/video of mapping contact person is played.

During those skilled in the art of the present technique are appreciated that the present invention can be related to for performing operation described herein One or more equipment of operation.The equipment can be for needed for purpose and specially design and manufacture, or can also include Known device in all-purpose computer, the all-purpose computer is activated or reconstructed with having procedure Selection of the storage in it.This The computer program of sample can be stored in equipment (for example, computer) computer-readable recording medium or store be suitable to storage electronics refer to Make and be coupled to respectively in any kind of medium of bus, the computer-readable medium is including but not limited to any kind of Disk (including floppy disk, hard disk, CD, CD-ROM and magneto-optic disk), immediately memory (RAM), read-only storage (ROM), electricity can be compiled Journey ROM, electrically erasable ROM (EPROM), electrically erasable ROM (EEPROM), flash memory, magnetic card or light card.It is readable Medium is included for by any mechanism of the readable form storage of equipment (for example, computer) or transmission information.For example, readable Medium include immediately memory (RAM), read-only storage (ROM), magnetic disk storage medium, optical storage medium, flash memory device, with Signal (such as carrier wave, infrared signal, data signal) that electricity, light, sound or other forms are propagated etc..

Those skilled in the art of the present technique are appreciated that method above with reference to implementation of the invention, method, are Invention has been described for the structure chart and/or block diagram and/or flow graph of system and computer program product.It should be understood that can With each frame and these structure charts during these structure charts and/or block diagram and/or flow graph are realized with computer program instructions And/or the combination of the frame in block diagram and/or flow graph.These computer program instructions can be supplied to all-purpose computer, specialty The processor of computer or other programmable data processing methods generates machine, so as to pass through computer or other programmable numbers The instruction performed according to the processor of processing method creates the frame or many for realizing structure chart and/or block diagram and/or flow graph The method specified in individual frame.

Those skilled in the art of the present technique are appreciated that in various operations, method, the flow discussed in the present invention Step, measure, scheme can be replaced, changed, combined or deleted.Furthermore, with having discussed in the present invention Other steps, measure in various operations, method, flow, scheme can also be replaced, changed, reset, decomposed, combined or deleted Remove.Furthermore, it is of the prior art with various operations, method, the flow disclosed in the present invention in step, arrange Apply, scheme can also be replaced, changed, reset, decompose, combines or be deleted.

Illustrative embodiment of the invention is disclosed in drawing and description.Despite the use of particular term, but it Be only used for general and description meaning, and be not for purposes of limitation.It should be pointed out that general for the art For logical technical staff, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improve and Retouching also should be regarded as protection scope of the present invention.Protection scope of the present invention should be limited with claims of the present invention.

Claims

1. a kind of method operated to audio/video file based on voiceprint, it is characterised in that comprise the following steps：

Gather the voiceprint of audible target；And

Audio/video file is searched for according to the voiceprint, terminal device shows the sound for being labeled with audible target hereof The time point that line information occurs and/or terminates；

Wherein, all sound being recorded in the audio/video file are divided into multiple voice units, each voice unit The voice of one of audible target is only included, and records time point of the audible target in the audio/video file, led to Cross the position that the time point mapping audio/video is appeared in corresponding document.

2. method according to claim 1, it is characterised in that the voiceprint of the collection audible target includes：

When certain audible target in choosing, voiceprint is gathered；And

Store the voiceprint of collection.

3. method according to claim 2, it is characterised in that collection and storage voiceprint include：

Speaker model is generated according to the voiceprint；And

By speaker model storage in module is locally stored.

4. method according to claim 3, it is characterised in that the voiceprint of the storage collection includes：

Classification storage is carried out according to the speaker model.

5. method according to claim 3, it is characterised in that searching for audio/video file according to the voiceprint includes：

When module is locally stored described in opening, the audio/video file is shown.

6. method according to claim 4, it is characterised in that the classification includes：

Classification is carried out according to the speaker model to audio/video file to show.

7. method according to claim 6, it is characterised in that the classification includes：

Species according to the audible target is searched classifiably to audio/video file.

8. method according to claim 6, it is characterised in that the time point includes：

When the time point in display of classifying in choosing, from the time point commence play out the audio/video file in contain The audio/video of the audible target.

9. method according to claim 1, it is characterised in that certain in the audible target is contact application During individual contact person, the voiceprint of the collection audible target includes：

When being conversed with the contact person, the voiceprint of the contact person is recorded.

10. method according to claim 1, it is characterised in that in the audible target is contact application During certain contact person, the voiceprint of the collection audible target includes：

User records contact person's voice manually, records the voiceprint of the contact person.

11. methods according to claim 1, it is characterised in that in the audible target is contact application During certain contact person, the search audio/video file includes：

As the contact person in choosing, the audio/video of the mapping contact person is played.

A kind of 12. terminal devices, it is characterised in that including：

Voiceprint extraction module, the voiceprint for gathering audible target；And

Performing module, for searching for audio/video file according to the voiceprint；

Display, the time point that the voiceprint of audible target occurs and/or terminates is labeled with for showing hereof；

13. terminal devices according to claim 12, it is characterised in that the voiceprint extraction module includes：

Voiceprint collecting unit, for gathering voiceprint when certain audible target is chosen；

Vocal print sample generation unit, for generating speaker model according to the voiceprint.

14. terminal devices according to claim 13, it is characterised in that also include：

Memory module, the voiceprint for storing collection.

15. terminal devices according to claim 14, it is characterised in that the memory module is additionally operable to：Storage is stated Words people's model.

16. terminal device according to claim 13 or 15, it is characterised in that the voiceprint extraction module includes：

Target classification unit, classification storage is carried out according to the speaker model.

17. terminal devices according to claim 14, it is characterised in that the display, when opening is locally stored module When, show the audio/video file.

18. terminal devices according to claim 16, it is characterised in that the display is used for：

The species for being based on the audible target according to the target classification unit carries out classification and shows to the audio/video file.

19. terminal devices according to claim 18, it is characterised in that the target classification unit is additionally operable to：

Species according to audible target is searched classifiably to audio/video file.

20. terminal devices according to claim 18, it is characterised in that the performing module is additionally operable to：

21. terminal devices according to claim 12, it is characterised in that when the audible target is contact application In certain contact person when, the voiceprint extraction module is used for：

22. terminal devices according to claim 12, it is characterised in that when the audible target is contact application In certain contact person when, the voiceprint extraction module is used for：

23. terminal devices according to claim 12, it is characterised in that when the audible target is contact application In certain contact person when, the performing module is additionally operable to：