It is on December 7th, 2012 applying date that the application is, entitled " one kind is found in real time based on user's inquiry log
The divisional application of the Chinese patent application 201210525735.7 of the method and apparatus of hot video ".
The content of the invention
In view of problems of the prior art, real-time based on user's inquiry log it is an object of the invention to provide one kind
It was found that the method for hot video, it is characterised in that comprise the following steps:
S1, the user video inquiry log in a period of time is input to cutting word program, completes each user video and look into
The cutting word of daily record is ask, the cutting word result of every user inquiry is obtained, and the different terms of the cutting word result as original will be constituted
Sub- word;
What is occurred in S2, each user video inquiry log within described a period of time of the statistics atom word is secondary
Number;
S3, the cutting word result statistics any two atom word obtained according to step S1 are while appear in same user inquiry
In number of times;
S4, the secondary numerical value obtained according to step S2 and S3 calculate user video and look into using the method for pointwise mutual information (PMI)
The degree of association in inquiry daily record between any two atom word;
S5, by the degree of association calculated in step S4 exceed certain threshold value any two atom word merge into a compound word
It is put into compound word vocabulary;
S6, the user video in compound vocabulary according to the atom word of composition compound word within described a period of time inquire about day
The number of times occurred in will carries out descending sort to compound word, and the compound word that will finally come by a certain percentage above is sent out as real-time
The keyword of existing hot video is returned.
Further, the method for finding hot video in real time based on user's inquiry log of the present invention, it is characterised in that
The circular of the method for pointwise mutual information (PMI) is as follows described in the step S4:
Appoint to two atoms word A, B, their association table is shown as
Wherein, P (A, B) represents that A, B appear in number of times in same user video inquiry log, P (A), P (B) difference table
Show the number of times occurred in the user video inquiry log of A, B within described a period of time.
Further, the method for finding hot video in real time based on user's inquiry log of the present invention, it is characterised in that
Using maximal possibility estimation (Maximum likelihood estimate) come calculation times.
Additionally, present invention also offers a kind of device for finding hot video in real time based on user's inquiry log, its feature
It is to include such as lower module:
Cutting word module, for the user video inquiry log in a period of time to be input into cutting word program, completes each
The cutting word of user video inquiry log, obtains every cutting word result of user's inquiry, and will constitute the difference of the cutting word result
Word is used as atom word;
Atom word occurrence number statistical module, each use within described a period of time for counting the atom word
The number of times occurred in the query video daily record of family;
Atom Term co-occurrence number of times statistical module, for the cutting word knot obtained according to the atom word occurrence number statistical module
Fruit statistics any two atom word appears in the number of times in same user inquiry simultaneously;
Calculation of relationship degree module, for according to the atom word occurrence number statistical module and the atom Term co-occurrence number of times
Any two is former during the secondary numerical value that statistical module is obtained calculates user video inquiry log using the method for pointwise mutual information (PMI)
The degree of association between sub- word;
Compound word generation module, the degree of association for the calculation of relationship degree module to be calculated exceedes appointing for certain threshold value
Two atom words of meaning are merged into a compound word and are put into compound word vocabulary;
Hot keyword determining module, during in compound vocabulary according to the atom word of compound word is constituted at described one section
The number of times occurred in interior user video inquiry log carries out descending sort to compound word, before finally coming by a certain percentage
The compound word in face is returned as the keyword for finding hot video in real time.
Further, the device for finding hot video in real time based on user's inquiry log of the present invention, it is characterised in that
The circular of the method for pointwise mutual information (PMI) is as follows described in the calculation of relationship degree module:
Appoint to two atoms word A, B, their association table is shown as
Wherein, P (A, B) represents that A, B appear in number of times in same user video inquiry log, P (A), P (B) difference table
Show the number of times occurred in the user video inquiry log of A, B within described a period of time.
Further, the device for finding hot video in real time based on user's inquiry log of the present invention, it is characterised in that
Using maximal possibility estimation (Maximum likelihood estimate) come calculation times.
Concept in this information theory by pointwise mutual information of the invention, is applied to the analysis of user's inquiry log, solves
Cutting word is inaccurate and the real-time focus of video pinpoint the problems caused by new term emerges in large numbers.The present invention does not only have strict theory
Basis, and it is simple efficient in Project Realization, efficiently avoid because using cascade system (i.e. exhaustive any two word or many words
Combination) and the problem of multiple shot array brought.This method may be such that video hotspot finds full automation, without artificial ginseng
With accuracy rate higher is in turn ensure that while efficiency is greatly improved.
Specific embodiment
To make the above objects, features and advantages of the present invention more obvious understandable, below in conjunction with the accompanying drawings and specific embodiment party
The present invention is further detailed explanation for formula:
Because real-time focus has the characteristics of volumes of searches is big within a short period of time, by newest user's inquiry log point
Analysis most possibly finds out new focus word and focus incident, and reaction of the searching order result to real-time is improved with this.Fig. 1
It is the realization principle figure of the method that the present invention has found hot video based on user's inquiry log in real time;As shown in figure 1, the present invention will
User's inquiry log in a period of time is input in cutting word program, obtains every cutting word result of user's inquiry, is carried here
The word of taking-up we be referred to as atom word.Then, the word frequency information and their co-occurrence number of times of atom word are counted on this basis (i.e.
Two words are appeared in same user inquiry simultaneously), and using the computational methods of pointwise mutual information (PMI), will be semantically close
Two or more atom words of association synthesize a compound word, and thus grey iterative generation goes out new vocabulary.Finally, by word in new vocabulary
Word frequency sequence, focus word and focus incident are found out automatically.
Fig. 2 is the flow chart of the method that the present invention has found hot video based on user's inquiry log in real time;As illustrated, this
Invention finds that the method for the real-time focus of video comprises the following steps based on user's inquiry log:
S1, the user video inquiry log in a period of time is input to cutting word program, completes each user video and look into
The cutting word of daily record is ask, the cutting word result of every user inquiry is obtained, and the different terms of the cutting word result as original will be constituted
Sub- word;
Cutting word program carries out cutting in the method that maximum forward is matched according to existing vocabulary to inquiry.
For example, user input query:" also pearl sound of laughing theme song ", cutting word program returning result " go back pearl | sound of laughing | theme
It is bent ", i.e., three atom words are contained in the inquiry:" also pearl ", " sound of laughing " and " theme song ".
What is occurred in S2, each user video inquiry log within described a period of time of the statistics atom word is secondary
Number;
For example, counted from user's inquiry log of a day obtaining:" also pearl " occurs 61,661 times, " sound of laughing " occur 65,
564 times, " theme song " occurs 306,050 time.
S3, the cutting word result statistics any two atom word obtained according to step S1 are while appear in same user inquiry
In number of times;
For example, counted from user's inquiry log of a day obtaining:" also pearl sound of laughing " occurs 60,245 times, " sound of laughing theme
It is bent " occur 1,505 times.
S4, the secondary numerical value obtained according to step S2 and S3 calculate user video and look into using the method for pointwise mutual information (PMI)
The degree of association in inquiry daily record between any two atom word;
Method using pointwise mutual information (Pointwise Mutual Information, be abbreviated as PMI) is used to portray
The degree of association in user's inquiry log between two words.The basic thought of the method is described below.
The computational methods of pointwise mutual information
PMI is a classical concept in information theory, for correlation between two chance events of measurement.It is considered that
PMI is equally applicable to calculate in video search the degree of association between two words.Intuitively, by analyzing user's inquiry log, if two
Number of times of the individual Term co-occurrence in same inquiry is a lot, then mean that two words have and may greatly merge into a compound word.Under
Face provides the circular of PMI.
Appoint to two words A, B, their association table is shown as
Wherein, P (A, B) represents that A, the number of times of B co-occurrences, P (A), P (B) are respectively A, the number of times that B occurs.
From above formula, if A, B independence, then the value of PMI (A, B) is 0;If there is association (herein referring to co-occurrence) in A, B, then
PMI (A, B) > 0, and the degree of association is higher, PMI value is bigger.
When using maximal possibility estimation (Maximum likelihood estimate) to estimate count parameter, formula
(1) it is equivalent to
Wherein, freq (A, B) represents the user's inquiry quantity comprising A and B simultaneously, and freq (A), freq (B) are represented respectively
User's inquiry quantity comprising A, B, the user's inquiry sum in a period of time of | Q | expressions.
By calculating PMI, we can be assigned to a numerical value for any two word, and its degree of association is represented with it, and in this base
On plinth, easily comparing word and word are associated, and generate compound word.
For example, user's inquiry sum that statistics obtains a day is 42,567,550 times, then can be obtained according to formula (2):
S5, by the degree of association calculated in step S4 exceed certain threshold value any two atom word merge into a compound word
It is put into compound word vocabulary;
For example, choose one day user's inquiry log calculate obtained by PMI averages 3.83 as threshold value, then go up in example " and also
Pearl " and " sound of laughing " because the degree of association be higher than threshold value, so compound word " also pearl sound of laughing " can be combined into;And " sound of laughing " and " theme song " then because
The degree of association is too low, it is impossible to merge.
S6, the user video in compound vocabulary according to the atom word of composition compound word within described a period of time inquire about day
The number of times occurred in will carries out descending sort to compound word, and the compound word that will finally come by a certain percentage above is sent out as real-time
The keyword of existing hot video is returned.
For example, the new focus word about 150,000 obtained by the analysis of user's inquiry log in a day, discovery, wherein sorting
It is most forward including " distorting the truth by despicable means " (584,435 times), " happy base camp " (485,773 times), " Must Be yours " (476,852
It is secondary) etc..
Fig. 3 is the functional block diagram of the device that the present invention has found hot video based on user's inquiry log in real time, as illustrated,
It is of the invention to find that the device of hot video includes such as lower module in real time based on user's inquiry log:
Cutting word module 1, for the user video inquiry log in a period of time to be input into cutting word program, completes each
The cutting word of user video inquiry log, obtains every cutting word result of user's inquiry, and will constitute the difference of the cutting word result
Word is used as atom word;
Atom word occurrence number statistical module 2, for count the atom word each within described a period of time
The number of times occurred in user video inquiry log;
Atom Term co-occurrence number of times statistical module 3, for the cutting word obtained according to the atom word occurrence number statistical module
Result statistics any two atom word appears in the number of times in same user inquiry simultaneously;
Calculation of relationship degree module 4, for according to the atom word occurrence number statistical module and the atom Term co-occurrence
The secondary numerical value that number statistical module is obtained is using any two in the method calculating user video inquiry log of pointwise mutual information (PMI)
The degree of association between atom word;
Compound word generation module 5, the degree of association for the calculation of relationship degree module to be calculated exceedes certain threshold value
Any two atom word is merged into a compound word and is put into compound word vocabulary;
Hot keyword determining module 6, in compound vocabulary according to constituting the atom word of compound word at described one section
The number of times occurred in user video inquiry log in time carries out descending sort to compound word, will finally come by a certain percentage
Compound word above is returned as the keyword for finding hot video in real time.
Concept in this information theory by pointwise mutual information of the invention, is applied to the analysis of user's inquiry log, solves
Cutting word is inaccurate and the real-time focus of video pinpoint the problems caused by new term emerges in large numbers.The present invention does not only have strict theory
Basis, and it is simple efficient in Project Realization, efficiently avoid because using cascade system (i.e. exhaustive any two word or many words
Combination) and the problem of multiple shot array brought.This method may be such that video hotspot finds full automation, without artificial ginseng
With accuracy rate higher is in turn ensure that while efficiency is greatly improved.By to one day 5,0000000 left side in certain video website
Right user video inquiry log is tested using invention proposed method, by six iterative calculation of PMI, is obtained automatically altogether
Compound word 150,000 is obtained, and with more than 85% accuracy rate.
Above is the detailed description carried out to the preferred embodiments of the present invention, but one of ordinary skill in the art should anticipate
Know, within the scope of the present invention, and guided by the spirit, various improvement, addition and replacement are all possible, for example, adjust interface
Call order, change message format and content, use different programming languages (such as C, C++, Java) to realize etc..These all exist
In the protection domain that claim of the invention is limited.