CN111178048A

CN111178048A - Smooth phrase topic model-based topic extraction method and device

Info

Publication number: CN111178048A
Application number: CN201911421842.3A
Authority: CN
Inventors: 郭佳; 张景鹏; 徐路; 李油; 赵小琦
Original assignee: Weibo Internet Technology China Co Ltd
Current assignee: Weibo Internet Technology China Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-19
Anticipated expiration: 2039-12-31
Also published as: CN111178048B

Abstract

Embodiments of the present invention provide a topic extraction method and device based on a smooth phrase topic model, including: extracting valid words in a data set to be processed to obtain a preprocessing data set; extracting frequent phrases from the preprocessing data set through an Apriori association algorithm , to form a frequent phrase dataset; according to the Gaussian distribution characteristics of the frequency of frequent phrases, the adjacent frequent phrases that meet the preset requirements in the preprocessing dataset are combined into new phrases, and the new phrases are added to the frequent phrase dataset, A candidate phrase dataset is formed; the candidate phrase dataset is analyzed by the SPLDA smooth phrase topic model to obtain topic phrases, and corresponding topics are formed through topic phrases. The topic phrase is obtained by analyzing the candidate phrase data set through the smooth phrase topic model, and the corresponding topic is formed through the topic phrase, which improves the readability of the topic and more accurately expresses the real information of the topic.

Description

Smooth phrase topic model-based topic extraction method and device

Technical Field

The invention relates to the field of data mining, in particular to a smooth phrase topic model-based topic extraction method and device.

Background

With the rapid development of the internet, social platforms such as microblogs, WeChat and headlines become mainstream media for information dissemination and user speech publishing. The microblog attracts more and more users by virtue of the characteristics of platform openness, information timeliness, concise content, wide coverage field and the like, and gradually becomes an important way for netizens to acquire news, interpersonal communication, release comments and participate in social event discussion and an important platform for reflecting social public opinions.

Common microblog hot search topics are typically described using manually labeled phrases, as shown in table 1.

TABLE 1 microblog hot search topic

In the process of implementing the invention, the applicant finds that at least the following problems exist in the prior art:

most of existing topic discovery methods are based on a word bag model for feature extraction, association information among words in a phrase is not considered, partial effective information is lost, and the topic is represented by isolated words in the method, so that topic representation readability is poor, ambiguity exists, and real information of the topic cannot be accurately reflected. For example, as a result of mining the data of topic 1, "sun, korea, songhigo, etc., it is difficult to obtain a result described by a phrase such as" descendant of sun, "and topic comprehensiveness is to be improved.

Disclosure of Invention

The embodiment of the invention provides a smooth phrase topic model-based topic extraction method and device.

To achieve the above object, in one aspect, an embodiment of the present invention provides a topic extraction method based on a smooth phrase topic model, including:

extracting effective words in a data set to be processed to obtain a preprocessed data set;

extracting frequent phrases from the preprocessing data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into a new phrase according to the Gaussian distribution characteristic of the frequency of occurrence of the frequent phrases, and adding the new phrase into the frequent phrase data set to form a candidate phrase data set;

and analyzing the candidate phrase data set through an SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.

On the other hand, an embodiment of the present invention provides a topic extraction apparatus based on a smooth phrase topic model, including:

a preprocessing module: the method comprises the steps of extracting effective words in a data set to be processed to obtain a preprocessed data set;

the phrase extraction module: extracting frequent phrases from the preprocessed data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into a new phrase according to the Gaussian distribution characteristic of the frequency of occurrence of the frequent phrases, and adding the new phrase into the frequent phrase data set to form a candidate phrase data set;

the theme generation module: the method is used for analyzing the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and corresponding topics are formed through the topic phrases.

The technical scheme has the following beneficial effects: and generating frequent phrases by using an Apriori association algorithm, and generating high-quality candidate phrases by combining with the Gaussian distribution characteristics of the text, so that the candidate phrases can be obtained by fast convergence. And mining candidate phrases by using Gaussian distribution characteristics of texts based on microblog topics of the smooth phrase topic model, analyzing a candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a topic extraction method based on a smooth phrase topic model according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a topic extraction apparatus based on a smooth phrase topic model according to an embodiment of the present invention;

FIG. 3 is a block diagram of topic extraction based on a smooth phrase topic model for an embodiment of the present invention;

FIG. 4 is a schematic diagram of a preprocessing module for an embodiment of the present invention;

fig. 5 is a schematic structural diagram of the SPLDA according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, in combination with the embodiment of the present invention, there is provided a smooth phrase topic model-based topic extraction method, including:

s101: extracting effective words in a data set to be processed to obtain a preprocessed data set;

s102: extracting frequent phrases from the preprocessing data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into a new phrase according to the Gaussian distribution characteristic of the frequency of occurrence of the frequent phrases, and adding the new phrase into the frequent phrase data set to form a candidate phrase data set;

s103: and analyzing the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.

Preferably, in step 102, extracting frequent phrases from the preprocessed data set by Apriori association algorithm to form a frequent phrase data set, which specifically includes:

s1021: the preprocessing data set comprises a data set at a text level, and when the occurrence frequency of a word in the data set at the text level is greater than the minimum support degree in an Apriori algorithm, the word is set as a frequent phrase, and a frequent phrase data set is generated;

s1022: the updating of the frequent phrase data set by the Apriori association algorithm specifically includes:

marking the position of each frequent phrase in the data set at the text level;

detecting whether the text-level data set contains frequent phrases with preset lengths or not, and keeping the text-level data set when the text-level data set contains frequent phrases with preset lengths; otherwise, deleting the data set of the text level; and the number of the first and second groups,

aiming at frequent phrases with the same length in a reserved text-level data set, synthesizing the frequent phrases and adjacent phrases into a first-level phrase when the adjacent phrases on one side of the frequent phrases are also the frequent phrases according to the positions of the frequent phrases, adding the first-level phrase into the frequent phrase data set when the first-level phrase reaches the minimum support degree, and deleting two adjacent frequent phrases corresponding to the first-level phrase from the frequent phrase data set; and repeating the loop to combine the frequent phrase and the adjacent phrase into a first-level phrase until the first-level phrase does not meet the minimum support degree, and completing the update of the frequent phrase data set.

Preferably, in step 102, synthesizing adjacent frequent phrases meeting preset requirements in the preprocessed data set into a new phrase, and adding the new phrase into the frequent phrase data set to form a candidate phrase data set, which specifically includes:

s1023: acquiring two adjacent frequent phrases in a data set at a text level, combining the two frequent phrases into a second-level phrase, and calculating the importance of the second-level phrase in the data set at the text level, wherein the importance is the probability of the two frequent phrases appearing at the same position in the data set at the text level;

s1024: when the importance degree is not less than a preset first threshold value, adding the second-level phrase to the frequent phrase data set, and deleting the two adjacent frequent phrases;

s1025: and circulating the operation of combining two adjacent frequent phrases into a second-level phrase until the importance degree of the second-level phrase synthesized by any two adjacent frequent phrases is smaller than a preset first threshold value to obtain a candidate phrase data set.

Preferably, step 103 specifically includes:

calculating the probability of the candidate phrase under different topics through an SPLDA smooth phrase topic model, and when the probability of the candidate phrase in a certain topic is not less than a second threshold value, taking the candidate phrase as a topic phrase, and forming a corresponding topic through the topic phrase.

Preferably, step 103 further comprises: further comprising: and calculating the standard deviation of the probability distribution of the words in the candidate phrase under the theme, and correcting the probability of the candidate phrase under different themes through the standard deviation of the words.

As shown in fig. 1, in combination with the embodiment of the present invention, there is also provided a topic extraction apparatus based on a smooth phrase topic model, including:

the preprocessing module 21: the method comprises the steps of extracting effective words in a data set to be processed to obtain a preprocessed data set;

the phrase extraction module 22: extracting frequent phrases from the preprocessed data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into a new phrase according to the Gaussian distribution characteristic of the frequency of occurrence of the frequent phrases, and adding the new phrase into the frequent phrase data set to form a candidate phrase data set;

the theme generation module 23: the method is used for analyzing the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and corresponding topics are formed through the topic phrases.

Preferably, the phrase extraction module 22 includes a frequent phrase mining sub-module 221, and the frequent phrase mining sub-module 221 is specifically configured to:

the preprocessing data set comprises a data set at a text level, and when the occurrence frequency of a word in the data set at the text level is greater than the minimum support degree in an Apriori algorithm, the word is set as a frequent phrase, and a frequent phrase data set is generated; updating the frequent phrase data set by the Apriori association algorithm specifically includes:

and marking the position of each frequent phrase in the data set at the text level; and the number of the first and second groups,

in a reserved text-level data set, aiming at frequent phrases with the same length, according to the positions of the frequent phrases, when the adjacent phrase on one side of the frequent phrases is also the frequent phrase, synthesizing the frequent phrases and the adjacent phrase into a first-level phrase, when the first-level phrase reaches the minimum support degree, adding the first-level phrase into the frequent phrase data set, and deleting two adjacent frequent phrases corresponding to the first-level phrase from the frequent phrase data set; and repeating the loop to combine the frequent phrase and the adjacent synthesized first-level phrase until the first-level phrase does not meet the minimum support degree, and completing the update of the frequent phrase data set.

Preferably, the phrase extraction module 22 includes a candidate phrase generation sub-module 222, and is specifically further configured to

Acquiring two adjacent frequent phrases in a data set at a text level, combining the two frequent phrases with a second-level phrase, and calculating the importance of the second-level phrase in the data set at the text level, wherein the importance is the probability of the two frequent phrases appearing at the same position in the data set at the text level;

when the importance degree is not less than a preset first threshold value, adding the second-level phrase to the frequent phrase data set, and deleting the two adjacent frequent phrases;

and circulating the operation of combining two adjacent frequent phrases into a second-level phrase until the importance of the second-level phrase combined by any two adjacent frequent phrases is less than a preset first threshold value, wherein the second-level phrase is a frequent phrase.

Preferably, the theme generation module 23 is specifically configured to: calculating the probability of the candidate phrase under different topics through an SPLDA smooth phrase topic model, and when the probability of the candidate phrase in a certain topic is not less than a second threshold value, taking the candidate phrase as a topic phrase, and forming a corresponding topic through the topic phrase.

Preferably, the theme generation module 23 is further configured to: and calculating the standard deviation of the probability distribution of the words in the candidate phrase under the theme, and correcting the probability of the candidate phrase under different themes through the standard deviation of the words.

The invention has the following beneficial effects: and generating frequent phrases by using an Apriori association algorithm, wherein the algorithm rapidly and effectively mines the frequent phrases by using two important rules. And further generating high-quality candidate phrases by combining the Gaussian distribution characteristics of the text.

Mining candidate phrases by using Gaussian distribution characteristics of texts based on microblog topics of a smooth phrase topic model, calculating the probability of the candidate phrases under different topics through an SPLDA smooth phrase topic model, taking the candidate phrases as topic phrases when the probability of the candidate phrases in a certain topic is not less than a second threshold value, and forming corresponding topics through the topic phrases.

By combining the variance (namely standard deviation) of the probability distribution of the 'theme-word' of the words in the phrase under the same theme, the smaller the variance is, the higher the possibility that the phrase belongs to the topic is shown, so that the probability distribution of the 'theme-word' is corrected, the convergence speed of sampling and the result accuracy are improved, and the generated theme phrase is used for expressing the topic, thereby improving the readability of the topic, reducing the ambiguity and more accurately expressing the real information of the topic.

The above technical solutions of the embodiments of the present invention are described in detail below with reference to specific application examples, and reference may be made to the foregoing related descriptions for technical details that are not described in the implementation process.

Abbreviations and key term definitions appearing in the present invention:

the topic model is as follows: a statistical model for discovering abstract topics in a series of documents in the fields of machine learning and natural language processing.

SPLDA: smoothened Phrase LDA (smooth Phrase topic model).

Smoothing a Gibbs sampling equation by combining the variance of the probability of the terms in the phrase under the same theme, and further correcting the distribution of the theme-phrase to mine the microblog topics.

In order to improve the effect of discovering microblog topics facing public sentiment, a topic discovery method (also called a topic extraction method) based on SPLDA (smoothenPhrase LDA) is provided. The microblog topics are described by phrases, so that people can be helped to grasp the meaning of the topics more accurately and comprehensively, and microblog public opinion monitoring is further assisted.

The principle of the topic (theme) discovery method based on SPLDA is shown in fig. 3. The method comprises the steps of firstly inputting a data set into a preprocessing module, extracting effective words in the data set to be processed, and obtaining a preprocessed data set. The data set is the collected microblog contents, the data set can contain a plurality of pieces of blog contents, and the microblog contents can be randomly collected from the daily routing data. Obtaining a preprocessed data set by carrying out mixed label filtering, word segmentation and word stop removal on the data set; where hash tag filtering, word segmentation, and stop word are common processes of natural language processing nlp.

The text processed by the preprocessing module is dried and converted into a data form capable of inputting a Gaussian model, then the preprocessed text is input into a phrase mining module, frequent phrases are mined by an Apriori association algorithm, and the frequent phrases are recombined by combining with Gaussian distribution characteristics of the occurrence times of the frequent phrases in the text processed by the preprocessing module to generate candidate phrases; the frequent phrases are the frequent items, and the frequent items are terms in the association algorithm and refer to the phrases meeting the support degree. The recombination of frequent phrases is specifically: if the importance of two adjacent phrases meets the threshold (the occurrence frequency reaches the minimum support), the two phrases are merged into one phrase, for example, "Beijing" and "Prime" are merged into "Beijing Prime", so that a candidate phrase set can be generated.

And finally, inputting the candidate phrase set into an SPLDA topic phrase generation model, analyzing by using a phrase topic model, obtaining topic-phrase probability distribution by combining the incidence relation of words in the phrases belonging to a topic and the variance of the probability distribution under the same topic, and selecting one or more topic phrases with high probability values to represent a topic. For example, there are ten thousand pieces of microblog texts, and the core topic in the microblog texts may be a plurality of texts, such as the eight diagrams of stars, the nobel prize, and the like.

The details are as follows:

as shown in fig. 4, in the process, a schematic diagram of a preprocessing module is first used to filter noise symbols such as html tags, emoticons, punctuation marks and the like in a microblog data set by using a regular filter, where html tags refer to tags in a web page source code, for example: < br > < div >, etc., and performing simplified transformation; then, carrying out word segmentation and part-of-speech tagging on the data set by using a Chinese word segmentation tool, and stopping words, wherein the stop words refer to removed words which have many meanings but no meanings in the text; in addition, other languages like English are naturally segmented without word segmentation, Chinese is mainly used in microblogs, and hot spots of collected data sets can be obtained by analyzing Chinese data, so that single English preprocessing is omitted. And finally, removing the microblog texts with the content of the blog text less than 4 effective words, wherein the types of the effective words generally comprise nouns, verbs, adjectives, digital words, time words and the like.

The phrase mining (phrase extraction module) mainly comprises two steps: (1) mining frequent phrases, generating frequent phrases by using an ariori association algorithm and counting the occurrence times of the phrases; (2) and generating candidate phrases, wherein high-quality candidate phrases are generated by combining with the Gaussian distribution characteristics of the text.

The task of frequent phrase mining is to collect all the continuous words in a corpus (preprocessed data set) whose statistical number is greater than the minimum support (support) of Apriori algorithm, where the minimum support means the minimum number of occurrences of a word, for example, the minimum support of a word is 3. The algorithm can rapidly and effectively mine frequent phrases by utilizing two important rules of 'downward closing principle' and 'data anti-monotonicity'. The "downward closing principle" and "inverse monotonicity of data" belong to two rules of a parallel relationship. Specifically, the method comprises the following steps: the downward closing principle is as follows: "if the phrase P is not a frequent item, then any phrase containing P may be considered to be also not a frequent item". The inverse monotonicity of the data is: "if a document does not contain frequent phrases of length n, then the document will not contain frequent phrases of length greater than n".

The flow of the frequent phrase mining algorithm is shown in table 2. The Apriori algorithm is first utilized to mine the patterns of frequent terms (frequent phrases) and maintain an active set of indices. The specific operation is as follows: phrases that satisfy the support are derived in conjunction with the downward closing principle, i.e., phrases that have been removed and include infrequent phrases are removed. And the active indexes are index information of frequent phrases with the length of n in a document where the microblog content is located, and the group of active indexes are positions of all words meeting the support degree in the data set during initialization.

And then judging whether the document needs to be further mined or not by utilizing a data reverse monotonicity rule, wherein the active indexes are index information of frequent phrases with the length of n in the document of the microblog content, so that the document of the microblog content does not contain the frequent phrases with the certain length of n, and the document does not contain the frequent phrases with the length of more than n, and the document of the microblog content is deleted in the next iteration process of the Apriori algorithm. The two pruning technologies for frequent phrase mining can be combined with the natural sparsity of phrases, the convergence speed is accelerated, the iteration of the algorithm can be stopped in advance, and the phrase mining efficiency is improved.

Examples are: 1. after word segmentation, obtaining the position of a word meeting a threshold value according to the support degree; 2. assuming that document d1 is processed, if the active word w1 (frequent phrase) is in d1, combining w1 with its adjacent words (the adjacent words are the frequent phrases and can be combined, otherwise they are not combined, based on the downward closing principle that if the phrase P is not a frequent item, then any phrase containing P can be considered as not a frequent item) to generate phrase P1 according to the position of the frequent phrase in the text-level data set; 3. judging whether p1 meets the support degree, if so, adding p1 (or p1) to the active index group if p1 meets the support degree, and deleting w1 and words adjacent to w1 in the frequent phrase data set after the iteration (for example, when the length of the frequent phrase is 5) is completed; 4. according to the combination mode, the active words are combined in d1, and if the words in the document d1 are not in the active traction group, the document d1 is deleted; 5. and adding the first-level phrase to the frequent phrase data set when the combined new phrase (first-level phrase) has a minimum support; 6. and iterating the steps according to the documents of all microblog contents until the active traction group is unchanged.

TABLE 2 frequent phrase mining Algorithm

That is, in the frequent phrase mining algorithm Apriori algorithm, through "closing-down principle" and "inverse monotonicity of data", a corpus (data set) is scanned by using a sliding window to generate phrases, the phrases satisfying the support degree are used as frequent phrases, the number of the phrases is counted, and the size of the sliding window is increased by 1 per iteration (the size of the scanning window refers to the length of the phrase, such as the scanning window n, meaning that the phrase satisfying the support degree and having the length of n is searched). In specific implementation, the initial value of n is usually 2, at the nth iteration, a candidate phrase with a length of n-1 is intercepted from each active index position, and a hash-based counter (HashMap) is used to count the candidate phrases, as shown in line 7 of table 2, if the phrase with the length of n-1 does not satisfy Apriori minimum support width, the phrase will not be added to the phrase candidate set, and at the same time, the start position index of the phrase is removed from the active index set before the next iteration.

And the candidate phrase generation sub-module is used for recombining the frequent phrases in the frequent phrase mining module by using the result of the frequent phrase mining module to generate the candidate phrases. Assuming that the corpus (the data set processed by Apriori algorithm) is generated by a series of independent bernoulli tests, whether the phrases P exist at specific positions in the corpus belongs to bernoulli distribution, the expected occurrence times of the phrases P are subjected to binomial distribution, and the distribution of each occurrence event is independent, that is, the expected occurrence times of each phrase are subjected to different binomial distributions respectively. Since the number L of phrases in the corpus is quite large, the binomial distribution can reasonably be approximated to a gaussian distribution according to the theorem of large numbers. Assume that frequent phrases P are known₁And P₂. The method uses importance index sig (P)₁,P₂) As the phrase P₁And P₂The basis of whether to merge, wherein the importance index sig (P)₁,P₂) Refers to calculating the known frequent phrase P₁And P₂Probability of co-occurrence, i.e. involving frequent phrases P₁And P₂Probability of merging into the same phrase (co-occurrence of two phrases). The specific derivation process of the index is as follows. f (P) represents the number of phrases P in the corpus, and the probability distribution is shown in formula (1).

h₀(f(P))＝N(Lp(P),Lp(P)(1-p(P)))≈N(L(p(P)),Lp(P)) (1)

Where P (P) is the probability of success of the Bernoulli test on the phrase P, the empirical probability of the occurrence of the phrase P in the corpus can be estimated as

Suppose phrase P₀By the phrase P₁And the phrase P₂Composition (phrases include words and phrases), and P₁And P₂Independent of each other, then the phrase P₀The expected number of occurrences is:

h can be obtained from the formula (1) and the formula (2)₀(f(P₀) Variance of the distribution is:

the method uses significance sig (P)₁,P₂) Index to measure the phrase P₁And P₂Probability of simultaneous occurrence. sig (P)₁,P₂) The calculation method is shown in formula (4).

There are two formulas in the binomial distribution approximation gaussian distribution: the expectation value Lf (P) and the variance Lf (P) (1-f (P)), both companies aim to obtain a gaussian distribution of events occurring in the corpus with the phrase P0P 1+ P2.

According to the formula (1), a Gaussian distribution (which is a probability density function) of the phrases P1 and P2 appearing in the document at the same time is obtained, f (P1P2) is normalized to obtain sig (P1, P2), and the larger the value of the sig (P1, P2), the larger the probability that the phrases P1 and P2 can be combined into one phrase is.

The specific process of generating the candidate phrases is shown in table 3, and the candidate phrase generation rule is to recombine frequent phrases to generate longer phrases and improve the readability of the phrases. After the data set passes through the frequent phrase mining module of the Apriori algorithm, the document where each microblog content is located consists of frequent phrases. For the microblog content at this time, firstly, a bottom-up aggregation method (namely combining the frequent phrases and the words adjacent to the left and right of the frequent phrases) is adopted for each document to combine the frequent phrases and the words adjacent to the left and right of the frequent phrases, and a new phrase is generated. Secondly, calculating the importance sig of the new phrase, adding the phrases meeting the threshold value into a candidate phrase set, namely a MaxHeap hash container (the key of the container is the phrase, and the value is the value of the importance), and updating the phrases at corresponding positions in the document, namely using the hash container to store active traction groups. Examples are: p1 and P2 in the document are andrewvier, and if sig (P1 and P2) satisfies the threshold, the document is merged with P1 and P2 to generate the phrase peking tiananmei, and P1 and P2 are deleted from d 1. If the threshold is not met, no changes are made to P1, P2. Next, the most important phrase Best is selected from the MaxHeap hash container (rows 2-4 in table 3), and the most important phrase Best is used as a word or phrase adjacent to the most important phrase Best in the document to generate a new phrase, the new phrase meeting the threshold is added into the candidate phrase set, and the Best phrase is removed from the MaxHeap. And finally, continuously iterating the process until the sig value of the Best phrase is smaller than the threshold value or the MaxHeap container is empty, and finishing.

TABLE 3 candidate phrase Generation procedure

After the data set is subjected to preprocessing and a phrase mining module, the content of each document (the blog content of each microblog) is divided into a plurality of candidate phrases, and the document is converted into a phrase bag form from a bag-of-words (bag-of-words) form. From the phrase mining module, it can be known that in the generated new phrase, there is a strong association relationship between words in the phrase, so in the phrase topic model herein, it can be reasonably assumed that all words in the phrase share one topic.

In 2003, Blei proposed an LDA (late Dirichlet Allocation) model, and Dirichlet prior distribution was introduced on the basis of the pLSA model. In the LDA pseudo model, it is assumed that the text d follows a multinomial distribution over all K topics, and each topic K follows a multinomial distribution over a set of words, and that the multinomial distribution

And

the SPLDA method is based on LDA, a phrase bag model is used for replacing a bag-of-words model in LDA, as shown in FIG. 5, the SPLDA model is based on the hypothesis that a word item in a phrase and a bag-of-words model belong to a same subject, the text independence refers to that if 1 ten thousand pieces of microblog data are in a data set, each microblog data is independent and does not influence each other, the hypothesis that the word items in the phrase belong to the same subject is given as an example, if a phrase P0 is Beijing university score line belongs to the topic of 'Gaoko', P1 is Beijing university score line and P2 is score line, and the two phrases are both belonging to the topic of 'Gaoko', the SPLDA model generation process of the document is as follows:

(1) for each topic k (there may be multiple microblog documents corresponding to multiple topics)

Generating a "topic-word" distribution

Updating the "topic-phrase" distribution when the topic of the words in the phrase is the same

(2) For each document d

a) Generating text topic distributions θ_d～Dir(α)

b) For the nth word in the text

i. Generating a subject item z_d,i:Multi(θ_d)

Generating term w_d,i:

The variables involved in the SPLDA model and their meanings are shown in table 4.

TABLE 4 SPLDA model-related parameters

In table 4, the vocabulary is the vocabulary generated by all the different words in the data set. For example: all the different words in the data set are stored in a file, and each line is a word, and the line number is the position of the word in the word list. Nd is the number of tokens in the d document, one token is a word, i.e. the token is a word; gd is the number of candidate phrases in the d-th document, the candidate phrases containing words.

In the SPLDA model, random variables are used

A topic variable representing the g-th phrase of document d. The joint distribution function of the Z, W, phi, theta parameters and the joint distribution function of the 'subject-word' in the SPLDA are as follows:

wherein Z is a topic term, W is a term, Φ is a parameter for each topic k, a "topic-term" distribution is generated, Θ is a parameter for each document d, a text topic distribution θ is generated_d～Dir(α)。

Wherein P is_LDA(Z, W, phi, theta) is a parameter joint distribution function in the LDA model, see formula (6), and C is a normalization constant for ensuring that the left side of formula (6) is a reasonable probability distribution. Function f (C)_d,g) Is used to constrain the words in the phrase to belong to the same topic, see formula (7).

Suppose phrase P₀At subject z_jAnd a subject z_iThe "topic-phrase" probability distributions below are the same, i.e., p (C)_d,g＝z_j)＝p(C_d,g＝z_i). And the phrase P₀Term of inclusion, { w_d,g,iAt topic z_iThe difference in the "topic-word" probability distribution is smaller; at subject z_jIn the following, the difference in the probability distribution of "topic-word" is large. At this time, it is the phrase P₀Is assigned a theme z_iIt is more consistent with the assumption that terms in phrases belong to the same topic. However, when the parameters of the phrase bag topic model are optimized by using the conventional Gibbs sampling method, the phrase P cannot be distinguished₀Subject selection z_jAnd z_iLeading to inaccurate "topic-phrase" distribution. To address this problem, the method uses the term { w ] in the phrase_d,g,iThe statistical properties of the probability distribution under the same topic (i.e. improved standard deviation) to improve the Gibbs sampling method and thus modify the probability distribution of the "topic-phrase". C_d,gWith K possible values, using C_d,gK represents C_d,gWherein each word is the same as a topic K, where K represents the number of topics and K represents a specific topic. The parameter optimization equation is as follows:

where VarSqrt is the standard deviation of the probability distribution of terms in the phrase under topic k. In equation (8), as Var increases, the value of Var/tanh (Var) should be large, thereby emphasizing p (C)_d,gK), the probability that the phrase is entitled to k becomes smaller. As Var decreases, the value of Var/tanh (Var) decreases (when Var is 0, the value is 1), thereby mitigating the pair p (C)_d,gK), the probability of the phrase topic k becomes large. The method fuses the difference of probability distribution of terms in phrases under the same theme into the training process of a theme model through a formula (8), thereby correcting the probability distribution of theme-phrase.

The invention has the following beneficial effects:

and generating frequent phrases by using an Apriori association algorithm, wherein the algorithm rapidly and effectively mines the frequent phrases by using two important rules. And further generating high-quality candidate phrases by combining the Gaussian distribution characteristics of the text.

It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. To those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Those of skill in the art will further appreciate that the various illustrative logical blocks, units, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.

The various illustrative logical blocks, or elements, described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.

In one or more exemplary designs, the functions described above in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. Such discs (disk) and disks (disc) include compact disks, laser disks, optical disks, DVDs, floppy disks and blu-ray disks where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included in the computer-readable medium.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. a topic extraction method based on smooth phrase topic model, is characterized in that, comprises:

Extract valid words in the data set to be processed to obtain a preprocessing data set;

The frequent phrases are extracted from the preprocessed data set by the Apriori association algorithm to form the frequent phrase data set, and the frequent phrase data set is updated by the Apriori association algorithm; according to the Gaussian distribution characteristics of the frequency of frequent phrases, the preprocessed data set meets the preset requirements. The adjacent frequent phrases are combined into new phrases, and the new phrases are added to the frequent phrase dataset to form a candidate phrase dataset;

The candidate phrase data set is analyzed by the SPLDA smooth phrase topic model, and the topic phrase is obtained, and the corresponding topic is formed by the topic phrase.

2. the topic extraction method based on smooth phrase topic model according to claim 1, is characterized in that,

The frequent phrases are extracted from the preprocessing data set through the Apriori association algorithm to form a frequent phrase data set, which specifically includes:

The preprocessing data set includes a text-level data set. When the number of occurrences of a certain word in the text-level data set is greater than the minimum support degree in the Apriori algorithm, the word is set as a frequent phrase, and a frequent phrase data set is generated. ;

The updating of the frequent phrase data set through the Apriori association algorithm specifically includes:

and mark the location of each frequent phrase in the text-level dataset;

Detecting whether the text-level dataset contains frequent phrases of a preset length, and when the frequent phrases of the preset length are included, the text-level dataset is retained; otherwise, the text-level dataset is deleted; and,

In the reserved text-level data set, for frequent phrases of the same length, according to the location of the frequent phrase, when the phrase adjacent to the frequent phrase is also a frequent phrase, the frequent phrase and the adjacent phrase are synthesized as The first-level phrase, when the first-level phrase reaches the minimum support, the first-level phrase is added to the frequent phrase data set, and the two adjacent frequent phrases corresponding to the first-level phrase are removed from the frequent phrase data set. Centralized deletion; repeated cycles synthesizing frequent phrases and adjacent phrases into first-level phrases until the first-level phrases do not meet the minimum support degree, completing the update of the frequent phrase data set.

3. The topic extraction method based on a smooth phrase topic model according to claim 2, wherein the adjacent frequent phrases that meet the preset requirements in the preprocessing data set are synthesized into new phrases, and the new phrases are added to the Frequent phrase datasets to form candidate phrase datasets, including:

Obtain two adjacent frequent phrases in the text-level dataset and combine the two frequent phrases into second-level phrases, and calculate the importance of the second-level phrase in the text-level dataset, where the importance is the two The probability that a frequent phrase appears in the same position in the text-level dataset;

When the importance is not less than the preset first threshold, add the second-level phrase to the frequent phrase data set, and delete the two adjacent frequent phrases;

The operation of combining two adjacent frequent phrases into a second-level phrase is repeated until the importance of the second-level phrase synthesized by any two adjacent frequent phrases is less than the preset first threshold, and the candidate phrase dataset is obtained. .

4. the subject extraction method based on smooth phrase subject model according to claim 1, is characterized in that, candidate phrase data set is analyzed by SPLDA smooth phrase subject model, obtain subject phrase, form corresponding topic by subject phrase, concretely. include:

The probability of candidate phrases under different topics is calculated by the SPLDA smooth phrase topic model. When the probability of the candidate phrase in a topic is not less than the second threshold, the candidate phrase is regarded as the topic phrase, and the corresponding topic is formed through the topic phrase.

5. The topic extraction method based on a smooth phrase topic model according to claim 4, further comprising: calculating the standard deviation of the probability distribution of the word in the candidate phrase under the topic, and correcting the candidate by the standard deviation of the word Probabilities of phrases under different topics.

6. A subject extraction device based on a smooth phrase subject model, characterized in that, comprising:

Preprocessing module: used to extract valid words in the data set to be processed to obtain a preprocessing data set;

Phrase extraction module: It is used to extract frequent phrases from the preprocessing data set through the Apriori association algorithm to form a frequent phrase data set, and update the frequent phrase data set through the Apriori association algorithm; according to the Gaussian distribution characteristics of the frequency of frequent phrases, the preprocessing The adjacent frequent phrases in the dataset that meet the preset requirements are combined into new phrases, and the new phrases are added to the frequent phrase dataset to form a candidate phrase dataset;

Topic generation module: It is used to analyze the candidate phrase data set through the SPLDA smooth phrase topic model, obtain topic phrases, and form corresponding topics through topic phrases.

7. The topic extraction device based on a smooth phrase topic model according to claim 6, wherein the phrase extraction module comprises a frequent phrase mining submodule: be specifically used for:

The preprocessing data set includes a text-level data set. When the number of occurrences of a certain word in the text-level data set is greater than the minimum support degree in the Apriori algorithm, the word is set as a frequent phrase, and a frequent phrase data set is generated. ; The frequent phrase data set is updated through the Apriori association algorithm, specifically including:

and mark the location of each frequent phrase in the text-level dataset;

8. The topic extraction device based on a smooth phrase topic model according to claim 7, wherein the phrase extraction module comprises a candidate phrase generation sub-module, and is specifically also used for:

Obtain two adjacent frequent phrases in the text-level dataset and combine the two frequent phrases into second-level phrases, and calculate the importance of the second-level phrase in the text-level dataset, where the importance is the two The probability of frequent phrases appearing in the same position in the text-level dataset;

The operation of combining two adjacent frequent phrases into a second-level phrase is repeated until the importance of the second-level phrase synthesized by any two adjacent frequent phrases is less than the preset first threshold, frequent phrase.

9. The topic extraction device based on a smooth phrase topic model according to claim 6, wherein the topic generation module is specifically used for:

10. The topic extraction device based on a smooth phrase topic model according to claim 9, wherein the topic generation module is specifically also used for:

Calculate the standard deviation of the probability distribution of the words in the candidate phrase under the topic, and correct the probability of the candidate phrase under different topics by the standard deviation of the word.