[go: up one dir, main page]

CN111178048A - Smooth phrase topic model-based topic extraction method and device - Google Patents

Smooth phrase topic model-based topic extraction method and device Download PDF

Info

Publication number
CN111178048A
CN111178048A CN201911421842.3A CN201911421842A CN111178048A CN 111178048 A CN111178048 A CN 111178048A CN 201911421842 A CN201911421842 A CN 201911421842A CN 111178048 A CN111178048 A CN 111178048A
Authority
CN
China
Prior art keywords
phrase
frequent
phrases
topic
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911421842.3A
Other languages
Chinese (zh)
Other versions
CN111178048B (en
Inventor
郭佳
张景鹏
徐路
李油
赵小琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weibo Internet Technology China Co Ltd
Original Assignee
Weibo Internet Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weibo Internet Technology China Co Ltd filed Critical Weibo Internet Technology China Co Ltd
Priority to CN201911421842.3A priority Critical patent/CN111178048B/en
Publication of CN111178048A publication Critical patent/CN111178048A/en
Application granted granted Critical
Publication of CN111178048B publication Critical patent/CN111178048B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明实施例提供一种基于平滑短语主题模型的主题提取方法及装置,包括:提取待处理数据集内的有效词,得到预处理数据集;通过Apriori关联算法自预处理数据集中提取出频繁短语,形成频繁短语数据集;根据频繁短语出现频率的高斯分布特性,将预处理数据集中符合预设要求的相邻的频繁短语组合成新的短语,并将新的短语加入到频繁短语数据集,形成候选短语数据集;通过SPLDA平滑短语主题模型对候选短语数据集进行分析,得到主题短语,通过主题短语形成相应的话题。通过平滑短语主题模型对候选短语数据集进行分析得到主题短语,通过主题短语形成相应的话题,提高了话题的可读性,更准确地表述了话题的真实信息。

Figure 201911421842

Embodiments of the present invention provide a topic extraction method and device based on a smooth phrase topic model, including: extracting valid words in a data set to be processed to obtain a preprocessing data set; extracting frequent phrases from the preprocessing data set through an Apriori association algorithm , to form a frequent phrase dataset; according to the Gaussian distribution characteristics of the frequency of frequent phrases, the adjacent frequent phrases that meet the preset requirements in the preprocessing dataset are combined into new phrases, and the new phrases are added to the frequent phrase dataset, A candidate phrase dataset is formed; the candidate phrase dataset is analyzed by the SPLDA smooth phrase topic model to obtain topic phrases, and corresponding topics are formed through topic phrases. The topic phrase is obtained by analyzing the candidate phrase data set through the smooth phrase topic model, and the corresponding topic is formed through the topic phrase, which improves the readability of the topic and more accurately expresses the real information of the topic.

Figure 201911421842

Description

Smooth phrase topic model-based topic extraction method and device
Technical Field
The invention relates to the field of data mining, in particular to a smooth phrase topic model-based topic extraction method and device.
Background
With the rapid development of the internet, social platforms such as microblogs, WeChat and headlines become mainstream media for information dissemination and user speech publishing. The microblog attracts more and more users by virtue of the characteristics of platform openness, information timeliness, concise content, wide coverage field and the like, and gradually becomes an important way for netizens to acquire news, interpersonal communication, release comments and participate in social event discussion and an important platform for reflecting social public opinions.
Common microblog hot search topics are typically described using manually labeled phrases, as shown in table 1.
TABLE 1 microblog hot search topic
Figure BDA0002352587010000011
In the process of implementing the invention, the applicant finds that at least the following problems exist in the prior art:
most of existing topic discovery methods are based on a word bag model for feature extraction, association information among words in a phrase is not considered, partial effective information is lost, and the topic is represented by isolated words in the method, so that topic representation readability is poor, ambiguity exists, and real information of the topic cannot be accurately reflected. For example, as a result of mining the data of topic 1, "sun, korea, songhigo, etc., it is difficult to obtain a result described by a phrase such as" descendant of sun, "and topic comprehensiveness is to be improved.
Disclosure of Invention
The embodiment of the invention provides a smooth phrase topic model-based topic extraction method and device.
To achieve the above object, in one aspect, an embodiment of the present invention provides a topic extraction method based on a smooth phrase topic model, including:
extracting effective words in a data set to be processed to obtain a preprocessed data set;
extracting frequent phrases from the preprocessing data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into a new phrase according to the Gaussian distribution characteristic of the frequency of occurrence of the frequent phrases, and adding the new phrase into the frequent phrase data set to form a candidate phrase data set;
and analyzing the candidate phrase data set through an SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.
On the other hand, an embodiment of the present invention provides a topic extraction apparatus based on a smooth phrase topic model, including:
a preprocessing module: the method comprises the steps of extracting effective words in a data set to be processed to obtain a preprocessed data set;
the phrase extraction module: extracting frequent phrases from the preprocessed data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into a new phrase according to the Gaussian distribution characteristic of the frequency of occurrence of the frequent phrases, and adding the new phrase into the frequent phrase data set to form a candidate phrase data set;
the theme generation module: the method is used for analyzing the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and corresponding topics are formed through the topic phrases.
The technical scheme has the following beneficial effects: and generating frequent phrases by using an Apriori association algorithm, and generating high-quality candidate phrases by combining with the Gaussian distribution characteristics of the text, so that the candidate phrases can be obtained by fast convergence. And mining candidate phrases by using Gaussian distribution characteristics of texts based on microblog topics of the smooth phrase topic model, analyzing a candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a topic extraction method based on a smooth phrase topic model according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a topic extraction apparatus based on a smooth phrase topic model according to an embodiment of the present invention;
FIG. 3 is a block diagram of topic extraction based on a smooth phrase topic model for an embodiment of the present invention;
FIG. 4 is a schematic diagram of a preprocessing module for an embodiment of the present invention;
fig. 5 is a schematic structural diagram of the SPLDA according to the embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, in combination with the embodiment of the present invention, there is provided a smooth phrase topic model-based topic extraction method, including:
s101: extracting effective words in a data set to be processed to obtain a preprocessed data set;
s102: extracting frequent phrases from the preprocessing data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into a new phrase according to the Gaussian distribution characteristic of the frequency of occurrence of the frequent phrases, and adding the new phrase into the frequent phrase data set to form a candidate phrase data set;
s103: and analyzing the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and forming corresponding topics through the topic phrases.
Preferably, in step 102, extracting frequent phrases from the preprocessed data set by Apriori association algorithm to form a frequent phrase data set, which specifically includes:
s1021: the preprocessing data set comprises a data set at a text level, and when the occurrence frequency of a word in the data set at the text level is greater than the minimum support degree in an Apriori algorithm, the word is set as a frequent phrase, and a frequent phrase data set is generated;
s1022: the updating of the frequent phrase data set by the Apriori association algorithm specifically includes:
marking the position of each frequent phrase in the data set at the text level;
detecting whether the text-level data set contains frequent phrases with preset lengths or not, and keeping the text-level data set when the text-level data set contains frequent phrases with preset lengths; otherwise, deleting the data set of the text level; and the number of the first and second groups,
aiming at frequent phrases with the same length in a reserved text-level data set, synthesizing the frequent phrases and adjacent phrases into a first-level phrase when the adjacent phrases on one side of the frequent phrases are also the frequent phrases according to the positions of the frequent phrases, adding the first-level phrase into the frequent phrase data set when the first-level phrase reaches the minimum support degree, and deleting two adjacent frequent phrases corresponding to the first-level phrase from the frequent phrase data set; and repeating the loop to combine the frequent phrase and the adjacent phrase into a first-level phrase until the first-level phrase does not meet the minimum support degree, and completing the update of the frequent phrase data set.
Preferably, in step 102, synthesizing adjacent frequent phrases meeting preset requirements in the preprocessed data set into a new phrase, and adding the new phrase into the frequent phrase data set to form a candidate phrase data set, which specifically includes:
s1023: acquiring two adjacent frequent phrases in a data set at a text level, combining the two frequent phrases into a second-level phrase, and calculating the importance of the second-level phrase in the data set at the text level, wherein the importance is the probability of the two frequent phrases appearing at the same position in the data set at the text level;
s1024: when the importance degree is not less than a preset first threshold value, adding the second-level phrase to the frequent phrase data set, and deleting the two adjacent frequent phrases;
s1025: and circulating the operation of combining two adjacent frequent phrases into a second-level phrase until the importance degree of the second-level phrase synthesized by any two adjacent frequent phrases is smaller than a preset first threshold value to obtain a candidate phrase data set.
Preferably, step 103 specifically includes:
calculating the probability of the candidate phrase under different topics through an SPLDA smooth phrase topic model, and when the probability of the candidate phrase in a certain topic is not less than a second threshold value, taking the candidate phrase as a topic phrase, and forming a corresponding topic through the topic phrase.
Preferably, step 103 further comprises: further comprising: and calculating the standard deviation of the probability distribution of the words in the candidate phrase under the theme, and correcting the probability of the candidate phrase under different themes through the standard deviation of the words.
As shown in fig. 1, in combination with the embodiment of the present invention, there is also provided a topic extraction apparatus based on a smooth phrase topic model, including:
the preprocessing module 21: the method comprises the steps of extracting effective words in a data set to be processed to obtain a preprocessed data set;
the phrase extraction module 22: extracting frequent phrases from the preprocessed data set through an Apriori association algorithm to form a frequent phrase data set, and updating the frequent phrase data set through the Apriori association algorithm; combining adjacent frequent phrases meeting preset requirements in the preprocessing data set into a new phrase according to the Gaussian distribution characteristic of the frequency of occurrence of the frequent phrases, and adding the new phrase into the frequent phrase data set to form a candidate phrase data set;
the theme generation module 23: the method is used for analyzing the candidate phrase data set through the SPLDA smooth phrase topic model to obtain topic phrases, and corresponding topics are formed through the topic phrases.
Preferably, the phrase extraction module 22 includes a frequent phrase mining sub-module 221, and the frequent phrase mining sub-module 221 is specifically configured to:
the preprocessing data set comprises a data set at a text level, and when the occurrence frequency of a word in the data set at the text level is greater than the minimum support degree in an Apriori algorithm, the word is set as a frequent phrase, and a frequent phrase data set is generated; updating the frequent phrase data set by the Apriori association algorithm specifically includes:
and marking the position of each frequent phrase in the data set at the text level; and the number of the first and second groups,
detecting whether the text-level data set contains frequent phrases with preset lengths or not, and keeping the text-level data set when the text-level data set contains frequent phrases with preset lengths; otherwise, deleting the data set of the text level; and the number of the first and second groups,
in a reserved text-level data set, aiming at frequent phrases with the same length, according to the positions of the frequent phrases, when the adjacent phrase on one side of the frequent phrases is also the frequent phrase, synthesizing the frequent phrases and the adjacent phrase into a first-level phrase, when the first-level phrase reaches the minimum support degree, adding the first-level phrase into the frequent phrase data set, and deleting two adjacent frequent phrases corresponding to the first-level phrase from the frequent phrase data set; and repeating the loop to combine the frequent phrase and the adjacent synthesized first-level phrase until the first-level phrase does not meet the minimum support degree, and completing the update of the frequent phrase data set.
Preferably, the phrase extraction module 22 includes a candidate phrase generation sub-module 222, and is specifically further configured to
Acquiring two adjacent frequent phrases in a data set at a text level, combining the two frequent phrases with a second-level phrase, and calculating the importance of the second-level phrase in the data set at the text level, wherein the importance is the probability of the two frequent phrases appearing at the same position in the data set at the text level;
when the importance degree is not less than a preset first threshold value, adding the second-level phrase to the frequent phrase data set, and deleting the two adjacent frequent phrases;
and circulating the operation of combining two adjacent frequent phrases into a second-level phrase until the importance of the second-level phrase combined by any two adjacent frequent phrases is less than a preset first threshold value, wherein the second-level phrase is a frequent phrase.
Preferably, the theme generation module 23 is specifically configured to: calculating the probability of the candidate phrase under different topics through an SPLDA smooth phrase topic model, and when the probability of the candidate phrase in a certain topic is not less than a second threshold value, taking the candidate phrase as a topic phrase, and forming a corresponding topic through the topic phrase.
Preferably, the theme generation module 23 is further configured to: and calculating the standard deviation of the probability distribution of the words in the candidate phrase under the theme, and correcting the probability of the candidate phrase under different themes through the standard deviation of the words.
The invention has the following beneficial effects: and generating frequent phrases by using an Apriori association algorithm, wherein the algorithm rapidly and effectively mines the frequent phrases by using two important rules. And further generating high-quality candidate phrases by combining the Gaussian distribution characteristics of the text.
Mining candidate phrases by using Gaussian distribution characteristics of texts based on microblog topics of a smooth phrase topic model, calculating the probability of the candidate phrases under different topics through an SPLDA smooth phrase topic model, taking the candidate phrases as topic phrases when the probability of the candidate phrases in a certain topic is not less than a second threshold value, and forming corresponding topics through the topic phrases.
By combining the variance (namely standard deviation) of the probability distribution of the 'theme-word' of the words in the phrase under the same theme, the smaller the variance is, the higher the possibility that the phrase belongs to the topic is shown, so that the probability distribution of the 'theme-word' is corrected, the convergence speed of sampling and the result accuracy are improved, and the generated theme phrase is used for expressing the topic, thereby improving the readability of the topic, reducing the ambiguity and more accurately expressing the real information of the topic.
The above technical solutions of the embodiments of the present invention are described in detail below with reference to specific application examples, and reference may be made to the foregoing related descriptions for technical details that are not described in the implementation process.
Abbreviations and key term definitions appearing in the present invention:
the topic model is as follows: a statistical model for discovering abstract topics in a series of documents in the fields of machine learning and natural language processing.
SPLDA: smoothened Phrase LDA (smooth Phrase topic model).
Smoothing a Gibbs sampling equation by combining the variance of the probability of the terms in the phrase under the same theme, and further correcting the distribution of the theme-phrase to mine the microblog topics.
In order to improve the effect of discovering microblog topics facing public sentiment, a topic discovery method (also called a topic extraction method) based on SPLDA (smoothenPhrase LDA) is provided. The microblog topics are described by phrases, so that people can be helped to grasp the meaning of the topics more accurately and comprehensively, and microblog public opinion monitoring is further assisted.
The principle of the topic (theme) discovery method based on SPLDA is shown in fig. 3. The method comprises the steps of firstly inputting a data set into a preprocessing module, extracting effective words in the data set to be processed, and obtaining a preprocessed data set. The data set is the collected microblog contents, the data set can contain a plurality of pieces of blog contents, and the microblog contents can be randomly collected from the daily routing data. Obtaining a preprocessed data set by carrying out mixed label filtering, word segmentation and word stop removal on the data set; where hash tag filtering, word segmentation, and stop word are common processes of natural language processing nlp.
The text processed by the preprocessing module is dried and converted into a data form capable of inputting a Gaussian model, then the preprocessed text is input into a phrase mining module, frequent phrases are mined by an Apriori association algorithm, and the frequent phrases are recombined by combining with Gaussian distribution characteristics of the occurrence times of the frequent phrases in the text processed by the preprocessing module to generate candidate phrases; the frequent phrases are the frequent items, and the frequent items are terms in the association algorithm and refer to the phrases meeting the support degree. The recombination of frequent phrases is specifically: if the importance of two adjacent phrases meets the threshold (the occurrence frequency reaches the minimum support), the two phrases are merged into one phrase, for example, "Beijing" and "Prime" are merged into "Beijing Prime", so that a candidate phrase set can be generated.
And finally, inputting the candidate phrase set into an SPLDA topic phrase generation model, analyzing by using a phrase topic model, obtaining topic-phrase probability distribution by combining the incidence relation of words in the phrases belonging to a topic and the variance of the probability distribution under the same topic, and selecting one or more topic phrases with high probability values to represent a topic. For example, there are ten thousand pieces of microblog texts, and the core topic in the microblog texts may be a plurality of texts, such as the eight diagrams of stars, the nobel prize, and the like.
The details are as follows:
as shown in fig. 4, in the process, a schematic diagram of a preprocessing module is first used to filter noise symbols such as html tags, emoticons, punctuation marks and the like in a microblog data set by using a regular filter, where html tags refer to tags in a web page source code, for example: < br > < div >, etc., and performing simplified transformation; then, carrying out word segmentation and part-of-speech tagging on the data set by using a Chinese word segmentation tool, and stopping words, wherein the stop words refer to removed words which have many meanings but no meanings in the text; in addition, other languages like English are naturally segmented without word segmentation, Chinese is mainly used in microblogs, and hot spots of collected data sets can be obtained by analyzing Chinese data, so that single English preprocessing is omitted. And finally, removing the microblog texts with the content of the blog text less than 4 effective words, wherein the types of the effective words generally comprise nouns, verbs, adjectives, digital words, time words and the like.
The phrase mining (phrase extraction module) mainly comprises two steps: (1) mining frequent phrases, generating frequent phrases by using an ariori association algorithm and counting the occurrence times of the phrases; (2) and generating candidate phrases, wherein high-quality candidate phrases are generated by combining with the Gaussian distribution characteristics of the text.
The task of frequent phrase mining is to collect all the continuous words in a corpus (preprocessed data set) whose statistical number is greater than the minimum support (support) of Apriori algorithm, where the minimum support means the minimum number of occurrences of a word, for example, the minimum support of a word is 3. The algorithm can rapidly and effectively mine frequent phrases by utilizing two important rules of 'downward closing principle' and 'data anti-monotonicity'. The "downward closing principle" and "inverse monotonicity of data" belong to two rules of a parallel relationship. Specifically, the method comprises the following steps: the downward closing principle is as follows: "if the phrase P is not a frequent item, then any phrase containing P may be considered to be also not a frequent item". The inverse monotonicity of the data is: "if a document does not contain frequent phrases of length n, then the document will not contain frequent phrases of length greater than n".
The flow of the frequent phrase mining algorithm is shown in table 2. The Apriori algorithm is first utilized to mine the patterns of frequent terms (frequent phrases) and maintain an active set of indices. The specific operation is as follows: phrases that satisfy the support are derived in conjunction with the downward closing principle, i.e., phrases that have been removed and include infrequent phrases are removed. And the active indexes are index information of frequent phrases with the length of n in a document where the microblog content is located, and the group of active indexes are positions of all words meeting the support degree in the data set during initialization.
And then judging whether the document needs to be further mined or not by utilizing a data reverse monotonicity rule, wherein the active indexes are index information of frequent phrases with the length of n in the document of the microblog content, so that the document of the microblog content does not contain the frequent phrases with the certain length of n, and the document does not contain the frequent phrases with the length of more than n, and the document of the microblog content is deleted in the next iteration process of the Apriori algorithm. The two pruning technologies for frequent phrase mining can be combined with the natural sparsity of phrases, the convergence speed is accelerated, the iteration of the algorithm can be stopped in advance, and the phrase mining efficiency is improved.
Examples are: 1. after word segmentation, obtaining the position of a word meeting a threshold value according to the support degree; 2. assuming that document d1 is processed, if the active word w1 (frequent phrase) is in d1, combining w1 with its adjacent words (the adjacent words are the frequent phrases and can be combined, otherwise they are not combined, based on the downward closing principle that if the phrase P is not a frequent item, then any phrase containing P can be considered as not a frequent item) to generate phrase P1 according to the position of the frequent phrase in the text-level data set; 3. judging whether p1 meets the support degree, if so, adding p1 (or p1) to the active index group if p1 meets the support degree, and deleting w1 and words adjacent to w1 in the frequent phrase data set after the iteration (for example, when the length of the frequent phrase is 5) is completed; 4. according to the combination mode, the active words are combined in d1, and if the words in the document d1 are not in the active traction group, the document d1 is deleted; 5. and adding the first-level phrase to the frequent phrase data set when the combined new phrase (first-level phrase) has a minimum support; 6. and iterating the steps according to the documents of all microblog contents until the active traction group is unchanged.
TABLE 2 frequent phrase mining Algorithm
Figure BDA0002352587010000081
That is, in the frequent phrase mining algorithm Apriori algorithm, through "closing-down principle" and "inverse monotonicity of data", a corpus (data set) is scanned by using a sliding window to generate phrases, the phrases satisfying the support degree are used as frequent phrases, the number of the phrases is counted, and the size of the sliding window is increased by 1 per iteration (the size of the scanning window refers to the length of the phrase, such as the scanning window n, meaning that the phrase satisfying the support degree and having the length of n is searched). In specific implementation, the initial value of n is usually 2, at the nth iteration, a candidate phrase with a length of n-1 is intercepted from each active index position, and a hash-based counter (HashMap) is used to count the candidate phrases, as shown in line 7 of table 2, if the phrase with the length of n-1 does not satisfy Apriori minimum support width, the phrase will not be added to the phrase candidate set, and at the same time, the start position index of the phrase is removed from the active index set before the next iteration.
And the candidate phrase generation sub-module is used for recombining the frequent phrases in the frequent phrase mining module by using the result of the frequent phrase mining module to generate the candidate phrases. Assuming that the corpus (the data set processed by Apriori algorithm) is generated by a series of independent bernoulli tests, whether the phrases P exist at specific positions in the corpus belongs to bernoulli distribution, the expected occurrence times of the phrases P are subjected to binomial distribution, and the distribution of each occurrence event is independent, that is, the expected occurrence times of each phrase are subjected to different binomial distributions respectively. Since the number L of phrases in the corpus is quite large, the binomial distribution can reasonably be approximated to a gaussian distribution according to the theorem of large numbers. Assume that frequent phrases P are known1And P2. The method uses importance index sig (P)1,P2) As the phrase P1And P2The basis of whether to merge, wherein the importance index sig (P)1,P2) Refers to calculating the known frequent phrase P1And P2Probability of co-occurrence, i.e. involving frequent phrases P1And P2Probability of merging into the same phrase (co-occurrence of two phrases). The specific derivation process of the index is as follows. f (P) represents the number of phrases P in the corpus, and the probability distribution is shown in formula (1).
h0(f(P))=N(Lp(P),Lp(P)(1-p(P)))≈N(L(p(P)),Lp(P)) (1)
Where P (P) is the probability of success of the Bernoulli test on the phrase P, the empirical probability of the occurrence of the phrase P in the corpus can be estimated as
Figure BDA0002352587010000091
Suppose phrase P0By the phrase P1And the phrase P2Composition (phrases include words and phrases), and P1And P2Independent of each other, then the phrase P0The expected number of occurrences is:
Figure BDA0002352587010000092
h can be obtained from the formula (1) and the formula (2)0(f(P0) Variance of the distribution is:
Figure BDA0002352587010000093
the method uses significance sig (P)1,P2) Index to measure the phrase P1And P2Probability of simultaneous occurrence. sig (P)1,P2) The calculation method is shown in formula (4).
Figure BDA0002352587010000094
There are two formulas in the binomial distribution approximation gaussian distribution: the expectation value Lf (P) and the variance Lf (P) (1-f (P)), both companies aim to obtain a gaussian distribution of events occurring in the corpus with the phrase P0P 1+ P2.
According to the formula (1), a Gaussian distribution (which is a probability density function) of the phrases P1 and P2 appearing in the document at the same time is obtained, f (P1P2) is normalized to obtain sig (P1, P2), and the larger the value of the sig (P1, P2), the larger the probability that the phrases P1 and P2 can be combined into one phrase is.
The specific process of generating the candidate phrases is shown in table 3, and the candidate phrase generation rule is to recombine frequent phrases to generate longer phrases and improve the readability of the phrases. After the data set passes through the frequent phrase mining module of the Apriori algorithm, the document where each microblog content is located consists of frequent phrases. For the microblog content at this time, firstly, a bottom-up aggregation method (namely combining the frequent phrases and the words adjacent to the left and right of the frequent phrases) is adopted for each document to combine the frequent phrases and the words adjacent to the left and right of the frequent phrases, and a new phrase is generated. Secondly, calculating the importance sig of the new phrase, adding the phrases meeting the threshold value into a candidate phrase set, namely a MaxHeap hash container (the key of the container is the phrase, and the value is the value of the importance), and updating the phrases at corresponding positions in the document, namely using the hash container to store active traction groups. Examples are: p1 and P2 in the document are andrewvier, and if sig (P1 and P2) satisfies the threshold, the document is merged with P1 and P2 to generate the phrase peking tiananmei, and P1 and P2 are deleted from d 1. If the threshold is not met, no changes are made to P1, P2. Next, the most important phrase Best is selected from the MaxHeap hash container (rows 2-4 in table 3), and the most important phrase Best is used as a word or phrase adjacent to the most important phrase Best in the document to generate a new phrase, the new phrase meeting the threshold is added into the candidate phrase set, and the Best phrase is removed from the MaxHeap. And finally, continuously iterating the process until the sig value of the Best phrase is smaller than the threshold value or the MaxHeap container is empty, and finishing.
TABLE 3 candidate phrase Generation procedure
Figure BDA0002352587010000101
After the data set is subjected to preprocessing and a phrase mining module, the content of each document (the blog content of each microblog) is divided into a plurality of candidate phrases, and the document is converted into a phrase bag form from a bag-of-words (bag-of-words) form. From the phrase mining module, it can be known that in the generated new phrase, there is a strong association relationship between words in the phrase, so in the phrase topic model herein, it can be reasonably assumed that all words in the phrase share one topic.
In 2003, Blei proposed an LDA (late Dirichlet Allocation) model, and Dirichlet prior distribution was introduced on the basis of the pLSA model. In the LDA pseudo model, it is assumed that the text d follows a multinomial distribution over all K topics, and each topic K follows a multinomial distribution over a set of words, and that the multinomial distribution
Figure BDA0002352587010000113
And
Figure BDA0002352587010000114
the SPLDA method is based on LDA, a phrase bag model is used for replacing a bag-of-words model in LDA, as shown in FIG. 5, the SPLDA model is based on the hypothesis that a word item in a phrase and a bag-of-words model belong to a same subject, the text independence refers to that if 1 ten thousand pieces of microblog data are in a data set, each microblog data is independent and does not influence each other, the hypothesis that the word items in the phrase belong to the same subject is given as an example, if a phrase P0 is Beijing university score line belongs to the topic of 'Gaoko', P1 is Beijing university score line and P2 is score line, and the two phrases are both belonging to the topic of 'Gaoko', the SPLDA model generation process of the document is as follows:
(1) for each topic k (there may be multiple microblog documents corresponding to multiple topics)
Generating a "topic-word" distribution
Figure BDA0002352587010000115
Updating the "topic-phrase" distribution when the topic of the words in the phrase is the same
(2) For each document d
a) Generating text topic distributions θd~Dir(α)
b) For the nth word in the text
i. Generating a subject item zd,i:Multi(θd)
Generating term wd,i:
Figure BDA0002352587010000111
The variables involved in the SPLDA model and their meanings are shown in table 4.
TABLE 4 SPLDA model-related parameters
Figure BDA0002352587010000112
Figure BDA0002352587010000121
In table 4, the vocabulary is the vocabulary generated by all the different words in the data set. For example: all the different words in the data set are stored in a file, and each line is a word, and the line number is the position of the word in the word list. Nd is the number of tokens in the d document, one token is a word, i.e. the token is a word; gd is the number of candidate phrases in the d-th document, the candidate phrases containing words.
In the SPLDA model, random variables are used
Figure BDA0002352587010000122
A topic variable representing the g-th phrase of document d. The joint distribution function of the Z, W, phi, theta parameters and the joint distribution function of the 'subject-word' in the SPLDA are as follows:
Figure BDA0002352587010000123
wherein Z is a topic term, W is a term, Φ is a parameter for each topic k, a "topic-term" distribution is generated, Θ is a parameter for each document d, a text topic distribution θ is generatedd~Dir(α)。
Wherein P isLDA(Z, W, phi, theta) is a parameter joint distribution function in the LDA model, see formula (6), and C is a normalization constant for ensuring that the left side of formula (6) is a reasonable probability distribution. Function f (C)d,g) Is used to constrain the words in the phrase to belong to the same topic, see formula (7).
Figure BDA0002352587010000124
Suppose phrase P0At subject zjAnd a subject ziThe "topic-phrase" probability distributions below are the same, i.e., p (C)d,g=zj)=p(Cd,g=zi). And the phrase P0Term of inclusion, { wd,g,iAt topic ziThe difference in the "topic-word" probability distribution is smaller; at subject zjIn the following, the difference in the probability distribution of "topic-word" is large. At this time, it is the phrase P0Is assigned a theme ziIt is more consistent with the assumption that terms in phrases belong to the same topic. However, when the parameters of the phrase bag topic model are optimized by using the conventional Gibbs sampling method, the phrase P cannot be distinguished0Subject selection zjAnd ziLeading to inaccurate "topic-phrase" distribution. To address this problem, the method uses the term { w ] in the phrased,g,iThe statistical properties of the probability distribution under the same topic (i.e. improved standard deviation) to improve the Gibbs sampling method and thus modify the probability distribution of the "topic-phrase". Cd,gWith K possible values, using Cd,gK represents Cd,gWherein each word is the same as a topic K, where K represents the number of topics and K represents a specific topic. The parameter optimization equation is as follows:
Figure BDA0002352587010000131
Figure BDA0002352587010000132
where VarSqrt is the standard deviation of the probability distribution of terms in the phrase under topic k. In equation (8), as Var increases, the value of Var/tanh (Var) should be large, thereby emphasizing p (C)d,gK), the probability that the phrase is entitled to k becomes smaller. As Var decreases, the value of Var/tanh (Var) decreases (when Var is 0, the value is 1), thereby mitigating the pair p (C)d,gK), the probability of the phrase topic k becomes large. The method fuses the difference of probability distribution of terms in phrases under the same theme into the training process of a theme model through a formula (8), thereby correcting the probability distribution of theme-phrase.
The invention has the following beneficial effects:
and generating frequent phrases by using an Apriori association algorithm, wherein the algorithm rapidly and effectively mines the frequent phrases by using two important rules. And further generating high-quality candidate phrases by combining the Gaussian distribution characteristics of the text.
Mining candidate phrases by using Gaussian distribution characteristics of texts based on microblog topics of a smooth phrase topic model, calculating the probability of the candidate phrases under different topics through an SPLDA smooth phrase topic model, taking the candidate phrases as topic phrases when the probability of the candidate phrases in a certain topic is not less than a second threshold value, and forming corresponding topics through the topic phrases.
By combining the variance (namely standard deviation) of the probability distribution of the 'theme-word' of the words in the phrase under the same theme, the smaller the variance is, the higher the possibility that the phrase belongs to the topic is shown, so that the probability distribution of the 'theme-word' is corrected, the convergence speed of sampling and the result accuracy are improved, and the generated theme phrase is used for expressing the topic, thereby improving the readability of the topic, reducing the ambiguity and more accurately expressing the real information of the topic.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. To those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".
Those of skill in the art will further appreciate that the various illustrative logical blocks, units, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
The various illustrative logical blocks, or elements, described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.
In one or more exemplary designs, the functions described above in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. Such discs (disk) and disks (disc) include compact disks, laser disks, optical disks, DVDs, floppy disks and blu-ray disks where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included in the computer-readable medium.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1.一种基于平滑短语主题模型的主题提取方法,其特征在于,包括:1. a topic extraction method based on smooth phrase topic model, is characterized in that, comprises: 提取待处理数据集内的有效词,得到预处理数据集;Extract valid words in the data set to be processed to obtain a preprocessing data set; 通过Apriori关联算法自预处理数据集中提取出频繁短语,形成频繁短语数据集,并通过Apriori关联算法更新频繁短语数据集;根据频繁短语出现频率的高斯分布特性,将预处理数据集中符合预设要求的相邻的频繁短语组合成新的短语,并将新的短语加入到频繁短语数据集,形成候选短语数据集;The frequent phrases are extracted from the preprocessed data set by the Apriori association algorithm to form the frequent phrase data set, and the frequent phrase data set is updated by the Apriori association algorithm; according to the Gaussian distribution characteristics of the frequency of frequent phrases, the preprocessed data set meets the preset requirements. The adjacent frequent phrases are combined into new phrases, and the new phrases are added to the frequent phrase dataset to form a candidate phrase dataset; 通过SPLDA平滑短语主题模型对候选短语数据集进行分析,得到主题短语,通过主题短语形成相应的话题。The candidate phrase data set is analyzed by the SPLDA smooth phrase topic model, and the topic phrase is obtained, and the corresponding topic is formed by the topic phrase. 2.根据权利要求1所述的基于平滑短语主题模型的主题提取方法,其特征在于,2. the topic extraction method based on smooth phrase topic model according to claim 1, is characterized in that, 所述通过Apriori关联算法自预处理数据集中提取出频繁短语,形成频繁短语数据集,具体包括:The frequent phrases are extracted from the preprocessing data set through the Apriori association algorithm to form a frequent phrase data set, which specifically includes: 所述预处理数据集包括文本级别的数据集,当所述文本级别的数据集中某个词出现的次数大于Apriori算法中的最小支持度,则设定该词为频繁短语,生成频繁短语数据集;The preprocessing data set includes a text-level data set. When the number of occurrences of a certain word in the text-level data set is greater than the minimum support degree in the Apriori algorithm, the word is set as a frequent phrase, and a frequent phrase data set is generated. ; 所述通过Apriori关联算法更新频繁短语数据集,具体包括:The updating of the frequent phrase data set through the Apriori association algorithm specifically includes: 并标记每个频繁短语在所述文本级别的数据集中的所在位置;and mark the location of each frequent phrase in the text-level dataset; 检测文本级别的数据集中是否包含预设长度的频繁短语,当包含预设长度的频繁短语时则保留该文本级别的数据集;否则删除该文本级别的数据集;以及,Detecting whether the text-level dataset contains frequent phrases of a preset length, and when the frequent phrases of the preset length are included, the text-level dataset is retained; otherwise, the text-level dataset is deleted; and, 在保留的文本级别的数据集中,针对同一长度的频繁短语,根据频繁短语所在位置,当与该频繁短语一侧相邻的短语也为频繁短语时,将频繁短语与该相邻的短语合成为第一级短语,当第一级短语达到最小支持度时,将该第一级短语添加到频繁短语数据集内,并将该第一级短语对应的两个相邻的频繁短语从频繁短语数据集中删除;重复循环将频繁短语与相邻的短语合成第一级短语直到第一级短语不满足最小支持度,完成对频繁短语数据集的更新。In the reserved text-level data set, for frequent phrases of the same length, according to the location of the frequent phrase, when the phrase adjacent to the frequent phrase is also a frequent phrase, the frequent phrase and the adjacent phrase are synthesized as The first-level phrase, when the first-level phrase reaches the minimum support, the first-level phrase is added to the frequent phrase data set, and the two adjacent frequent phrases corresponding to the first-level phrase are removed from the frequent phrase data set. Centralized deletion; repeated cycles synthesizing frequent phrases and adjacent phrases into first-level phrases until the first-level phrases do not meet the minimum support degree, completing the update of the frequent phrase data set. 3.根据权利要求2所述的基于平滑短语主题模型的主题提取方法,其特征在于,将预处理数据集中符合预设要求的相邻的频繁短语合成新的短语,并将新的短语加入到频繁短语数据集,形成候选短语数据集,具体包括:3. The topic extraction method based on a smooth phrase topic model according to claim 2, wherein the adjacent frequent phrases that meet the preset requirements in the preprocessing data set are synthesized into new phrases, and the new phrases are added to the Frequent phrase datasets to form candidate phrase datasets, including: 获取文本级别的数据集中两个相邻的频繁短语并将该两个频繁短语合为第二级短语,计算该第二级短语在文本级别的数据集中的重要度,所述重要度为该两个频繁短语在文本级别的数据集中相同位置出现的概率;Obtain two adjacent frequent phrases in the text-level dataset and combine the two frequent phrases into second-level phrases, and calculate the importance of the second-level phrase in the text-level dataset, where the importance is the two The probability that a frequent phrase appears in the same position in the text-level dataset; 当重要度不小于预设的第一阈值时,将该第二级短语添加到频繁短语数据集,并删除该两个相邻的频繁短语;When the importance is not less than the preset first threshold, add the second-level phrase to the frequent phrase data set, and delete the two adjacent frequent phrases; 循环将两个相邻的频繁短语合为一个第二级短语的操作,直到任何两个相邻的频繁短语合成的第二级短语的重要度小于预设的第一阈值,得到候选短语数据集。The operation of combining two adjacent frequent phrases into a second-level phrase is repeated until the importance of the second-level phrase synthesized by any two adjacent frequent phrases is less than the preset first threshold, and the candidate phrase dataset is obtained. . 4.根据权利要求1所述的基于平滑短语主题模型的主题提取方法,其特征在于,通过SPLDA平滑短语主题模型对候选短语数据集进行分析,得到主题短语,通过主题短语形成相应的话题,具体包括:4. the subject extraction method based on smooth phrase subject model according to claim 1, is characterized in that, candidate phrase data set is analyzed by SPLDA smooth phrase subject model, obtain subject phrase, form corresponding topic by subject phrase, concretely. include: 通过SPLDA平滑短语主题模型计算候选短语在不同主题下的概率,当该候选短语在某主题中的概率不小于第二阈值时,将该候选短语作为主题短语,通过该主题短语形成相应的话题。The probability of candidate phrases under different topics is calculated by the SPLDA smooth phrase topic model. When the probability of the candidate phrase in a topic is not less than the second threshold, the candidate phrase is regarded as the topic phrase, and the corresponding topic is formed through the topic phrase. 5.根据权利要求4所述的基于平滑短语主题模型的主题提取方法,其特征在于,还包括:计算候选短语中的词在主题下的概率分布的标准差,通过词的标准差修正该候选短语在不同主题下的概率。5. The topic extraction method based on a smooth phrase topic model according to claim 4, further comprising: calculating the standard deviation of the probability distribution of the word in the candidate phrase under the topic, and correcting the candidate by the standard deviation of the word Probabilities of phrases under different topics. 6.一种基于平滑短语主题模型的主题提取装置,其特征在于,包括:6. A subject extraction device based on a smooth phrase subject model, characterized in that, comprising: 预处理模块:用于提取待处理数据集内的有效词,得到预处理数据集;Preprocessing module: used to extract valid words in the data set to be processed to obtain a preprocessing data set; 短语提取模块:用于通过Apriori关联算法自预处理数据集中提取出频繁短语,形成频繁短语数据集,并通过Apriori关联算法更新频繁短语数据集;根据频繁短语出现频率的高斯分布特性,将预处理数据集中符合预设要求的相邻的频繁短语组合成新的短语,并将新的短语加入到频繁短语数据集,形成候选短语数据集;Phrase extraction module: It is used to extract frequent phrases from the preprocessing data set through the Apriori association algorithm to form a frequent phrase data set, and update the frequent phrase data set through the Apriori association algorithm; according to the Gaussian distribution characteristics of the frequency of frequent phrases, the preprocessing The adjacent frequent phrases in the dataset that meet the preset requirements are combined into new phrases, and the new phrases are added to the frequent phrase dataset to form a candidate phrase dataset; 主题生成模块:用于通过SPLDA平滑短语主题模型对候选短语数据集进行分析,得到主题短语,通过主题短语形成相应的话题。Topic generation module: It is used to analyze the candidate phrase data set through the SPLDA smooth phrase topic model, obtain topic phrases, and form corresponding topics through topic phrases. 7.根据权利要求6所述的基于平滑短语主题模型的主题提取装置,其特征在于,所述短语提取模块包括频繁短语挖掘子模块:具体用于:7. The topic extraction device based on a smooth phrase topic model according to claim 6, wherein the phrase extraction module comprises a frequent phrase mining submodule: be specifically used for: 所述预处理数据集包括文本级别的数据集,当所述文本级别的数据集中某个词出现的次数大于Apriori算法中的最小支持度,则设定该词为频繁短语,生成频繁短语数据集;所述通过Apriori关联算法更新频繁短语数据集,具体包括:The preprocessing data set includes a text-level data set. When the number of occurrences of a certain word in the text-level data set is greater than the minimum support degree in the Apriori algorithm, the word is set as a frequent phrase, and a frequent phrase data set is generated. ; The frequent phrase data set is updated through the Apriori association algorithm, specifically including: 并标记每个频繁短语在所述文本级别的数据集中的所在位置;and mark the location of each frequent phrase in the text-level dataset; 检测文本级别的数据集中是否包含预设长度的频繁短语,当包含预设长度的频繁短语时则保留该文本级别的数据集;否则删除该文本级别的数据集;以及,Detecting whether the text-level dataset contains frequent phrases of a preset length, and when the frequent phrases of the preset length are included, the text-level dataset is retained; otherwise, the text-level dataset is deleted; and, 在保留的文本级别的数据集中,针对同一长度的频繁短语,根据频繁短语所在位置,当与该频繁短语一侧相邻的短语也为频繁短语时,将频繁短语与该相邻的短语合成为第一级短语,当第一级短语达到最小支持度时,将该第一级短语添加到频繁短语数据集内,并将该第一级短语对应的两个相邻的频繁短语从频繁短语数据集中删除;重复循环将频繁短语与相邻的短语合成第一级短语直到第一级短语不满足最小支持度,完成对频繁短语数据集的更新。In the reserved text-level data set, for frequent phrases of the same length, according to the location of the frequent phrase, when the phrase adjacent to the frequent phrase is also a frequent phrase, the frequent phrase and the adjacent phrase are synthesized as The first-level phrase, when the first-level phrase reaches the minimum support, the first-level phrase is added to the frequent phrase data set, and the two adjacent frequent phrases corresponding to the first-level phrase are removed from the frequent phrase data set. Centralized deletion; repeated cycles synthesizing frequent phrases and adjacent phrases into first-level phrases until the first-level phrases do not meet the minimum support degree, completing the update of the frequent phrase data set. 8.根据权利要求7所述的基于平滑短语主题模型的主题提取装置,其特征在于,所述短语提取模块包括候选短语生成子模块,具体还用于:8. The topic extraction device based on a smooth phrase topic model according to claim 7, wherein the phrase extraction module comprises a candidate phrase generation sub-module, and is specifically also used for: 获取文本级别的数据集中两个相邻的频繁短语并将该两个频繁短语合第二级短语,计算该第二级短语在文本级别的数据集中的重要度,所述重要度为该两个频繁短语在文本级别的数据集中相同位置出现的概率;Obtain two adjacent frequent phrases in the text-level dataset and combine the two frequent phrases into second-level phrases, and calculate the importance of the second-level phrase in the text-level dataset, where the importance is the two The probability of frequent phrases appearing in the same position in the text-level dataset; 当重要度不小于预设的第一阈值时,将该第二级短语添加到频繁短语数据集,并删除该两个相邻的频繁短语;When the importance is not less than the preset first threshold, add the second-level phrase to the frequent phrase data set, and delete the two adjacent frequent phrases; 循环将两个相邻的频繁短语合为一个第二级短语的操作,直到任何两个相邻的频繁短语合成的第二级短语的重要度小于预设的第一阈值,频繁短语。The operation of combining two adjacent frequent phrases into a second-level phrase is repeated until the importance of the second-level phrase synthesized by any two adjacent frequent phrases is less than the preset first threshold, frequent phrase. 9.根据权利要求6所述的基于平滑短语主题模型的主题提取装置,其特征在于,主题生成模块,具体用于:9. The topic extraction device based on a smooth phrase topic model according to claim 6, wherein the topic generation module is specifically used for: 通过SPLDA平滑短语主题模型计算候选短语在不同主题下的概率,当该候选短语在某主题中的概率不小于第二阈值时,将该候选短语作为主题短语,通过该主题短语形成相应的话题。The probability of candidate phrases under different topics is calculated by the SPLDA smooth phrase topic model. When the probability of the candidate phrase in a topic is not less than the second threshold, the candidate phrase is regarded as the topic phrase, and the corresponding topic is formed through the topic phrase. 10.根据权利要求9所述的基于平滑短语主题模型的主题提取装置,其特征在于,主题生成模块,具体还用于:10. The topic extraction device based on a smooth phrase topic model according to claim 9, wherein the topic generation module is specifically also used for: 计算候选短语中的词在主题下的概率分布的标准差,通过词的标准差修正该候选短语在不同主题下的概率。Calculate the standard deviation of the probability distribution of the words in the candidate phrase under the topic, and correct the probability of the candidate phrase under different topics by the standard deviation of the word.
CN201911421842.3A 2019-12-31 2019-12-31 Topic extraction method and device based on smooth phrase topic model Active CN111178048B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911421842.3A CN111178048B (en) 2019-12-31 2019-12-31 Topic extraction method and device based on smooth phrase topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911421842.3A CN111178048B (en) 2019-12-31 2019-12-31 Topic extraction method and device based on smooth phrase topic model

Publications (2)

Publication Number Publication Date
CN111178048A true CN111178048A (en) 2020-05-19
CN111178048B CN111178048B (en) 2023-08-01

Family

ID=70654319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911421842.3A Active CN111178048B (en) 2019-12-31 2019-12-31 Topic extraction method and device based on smooth phrase topic model

Country Status (1)

Country Link
CN (1) CN111178048B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030065655A1 (en) * 2001-09-28 2003-04-03 International Business Machines Corporation Method and apparatus for detecting query-driven topical events using textual phrases on foils as indication of topic
US20180211287A1 (en) * 2017-01-24 2018-07-26 International Business Machines Corporation Digital content generation based on user feedback
CN108399162A (en) * 2018-03-21 2018-08-14 北京理工大学 The topic of phrase-based bag topic model finds method
US20180357684A1 (en) * 2017-01-12 2018-12-13 Hefei University Of Technology Method for identifying prefereed region of product, apparatus and storage medium thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030065655A1 (en) * 2001-09-28 2003-04-03 International Business Machines Corporation Method and apparatus for detecting query-driven topical events using textual phrases on foils as indication of topic
US20180357684A1 (en) * 2017-01-12 2018-12-13 Hefei University Of Technology Method for identifying prefereed region of product, apparatus and storage medium thereof
US20180211287A1 (en) * 2017-01-24 2018-07-26 International Business Machines Corporation Digital content generation based on user feedback
CN108399162A (en) * 2018-03-21 2018-08-14 北京理工大学 The topic of phrase-based bag topic model finds method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
余琴琴;彭敦陆;刘丛;: "大规模词序列中基于频繁词集的特征短语抽取模型", 小型微型计算机系统, no. 05 *
杨;张德生;: "中文文本的主题关键短语提取技术", 计算机科学, no. 2 *
熊才伟;曹亚男;: "基于发文内容的微博用户兴趣挖掘方法研究", vol. 35, no. 06, pages 1620 *
肖波: "可信关联规则挖掘算法研究", no. 05, pages 27 - 28 *

Also Published As

Publication number Publication date
CN111178048B (en) 2023-08-01

Similar Documents

Publication Publication Date Title
Chan et al. Sentiment analysis in financial texts
CN112347778B (en) Keyword extraction method, keyword extraction device, terminal equipment and storage medium
CN109241274B (en) Text clustering method and device
RU2628431C1 (en) Selection of text classifier parameter based on semantic characteristics
US20140032207A1 (en) Information Classification Based on Product Recognition
CN109783787A (en) A kind of generation method of structured document, device and storage medium
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN104899230A (en) Public opinion hotspot automatic monitoring system
CN116050397B (en) Method, system, equipment and storage medium for generating long text abstract
CN102591988A (en) Short text classification method based on semantic graphs
CN110750635A (en) A method for legal recommendation based on joint deep learning model
CN106407195B (en) Method and system for deduplication of web pages
CN111177375A (en) Electronic document classification method and device
CN112528653B (en) Short text entity recognition method and system
US9754023B2 (en) Stochastic document clustering using rare features
CN117454220A (en) Data hierarchical classification method, device, equipment and storage medium
Ma et al. The impact of weighting schemes and stemming process on topic modeling of arabic long and short texts
CN113408660B (en) Book clustering method, device, equipment and storage medium
CN111985212A (en) Text keyword recognition method and device, computer equipment and readable storage medium
CN116029280A (en) A document key information extraction method, device, computing device and storage medium
CN113157857B (en) News-oriented hot topic detection method, device and equipment
CN110489759B (en) Text feature weighting and short text similarity calculation method, system and medium based on word frequency
CN103886097A (en) Chinese microblog viewpoint sentence recognition feature extraction method based on self-adaption lifting algorithm
CN111178048B (en) Topic extraction method and device based on smooth phrase topic model
Ait Addi et al. Supervised classifiers and keyword extraction methods for text classification in Arabic

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant