[go: up one dir, main page]

CN113064990A - A method and system for hot spot event recognition based on multi-level clustering - Google Patents

A method and system for hot spot event recognition based on multi-level clustering Download PDF

Info

Publication number
CN113064990A
CN113064990A CN202110003161.6A CN202110003161A CN113064990A CN 113064990 A CN113064990 A CN 113064990A CN 202110003161 A CN202110003161 A CN 202110003161A CN 113064990 A CN113064990 A CN 113064990A
Authority
CN
China
Prior art keywords
event
cluster
word
text
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110003161.6A
Other languages
Chinese (zh)
Inventor
林越峰
鲁继东
苗仲辰
王晨宇
倪梦珺
江航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Financial Futures Information Technology Co ltd
Original Assignee
Shanghai Financial Futures Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Financial Futures Information Technology Co ltd filed Critical Shanghai Financial Futures Information Technology Co ltd
Priority to CN202110003161.6A priority Critical patent/CN113064990A/en
Publication of CN113064990A publication Critical patent/CN113064990A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Creation or modification of classes or clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了基于多层次聚类的热点事件识别方法和系统,能实时准确的识别出热点事件,并提供能够代表热点事件的特征词以对热点舆情进行准确描述,可增加用户阅读热点的效率。其技术方案为:对文本进行预处理,将文本内容分割为多个短语;对经短语分割的文本进行文本向量化的处理,形成向量化的事件集合;采用无监督聚类算法对向量化的事件集合进行聚合,形成热点的事件簇;对每个事件簇采用深度学习算法进行向量化处理并再次使用无监督聚类算法进行聚合;使用新词发现算法,生成话题簇描述。

Figure 202110003161

The invention discloses a method and system for identifying hotspot events based on multi-level clustering, which can accurately identify hotspot events in real time, and provide feature words that can represent hotspot events to accurately describe hotspot public opinion, which can increase the efficiency of users reading hotspots . The technical scheme is: preprocessing the text, dividing the text content into a plurality of phrases; performing text vectorization processing on the text segmented by the phrases to form a vectorized event set; using an unsupervised clustering algorithm to classify the vectorized events. The event sets are aggregated to form hot event clusters; each event cluster is vectorized with a deep learning algorithm and then aggregated using an unsupervised clustering algorithm; a new word discovery algorithm is used to generate topic cluster descriptions.

Figure 202110003161

Description

Hot event identification method and system based on multi-level clustering
Technical Field
The invention relates to an automatic identification technology of hot topics, in particular to a method and a system for automatically identifying hot event topics based on a multilevel text clustering algorithm.
Background
In recent years, with the rapid development of the internet, social networks including microblogs, wechat and the like are started, so that information can be rapidly diffused, and the amount of information is explosively increased, so that text information browsed by a user is too much and too scattered. In addition, in the financial field, public sentiment and market trend are closely related, so that an automatic information extraction tool is urgently needed by people, and is helpful for people to quickly find valuable information from massive news information, extract news hotspots, gather texts similar to reports together, and know the association and hierarchical relationship among news.
Generally, to solve this problem, it is necessary to manually specify the hierarchical relationship between news, provide labeled data for training a machine learning model, and then use the trained model for text classification. However, the method has the disadvantages of consuming a large amount of labor cost, especially in the financial field, obtaining the labeling data often requires a large amount of financial professionals to participate in labeling, which is expensive, and meanwhile, the product development cycle is prolonged, so that the cost is huge.
Disclosure of Invention
The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.
The invention aims to solve the problems and provides a hot event identification method and system based on multi-level clustering, which can accurately identify hot events in real time and provide characteristic words capable of representing the hot events to accurately describe hot public sentiments, so that the efficiency of reading hot sentiments by a user can be improved.
The technical scheme of the invention is as follows: the invention discloses a hot event identification method based on multi-level clustering, which comprises the following steps:
step 1: preprocessing a text, and dividing the text content into a plurality of phrases;
step 2: performing text vectorization processing on the text divided by the phrase to form a vectorized event set;
and step 3: aggregating the event sets subjected to vector quantization by adopting an unsupervised clustering algorithm to form an event cluster of the hot spot;
and 4, step 4: and performing vectorization processing on each event cluster by adopting a deep learning algorithm and performing aggregation by using an unsupervised clustering algorithm again.
According to an embodiment of the hot spot event identification method based on multi-level clustering of the present invention, step 1 further includes:
step 1-1: leading in a professional word bank and a stop word list for assisting a Chinese word segmentation module;
step 1-2: identifying major organizations and names appearing in the text using named entity identification technology;
step 1-3: a Chinese word segmentation module is adopted to segment the text into a plurality of phrases.
According to an embodiment of the hot spot event identification method based on multi-level clustering of the present invention, step 2 further includes:
step 2-1: calculating the frequency of each word appearing in the text, namely word frequency, and carrying out normalization processing;
step 2-2: calculating the reverse file frequency;
step 2-3: and vectorizing each piece of news in the text by adopting a word frequency-reverse file frequency algorithm.
According to an embodiment of the hot spot event identification method based on multi-level clustering of the present invention, step 3 further includes:
step 3-1: inputting news collection D ═ { D ═ D1,d2,...dnAnd a minimum threshold θ;
step 3-2: taking one news as an initial clustering center, and calculating the content similarity of the news and other news;
step 3-3: comparing the calculated content similarity with a minimum threshold theta, and if all the content similarity is smaller than the minimum threshold theta, using d1Adding a new cluster to the cluster center, otherwise d1Classifying the cluster with the maximum similarity;
step 3-4: and respectively aggregating the news sets into a plurality of event clusters according to the clustering result, and outputting the category numbers of the event clusters.
According to an embodiment of the hot spot event identification method based on multi-level clustering of the present invention, step 4 further includes:
step 4-1: taking each event cluster as a long text, performing word segmentation processing, and inputting the long text into a skip-gram algorithm, wherein the skip-gram algorithm passes through p (w)i+1,wi-1|wi,uj) Probabilistic model, computing and current word wiThe probability of two adjacent words is selected, the word with the highest probability is selected as output in a dictionary, and the event cluster vector u obtained by the last iteration is usedjInputting the data into a skip-gram algorithm;
step 4-2: will pass through p (w)i+1,wi-1|wi,uj) Calculating the obtained word, making difference between the obtained word and real adjacent word to obtain loss term, transferring the loss term to p (w) by using back propagation algorithmi+1,wi-1|wi,uj) Then updates the corresponding ujAn event cluster vector value of;
step 4-3: repeating steps 4-1 to 4-2 until ujThe vector value approaches to be stable or the event cluster is trained in the following text;
step 4-4: and (4) integrating the vectorization results of each event cluster together, taking the vectorization results as the input of a single-pass algorithm, carrying out secondary clustering, and defining the results as topic clusters.
According to an embodiment of the hot event identification method based on multi-level clustering of the present invention, the method further comprises:
and 5: and generating topic cluster description by using a new word discovery algorithm.
According to an embodiment of the hot spot event identification method based on multi-level clustering of the present invention, step 5 further includes:
step 5-1: gathering all news in each topic cluster together, using the segmented result as input through a Chinese word segmentation module, and respectively calculating three indexes of word frequency, polymerization degree and freedom degree;
step 5-2: and taking the product of word frequency, polymerization degree and freedom degree as a sequencing index, and generating a representative word as topic description.
The invention also discloses a hot spot event recognition system based on multi-level clustering, which comprises the following steps:
the phrase segmentation module is configured to preprocess the text and segment the text content into a plurality of phrases;
the vectorization module is configured to perform text vectorization processing on the text subjected to the phrase segmentation to form a vectorized event set;
the event cluster acquisition module adopts an unsupervised clustering algorithm to aggregate the event sets of the vector quantization to form an event cluster of the hot spot;
and the aggregation module is used for vectorizing each event cluster by adopting a deep learning algorithm and aggregating by using an unsupervised clustering algorithm again.
According to an embodiment of the hot event identification system based on multi-level clustering of the present invention, the phrase segmentation module is further configured to process the following: leading in a professional word bank and a stop word list for assisting a Chinese word segmentation module; identifying major organizations and names appearing in the text using named entity identification technology; a Chinese word segmentation module is adopted to segment the text into a plurality of phrases.
According to an embodiment of the hot event identification system based on multi-level clustering of the present invention, the vectorization module is further configured to process the following: calculating the frequency of each word appearing in the text, namely word frequency, and carrying out normalization processing; calculating the reverse file frequency; and vectorizing each piece of news in the text by adopting a word frequency-reverse file frequency algorithm.
According to an embodiment of the hot spot event recognition system based on multi-level clustering of the present invention, the event cluster acquisition module is further configured to process the following: inputting news collection D ═ { D ═ D1,d2,...dnAnd a minimum threshold θ; taking one news as an initial clustering center, and calculating the content similarity of the news and other news; comparing the calculated content similarity with a minimum threshold theta, and if all the content similarity is smaller than the minimum threshold theta, using d1Adding a new cluster to the cluster center, otherwise d1Classifying the cluster with the maximum similarity; and respectively aggregating the news sets into a plurality of event clusters according to the clustering result, and outputting the category numbers of the event clusters.
According to an embodiment of the hot event identification system based on multi-level clustering of the present invention, the aggregation module is further configured to process the following: taking each event cluster as a long text, performing word segmentation processing, and inputting the long text into a skip-gram algorithm, wherein the skip-gram algorithm passes through p (w)i+1,wi-1|wi,uj) Probabilistic model, computing and current word wiThe probability of two adjacent words is selected, the word with the highest probability is selected as output in a dictionary, and the event cluster vector u obtained by the last iteration is usedjInputting the data into a skip-gram algorithm; will pass through p (w)i+1,wi-1|wi,uj) Calculating the obtained word, making difference between the obtained word and real adjacent word to obtain loss term, transferring the loss term to p (w) by using back propagation algorithmi+1,wi-1|wi,uj) Then updates the corresponding ujAn event cluster vector value of; repeating the above two steps until ujThe vector value approaches to be stable or the event cluster is trained in the following text; and (4) integrating the vectorization results of each event cluster together, taking the vectorization results as the input of a single-pass algorithm, carrying out secondary clustering, and defining the results as topic clusters.
According to an embodiment of the hot spot event identification system based on multi-level clustering of the present invention, the system further includes:
and the topic cluster description generation module generates topic cluster description by using a new word discovery algorithm.
According to an embodiment of the hot event identification system based on multi-level clustering of the present invention, the topic cluster description generation module is further configured to process the following: gathering all news in each topic cluster together, using the segmented result as input through a Chinese word segmentation module, and respectively calculating three indexes of word frequency, polymerization degree and freedom degree; and taking the product of word frequency, polymerization degree and freedom degree as a sequencing index, and generating a representative word as topic description.
Compared with the prior art, the invention has the following beneficial effects:
firstly, the overall architecture of the method is initiated, the problem of multi-level text clustering cannot be simultaneously solved by the traditional technical process on the premise of not providing label data and manual intervention, the method solves the problem of text representation by deep learning and traditional TF-IDF vectorization for the first time, and a foundation is laid for multi-level text clustering.
Secondly, aiming at the fields with strong specialization and few labels (such as the financial field), the invention adopts the way of a financial professional lexicon and an entity recognition algorithm to increase the effectiveness of Chinese word segmentation and improve the effect of a news hotspot discovery algorithm.
Thirdly, compared with the existing hot spot discovery technology, the method can accurately identify the characteristic words representing the events through hot word discovery, form accurate description of the hot spot public sentiment and improve the efficiency of reading the hot spots by the user.
Fourthly, the method can intelligently identify recent hot words through topic description, automatically improve algorithm effect and enhance the real-time property of hot spot discovery.
Drawings
The above features and advantages of the present disclosure will be better understood upon reading the detailed description of embodiments of the disclosure in conjunction with the following drawings. In the drawings, components are not necessarily drawn to scale, and components having similar relative characteristics or features may have the same or similar reference numerals.
FIG. 1 is a flowchart illustrating a hot spot event identification method based on multi-level clustering according to an embodiment of the present invention.
Fig. 2 shows a refined flow diagram of a partial step in the embodiment of the method shown in fig. 1.
Fig. 3 shows a refined flow chart of a partial step in the embodiment of the method shown in fig. 1.
FIG. 4 is a schematic diagram of an embodiment of a hot spot event recognition system based on multi-level clustering according to the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. It is noted that the aspects described below in connection with the figures and the specific embodiments are only exemplary and should not be construed as imposing any limitation on the scope of the present invention.
Fig. 1 shows a flow of an embodiment of a hot event identification method based on multi-level clustering according to the present invention. Referring to fig. 1, the implementation steps of the method of the embodiment are described in detail below, and the embodiment takes the identification of the hot event of the news text in the financial field as an example, and the invention can be extended to other similar application fields.
Step 1: and preprocessing the text, and dividing the text content into a plurality of phrases.
In this embodiment, the preprocessing is performed on the news text related to the financial field, and the specific processing steps of the preprocessing are as follows:
step 1-1: and (3) importing a professional word bank (such as a financial professional word bank) and a stop word list for assisting the Chinese word segmentation module.
Step 1-2: the main agencies and names of people that appear in text are identified using named entity recognition techniques. Such as named entity recognition techniques based on the language pre-training model BERT using large-scale financial annotation samples.
Step 1-3: a Chinese word segmentation module is adopted to segment the news text into a plurality of phrases.
Step 2: and performing text vectorization processing on the text subjected to the phrase segmentation.
The specific processing procedure of this step is as follows.
Step 2-1: the number of times each word appears in the news text-word frequency (term frequency) is calculated and normalized:
Figure BDA0002882013910000071
wherein f isijIndicating that word i is in News djNumber of occurrences in, NjRepresenting a news aggregate review, tfiIndicating the frequency with which the representative word i appears in the news.
Step 2-2: calculating reverse document frequency (inverse document frequency):
Figure BDA0002882013910000072
wherein N isiIndicates the number of news items, id, containing the word iiAnd representing the reverse file frequency, dividing the total news number by the news data containing the word, and taking the logarithm of the obtained quotient to obtain the reverse file frequency, wherein N represents the total news in the news set.
Step 2-3: quantizing each news vector by adopting a TF-IDF (term frequency-inverse document frequency) algorithm into: d { (t)1,w1),(t2,w2),…,(ti,wi),…,(tn,wn) Where t isiIs a feature item of the text, wiAnd d is the weight of the feature item, represents the result of vectorization of the news, firstly, a TF-IDF model is trained on the basis of large-scale corpus, and each piece of news is vectorized by using the model.
In addition, the method of vectorizing news according to the present invention is not limited to the TF-IDF method of the present embodiment, and other vectorizing methods may be used instead.
And step 3: and aggregating the relatively quantized news sets by adopting an unsupervised clustering algorithm (such as a single-pass clustering algorithm) to form hot news clusters.
The specific processing procedure of this step is as follows, please refer to fig. 2.
Step 3-1: inputting news collection D ═ { D ═ D1,d2,...dnAnd a minimum threshold θ.
Step 3-2: and taking one news as an initial clustering center, and calculating the content similarity of the news and other news.
In the present embodiment, the news d is used1As initial clustering center, calculating each of the rest news and news d by cosine similarity algorithm1Content similarity of (2):
sim(d,T)=cos(d,T)=a
in the above formula, T represents the whole news set, a represents the cosine similarity value, and the specific calculation step is the same feature item T in all feature items (n) in different news diWeight value w ofiMultiplication.
Step 3-3: comparing the calculated content similarity with a minimum threshold theta, and if all the content similarity is smaller than the minimum threshold theta, using d1Adding a new cluster to the cluster center, otherwise d1And classifying the cluster with the maximum similarity.
Step 3-4: and respectively aggregating the news sets into a plurality of event clusters according to the clustering result, outputting the class numbers of the event clusters, and defining each cluster as an event cluster with similar report contents.
And 4, step 4: and (3) vectorizing each event cluster by adopting a deep learning algorithm (such as a skip-gram algorithm) and aggregating by using an unsupervised clustering algorithm (such as a single-pass algorithm).
The specific processing procedure of this step is as follows.
Step 4-1: taking each event cluster as a long text, performing word segmentation processing, and inputting the long text into a skip-gram algorithm, wherein the skip-gram algorithm passes through p (w)i+1,wi-1|wi,uj) Probabilistic model (this model)Parameter w iniRepresenting the current word, parameter wi+1,wi-1Representing two words adjacent to the current word, parameter ujRepresenting the event cluster vector obtained by the last iteration, and randomly generating for the first time), calculating and calculating the current word wiThe probabilities of two adjacent words, and the word with the highest probability is selected as the output in the dictionary. Simultaneously, event cluster vector u obtained by last iteration is usedjInput into the skip-gram algorithm.
Step 4-2: will pass through p (w)i+1,wi-1|wi,uj) Calculating the obtained word, making difference between the obtained word and real adjacent word to obtain loss term, transferring the loss term to p (w) by using back propagation algorithmi+1,wi-1|wi,uj) Then updates the corresponding ujThe event cluster vector value of.
Step 4-3: repeating steps 4-1 to 4-2 until ujThe vector value of (a) approaches stability or the event cluster is trained in the following text.
Step 4-4: and (4) integrating the vectorization results of each event cluster together, taking the vectorization results as the input of a single-pass algorithm, carrying out secondary clustering, and defining the results as topic clusters.
In addition, the invention is not limited to the secondary clustering to form topic clusters in the embodiment, and multi-layer clustering can be performed by using the same method. The neural network structure used for vectorization in this step may be replaced with another network structure.
Preferably, the method of this embodiment further includes step 5: and generating topic cluster description by using a new word discovery algorithm.
The specific processing procedure of this step is as follows, please refer to fig. 3 at the same time.
Step 5-1: all news in each topic cluster are collected together, a Chinese word segmentation module is used for inputting word segmentation results, and three indexes of word frequency, polymerization degree and freedom degree are respectively calculated, as shown in fig. 3, the specific calculation mode is as follows:
(1) calculating word frequency: regular expressions are used for matching single Chinese characters, double Chinese characters, three Chinese characters, four Chinese characters and five Chinese character words and calculating word frequency respectively.
(2) Calculating the polymerization degree: assuming that the word is S, firstly calculating the probability P (S) of the occurrence of the word, and then trying all possible two segmentations of S, namely dividing the word into a left half part sl and a right half part sr, and calculating P (sl) and P (sr), for example, two segmentations exist in a double Chinese character word, and two segmentations exist in a three Chinese character word. Then, in all the two-segmentation schemes, the minimum value of P (S)/(P (sl) xP (sr)) is calculated, and after taking the logarithm, the minimum value can be used as the measure of the degree of polymerization, and the degree of polymerization of all possible alternative words is calculated.
(3) And (3) calculating the degree of freedom: assuming that a word totally appears N times, N Chinese characters totally appear on the left side of the word, and each Chinese character sequentially appears N1, N2, … … and Nn times, N is satisfied as N1+ N2+ … … + Nn, so that the probability of the appearance of each Chinese character on the left side of the word can be calculated, and the left-adjacent entropy can be calculated according to the entropy formula. The smaller the entropy is, the lower the degree of freedom is, and the smaller one of the left-adjacent entropy and the right-adjacent entropy of a word is taken as the final degree of freedom.
Step 5-2: and taking the product of word frequency, polymerization degree and freedom degree as a sequencing index, and generating a representative word as topic description.
FIG. 4 illustrates the principle of an embodiment of the hot spot event identification system based on multi-level clustering according to the present invention. Referring to fig. 4, the system of the present embodiment includes: the system comprises a phrase segmentation module, a vectorization module, an event cluster acquisition module and an aggregation module. Preferably, the system further comprises a topic cluster description generation module.
The phrase segmentation module is configured to preprocess the text and segment the text content into a plurality of phrases.
The phrase segmentation module is further configured to process the following:
a special word bank (such as a financial special word bank) and a stop word list are imported and used for assisting a Chinese word segmentation module;
identifying major institutions and names appearing in the text using named entity identification techniques, such as named entity identification techniques based on a language pre-training model BERT using large-scale financial annotation samples;
a Chinese word segmentation module is adopted to segment the text into a plurality of phrases.
The vectorization module is configured to perform text vectorization processing on the phrase-segmented text to form a vectorized event set.
The vectoring module is further configured to process the following:
calculating the frequency of occurrence of each word in the text, namely word frequency, and normalizing:
Figure BDA0002882013910000101
wherein f isijIndicating that word i is in News djNumber of occurrences in, NjRepresenting a news aggregate review, tfiIndicating the frequency of occurrence of the representative word i in news;
calculating reverse file frequency:
Figure BDA0002882013910000102
wherein N isiIndicating the number of news containing the word i, idfiRepresenting the frequency of reverse files, dividing the total news number by the news data containing the words, and then taking the logarithm of the obtained quotient to obtain the frequency of the reverse files, wherein N represents the total number of news in a news set;
performing vector quantization on each piece of news in the text by adopting a word frequency-reverse file frequency algorithm: d { (t)1,w1),(t2,w2),…,(ti,wi),…,(tn,wn) Where t isiIs a feature item of the text, wiAnd d is the weight of the feature item, represents the result of vectorization of the news, firstly, a TF-IDF model is trained on the basis of large-scale corpus, and each piece of news is vectorized by using the model.
The event cluster acquisition module is configured to aggregate the quantified event sets by adopting an unsupervised clustering algorithm to form event clusters of hot spots.
The event cluster acquisition module is further configured to process the following:
input needs to be processedNews set D ═ D1,d2,...dnAnd a minimum threshold θ;
taking one news as an initial clustering center, calculating the content similarity of the news and other news, and taking the news d1As initial clustering center, calculating each of the rest news and news d by cosine similarity algorithm1Content similarity of (2):
sim(d,T)=cos(d,T)=a
in the above formula, T represents the whole news set, and a represents the cosine similarity value;
comparing the calculated content similarity with a minimum threshold theta, and if all the content similarity is smaller than the minimum threshold theta, using d1Adding a new cluster to the cluster center, otherwise d1Classifying the cluster with the maximum similarity;
and respectively aggregating the news sets into a plurality of event clusters according to the clustering result, outputting the class numbers of the event clusters, and defining each cluster as an event cluster with similar report contents.
The aggregation module is configured to conduct vectorization processing on each event cluster by adopting a deep learning algorithm and conduct aggregation by using an unsupervised clustering algorithm again.
The aggregation module is further configured to process the following:
taking each event cluster as a long text, performing word segmentation processing, and inputting the long text into a skip-gram algorithm, wherein the skip-gram algorithm passes through p (w)i+1,wi-1|wi,uj) Probabilistic model (parameter w in this model)iRepresenting the current word, parameter wi+1,wi-1Representing two words adjacent to the current word, parameter ujRepresenting the event cluster vector obtained by the last iteration, and randomly generating for the first time), calculating and calculating the current word wiThe probability of two adjacent words is selected, the word with the highest probability is selected as output in a dictionary, and the event cluster vector u obtained by the last iteration is usedjInputting the data into a skip-gram algorithm;
will pass through p (w)i+1,wi-1|wi,uj) Calculating the difference between the obtained word and the real adjacent wordObtaining a loss term, and transmitting the loss term to p (w) through a back propagation algorithmi+1,wi-1|wi,uj) Then updates the corresponding ujAn event cluster vector value of;
repeating the above two steps until ujThe vector value approaches to be stable or the event cluster is trained in the following text;
and (4) integrating the vectorization results of each event cluster together, taking the vectorization results as the input of a single-pass algorithm, carrying out secondary clustering, and defining the results as topic clusters.
The topic cluster description generation module is configured to generate a topic cluster description using a new word discovery algorithm.
The topic cluster description generation module is further configured to process the following:
all news in each topic cluster are gathered together, a Chinese word segmentation module is used for inputting word segmentation results, and three indexes of word frequency, polymerization degree and freedom degree are calculated respectively in the following specific calculation mode:
(1) calculating word frequency: regular expressions are used for matching single Chinese characters, double Chinese characters, three Chinese characters, four Chinese characters and five Chinese character words and calculating word frequency respectively.
(2) Calculating the polymerization degree: assuming that the word is S, firstly calculating the probability P (S) of the occurrence of the word, and then trying all possible two segmentations of S, namely dividing the word into a left half part sl and a right half part sr, and calculating P (sl) and P (sr), for example, two segmentations exist in a double Chinese character word, and two segmentations exist in a three Chinese character word. Then, in all the two-segmentation schemes, the minimum value of P (S)/(P (sl) xP (sr)) is calculated, and after taking the logarithm, the minimum value can be used as the measure of the degree of polymerization, and the degree of polymerization of all possible alternative words is calculated.
(3) And (3) calculating the degree of freedom: assuming that a word totally appears N times, N Chinese characters totally appear on the left side of the word, and each Chinese character sequentially appears N1, N2, … … and Nn times, N is satisfied as N1+ N2+ … … + Nn, so that the probability of the appearance of each Chinese character on the left side of the word can be calculated, and the left-adjacent entropy can be calculated according to the entropy formula. The smaller the entropy is, the lower the degree of freedom is, and the smaller one of the left adjacent entropy and the right adjacent entropy of a word is taken as the final degree of freedom;
and taking the product of word frequency, polymerization degree and freedom degree as a sequencing index, and generating a representative word as topic description.
While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood by one skilled in the art.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (14)

1.一种基于多层次聚类的热点事件识别方法,其特征在于,方法包括:1. a hot spot event identification method based on multi-level clustering, is characterized in that, method comprises: 步骤1:对文本进行预处理,将文本内容分割为多个短语;Step 1: Preprocess the text and divide the text content into multiple phrases; 步骤2:对经短语分割的文本进行文本向量化的处理,形成向量化的事件集合;Step 2: Perform text vectorization processing on the text segmented by phrases to form a vectorized event set; 步骤3:采用无监督聚类算法对向量化的事件集合进行聚合,形成热点的事件簇;Step 3: Use an unsupervised clustering algorithm to aggregate the vectorized event sets to form hot event clusters; 步骤4:对每个事件簇采用深度学习算法进行向量化处理并再次使用无监督聚类算法进行聚合。Step 4: Vectorize each event cluster using a deep learning algorithm and aggregate again using an unsupervised clustering algorithm. 2.根据权利要求1所述的基于多层次聚类的热点事件识别方法,其特征在于,步骤1进一步包括:2. The method for identifying hotspot events based on multi-level clustering according to claim 1, wherein step 1 further comprises: 步骤1-1:导入专业词词库和停用词词表,用于辅助中文分词模块;Step 1-1: Import professional thesaurus and stop word list to assist the Chinese word segmentation module; 步骤1-2:使用命名实体识别技术识别文本中出现的主要机构和人名;Step 1-2: Use named entity recognition technology to identify the main institutions and names appearing in the text; 步骤1-3:采用中文分词模块将文本分割成多个短语。Step 1-3: Use the Chinese word segmentation module to segment the text into multiple phrases. 3.根据权利要求1所述的基于多层次聚类的热点事件识别方法,其特征在于,步骤2进一步包括:3. The method for identifying hotspot events based on multi-level clustering according to claim 1, wherein step 2 further comprises: 步骤2-1:计算每个词语在文本中出现的次数-词频,并归一化处理;Step 2-1: Calculate the number of times each word appears in the text - word frequency, and normalize it; 步骤2-2:计算逆向文件频率;Step 2-2: Calculate the reverse file frequency; 步骤2-3:采用词频-逆向文件频率算法对文本中的每条新闻进行向量化。Step 2-3: Vectorize each news item in the text using the word frequency-inverse document frequency algorithm. 4.根据权利要求1所述的基于多层次聚类的热点事件识别方法,其特征在于,步骤3进一步包括:4. the hot spot event identification method based on multi-level clustering according to claim 1, is characterized in that, step 3 further comprises: 步骤3-1:输入需要处理的新闻集合D={d1,d2,...dn}和最小阈值θ;Step 3-1: Input the news set to be processed D={d 1 , d 2 ,...d n } and the minimum threshold θ; 步骤3-2:以其中一条新闻作为初始聚类中心,计算其与其他各条新闻的内容相似度;Step 3-2: Take one of the news as the initial clustering center, and calculate its content similarity with other news; 步骤3-3:将计算得到的多个内容相似度与最小阈值θ进行比较,如果所有的内容相似度均小于最小阈值θ,则以d1为聚类中心增加一个新的聚类,否则将d1归为相似度最大的簇类;Step 3-3: Compare the calculated multiple content similarities with the minimum threshold θ. If all the content similarities are less than the minimum threshold θ, add a new cluster with d 1 as the cluster center, otherwise, add a new cluster. d 1 is classified as the cluster with the largest similarity; 步骤3-4:按照聚类结果,将新闻集合分别聚合成多个事件簇,输出事件簇的类别号。Step 3-4: According to the clustering results, the news sets are aggregated into multiple event clusters, and the category numbers of the event clusters are output. 5.根据权利要求1所述的基于多层次聚类的热点事件识别方法,其特征在于,步骤4进一步包括:5. The method for identifying hotspot events based on multi-level clustering according to claim 1, wherein step 4 further comprises: 步骤4-1:将每个事件簇作为一个长文本,分词处理后输入到skip-gram算法,skip-gram算法通过p(wi+1,wi-1|wi,uj)概率模型,计算与当前词wi相邻的两个词的概率,并在词典中选择概率最高的词作为输出,同时将上次迭代得到的事件簇向量uj输入到skip-gram算法中;Step 4-1: Take each event cluster as a long text, and input it into the skip-gram algorithm after word segmentation. The skip-gram algorithm passes the p(w i+1 ,w i-1 | wi ,u j ) probability model , calculate the probability of two words adjacent to the current word wi , and select the word with the highest probability in the dictionary as the output, and input the event cluster vector u j obtained from the last iteration into the skip-gram algorithm; 步骤4-2:将通过p(wi+1,wi-1|wi,uj)计算获得的词,与真实的相邻词做差值得到损失项,将损失项通过反向传播算法传递给p(wi+1,wi-1|wi,uj),然后更新对应uj的事件簇向量值;Step 4-2: Calculate the word obtained by p(w i+1 ,w i-1 |w i ,u j ), and make a difference with the real adjacent words to obtain the loss item, and pass the loss item through back propagation The algorithm is passed to p(wi +1 ,wi -1 | wi ,u j ), and then the value of the event cluster vector corresponding to u j is updated; 步骤4-3:重复步骤4-1至4-2,直到uj的向量值趋近稳定或该事件簇下文本训练完毕;Step 4-3: Repeat steps 4-1 to 4-2 until the vector value of u j tends to be stable or the text training under the event cluster is completed; 步骤4-4:将每个事件簇的向量化结果集合在一起,作为single-pass算法的输入,进行第二次聚类,将结果定义为话题簇。Step 4-4: Assemble the vectorized results of each event cluster as the input of the single-pass algorithm, perform the second clustering, and define the results as topic clusters. 6.根据权利要求1所述的基于多层次聚类的热点事件识别方法,其特征在于,方法还包括:6. The method for identifying hotspot events based on multi-level clustering according to claim 1, wherein the method further comprises: 步骤5:使用新词发现算法,生成话题簇描述。Step 5: Use the new word discovery algorithm to generate topic cluster descriptions. 7.根据权利要求6所述的基于多层次聚类的热点事件识别方法,其特征在于,步骤5进一步包括:7. The method for identifying hotspot events based on multi-level clustering according to claim 6, wherein step 5 further comprises: 步骤5-1:将每个话题簇中的全部新闻集合在一起,经过中文分词模块,将分词后的结果作为输入,分别计算词频、聚合度、自由度三个指标;Step 5-1: Collect all the news in each topic cluster, go through the Chinese word segmentation module, take the result of word segmentation as input, and calculate the three indicators of word frequency, degree of aggregation, and degree of freedom; 步骤5-2:将词频、聚合度、自由度三者的乘积作为排序指标,生成具有代表性的词语作为话题描述。Step 5-2: Use the product of word frequency, degree of aggregation, and degree of freedom as the ranking index, and generate representative words as topic descriptions. 8.一种基于多层次聚类的热点事件识别系统,其特征在于,系统包括:8. A hot spot event identification system based on multi-level clustering, characterized in that the system comprises: 短语分割模块,配置为对文本进行预处理,将文本内容分割为多个短语;Phrase segmentation module, configured to preprocess the text and segment the text content into multiple phrases; 向量化模块,配置为对经短语分割的文本进行文本向量化的处理,形成向量化的事件集合;The vectorization module is configured to perform text vectorization processing on the text segmented by phrases to form a vectorized event set; 事件簇获取模块,采用无监督聚类算法对向量化的事件集合进行聚合,形成热点的事件簇;The event cluster acquisition module uses an unsupervised clustering algorithm to aggregate the vectorized event sets to form hot event clusters; 聚合模块,对每个事件簇采用深度学习算法进行向量化处理并再次使用无监督聚类算法进行聚合。The aggregation module uses a deep learning algorithm to vectorize each event cluster and again uses an unsupervised clustering algorithm for aggregation. 9.根据权利要求8所述的基于多层次聚类的热点事件识别系统,其特征在于,短语分割模块进一步配置为处理以下:导入专业词词库和停用词词表,用于辅助中文分词模块;使用命名实体识别技术识别文本中出现的主要机构和人名;采用中文分词模块将文本分割成多个短语。9. The hot-spot event recognition system based on multi-level clustering according to claim 8, wherein the phrase segmentation module is further configured to handle the following: import a specialized vocabulary and a stop word vocabulary for assisting Chinese word segmentation module; uses named entity recognition technology to identify the main institutions and names appearing in the text; adopts the Chinese word segmentation module to segment the text into multiple phrases. 10.根据权利要求8所述的基于多层次聚类的热点事件识别系统,其特征在于,向量化模块进一步配置为处理以下:计算每个词语在文本中出现的次数-词频,并归一化处理;计算逆向文件频率;采用词频-逆向文件频率算法对文本中的每条新闻进行向量化。10. The hot-spot event recognition system based on multi-level clustering according to claim 8, wherein the vectorization module is further configured to process the following: calculate the number of times each word appears in the text-word frequency, and normalize Process; calculate reverse file frequency; vectorize each news item in the text using the term frequency-reverse file frequency algorithm. 11.根据权利要求8所述的基于多层次聚类的热点事件识别系统,其特征在于,事件簇获取模块进一步配置为处理以下:输入需要处理的新闻集合D={d1,d2,...dn}和最小阈值θ;以其中一条新闻作为初始聚类中心,计算其与其他各条新闻的内容相似度;将计算得到的多个内容相似度与最小阈值θ进行比较,如果所有的内容相似度均小于最小阈值θ,则以d1为聚类中心增加一个新的聚类,否则将d1归为相似度最大的簇类;按照聚类结果,将新闻集合分别聚合成多个事件簇,输出事件簇的类别号。11. The hot-spot event identification system based on multi-level clustering according to claim 8, wherein the event cluster acquisition module is further configured to process the following: input the news set D={d 1 ,d 2 ,. ..d n } and the minimum threshold θ; take one of the news as the initial cluster center, calculate its content similarity with other news; compare the calculated multiple content similarities with the minimum threshold θ, if all If the content similarity is less than the minimum threshold θ, a new cluster is added with d 1 as the cluster center, otherwise d 1 is classified as the cluster with the largest similarity; according to the clustering results, the news sets are aggregated into multiple clusters. event cluster, output the category number of the event cluster. 12.根据权利要求8所述的基于多层次聚类的热点事件识别系统,其特征在于,聚合模块进一步配置为处理以下:将每个事件簇作为一个长文本,分词处理后输入到skip-gram算法,skip-gram算法通过p(wi+1,wi-1|wi,uj)概率模型,计算与当前词wi相邻的两个词的概率,并在词典中选择概率最高的词作为输出,同时将上次迭代得到的事件簇向量uj输入到skip-gram算法中;将通过p(wi+1,wi-1|wi,uj)计算获得的词,与真实的相邻词做差值得到损失项,将损失项通过反向传播算法传递给p(wi+1,wi-1|wi,uj),然后更新对应uj的事件簇向量值;重复上述两个步骤,直到uj的向量值趋近稳定或该事件簇下文本训练完毕;将每个事件簇的向量化结果集合在一起,作为single-pass算法的输入,进行第二次聚类,将结果定义为话题簇。12. The hot-spot event recognition system based on multi-level clustering according to claim 8, wherein the aggregation module is further configured to process the following: take each event cluster as a long text, and input it into skip-gram after word segmentation processing Algorithm, skip-gram algorithm calculates the probability of two words adjacent to the current word wi through p(wi +1 ,wi -1 | wi ,u j ) probability model, and selects the highest probability in the dictionary As output, the event cluster vector u j obtained from the previous iteration is input into the skip-gram algorithm; the words obtained by p(w i+1 ,wi -1 | wi ,u j ) will be calculated as Make a difference with the real adjacent words to get the loss item, pass the loss item to p(w i+1 ,wi -1 | wi ,u j ) through the back-propagation algorithm, and then update the event cluster corresponding to u j vector value; repeat the above two steps until the vector value of u j tends to be stable or the text training under the event cluster is completed; the vectorized results of each event cluster are collected together as the input of the single-pass algorithm, and the first Secondary clustering, defining the results as topic clusters. 13.根据权利要求8所述的基于多层次聚类的热点事件识别系统,其特征在于,系统还包括:13. The hot spot event identification system based on multi-level clustering according to claim 8, wherein the system further comprises: 话题簇描述生成模块,使用新词发现算法,生成话题簇描述。The topic cluster description generation module uses a new word discovery algorithm to generate topic cluster descriptions. 14.根据权利要求13所述的基于多层次聚类的热点事件识别系统,其特征在于,话题簇描述生成模块进一步配置为处理以下:将每个话题簇中的全部新闻集合在一起,经过中文分词模块,将分词后的结果作为输入,分别计算词频、聚合度、自由度三个指标;将词频、聚合度、自由度三者的乘积作为排序指标,生成具有代表性的词语作为话题描述。14. The hot-spot event identification system based on multi-level clustering according to claim 13, wherein the topic cluster description generation module is further configured to process the following: all news in each topic cluster is gathered together, and the Chinese The word segmentation module takes the result of word segmentation as input, and calculates three indicators of word frequency, degree of aggregation, and degree of freedom; the product of word frequency, degree of aggregation, and degree of freedom is used as a ranking indicator to generate representative words as topic descriptions.
CN202110003161.6A 2021-01-04 2021-01-04 A method and system for hot spot event recognition based on multi-level clustering Pending CN113064990A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110003161.6A CN113064990A (en) 2021-01-04 2021-01-04 A method and system for hot spot event recognition based on multi-level clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110003161.6A CN113064990A (en) 2021-01-04 2021-01-04 A method and system for hot spot event recognition based on multi-level clustering

Publications (1)

Publication Number Publication Date
CN113064990A true CN113064990A (en) 2021-07-02

Family

ID=76558555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110003161.6A Pending CN113064990A (en) 2021-01-04 2021-01-04 A method and system for hot spot event recognition based on multi-level clustering

Country Status (1)

Country Link
CN (1) CN113064990A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515648A (en) * 2021-09-13 2021-10-19 北京中科闻歌科技股份有限公司 Content clustering method and system
CN114443850A (en) * 2022-04-06 2022-05-06 杭州费尔斯通科技有限公司 Label generation method, system, device and medium based on semantic similar model
CN115034206A (en) * 2022-06-20 2022-09-09 科大国创云网科技有限公司 Customer service hot spot event discovery method and system
CN115204318A (en) * 2022-09-15 2022-10-18 天津汇智星源信息技术有限公司 Event automatic hierarchical classification method and electronic equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5828999A (en) * 1996-05-06 1998-10-27 Apple Computer, Inc. Method and system for deriving a large-span semantic language model for large-vocabulary recognition systems
CN107423337A (en) * 2017-04-27 2017-12-01 天津大学 News topic detection method based on LDA Fusion Models and multi-level clustering
CN108038119A (en) * 2017-11-01 2018-05-15 平安科技(深圳)有限公司 Utilize the method, apparatus and storage medium of new word discovery investment target
CN109635296A (en) * 2018-12-08 2019-04-16 广州荔支网络技术有限公司 Neologisms method for digging, device computer equipment and storage medium
CN110297988A (en) * 2019-07-06 2019-10-01 四川大学 Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN110399478A (en) * 2018-04-19 2019-11-01 清华大学 Event discovery method and device
CN111090811A (en) * 2019-12-24 2020-05-01 北京理工大学 A method and system for extracting hot topics from mass news
CN111104511A (en) * 2019-11-18 2020-05-05 腾讯科技(深圳)有限公司 Method and device for extracting hot topics and storage medium
CN111310453A (en) * 2019-11-05 2020-06-19 上海金融期货信息技术有限公司 User theme vectorization representation method and system based on deep learning
CN111694958A (en) * 2020-06-05 2020-09-22 深兰人工智能芯片研究院(江苏)有限公司 Microblog topic clustering method based on word vector and single-pass fusion

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5828999A (en) * 1996-05-06 1998-10-27 Apple Computer, Inc. Method and system for deriving a large-span semantic language model for large-vocabulary recognition systems
CN107423337A (en) * 2017-04-27 2017-12-01 天津大学 News topic detection method based on LDA Fusion Models and multi-level clustering
CN108038119A (en) * 2017-11-01 2018-05-15 平安科技(深圳)有限公司 Utilize the method, apparatus and storage medium of new word discovery investment target
CN110399478A (en) * 2018-04-19 2019-11-01 清华大学 Event discovery method and device
CN109635296A (en) * 2018-12-08 2019-04-16 广州荔支网络技术有限公司 Neologisms method for digging, device computer equipment and storage medium
CN110297988A (en) * 2019-07-06 2019-10-01 四川大学 Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN111310453A (en) * 2019-11-05 2020-06-19 上海金融期货信息技术有限公司 User theme vectorization representation method and system based on deep learning
CN111104511A (en) * 2019-11-18 2020-05-05 腾讯科技(深圳)有限公司 Method and device for extracting hot topics and storage medium
CN111090811A (en) * 2019-12-24 2020-05-01 北京理工大学 A method and system for extracting hot topics from mass news
CN111694958A (en) * 2020-06-05 2020-09-22 深兰人工智能芯片研究院(江苏)有限公司 Microblog topic clustering method based on word vector and single-pass fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱晨光: "《机器阅读理解》", 机械工业出版社, pages: 42 - 45 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515648A (en) * 2021-09-13 2021-10-19 北京中科闻歌科技股份有限公司 Content clustering method and system
CN114443850A (en) * 2022-04-06 2022-05-06 杭州费尔斯通科技有限公司 Label generation method, system, device and medium based on semantic similar model
CN114443850B (en) * 2022-04-06 2022-07-22 杭州费尔斯通科技有限公司 Label generation method, system, device and medium based on semantic similar model
CN115034206A (en) * 2022-06-20 2022-09-09 科大国创云网科技有限公司 Customer service hot spot event discovery method and system
CN115204318A (en) * 2022-09-15 2022-10-18 天津汇智星源信息技术有限公司 Event automatic hierarchical classification method and electronic equipment

Similar Documents

Publication Publication Date Title
CN110442760B (en) A synonym mining method and device for question answering retrieval system
CN113064990A (en) A method and system for hot spot event recognition based on multi-level clustering
CN110019732B (en) Intelligent question answering method and related device
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN105183833B (en) A user model-based microblog text recommendation method and recommendation device
CN110750635B (en) French recommendation method based on joint deep learning model
CN106649561A (en) Intelligent question-answering system for tax consultation service
CN106997382A (en) Innovation intention label automatic marking method and system based on big data
CN111221968B (en) Author disambiguation method and device based on subject tree clustering
CN110633365A (en) A hierarchical multi-label text classification method and system based on word vectors
CN113962293A (en) A Name Disambiguation Method and System Based on LightGBM Classification and Representation Learning
CN115080710B (en) An intelligent question-answering system that is adaptive to knowledge graphs in different fields and its construction method
CN103778206A (en) Method for providing network service resources
CN110674293B (en) Text classification method based on semantic migration
CN113868387A (en) Word2vec medical similar problem retrieval method based on improved tf-idf weighting
CN118626611A (en) Retrieval method, device, electronic device and readable storage medium
CN118069851A (en) Intelligent document information intelligent classification retrieval method and system
CN117216687A (en) Large language model generation text detection method based on ensemble learning
CN108470035B (en) A Discriminant Mixture Model-Based Entity-Citation Correlation Classification Method
CN113516094A (en) A system and method for matching review experts for documents
CN111899832B (en) Medical theme management system and method based on context semantic analysis
CN119226455A (en) Text generation method, device, electronic device and readable storage medium
CN118780249A (en) Power accident event extraction method based on knowledge distillation and preference optimization
CN113743079A (en) Text similarity calculation method and device based on co-occurrence entity interaction graph
CN117454217A (en) A method, device and system for identifying depressive emotions based on deep integrated learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210702

RJ01 Rejection of invention patent application after publication