CN113064990A

CN113064990A - A method and system for hot spot event recognition based on multi-level clustering

Info

Publication number: CN113064990A
Application number: CN202110003161.6A
Authority: CN
Inventors: 林越峰; 鲁继东; 苗仲辰; 王晨宇; 倪梦珺; 江航
Original assignee: Shanghai Financial Futures Information Technology Co ltd
Current assignee: Shanghai Financial Futures Information Technology Co ltd
Priority date: 2021-01-04
Filing date: 2021-01-04
Publication date: 2021-07-02

Abstract

The invention discloses a method and system for identifying hotspot events based on multi-level clustering, which can accurately identify hotspot events in real time, and provide feature words that can represent hotspot events to accurately describe hotspot public opinion, which can increase the efficiency of users reading hotspots . The technical scheme is: preprocessing the text, dividing the text content into a plurality of phrases; performing text vectorization processing on the text segmented by the phrases to form a vectorized event set; using an unsupervised clustering algorithm to classify the vectorized events. The event sets are aggregated to form hot event clusters; each event cluster is vectorized with a deep learning algorithm and then aggregated using an unsupervised clustering algorithm; a new word discovery algorithm is used to generate topic cluster descriptions.

Description

Hot event identification method and system based on multi-level clustering

Technical Field

The invention relates to an automatic identification technology of hot topics, in particular to a method and a system for automatically identifying hot event topics based on a multilevel text clustering algorithm.

Background

In recent years, with the rapid development of the internet, social networks including microblogs, wechat and the like are started, so that information can be rapidly diffused, and the amount of information is explosively increased, so that text information browsed by a user is too much and too scattered. In addition, in the financial field, public sentiment and market trend are closely related, so that an automatic information extraction tool is urgently needed by people, and is helpful for people to quickly find valuable information from massive news information, extract news hotspots, gather texts similar to reports together, and know the association and hierarchical relationship among news.

Generally, to solve this problem, it is necessary to manually specify the hierarchical relationship between news, provide labeled data for training a machine learning model, and then use the trained model for text classification. However, the method has the disadvantages of consuming a large amount of labor cost, especially in the financial field, obtaining the labeling data often requires a large amount of financial professionals to participate in labeling, which is expensive, and meanwhile, the product development cycle is prolonged, so that the cost is huge.

Disclosure of Invention

The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.

The invention aims to solve the problems and provides a hot event identification method and system based on multi-level clustering, which can accurately identify hot events in real time and provide characteristic words capable of representing the hot events to accurately describe hot public sentiments, so that the efficiency of reading hot sentiments by a user can be improved.

The technical scheme of the invention is as follows: the invention discloses a hot event identification method based on multi-level clustering, which comprises the following steps:

step 1: preprocessing a text, and dividing the text content into a plurality of phrases;

step 2: performing text vectorization processing on the text divided by the phrase to form a vectorized event set;

and step 3: aggregating the event sets subjected to vector quantization by adopting an unsupervised clustering algorithm to form an event cluster of the hot spot;

and 4, step 4: and performing vectorization processing on each event cluster by adopting a deep learning algorithm and performing aggregation by using an unsupervised clustering algorithm again.

According to an embodiment of the hot spot event identification method based on multi-level clustering of the present invention, step 1 further includes:

step 1-1: leading in a professional word bank and a stop word list for assisting a Chinese word segmentation module;

step 1-2: identifying major organizations and names appearing in the text using named entity identification technology;

step 1-3: a Chinese word segmentation module is adopted to segment the text into a plurality of phrases.

According to an embodiment of the hot spot event identification method based on multi-level clustering of the present invention, step 2 further includes:

step 2-1: calculating the frequency of each word appearing in the text, namely word frequency, and carrying out normalization processing;

step 2-2: calculating the reverse file frequency;

step 2-3: and vectorizing each piece of news in the text by adopting a word frequency-reverse file frequency algorithm.

According to an embodiment of the hot spot event identification method based on multi-level clustering of the present invention, step 3 further includes:

step 3-1: inputting news collection D ═ { D ═ D_1,d_2,...d_nAnd a minimum threshold θ;

step 3-2: taking one news as an initial clustering center, and calculating the content similarity of the news and other news;

step 3-3: comparing the calculated content similarity with a minimum threshold theta, and if all the content similarity is smaller than the minimum threshold theta, using d₁Adding a new cluster to the cluster center, otherwise d₁Classifying the cluster with the maximum similarity;

step 3-4: and respectively aggregating the news sets into a plurality of event clusters according to the clustering result, and outputting the category numbers of the event clusters.

According to an embodiment of the hot spot event identification method based on multi-level clustering of the present invention, step 4 further includes:

step 4-1: taking each event cluster as a long text, performing word segmentation processing, and inputting the long text into a skip-gram algorithm, wherein the skip-gram algorithm passes through p (w)_i+1,w_i-1|w_i,u_j) Probabilistic model, computing and current word w_iThe probability of two adjacent words is selected, the word with the highest probability is selected as output in a dictionary, and the event cluster vector u obtained by the last iteration is used_jInputting the data into a skip-gram algorithm;

step 4-2: will pass through p (w)_i+1,w_i-1|w_i,u_j) Calculating the obtained word, making difference between the obtained word and real adjacent word to obtain loss term, transferring the loss term to p (w) by using back propagation algorithm_i+1,w_i-1|w_i,u_j) Then updates the corresponding u_jAn event cluster vector value of;

step 4-3: repeating steps 4-1 to 4-2 until u_jThe vector value approaches to be stable or the event cluster is trained in the following text;

step 4-4: and (4) integrating the vectorization results of each event cluster together, taking the vectorization results as the input of a single-pass algorithm, carrying out secondary clustering, and defining the results as topic clusters.

According to an embodiment of the hot event identification method based on multi-level clustering of the present invention, the method further comprises:

and 5: and generating topic cluster description by using a new word discovery algorithm.

According to an embodiment of the hot spot event identification method based on multi-level clustering of the present invention, step 5 further includes:

step 5-1: gathering all news in each topic cluster together, using the segmented result as input through a Chinese word segmentation module, and respectively calculating three indexes of word frequency, polymerization degree and freedom degree;

step 5-2: and taking the product of word frequency, polymerization degree and freedom degree as a sequencing index, and generating a representative word as topic description.

The invention also discloses a hot spot event recognition system based on multi-level clustering, which comprises the following steps:

the phrase segmentation module is configured to preprocess the text and segment the text content into a plurality of phrases;

the vectorization module is configured to perform text vectorization processing on the text subjected to the phrase segmentation to form a vectorized event set;

the event cluster acquisition module adopts an unsupervised clustering algorithm to aggregate the event sets of the vector quantization to form an event cluster of the hot spot;

and the aggregation module is used for vectorizing each event cluster by adopting a deep learning algorithm and aggregating by using an unsupervised clustering algorithm again.

According to an embodiment of the hot event identification system based on multi-level clustering of the present invention, the phrase segmentation module is further configured to process the following: leading in a professional word bank and a stop word list for assisting a Chinese word segmentation module; identifying major organizations and names appearing in the text using named entity identification technology; a Chinese word segmentation module is adopted to segment the text into a plurality of phrases.

According to an embodiment of the hot event identification system based on multi-level clustering of the present invention, the vectorization module is further configured to process the following: calculating the frequency of each word appearing in the text, namely word frequency, and carrying out normalization processing; calculating the reverse file frequency; and vectorizing each piece of news in the text by adopting a word frequency-reverse file frequency algorithm.

According to an embodiment of the hot spot event recognition system based on multi-level clustering of the present invention, the event cluster acquisition module is further configured to process the following: inputting news collection D ═ { D ═ D_1,d_2,...d_nAnd a minimum threshold θ; taking one news as an initial clustering center, and calculating the content similarity of the news and other news; comparing the calculated content similarity with a minimum threshold theta, and if all the content similarity is smaller than the minimum threshold theta, using d₁Adding a new cluster to the cluster center, otherwise d₁Classifying the cluster with the maximum similarity; and respectively aggregating the news sets into a plurality of event clusters according to the clustering result, and outputting the category numbers of the event clusters.

According to an embodiment of the hot event identification system based on multi-level clustering of the present invention, the aggregation module is further configured to process the following: taking each event cluster as a long text, performing word segmentation processing, and inputting the long text into a skip-gram algorithm, wherein the skip-gram algorithm passes through p (w)_i+1,w_i-1|w_i,u_j) Probabilistic model, computing and current word w_iThe probability of two adjacent words is selected, the word with the highest probability is selected as output in a dictionary, and the event cluster vector u obtained by the last iteration is used_jInputting the data into a skip-gram algorithm; will pass through p (w)_i+1,w_i-1|w_i,u_j) Calculating the obtained word, making difference between the obtained word and real adjacent word to obtain loss term, transferring the loss term to p (w) by using back propagation algorithm_i+1,w_i-1|w_i,u_j) Then updates the corresponding u_jAn event cluster vector value of; repeating the above two steps until u_jThe vector value approaches to be stable or the event cluster is trained in the following text; and (4) integrating the vectorization results of each event cluster together, taking the vectorization results as the input of a single-pass algorithm, carrying out secondary clustering, and defining the results as topic clusters.

According to an embodiment of the hot spot event identification system based on multi-level clustering of the present invention, the system further includes:

and the topic cluster description generation module generates topic cluster description by using a new word discovery algorithm.

According to an embodiment of the hot event identification system based on multi-level clustering of the present invention, the topic cluster description generation module is further configured to process the following: gathering all news in each topic cluster together, using the segmented result as input through a Chinese word segmentation module, and respectively calculating three indexes of word frequency, polymerization degree and freedom degree; and taking the product of word frequency, polymerization degree and freedom degree as a sequencing index, and generating a representative word as topic description.

Compared with the prior art, the invention has the following beneficial effects:

firstly, the overall architecture of the method is initiated, the problem of multi-level text clustering cannot be simultaneously solved by the traditional technical process on the premise of not providing label data and manual intervention, the method solves the problem of text representation by deep learning and traditional TF-IDF vectorization for the first time, and a foundation is laid for multi-level text clustering.

Secondly, aiming at the fields with strong specialization and few labels (such as the financial field), the invention adopts the way of a financial professional lexicon and an entity recognition algorithm to increase the effectiveness of Chinese word segmentation and improve the effect of a news hotspot discovery algorithm.

Thirdly, compared with the existing hot spot discovery technology, the method can accurately identify the characteristic words representing the events through hot word discovery, form accurate description of the hot spot public sentiment and improve the efficiency of reading the hot spots by the user.

Fourthly, the method can intelligently identify recent hot words through topic description, automatically improve algorithm effect and enhance the real-time property of hot spot discovery.

Drawings

The above features and advantages of the present disclosure will be better understood upon reading the detailed description of embodiments of the disclosure in conjunction with the following drawings. In the drawings, components are not necessarily drawn to scale, and components having similar relative characteristics or features may have the same or similar reference numerals.

FIG. 1 is a flowchart illustrating a hot spot event identification method based on multi-level clustering according to an embodiment of the present invention.

Fig. 2 shows a refined flow diagram of a partial step in the embodiment of the method shown in fig. 1.

Fig. 3 shows a refined flow chart of a partial step in the embodiment of the method shown in fig. 1.

FIG. 4 is a schematic diagram of an embodiment of a hot spot event recognition system based on multi-level clustering according to the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. It is noted that the aspects described below in connection with the figures and the specific embodiments are only exemplary and should not be construed as imposing any limitation on the scope of the present invention.

Fig. 1 shows a flow of an embodiment of a hot event identification method based on multi-level clustering according to the present invention. Referring to fig. 1, the implementation steps of the method of the embodiment are described in detail below, and the embodiment takes the identification of the hot event of the news text in the financial field as an example, and the invention can be extended to other similar application fields.

Step 1: and preprocessing the text, and dividing the text content into a plurality of phrases.

In this embodiment, the preprocessing is performed on the news text related to the financial field, and the specific processing steps of the preprocessing are as follows:

step 1-1: and (3) importing a professional word bank (such as a financial professional word bank) and a stop word list for assisting the Chinese word segmentation module.

Step 1-2: the main agencies and names of people that appear in text are identified using named entity recognition techniques. Such as named entity recognition techniques based on the language pre-training model BERT using large-scale financial annotation samples.

Step 1-3: a Chinese word segmentation module is adopted to segment the news text into a plurality of phrases.

Step 2: and performing text vectorization processing on the text subjected to the phrase segmentation.

The specific processing procedure of this step is as follows.

Step 2-1: the number of times each word appears in the news text-word frequency (term frequency) is calculated and normalized:

wherein f is_ijIndicating that word i is in News d_jNumber of occurrences in, N_jRepresenting a news aggregate review, tf_iIndicating the frequency with which the representative word i appears in the news.

Step 2-2: calculating reverse document frequency (inverse document frequency):

wherein N is_iIndicates the number of news items, id, containing the word i_iAnd representing the reverse file frequency, dividing the total news number by the news data containing the word, and taking the logarithm of the obtained quotient to obtain the reverse file frequency, wherein N represents the total news in the news set.

Step 2-3: quantizing each news vector by adopting a TF-IDF (term frequency-inverse document frequency) algorithm into: d { (t)₁,w₁),(t₂,w₂),…,(t_i,w_i),…,(t_n,w_n) Where t is_iIs a feature item of the text, w_iAnd d is the weight of the feature item, represents the result of vectorization of the news, firstly, a TF-IDF model is trained on the basis of large-scale corpus, and each piece of news is vectorized by using the model.

In addition, the method of vectorizing news according to the present invention is not limited to the TF-IDF method of the present embodiment, and other vectorizing methods may be used instead.

And step 3: and aggregating the relatively quantized news sets by adopting an unsupervised clustering algorithm (such as a single-pass clustering algorithm) to form hot news clusters.

The specific processing procedure of this step is as follows, please refer to fig. 2.

Step 3-1: inputting news collection D ═ { D ═ D_1,d_2,...d_nAnd a minimum threshold θ.

Step 3-2: and taking one news as an initial clustering center, and calculating the content similarity of the news and other news.

In the present embodiment, the news d is used₁As initial clustering center, calculating each of the rest news and news d by cosine similarity algorithm₁Content similarity of (2):

sim(d,T)＝cos(d,T)＝a

in the above formula, T represents the whole news set, a represents the cosine similarity value, and the specific calculation step is the same feature item T in all feature items (n) in different news d_iWeight value w of_iMultiplication.

Step 3-3: comparing the calculated content similarity with a minimum threshold theta, and if all the content similarity is smaller than the minimum threshold theta, using d₁Adding a new cluster to the cluster center, otherwise d₁And classifying the cluster with the maximum similarity.

Step 3-4: and respectively aggregating the news sets into a plurality of event clusters according to the clustering result, outputting the class numbers of the event clusters, and defining each cluster as an event cluster with similar report contents.

And 4, step 4: and (3) vectorizing each event cluster by adopting a deep learning algorithm (such as a skip-gram algorithm) and aggregating by using an unsupervised clustering algorithm (such as a single-pass algorithm).

The specific processing procedure of this step is as follows.

Step 4-1: taking each event cluster as a long text, performing word segmentation processing, and inputting the long text into a skip-gram algorithm, wherein the skip-gram algorithm passes through p (w)_i+1,w_i-1|w_i,u_j) Probabilistic model (this model)Parameter w in_iRepresenting the current word, parameter w_i+1,w_i-1Representing two words adjacent to the current word, parameter u_jRepresenting the event cluster vector obtained by the last iteration, and randomly generating for the first time), calculating and calculating the current word w_iThe probabilities of two adjacent words, and the word with the highest probability is selected as the output in the dictionary. Simultaneously, event cluster vector u obtained by last iteration is used_jInput into the skip-gram algorithm.

Step 4-2: will pass through p (w)_i+1,w_i-1|w_i,u_j) Calculating the obtained word, making difference between the obtained word and real adjacent word to obtain loss term, transferring the loss term to p (w) by using back propagation algorithm_i+1,w_i-1|w_i,u_j) Then updates the corresponding u_jThe event cluster vector value of.

Step 4-3: repeating steps 4-1 to 4-2 until u_jThe vector value of (a) approaches stability or the event cluster is trained in the following text.

In addition, the invention is not limited to the secondary clustering to form topic clusters in the embodiment, and multi-layer clustering can be performed by using the same method. The neural network structure used for vectorization in this step may be replaced with another network structure.

Preferably, the method of this embodiment further includes step 5: and generating topic cluster description by using a new word discovery algorithm.

The specific processing procedure of this step is as follows, please refer to fig. 3 at the same time.

Step 5-1: all news in each topic cluster are collected together, a Chinese word segmentation module is used for inputting word segmentation results, and three indexes of word frequency, polymerization degree and freedom degree are respectively calculated, as shown in fig. 3, the specific calculation mode is as follows:

(1) calculating word frequency: regular expressions are used for matching single Chinese characters, double Chinese characters, three Chinese characters, four Chinese characters and five Chinese character words and calculating word frequency respectively.

(2) Calculating the polymerization degree: assuming that the word is S, firstly calculating the probability P (S) of the occurrence of the word, and then trying all possible two segmentations of S, namely dividing the word into a left half part sl and a right half part sr, and calculating P (sl) and P (sr), for example, two segmentations exist in a double Chinese character word, and two segmentations exist in a three Chinese character word. Then, in all the two-segmentation schemes, the minimum value of P (S)/(P (sl) xP (sr)) is calculated, and after taking the logarithm, the minimum value can be used as the measure of the degree of polymerization, and the degree of polymerization of all possible alternative words is calculated.

(3) And (3) calculating the degree of freedom: assuming that a word totally appears N times, N Chinese characters totally appear on the left side of the word, and each Chinese character sequentially appears N1, N2, … … and Nn times, N is satisfied as N1+ N2+ … … + Nn, so that the probability of the appearance of each Chinese character on the left side of the word can be calculated, and the left-adjacent entropy can be calculated according to the entropy formula. The smaller the entropy is, the lower the degree of freedom is, and the smaller one of the left-adjacent entropy and the right-adjacent entropy of a word is taken as the final degree of freedom.

FIG. 4 illustrates the principle of an embodiment of the hot spot event identification system based on multi-level clustering according to the present invention. Referring to fig. 4, the system of the present embodiment includes: the system comprises a phrase segmentation module, a vectorization module, an event cluster acquisition module and an aggregation module. Preferably, the system further comprises a topic cluster description generation module.

The phrase segmentation module is configured to preprocess the text and segment the text content into a plurality of phrases.

The phrase segmentation module is further configured to process the following:

a special word bank (such as a financial special word bank) and a stop word list are imported and used for assisting a Chinese word segmentation module;

identifying major institutions and names appearing in the text using named entity identification techniques, such as named entity identification techniques based on a language pre-training model BERT using large-scale financial annotation samples;

a Chinese word segmentation module is adopted to segment the text into a plurality of phrases.

The vectorization module is configured to perform text vectorization processing on the phrase-segmented text to form a vectorized event set.

The vectoring module is further configured to process the following:

calculating the frequency of occurrence of each word in the text, namely word frequency, and normalizing:

wherein f is_ijIndicating that word i is in News d_jNumber of occurrences in, N_jRepresenting a news aggregate review, tf_iIndicating the frequency of occurrence of the representative word i in news;

calculating reverse file frequency:

wherein N is_iIndicating the number of news containing the word i, idf_iRepresenting the frequency of reverse files, dividing the total news number by the news data containing the words, and then taking the logarithm of the obtained quotient to obtain the frequency of the reverse files, wherein N represents the total number of news in a news set;

performing vector quantization on each piece of news in the text by adopting a word frequency-reverse file frequency algorithm: d { (t)₁,w₁),(t₂,w₂),…,(t_i,w_i),…,(t_n,w_n) Where t is_iIs a feature item of the text, w_iAnd d is the weight of the feature item, represents the result of vectorization of the news, firstly, a TF-IDF model is trained on the basis of large-scale corpus, and each piece of news is vectorized by using the model.

The event cluster acquisition module is configured to aggregate the quantified event sets by adopting an unsupervised clustering algorithm to form event clusters of hot spots.

The event cluster acquisition module is further configured to process the following:

input needs to be processedNews set D ═ D_1,d_2,...d_nAnd a minimum threshold θ;

taking one news as an initial clustering center, calculating the content similarity of the news and other news, and taking the news d₁As initial clustering center, calculating each of the rest news and news d by cosine similarity algorithm₁Content similarity of (2):

sim(d,T)＝cos(d,T)＝a

in the above formula, T represents the whole news set, and a represents the cosine similarity value;

comparing the calculated content similarity with a minimum threshold theta, and if all the content similarity is smaller than the minimum threshold theta, using d₁Adding a new cluster to the cluster center, otherwise d₁Classifying the cluster with the maximum similarity;

and respectively aggregating the news sets into a plurality of event clusters according to the clustering result, outputting the class numbers of the event clusters, and defining each cluster as an event cluster with similar report contents.

The aggregation module is configured to conduct vectorization processing on each event cluster by adopting a deep learning algorithm and conduct aggregation by using an unsupervised clustering algorithm again.

The aggregation module is further configured to process the following:

taking each event cluster as a long text, performing word segmentation processing, and inputting the long text into a skip-gram algorithm, wherein the skip-gram algorithm passes through p (w)_i+1,w_i-1|w_i,u_j) Probabilistic model (parameter w in this model)_iRepresenting the current word, parameter w_i+1,w_i-1Representing two words adjacent to the current word, parameter u_jRepresenting the event cluster vector obtained by the last iteration, and randomly generating for the first time), calculating and calculating the current word w_iThe probability of two adjacent words is selected, the word with the highest probability is selected as output in a dictionary, and the event cluster vector u obtained by the last iteration is used_jInputting the data into a skip-gram algorithm;

will pass through p (w)_i+1,w_i-1|w_i,u_j) Calculating the difference between the obtained word and the real adjacent wordObtaining a loss term, and transmitting the loss term to p (w) through a back propagation algorithm_i+1,w_i-1|w_i,u_j) Then updates the corresponding u_jAn event cluster vector value of;

repeating the above two steps until u_jThe vector value approaches to be stable or the event cluster is trained in the following text;

and (4) integrating the vectorization results of each event cluster together, taking the vectorization results as the input of a single-pass algorithm, carrying out secondary clustering, and defining the results as topic clusters.

The topic cluster description generation module is configured to generate a topic cluster description using a new word discovery algorithm.

The topic cluster description generation module is further configured to process the following:

all news in each topic cluster are gathered together, a Chinese word segmentation module is used for inputting word segmentation results, and three indexes of word frequency, polymerization degree and freedom degree are calculated respectively in the following specific calculation mode:

(3) And (3) calculating the degree of freedom: assuming that a word totally appears N times, N Chinese characters totally appear on the left side of the word, and each Chinese character sequentially appears N1, N2, … … and Nn times, N is satisfied as N1+ N2+ … … + Nn, so that the probability of the appearance of each Chinese character on the left side of the word can be calculated, and the left-adjacent entropy can be calculated according to the entropy formula. The smaller the entropy is, the lower the degree of freedom is, and the smaller one of the left adjacent entropy and the right adjacent entropy of a word is taken as the final degree of freedom;

and taking the product of word frequency, polymerization degree and freedom degree as a sequencing index, and generating a representative word as topic description.

While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it is to be understood and appreciated that the methodologies are not limited by the order of acts, as some acts may, in accordance with one or more embodiments, occur in different orders and/or concurrently with other acts from that shown and described herein or not shown and described herein, as would be understood by one skilled in the art.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk (disk) and disc (disc), as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks (disks) usually reproduce data magnetically, while discs (discs) reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. a hot spot event identification method based on multi-level clustering, is characterized in that, method comprises:

Step 1: Preprocess the text and divide the text content into multiple phrases;

Step 2: Perform text vectorization processing on the text segmented by phrases to form a vectorized event set;

Step 3: Use an unsupervised clustering algorithm to aggregate the vectorized event sets to form hot event clusters;

Step 4: Vectorize each event cluster using a deep learning algorithm and aggregate again using an unsupervised clustering algorithm.

2. The method for identifying hotspot events based on multi-level clustering according to claim 1, wherein step 1 further comprises:

Step 1-1: Import professional thesaurus and stop word list to assist the Chinese word segmentation module;

Step 1-2: Use named entity recognition technology to identify the main institutions and names appearing in the text;

Step 1-3: Use the Chinese word segmentation module to segment the text into multiple phrases.

3. The method for identifying hotspot events based on multi-level clustering according to claim 1, wherein step 2 further comprises:

Step 2-1: Calculate the number of times each word appears in the text - word frequency, and normalize it;

Step 2-2: Calculate the reverse file frequency;

Step 2-3: Vectorize each news item in the text using the word frequency-inverse document frequency algorithm.

4. the hot spot event identification method based on multi-level clustering according to claim 1, is characterized in that, step 3 further comprises:

Step 3-1: Input the news set to be processed D={d ₁ , d ₂ ,...d _n } and the minimum threshold θ;

Step 3-2: Take one of the news as the initial clustering center, and calculate its content similarity with other news;

Step 3-3: Compare the calculated multiple content similarities with the minimum threshold θ. If all the content similarities are less than the minimum threshold θ, add a new cluster with d ₁ as the cluster center, otherwise, add a new cluster. d ₁ is classified as the cluster with the largest similarity;

Step 3-4: According to the clustering results, the news sets are aggregated into multiple event clusters, and the category numbers of the event clusters are output.

5. The method for identifying hotspot events based on multi-level clustering according to claim 1, wherein step 4 further comprises:

Step 4-1: Take each event cluster as a long text, and input it into the skip-gram algorithm after word segmentation. The skip-gram algorithm passes the p(w _i+1 ,w _i-1 | _wi ,u _j ) probability model , calculate the probability of two words adjacent to the current word _wi , and select the word with the highest probability in the dictionary as the output, and input the event cluster vector u _j obtained from the last iteration into the skip-gram algorithm;

Step 4-2: Calculate the word obtained by p(w _i+1 ,w _i-1 |w _i ,u _j ), and make a difference with the real adjacent words to obtain the loss item, and pass the loss item through back propagation The algorithm is passed to p(wi ₊₁ ,wi _-1 | _wi ,u _j ), and then the value of the event cluster vector corresponding to u _j is updated;

Step 4-3: Repeat steps 4-1 to 4-2 until the vector value of u _j tends to be stable or the text training under the event cluster is completed;

Step 4-4: Assemble the vectorized results of each event cluster as the input of the single-pass algorithm, perform the second clustering, and define the results as topic clusters.

6. The method for identifying hotspot events based on multi-level clustering according to claim 1, wherein the method further comprises:

Step 5: Use the new word discovery algorithm to generate topic cluster descriptions.

7. The method for identifying hotspot events based on multi-level clustering according to claim 6, wherein step 5 further comprises:

Step 5-1: Collect all the news in each topic cluster, go through the Chinese word segmentation module, take the result of word segmentation as input, and calculate the three indicators of word frequency, degree of aggregation, and degree of freedom;

Step 5-2: Use the product of word frequency, degree of aggregation, and degree of freedom as the ranking index, and generate representative words as topic descriptions.

8. A hot spot event identification system based on multi-level clustering, characterized in that the system comprises:

Phrase segmentation module, configured to preprocess the text and segment the text content into multiple phrases;

The vectorization module is configured to perform text vectorization processing on the text segmented by phrases to form a vectorized event set;

The event cluster acquisition module uses an unsupervised clustering algorithm to aggregate the vectorized event sets to form hot event clusters;

The aggregation module uses a deep learning algorithm to vectorize each event cluster and again uses an unsupervised clustering algorithm for aggregation.

9. The hot-spot event recognition system based on multi-level clustering according to claim 8, wherein the phrase segmentation module is further configured to handle the following: import a specialized vocabulary and a stop word vocabulary for assisting Chinese word segmentation module; uses named entity recognition technology to identify the main institutions and names appearing in the text; adopts the Chinese word segmentation module to segment the text into multiple phrases.

10. The hot-spot event recognition system based on multi-level clustering according to claim 8, wherein the vectorization module is further configured to process the following: calculate the number of times each word appears in the text-word frequency, and normalize Process; calculate reverse file frequency; vectorize each news item in the text using the term frequency-reverse file frequency algorithm.

11. The hot-spot event identification system based on multi-level clustering according to claim 8, wherein the event cluster acquisition module is further configured to process the following: input the news set D={d ₁ ,d ₂ ,. ..d _n } and the minimum threshold θ; take one of the news as the initial cluster center, calculate its content similarity with other news; compare the calculated multiple content similarities with the minimum threshold θ, if all If the content similarity is less than the minimum threshold θ, a new cluster is added with d ₁ as the cluster center, otherwise d ₁ is classified as the cluster with the largest similarity; according to the clustering results, the news sets are aggregated into multiple clusters. event cluster, output the category number of the event cluster.

12. The hot-spot event recognition system based on multi-level clustering according to claim 8, wherein the aggregation module is further configured to process the following: take each event cluster as a long text, and input it into skip-gram after word segmentation processing Algorithm, skip-gram algorithm calculates the probability of two words adjacent to the current word _wi through p(wi ₊₁ ,wi _-1 | _wi ,u _j ) probability model, and selects the highest probability in the dictionary As output, the event cluster vector u _j obtained from the previous iteration is input into the skip-gram algorithm; the words obtained by p(w _i+1 ,wi _-1 | _wi ,u _j ) will be calculated as Make a difference with the real adjacent words to get the loss item, pass the loss item to p(w _i+1 ,wi _-1 | _wi ,u _j ) through the back-propagation algorithm, and then update the event cluster corresponding to u _j vector value; repeat the above two steps until the vector value of u _j tends to be stable or the text training under the event cluster is completed; the vectorized results of each event cluster are collected together as the input of the single-pass algorithm, and the first Secondary clustering, defining the results as topic clusters.

13. The hot spot event identification system based on multi-level clustering according to claim 8, wherein the system further comprises:

The topic cluster description generation module uses a new word discovery algorithm to generate topic cluster descriptions.

14. The hot-spot event identification system based on multi-level clustering according to claim 13, wherein the topic cluster description generation module is further configured to process the following: all news in each topic cluster is gathered together, and the Chinese The word segmentation module takes the result of word segmentation as input, and calculates three indicators of word frequency, degree of aggregation, and degree of freedom; the product of word frequency, degree of aggregation, and degree of freedom is used as a ranking indicator to generate representative words as topic descriptions.