CN103544255B

CN103544255B - Text semantic relativity based network public opinion information analysis method

Info

Publication number: CN103544255B
Application number: CN201310482522.5A
Authority: CN
Inventors: 陶宇炜; 谢爱娟; 熊长江; 王娟琳
Original assignee: Changzhou University
Current assignee: Changzhou Hualong Network Technology Co ltd
Priority date: 2013-10-15
Filing date: 2013-10-15
Publication date: 2017-01-11
Anticipated expiration: 2033-10-15
Also published as: CN103544255A

Abstract

The present invention relates to a network public opinion information analysis system based on text semantic correlation, including the following modules: a network public opinion information collection module, which collects various public opinion information rich in content from web pages; a public opinion information extraction module and a public opinion information preprocessing module collect Initially filter and segment the public opinion information, extract the meta information of the text, establish the feature semantic network graph of the text, and perform weighted calculation and feature extraction to provide services for public opinion information mining. The public opinion information mining module uses an improved text clustering analysis method based on semantic similarity to classify texts; the public opinion information analysis module performs OLAP multi-dimensional statistics on the mined data of public opinion information, analyzes public opinion evaluation indicators, and generates relevant public opinion information Decision support. The invention solves the problem of incomplete semantic information of words in the text, and efficiently realizes cluster analysis and hot topic discovery of dynamic data in a large-scale network environment.

Description

Analysis method of network public opinion information based on text semantic correlation

技术领域technical field

本发明涉及网络信息技术领域，具体是一种基于文本语义相关的网络舆情信息分析方法。The invention relates to the technical field of network information, in particular to a method for analyzing network public opinion information based on text semantic correlation.

背景技术Background technique

当今社会，互联网已经渗透到人们的日常生活中，微博、论坛、博客等即时通信工具已经成为人们获取信息，进而发表看法、传播信息的重要渠道。借助网络平台，舆情信息迅速传播，引起广泛关注，其传播的速度之快、范围之广、影响力之大，远非传统媒体可比，网络空间的匿名交互性、非时空限制性等特点，使网络舆情这股强大的社会舆论力量，对社会发展和稳定产生一定的冲击和影响。正面的网络舆情似“正能量”，推动和促进社会发展；负面的网络舆情对社会稳定形成负面效应，引发舆情危机。由此，加强网络舆情信息监测、分析、管理，对稳定社会秩序、构建和谐社会具有重要的现实意义。对网络舆情信息及时监测、正确判断决策、迅速及时回应，积极采取有效措施化解舆情危机，成为网络舆情管理工作的重点和难点问题。In today's society, the Internet has penetrated into people's daily life, and instant communication tools such as Weibo, forums, and blogs have become important channels for people to obtain information, express opinions, and disseminate information. With the help of network platforms, public opinion information spreads rapidly and attracts widespread attention. Its speed of transmission, wide range, and influence are far from comparable to traditional media. Internet public opinion, a powerful force of social public opinion, has a certain impact and influence on social development and stability. Positive Internet public opinion is like "positive energy", which promotes and promotes social development; negative Internet public opinion has a negative effect on social stability and triggers a public opinion crisis. Therefore, strengthening the monitoring, analysis and management of network public opinion information has important practical significance for stabilizing social order and building a harmonious society. Timely monitoring of network public opinion information, correct judgment and decision-making, rapid and timely response, and active and effective measures to resolve public opinion crises have become the key and difficult issues of network public opinion management.

发明内容Contents of the invention

针对上述背景技术中网络舆情信息的特点和网络舆情信息管理中需要解决的问题，本发明提供一种基于文本语义相关的网络舆情信息分析方法。In view of the characteristics of Internet public opinion information in the above background technology and the problems to be solved in the management of Internet public opinion information, the present invention provides a method for analyzing Internet public opinion information based on text semantic correlation.

本发明解决其技术问题所采用的技术方案是，一种基于文本语义相关的网络舆情信息分析方法。采用包括网络舆情信息采集模块、舆情信息萃取模块、舆情信息预处理模块、舆情信息挖掘模块、舆情信息分析模块和包含舆情信息数据库的网络舆情信息分析系统，并包括如下步骤：The technical solution adopted by the present invention to solve the technical problem is a method for analyzing network public opinion information based on text semantic correlation. Adopting a network public opinion information analysis system including a network public opinion information collection module, a public opinion information extraction module, a public opinion information preprocessing module, a public opinion information mining module, a public opinion information analysis module and a public opinion information database, and including the following steps:

a.网络舆情信息采集模块从网页中采集各种舆情信息，并存储到舆情信息数据库中；a. The network public opinion information collection module collects various public opinion information from web pages and stores them in the public opinion information database;

b.舆情信息萃取模块和舆情信息预处理模块将步骤a采集的舆情信息进行初步过滤和切分，抽取文本所包含的内容信息，为舆情信息挖掘提供数据服务；b. The public opinion information extraction module and the public opinion information preprocessing module preliminarily filter and segment the public opinion information collected in step a, extract the content information contained in the text, and provide data services for public opinion information mining;

c.在步骤b基础上，舆情信息挖掘模块采用基于语义相似度的改进文本聚类分析方法，生成类别描述信息，筛选出聚类分析结果中包含的文本信息；利用基于特征统计的TFIDF词频特征计算方法统计类别特征，获取类别特征词，选择名词作为候选类别特征词，按照候选特征词权重排序，以权重值较大的候选特征词作为类别关键词，利用类别关键词之间的语义关系，形成分类结果；识别和建立新的网络舆情主题，检测、跟踪已有舆情主题的相关内容；c. On the basis of step b, the public opinion information mining module adopts an improved text clustering analysis method based on semantic similarity to generate category description information, and screen out the text information contained in the clustering analysis results; use TFIDF word frequency features based on feature statistics The calculation method counts category features, obtains category feature words, selects nouns as candidate category feature words, sorts according to the weight of candidate feature words, uses the candidate feature words with larger weight values as category keywords, and utilizes the semantic relationship between category keywords, Form classification results; identify and establish new online public opinion topics, detect and track relevant content of existing public opinion topics;

d.最后，舆情信息分析模块把舆情信息经过步骤c挖掘的数据进行OLAP多维统计分析，分析舆情主题内容关注度、舆情主题情感倾向等舆情评测指标。d. Finally, the public opinion information analysis module conducts OLAP multi-dimensional statistical analysis on the data mined by the public opinion information through step c, and analyzes the public opinion evaluation indicators such as the degree of attention to the content of the public opinion topic and the emotional tendency of the public opinion topic.

在步骤a中，所述舆情信息采集模块，是对网络舆情信息源进行采集，与一般的网络爬虫不同的是，它不仅要完成网页的爬取，而且要将网页内容进行格式化处理，提取舆情的主题和内容，所得数据存入txt格式或html格式文件，并存储到舆情信息数据库；网络舆情信息采集模块采用分时访问、定时更换IP地址和模拟浏览器进行单点登录三种技术结合进行防屏蔽。网络舆情信息采集模块采用分时访问、定时更换IP地址和模拟浏览器进行单点登录三种技术结合进行防屏蔽。网络舆情信息采集模块执行的具体步骤为：所述舆情信息采集模块执行的具体步骤为，从预先定义的主题相关网页的URL开始，获取网页中的文本信息，并从当前网页中抽取新的URL放入队列中，直到满足条件的舆情信息采集完毕，URL队列为空为止；将采集到的网页文本信息按照字段分类存储到舆情信息数据库中，提供舆情信息萃取模块调用。In step a, the public opinion information collection module is to collect network public opinion information sources, which is different from general web crawlers in that it not only completes the crawling of web pages, but also formats the content of web pages to extract The theme and content of public opinion, the obtained data are stored in txt or html format files, and stored in the public opinion information database; the network public opinion information collection module adopts a combination of time-sharing access, regular IP address replacement and simulated browser for single sign-on Perform anti-shielding. The network public opinion information collection module adopts a combination of time-sharing access, regular IP address replacement and simulated browser single sign-on for anti-shielding. The specific steps that the network public opinion information collection module executes are: the specific steps that the public opinion information collection module executes are, start from the URL of the pre-defined topic-related webpage, obtain the text information in the webpage, and extract new URL from the current webpage Put it into the queue until the public opinion information that meets the conditions is collected and the URL queue is empty; the collected webpage text information is stored in the public opinion information database according to the field classification, and the public opinion information extraction module is provided for calling.

所述舆情信息萃取模块，是清除网页中的无关内容，如网页中的广告、导航信息、图片、版权说明等噪声数据，提取对舆情分析有用的正文部分的元信息，对文本进行重构，将具有主题代表性的信息聚集在一起；所述舆情信息预处理模块，是对采集的舆情信息源经过所述舆情信息萃取模块萃取后，进行中文分词处理、过滤停用词、命名实体识别、词性标注、语法解析和特征词提取，建立正序索引和倒排索引；建立文本特征语义网络图，以文本中包含的实体E作为图的节点，两个实体之间的语义关系作为图的有向边，实体之间的语义关系结合词频信息作为节点的权重，有向边的权重表示实体关系在文本中的重要程度，所述实体E包括事物实体NE、事件实体VE、事件关系实体RE；统计文本的词频和文本频率信息，然后进行特征词抽取，选取体现文本特征的词表示该文本。The public opinion information extraction module is to remove irrelevant content in the webpage, such as noise data such as advertisements in the webpage, navigation information, pictures, copyright descriptions, extract the meta information of the useful text part for public opinion analysis, and reconstruct the text, Gather together representative information of the theme; the public opinion information preprocessing module is to perform Chinese word segmentation processing, filter stop words, named entity recognition, Part-of-speech tagging, grammatical analysis and feature word extraction, establishment of forward index and inverted index; establishment of text feature semantic network graph, with the entity E contained in the text as the node of the graph, and the semantic relationship between the two entities as the effective For the edge, the semantic relationship between entities combined with the word frequency information is used as the weight of the node. The weight of the directed edge indicates the importance of the entity relationship in the text. The entity E includes a thing entity NE, an event entity VE, and an event relationship entity RE; The word frequency and text frequency information of the text are counted, and then the feature words are extracted, and the words that reflect the text characteristics are selected to represent the text.

在步骤b中，所述舆情信息萃取模块，是清除网页中的无关内容，提取对舆情分析有用的正文部分的元信息，对文本进行重构，将具有主题代表性的信息聚集在一起；所述舆情信息预处理模块，是对采集的舆情信息源经过所述舆情信息萃取模块萃取后，进行中文分词处理、过滤停用词、命名实体识别、词性标注、语法解析和特征词提取，建立正序索引和倒排索引；建立文本特征语义网络图，以文本中包含的实体E作为图的节点，两个实体之间的语义关系作为图的有向边，实体之间的语义关系结合词频信息作为节点的权重，有向边的权重表示实体关系在文本中的重要程度，所述实体E包括事物实体NE、事件实体VE、事件关系实体RE；统计文本的词频和文本频率信息，然后进行特征词抽取，选取体现文本特征的词表示该文本。In step b, the public opinion information extraction module is to clear the irrelevant content in the webpage, extract the meta information of the text part useful for public opinion analysis, reconstruct the text, and gather together the representative information of the theme; The public opinion information preprocessing module is to extract the collected public opinion information sources through the public opinion information extraction module, perform Chinese word segmentation, filter stop words, named entity recognition, part-of-speech tagging, grammar analysis and feature word extraction, and establish positive Sequence index and inverted index; establish a semantic network graph of text features, take the entity E contained in the text as the node of the graph, and the semantic relationship between two entities as the directed edge of the graph, and the semantic relationship between entities combined with word frequency information As the weight of the node, the weight of the directed edge represents the importance of the entity relationship in the text, and the entity E includes the thing entity NE, the event entity VE, and the event relationship entity RE; the word frequency and text frequency information of the text are counted, and then the feature Word extraction, selecting words that reflect the characteristics of the text to represent the text.

要实现网络舆情信息文本挖掘、自然语言处理等文本分析，首先要进行分词处理，借鉴国内中文分词领域的研究成果，使用中国科学院计算技术研究所研制的ICTCLAS汉语词法分析系统所具有的词语切分、词性标注、命名实体识别等功能，通过对舆情信息文本内容进行分词，提取长度大于二的词语。在文本分词之后，过滤对计算机理解文本无用的停用词，保留名词、动词、名形词、动形词等词性的词，得到备选特征词集，有效减少索引的大小，增加检索效率，提高准确率。经过分词处理的文本文档，建立正序索引和倒排索引，实现用户的查询交互。文本经过分词、词性标注、去停用词后，建立文本的特征语义网络图，统计文本的词频和文本频率等信息，然后进行加权计算和特征抽取等。In order to realize text analysis such as network public opinion information text mining and natural language processing, word segmentation processing must first be carried out, drawing on domestic research results in the field of Chinese word segmentation, and using the word segmentation of the ICTCLAS Chinese lexical analysis system developed by the Institute of Computing Technology, Chinese Academy of Sciences , part-of-speech tagging, named entity recognition and other functions, through word segmentation of public opinion information text content, to extract words with a length greater than two. After the text is segmented, stop words that are useless for the computer to understand the text are filtered, and words with parts of speech such as nouns, verbs, nouns, and verbs are reserved to obtain a set of candidate feature words, which effectively reduces the size of the index and increases retrieval efficiency. Improve accuracy. For the text document processed by word segmentation, positive sequence index and inverted index are established to realize user query interaction. After the text is segmented, part-of-speech tagged, and stop words removed, a feature semantic network map of the text is established, word frequency and text frequency and other information of the text are counted, and then weighted calculations and feature extraction are performed.

在步骤c中，所述舆情信息挖掘模块，是在对文本集进行预处理，包括中文分词处理、停用词过滤和结构化标签信息分析后，将信息萃取模块生成的文本数据集，根据文本特征语义网络图构建的文本语义特征描述结构，利用相似度评价方法计算文本之间的语义相似度，构建相似度矩阵，采用基于语义相似度的改进文本聚类分析方法生成聚类结果；聚类分析结果生成类别描述信息，筛选出聚类分析结果中包含的文本信息；利用基于特征统计的TFIDF词频特征计算方法统计类别特征，获取候选类别特征词，选择名词作为候选类别特征词，按照候选特征词权重排序，以权重值确定候选特征词作为类别关键词，利用类别关键词之间的语义关系，形成分类结果；将挖掘结果构建知识库，知识库还可以设置成具有同时支持舆情主题发现、舆情倾向性分析等文本挖掘功能。In step c, the public opinion information mining module is to preprocess the text set, including Chinese word segmentation processing, stop word filtering and structured label information analysis, and then extract the text data set generated by the information extraction module according to the text The text semantic feature description structure constructed by the feature semantic network graph uses the similarity evaluation method to calculate the semantic similarity between texts, constructs a similarity matrix, and uses the improved text clustering analysis method based on semantic similarity to generate clustering results; clustering The analysis results generate category description information, and screen out the text information contained in the cluster analysis results; use the TFIDF word frequency feature calculation method based on feature statistics to count category features, obtain candidate category feature words, select nouns as candidate category feature words, and select nouns as candidate category feature words. Word weight sorting, the weight value is used to determine the candidate feature words as category keywords, and the semantic relationship between category keywords is used to form classification results; the mining results are constructed into a knowledge base, and the knowledge base can also be set to support public opinion topic discovery, Text mining functions such as public opinion tendency analysis.

在步骤d中，所述舆情信息分析模块，是对已存入舆情信息数据库中的经过步骤c挖掘的数据进行OLAP多维统计分析，分析舆情主题关注度、舆情内容敏感度、舆情传播扩散度、舆情发布影响度等舆情评测指标，为相关部门及时掌握舆情动态、适时发布舆情信息、做出正确决策提供支持。In step d, the public opinion information analysis module is to perform OLAP multi-dimensional statistical analysis on the data mined in step c in the public opinion information database, and analyze the public opinion topic attention, public opinion content sensitivity, public opinion diffusion degree, Public opinion evaluation indicators such as the influence of public opinion releases provide support for relevant departments to grasp public opinion trends in a timely manner, release public opinion information in a timely manner, and make correct decisions.

与现有技术相比，本发明具有以下有益效果：Compared with the prior art, the present invention has the following beneficial effects:

1.当前网络舆情信息反映出了海量性、动态性、不完整性、表现形式多样性等特点，而现有的舆情信息分析方法往往忽视了舆情信息文本内容的相关关系，导致舆情信息分析结果不准确；本发明采用构建舆情信息文本的文本特征语义网络图模型，在文本描述结构中引入词语语义关联及上下文语境之间的联系；结合基于语义相似度的改进文本聚类算法，挖掘分析出舆情信息文本中上下文语义相关的内容。1. The current network public opinion information reflects the characteristics of mass, dynamic, incomplete, and diverse forms of expression, but the existing public opinion information analysis methods often ignore the correlation between the content of the public opinion information text, resulting in the results of public opinion information analysis. Inaccurate; the present invention adopts the text feature semantic network graph model of constructing public opinion information text, and introduces the connection between the semantic association of words and the context context in the text description structure; combined with the improved text clustering algorithm based on semantic similarity, mining analysis Content related to the context and semantics in the public opinion information text.

2.通过建立舆情信息文本的文本特征语义网络图，将舆情信息文本中词语间的上下文关系形成特征项和权重组成的有向图结构，在保留文本词语上下文信息结构的同时，强化了文本中词语上下文语义的内涵，较好地描述文本中隐含的语义信息和主题特征，解决文本中词语语义信息缺失的问题。2. By establishing the text feature semantic network graph of the public opinion information text, the contextual relationship between words in the public opinion information text is formed into a directed graph structure composed of feature items and weights. The connotation of word context semantics can better describe the hidden semantic information and topic features in the text, and solve the problem of lack of word semantic information in the text.

3.基于语义相似度的改进文本聚类算法适合于大规模网络环境下对动态数据的聚类分析和舆情主题热点发现，通过对文本语义相似度计算，构建文本语义相似度矩阵，深度挖掘出舆情信息文本中上下文语义相关的内容，及时检测、跟踪新的主题事件；采用类内多个中心的主题表示方法，选择文本与类内每个中心的相似度最大值作为该类文本的相似度，有效地提高了系统运行效率，随着文本数量的增加，聚类分析效果会更加明显。3. The improved text clustering algorithm based on semantic similarity is suitable for the cluster analysis of dynamic data and the discovery of public opinion topic hotspots in a large-scale network environment. By calculating the semantic similarity of text, a text semantic similarity matrix is constructed to dig out in depth Contextual semantics-related content in public opinion information texts, timely detection and tracking of new topic events; using the topic representation method of multiple centers in the class, selecting the maximum similarity between the text and each center in the class as the similarity of the text in this class , which effectively improves the operating efficiency of the system. As the number of texts increases, the effect of cluster analysis will be more obvious.

附图说明Description of drawings

图1是本发明实施例基于文本语义相关的网络舆情信息分析方法的工作流程图。Fig. 1 is a working flow chart of a method for analyzing network public opinion information based on text semantic correlation according to an embodiment of the present invention.

具体实施方式detailed description

下面将结合附图和具体实施例对本发明做进一步说明。但本发明的实施方式不限于此。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments. However, the embodiments of the present invention are not limited thereto.

如图1所示，本发明的方法中，包括网络舆情信息采集模块、舆情信息萃取模块、舆情信息预处理模块、舆情信息挖掘模块、舆情信息分析模块和包含舆情信息数据库的网络舆情信息分析系统。其处理流程是：As shown in Figure 1, in the method of the present invention, comprise network public opinion information collection module, public opinion information extraction module, public opinion information preprocessing module, public opinion information digging module, public opinion information analysis module and the network public opinion information analysis system that comprises public opinion information database . Its processing flow is:

(1)舆情信息采集(1) Collection of public opinion information

对网络舆情信息源进行采集，与一般的网络爬虫不同的是，它不仅要完成网页的爬取，而且要将网页内容进行格式化处理，提取有用的舆情信息，如舆情的主题和内容，所得数据存入txt格式或html格式文件，写入原始舆情信息数据库。具体步骤为：按照预设的网络舆情信息采集策略，从多个种子网页的URL开始，通过各类端口发送遵循http协议的指令（采用GET方法）；远程服务器根据申请指令的内容返回HTML类型的文档。舆情信息采集模块收集返回文档中所有的信息后先保存至缓存，然后传送到数据库中保存，获取网页中的文本信息；在获取网页文本信息过程中，不断从当前网页中抽取新出现的超链接URL访问，并剔除已经访问过的超链接URL，如此反复循环，直到满足搜索策略的网页文本信息采集完毕，未访问的URL队列为空为止。将采集的网页文本信息按照字段分类存储到数据库中，提供舆情信息萃取模块调用。Collecting network public opinion information sources is different from general web crawlers in that it not only crawls web pages, but also formats the content of web pages to extract useful public opinion information, such as the topic and content of public opinion. The data is stored in txt or html format files and written into the original public opinion information database. The specific steps are: according to the preset network public opinion information collection strategy, starting from the URLs of multiple seed webpages, sending instructions following the http protocol through various ports (using the GET method); the remote server returns an HTML-type URL according to the content of the application instruction document. The public opinion information collection module collects all the information in the returned document and saves it to the cache first, then transfers it to the database for storage, and obtains the text information in the web page; in the process of obtaining the text information of the web page, it continuously extracts new hyperlinks from the current web page URL access, and remove hyperlink URLs that have been visited, and repeat the cycle until the text information of web pages that meet the search strategy is collected and the queue of unvisited URLs is empty. Store the collected webpage text information in the database according to the field classification, and provide the call of the public opinion information extraction module.

网络舆情信息采集模块通常采用分时访问、定时更换IP地址、模拟浏览器进行单点登录等多种技术结合的防屏蔽策略。针对许多网站如论坛、博客、微博等通过用户登录方式才能访问，这里采用模拟浏览器的策略较易实现，利用微软.NET开发工具VisualStudio2008提供的Web Browser控件为微软IE浏览器的API调用，利用SSO单点登录模拟提交用户名及密码登录，等待用户登录信息加载完成后，页面跳转至相应URL地址，通过提交关键词进行检索，获得所需网页的源文件。The network public opinion information collection module usually adopts an anti-shielding strategy combining time-sharing access, regularly changing IP addresses, and simulating a browser for single sign-on. For many websites such as forums, blogs, microblogs, etc. can only be accessed through user login, it is easier to implement the strategy of simulating a browser here. Using the Web Browser control provided by Microsoft .NET development tool VisualStudio2008 to call the API of Microsoft IE browser, Use SSO single sign-on to simulate submitting user name and password to log in. After the user login information is loaded, the page jumps to the corresponding URL address, and searches by submitting keywords to obtain the source file of the required web page.

采集的网页文本信息包括Web内容信息、Web结构和使用记录信息两部分。Web内容信息包含新闻标题、正文内容、评论信息等文本内容信息，Web结构和Web使用记录信息包含点击量、浏览量、评论量等统计信息。The collected webpage text information includes two parts: web content information, web structure and usage record information. Web content information includes text content information such as news headlines, text content, and comment information, and Web structure and Web usage record information includes statistical information such as clicks, page views, and comments.

(2)舆情信息萃取(2) Public opinion information extraction

采集的网页信息含有广告、导航信息、图片、版权说明等噪声数据，对舆情信息分析来说真正需要的是正文部分的元信息，清除掉这些无关内容，提取对舆情信息分析有用的正文部分的元信息，为文本后续的挖掘、分析提供服务。具体流程如下：The collected webpage information contains noise data such as advertisements, navigation information, pictures, copyright descriptions, etc. What is really needed for the analysis of public opinion information is the meta information of the body part, clearing these irrelevant content, and extracting the meta information of the body part that is useful for the analysis of public opinion information. Meta information provides services for the subsequent mining and analysis of the text. The specific process is as follows:

(2-1)首先使用Tidy工具对正文网页进行HTML标记规范化，然后利用html parser工具构建HTML树，将HTML标记作为树的节点，这样表示便于对HTML代码的管理和操作，可以更好地对代码进行结构化挖掘。(2-1) First use the Tidy tool to normalize the HTML tags on the text webpage, then use the html parser tool to build an HTML tree, and use the HTML tags as the nodes of the tree, which means that it is convenient to manage and operate the HTML code, and can better understand Code for structured mining.

(2-2)从采集的舆情信息源中提取标题、关键词、正文、长度、更新时间和URL等相关信息，标题可截取标签<TITLE>与</TITLE>之间的信息；关键词包含在HTML文件头部的META标签，可从META标签信息中提取；时间信息可通过模式匹配分析和网页分析提取。(2-2) Extract relevant information such as title, keyword, text, length, update time and URL from the collected public opinion information sources. The title can intercept the information between the tags <TITLE> and </TITLE>; the keywords include The META tag in the head of the HTML file can be extracted from the META tag information; the time information can be extracted through pattern matching analysis and web page analysis.

(2-3)正文提取的具体步骤为：选择适当的关键词，获取相关网页的URL地址，通过访问URL地址所在的服务器，得到网页的HTML源代码；删除网页源代码中的无用标记行，保留网页主体内容；将HTML代码中的段落符号（如</p>、<br>等）替换为特殊符号（如*[/p]*、*[/br]*等），回车符和换行符替换为行分隔符，采用行结构存储方式，保留网页内容格式；提取每一行HTML标记“<”与“>”之间的文本；用回车符替换特殊符号（如*[/p]*、*[/br]*等），保持正文原有的段落；对结果字符串进行去除HTML特殊转义字符（如&quot、&lt等)处理，结合正则表达式，匹配并提取最终的正文结果。(2-3) The concrete steps that text extracts are: select appropriate keyword, obtain the URL address of relevant webpage, obtain the HTML source code of webpage by visiting the server where URL address is in; Delete the useless tag row in the webpage source code, Retain the main content of the webpage; replace paragraph symbols (such as </p>, <br>, etc.) in the HTML code with special symbols (such as *[/p]*, *[/br]*, etc.), carriage return and Newline characters are replaced by line separators, and the row structure storage method is used to retain the content format of the web page; the text between the HTML tags "<" and ">" of each line is extracted; special symbols (such as *[/p] are replaced by carriage returns *, *[/br]*, etc.), keep the original paragraph of the text; remove HTML special escape characters (such as &quot, &lt, etc.) from the result string, combine regular expressions, match and extract the final text result .

从采集的舆情信息源中提取标题、关键词、正文、长度、更新时间和URL等相关信息后，舆情信息萃取模块还要实现文本信息的重构。After extracting relevant information such as title, keywords, text, length, update time and URL from the collected public opinion information sources, the public opinion information extraction module also needs to realize the reconstruction of text information.

文本重构通过分析网络新闻、论坛帖子、微博博文等舆情信息存在形式和文本的结构特征，将具有代表性话题的信息组成“主旨块”，其余部分的信息组成“内容块”，以提高聚类分析效果。Text reconstruction analyzes the existing forms of public opinion information such as network news, forum posts, Weibo blog posts, and the structural characteristics of the text, and organizes information on representative topics into "subject blocks" and the rest of the information into "content blocks" to improve cluster analysis effect.

对于网页新闻的文本重构，是把网页新闻的标题和首段信息组成“主旨块”，其余的新闻描述信息和评论内容组成“内容块”。For the text reconstruction of webpage news, the title and the first paragraph of information of the webpage news are composed of "subject block", and the remaining news description information and comment content are composed of "content block".

对于论坛帖子的文本重构，是将帖子的标题和主帖组成“主旨块”，将回帖和跟帖信息净化处理，去除没有汉字内容的帖子和使用常用评价词的帖子，选择若干条帖子构成“内容块”。For the text reconstruction of forum posts, the title and main post of the post are composed of the "subject block", and the replies and follow-up information are purified, and the posts without Chinese characters and the posts using common evaluation words are removed, and several posts are selected to form "Content Block".

(3)舆情信息预处理(3) Preprocessing of public opinion information

舆情信息萃取后，接下来进行中文分词处理、命名实体识别、词性标注、语法解析、特征词提取等预处理，将结果保存到数据库中。要实现网络舆情信息文本挖掘、自然语言处理等文本分析，首先要进行分词处理，借鉴国内中文分词领域的研究成果，采用中国科学院计算技术研究所研制的汉语词法分析系统ICTCLAS进行文本的分词及词性标注，通过中文分词处理，提取长度大于二的词语。ICTCLAS的功能有中文文本的分词、词性标注、新词识别等；使用角色模型（role model）的方法进行命名实体识别；同时支持用户根据需要定义个性化词典，不仅具有较高的分词精度，分词效果也较好。其实现代码如下：After the public opinion information is extracted, preprocessing such as Chinese word segmentation, named entity recognition, part-of-speech tagging, grammar analysis, and feature word extraction is performed, and the results are saved in the database. In order to realize text analysis such as text mining and natural language processing of network public opinion information, word segmentation processing must be carried out first, drawing on domestic research results in the field of Chinese word segmentation, and using the Chinese lexical analysis system ICTCLAS developed by the Institute of Computing Technology, Chinese Academy of Sciences to perform text segmentation and parts of speech Annotation, through Chinese word segmentation processing, extracts words with a length greater than two. The functions of ICTCLAS include Chinese text word segmentation, part-of-speech tagging, new word recognition, etc.; use the method of role model (role model) for named entity recognition; at the same time, it supports users to define personalized dictionaries according to their needs, which not only has high word segmentation accuracy, but also The effect is also better. Its implementation code is as follows:

在文本分词之后，过滤对计算机理解文本无用的停用词，保留名词、动词、名形词、动形词等词性的词，得到备选特征词集，以避免文本的冗杂，有效减少索引的大小，增加检索效率，提高检索准确率。After the text is segmented, stop words that are useless for the computer to understand the text are filtered, and words of parts of speech such as nouns, verbs, nouns, and verbs are reserved to obtain a set of alternative feature words to avoid redundant text and effectively reduce indexing. size, increase retrieval efficiency, and improve retrieval accuracy.

经过分词处理的文本，建立正序索引和倒排索引，实现用户的查询交互。对于正序索引，根据词频的排序，选择前N个词语表示文本，用哈希表表示为：<文件名，关键词词组>；建立正序索引后，搜索文本中的关键词，找出包含此关键词的所有文件名，建立文件名词组，可得倒排索引，用哈希表表示为：<关键词，文件名词组>。For the text processed by word segmentation, a positive sequence index and an inverted index are established to realize user query interaction. For the positive sequence index, according to the sorting of the word frequency, select the first N words to represent the text, and use the hash table to express it as: <file name, keyword phrase>; after the positive sequence index is established, search for the keywords in the text to find out the For all file names of this keyword, establish a file noun group to obtain an inverted index, which is represented by a hash table as: <keyword, file noun group>.

索引的建立和索引的检索服务基于Apache开源项目Lucene实现，Lucene提供完整的查询引擎和索引引擎，文本分析引擎；采用Hadoop存储和管理海量的索引文件。The establishment of the index and the retrieval service of the index are implemented based on the Apache open source project Lucene, which provides a complete query engine, index engine, and text analysis engine; Hadoop is used to store and manage massive index files.

索引的建立过程如下：The index creation process is as follows:

1.创建索引写对象IndexWriter。该对象创建时需提供词汇解析器，不同的词汇解析器采用不同的词库。选用ThesaurusAnalyzer，能够提取内容摘要；1. Create an index write object IndexWriter. A lexical parser must be provided when creating this object, and different lexical parsers use different lexicons. Use ThesaurusAnalyzer to extract content summaries;

2.为取自数据库中的每个结果集创建一个Document对象；2. Create a Document object for each result set from the database;

3.将结果集中的数据元分别创建一个Field对象，并添加到Document对象；3. Create a Field object for each data element in the result set and add it to the Document object;

4.写入该Document对象。4. Write to the Document object.

索引检索的过程为：首先创建查询解析器，该查询解析器需要Field对象名以及对应的词汇解析器等参数；再由查询解析器和关键字获得查询对象；通过查询对象获取检索的结果集，结果集由Document对象构成。The process of index retrieval is as follows: first create a query parser, which needs parameters such as the Field object name and the corresponding vocabulary parser; then obtain the query object from the query parser and keywords; obtain the retrieved result set through the query object, The result set consists of Document objects.

文本经过分词、词性标注、去停用词后，建立文本的特征语义网络图，统计文本的词频和文本频率等信息，然后进行加权计算和特征抽取等。After the text is segmented, part-of-speech tagged, and stop words removed, a feature semantic network map of the text is established, word frequency and text frequency and other information of the text are counted, and then weighted calculations and feature extraction are performed.

文本特征语义网络图是一种用实体及其语义关系来表达舆情信息的有向图，以文本中包含的实体E（包括事物实体NE、事件实体VE、事件关系实体RE）作为图的节点，两个实体之间的语义关系作为图的有向边，实体之间的语义关系结合词频信息作为节点的权重，有向边的权重表示实体关系在文本中的重要程度。通过网络节点权值的引入和基于概念的合并与简化，构建文本特征语义网络图，提取文本的核心语义。即通过网络节点表示的词语合并，节点权值相加；再合并有向边，有向边权值相加，构建文本特征语义网络图，描述文本中的语义信息和主题特征。具体概念描述如下：The text feature semantic network graph is a directed graph that uses entities and their semantic relationships to express public opinion information. The entities E contained in the text (including thing entities NE, event entities VE, and event relationship entities RE) are used as graph nodes. The semantic relationship between two entities is used as the directed edge of the graph, and the semantic relationship between entities combined with word frequency information is used as the weight of the node. The weight of the directed edge indicates the importance of the entity relationship in the text. Through the introduction of network node weights and concept-based merging and simplification, a text feature semantic network graph is constructed to extract the core semantics of the text. That is, the words represented by the network nodes are merged, and the node weights are added; then the directed edges are merged, and the weights of the directed edges are added to construct a text feature semantic network graph to describe the semantic information and topic features in the text. The specific concepts are described as follows:

C1：事物实体NE定义为NE（id,concept,property,power）。id代表实体标识，concept代表实体概念，property代表实体属性，power代表权重。C1: The thing entity NE is defined as NE (id, concept, property, power). id stands for entity identifier, concept stands for entity concept, property stands for entity attribute, and power stands for weight.

C2：事件实体VE定义为VE（id,concept,property,power,isN,subT,objT1,objT2）。除了包含NE的几个数据项外，isN代表是否为否定，subT代表主体实体表头，objTl和objT2代表客体实体1与2的表头。C2: The event entity VE is defined as VE (id, concept, property, power, isN, subT, objT1, objT2). In addition to several data items including NE, isN represents whether it is negated, subT represents the header of the subject entity, and objT1 and objT2 represent the headers of object entities 1 and 2.

C3：事件关系实体RE定义为RE（id,concept,property,power,isN,subT,objT）。RE用一对主客体实体就可完全描述。C3: Event relation entity RE is defined as RE (id, concept, property, power, isN, subT, objT). RE can be fully described by a pair of subject and object entities.

文本特征语义网络图模型分析步骤如下：The analysis steps of text feature semantic network graph model are as follows:

S1：在分析文本时，首先以语句为单位，构建各条语句对应的特征语义网络图。逐句分析每句产生了哪些NE，将NE及其属性信息记入实体信息表。S1: When analyzing the text, firstly, the sentence is used as the unit to construct the feature semantic network graph corresponding to each sentence. Analyze which NEs are generated in each sentence sentence by sentence, and record the NEs and their attribute information into the entity information table.

S2：NE分析完毕后，分析VE，登记VE的概念，属性，主体和客体。主客体相同的VE实体表示为同一VE，否则设置不同的id。S2: After the NE analysis is completed, analyze the VE, and register the concepts, attributes, subjects and objects of the VE. VE entities with the same subject and object are represented as the same VE, otherwise different ids are set.

S3：接下来分析RE。分析RE要注意与NE、VE区分开来，把RE的概念、属性、主体、客体登记到实体信息表。S3: Next analyze the RE. When analyzing RE, it should be distinguished from NE and VE, and the concept, attribute, subject, and object of RE should be registered in the entity information table.

S4：分析结束后，得到该语句的实体信息表。实体信息表描述了实体之间的关系，用来构造实体关系图，NE与VE之间，RE与NE、VE之间，实体E与属性T之间通过不同的连线把实体关系可视化。S4: After the analysis ends, the entity information table of the statement is obtained. The entity information table describes the relationship between entities and is used to construct the entity relationship diagram, between NE and VE, between RE and NE, VE, and between entity E and attribute T to visualize the entity relationship through different connections.

S5：在分析构建第一条语句的特征语义网络图基础上，将后续语句的特征语义网络图合并，先合并节点，再合并有向边。S5: Based on the analysis and construction of the feature semantic network graph of the first sentence, merge the feature semantic network graphs of subsequent sentences, first merge nodes, and then merge directed edges.

S6：合并节点时，把节点之间词语相同或者语义相似度满足阈值条件的节点合并，节点权值相加；否则保留该节点。S6: When merging nodes, merge nodes with the same word or semantic similarity meeting the threshold condition, and add node weights; otherwise, keep the node.

S7：有向边合并，是把合并后的节点间存在的有向边进行合并，有向边权值相加。S7: Directed edge merging is to merge the directed edges existing between the merged nodes, and add the weights of the directed edges.

S8：更新新合并节点邻接边的权值为该节点的权值，强化节点之间的语义关系。S8: Update the weight of the adjacent edge of the newly merged node to the weight of the node to strengthen the semantic relationship between nodes.

S9：输出所有合并语句的特征语义网络图后，完成整个文本的特征语义网络图的构造。S9: After outputting the feature semantic network graph of all merged sentences, complete the construction of the feature semantic network graph of the entire text.

下一步对词性特征权重赋值，以准确标示文本。按照汉语词性特点及完整事件描述要素（时间、地点、人物以及事件内容），结合中国科学院汉语词性标记集，文本特征权重赋值分为：标题权重值为3，子标题和关键词权重值为2，摘要权重值为1.5，段首句和段尾句权重值为1.3。The next step is to assign weights to the part-of-speech features to accurately mark the text. According to Chinese part-of-speech characteristics and complete event description elements (time, place, person, and event content), combined with the Chinese Academy of Sciences Chinese part-of-speech tag set, the weight assignment of text features is divided into: title weight value 3, subtitle and keyword weight value 2 , the weight value of the abstract is 1.5, and the weight value of the first sentence and the last sentence of the paragraph is 1.3.

舆情信息经过预处理后，为文本的标题、正文和回复设置不同的标签，在计算权重时，读取关键词的标签信息，完成词语的位置权重的赋值。After the public opinion information is preprocessed, different tags are set for the title, body and reply of the text. When calculating the weight, the tag information of the keyword is read to complete the assignment of the position weight of the word.

(4)舆情信息挖掘(4) Public opinion information mining

舆情信息挖掘模块，是在对文本集进行预处理，包括中文分词处理、停用词过滤和结构化标签信息分析后，将信息萃取模块生成的文本数据集，根据文本特征语义网络图构建的文本语义特征描述结构，利用相似度评价方法计算文本之间的语义相似度，构建相似度矩阵，采用基于语义相似度的改进文本聚类分析方法生成聚类结果；聚类分析结果生成类别描述信息，筛选出聚类分析结果中包含的文本信息；利用基于特征统计的TFIDF词频特征计算方法统计类别特征，获取候选类别特征词，选择名词作为候选类别特征词，按照候选特征词权重排序，以权重值确定候选特征词作为类别关键词，利用类别关键词之间的语义关系，形成分类结果；将挖掘结果构建知识库，知识库还可以设置成具有同时支持舆情主题发现、舆情倾向性分析等文本挖掘功能。The public opinion information mining module is a text data set generated by the information extraction module after preprocessing the text set, including Chinese word segmentation processing, stop word filtering, and structured label information analysis, and constructing the text according to the text feature semantic network graph. Semantic feature description structure, use the similarity evaluation method to calculate the semantic similarity between texts, construct a similarity matrix, and use the improved text clustering analysis method based on semantic similarity to generate clustering results; the clustering analysis results generate category description information, Filter out the text information contained in the cluster analysis results; use the TFIDF word frequency feature calculation method based on feature statistics to count category features, obtain candidate category feature words, select nouns as candidate category feature words, sort according to the weight of candidate feature words, and use the weight value Determine the candidate feature words as category keywords, and use the semantic relationship between category keywords to form classification results; build a knowledge base from the mining results, and the knowledge base can also be set to support text mining such as public opinion topic discovery and public opinion tendency analysis. Features.

首先定义和计算文本之间的相似度，即文本之间所讨论主题的相关程度，用Sim(D₁,D₂)表示文本D₁和文本D₂之间的相似度。相似度取值范围在0和1之间，与文本D₁和D₂的相似程度成正比。文本之间的相似度越大，表明文本之间的主题相关程度越大。文本之间的语义相似度评价方法如下：Firstly, the similarity between texts is defined and calculated, that is, the degree of relevance between the topics discussed between texts, and Sim(D ₁ , D ₂ ) is used to represent the similarity between text D ₁ and text D ₂ . The value range of similarity is between ₀ and ₁ , which is proportional to the similarity between texts D1 and D2. The greater the similarity between texts, the greater the topic correlation between texts. The semantic similarity evaluation method between texts is as follows:

设经过步骤b的舆情信息萃取和预处理后的文本为D₁(t₁₁,t₁₂,t₁₃,…,t_1m)，D₂(t₂₁,t₂₂,t₂₃,…,t_2m)，计算文本D₁中所有关键词t_1i与文本D₂中所有关键词t_2i的相似度，形成相似度矩阵如下：Let the text after public opinion information extraction and preprocessing in step b be D ₁ (t ₁₁ ,t ₁₂ ,t ₁₃ ,…,t _1m ), D ₂ (t ₂₁ ,t ₂₂ ,t ₂₃ ,…,t _2m ) , calculate the similarity between all keywords t _1i in text D ₁ and all keywords t _2i in text D ₂ , and form a similarity matrix as follows:

$M m (({D D.}_{11},, {D D.}_{22})) = = (\begin{matrix} {Sim Sim}_{1111} & {Sim Sim}_{11 m m} \\ {Sim Sim}_{m m 11} & {Sim Sim}_{mm mm} \end{matrix})$

Sim_ij(1=i,j=m)表示文本D₁关键词t_1i与文本D₂关键词t_2j的相似度；M(D₁,D₂)表示文本D₁与文本D₂之间的相似度矩阵；i为文本D₁的关键词数；m为文本D₂的关键词数；Sim _ij (1=i, j=m) represents the similarity between text D ₁ keyword t _1i and text D ₂ keyword t _2j ; M(D ₁ , D ₂ ) represents the similarity between text D ₁ and text D ₂ Similarity matrix; i is the number of keywords in text D1; _m is the number of keywords in text D2 _;

词语相似度计算公式为：S(T₁,T₂)=Max(i=1,2,…,n;j=1,2,…,m)S(y_1i,y_2j)，即词语相似度为两词语所有义项（一个词语所包含的多个词义）相似度中的最大值。The formula for calculating word similarity is: S(T ₁ ,T ₂ )=Max(i=1,2,…,n;j=1,2,…,m)S(y _1i ,y _2j ), that is, word similarity The degree is the maximum value of the similarity between all senses of two words (multiple senses contained in a word).

依次遍历相似度矩阵M，找到相似度Sim值最大的关键词对应组合，并删除对应的行和列。然后继续遍历相似度矩阵M找到相似度值最大的关键词组合，反复循环直至矩阵M为零值矩阵。最后利用得到的相似度最大关键词组合序列，求得文本D₁和D₂的语义相似度，计算公式如下：Traverse the similarity matrix M in turn, find the corresponding combination of keywords with the largest similarity Sim value, and delete the corresponding row and column. Then continue to traverse the similarity matrix M to find the keyword combination with the largest similarity value, and repeat the cycle until the matrix M is a zero-valued matrix. Finally, the semantic similarity of texts D ₁ and D ₂ is obtained by using the keyword combination sequence with the largest similarity obtained, and the calculation formula is as follows:

$Sim Sim (({D D.}_{11},, {D D.}_{22})) = = \frac{11}{m m} {Σ Σ}_{k k = = 11}^{m m} {Sim Sim}_{k k - - max max} (({t t}_{11 i i},, {t t}_{22 j j}))$

其中，max为相似度Sim的最大值；i为文本D₁的关键词数；j为文本D₂的关键词数。Among them, max is the maximum value of similarity Sim; i is the number _of keywords in text D1 _; j is the number of keywords in text D2.

基于语义相似度的改进文本聚类分析方法，描述如下：An improved text clustering analysis method based on semantic similarity is described as follows:

1.首先对所有采集的文本经过预处理后，采用TFIDF加权法对所有类别关键词进行特征加权，提取m个最优特征关键词形成原始的基于关键词特征向量Di*。1. First, after preprocessing all the collected texts, use the TFIDF weighting method to perform feature weighting on all category keywords, and extract m optimal feature keywords to form the original keyword-based feature vector Di*.

2.依据所述知识库对原始的基于关键词特征向量Di*中关键词进行预处理：在知识库中找到与关键词匹配的词汇并将其替换，形成新的特征向量D_i，D_i=(T₁,T₂,…,T_i),i=1,2,3,…,m。2. Preprocessing the keywords in the original keyword-based feature vector Di* according to the knowledge base: find and replace the words matching the keywords in the knowledge base to form new feature vectors D _i , D _i =(T ₁ ,T ₂ ,...,T _i ), i=1,2,3,...,m.

3.形成n个文本的m个特征向量D_i，利用文本语义相似度计算公式计算采集的文本之间的语义相似度，形成文本集的相似度矩阵M，并求出所有特征向量的平均相似度MA。计算公式如下：3. Form m feature vectors D _i of n texts, use the text semantic similarity calculation formula to calculate the semantic similarity between the collected texts, form the similarity matrix M of the text set, and calculate the average similarity of all feature vectors Degree MA. Calculated as follows:

$M = (\begin{matrix} S_{11} & S_{1 n} \\ S_{n 1} & S_{nn} \end{matrix}), MA = \frac{Σ_{i = 1}^{n} Σ_{j = 1}^{n} Sij - n Σ_{i = 1}^{n} Sii}{n * (n - 1)};$ 其中，n为文本数； $m = (\begin{matrix} S_{11} & S_{1 no} \\ S_{no 1} & S_{n} \end{matrix}), MA = \frac{Σ_{i = 1}^{no} Σ_{j = 1}^{no} Sij - no Σ_{i = 1}^{no} Sii}{no * (no - 1)};$ Among them, n is the number of texts;

4.设定三个相似度阈值，一个重复度阈值为0.9，一个主题中心阈值为0.5，以及一个新主题阈值为0.3；4. Set three similarity thresholds, one repetition threshold is 0.9, one topic center threshold is 0.5, and one new topic threshold is 0.3;

5.将文本与中心主题比较，如果文本与中心主题的初始中心相似度大于重复度阈值0.9，认为该文本属于同一主题同一内容文本；如果相似度小于新主题阈值0.3，则该文本需要新建一个类；如果相似度在0～0.5范围内，则该文本属于同一主题的不同侧面讨论的核心内容文本，标记为第二个中心，以此类推，形成多个中心的层次化的聚类结果。5. Compare the text with the central topic. If the initial central similarity between the text and the central topic is greater than the repetition threshold of 0.9, the text is considered to belong to the same topic and the same content text; if the similarity is less than the new topic threshold of 0.3, the text needs to create a new one. class; if the similarity is in the range of 0 to 0.5, the text belongs to the core content text discussed on different sides of the same topic, and is marked as the second center, and so on, forming a hierarchical clustering result of multiple centers.

6.针对多个中心的主题表示方法，选择文本与类内每个中心的相似度的最大值作为该类文本的相似度。6. For the topic representation method of multiple centers, select the maximum value of the similarity between the text and each center in the class as the similarity of the text in this class.

基于语义相似度的改进文本聚类算法适合于大规模网络环境下对动态数据的聚类分析和舆情主题热点发现，能及时检测到新事件，检测、跟踪新的舆情主题；采用类内多个中心的舆情主题表示方法，有效地提高了系统运行效率，随着文本数量的增加，效果会更加明显。The improved text clustering algorithm based on semantic similarity is suitable for cluster analysis of dynamic data and discovery of public opinion topic hotspots in a large-scale network environment. It can detect new events in time, detect and track new public opinion topics; The public opinion theme representation method in the center has effectively improved the operating efficiency of the system, and the effect will be more obvious as the number of texts increases.

5)舆情信息分析5) Analysis of public opinion information

所述舆情信息分析模块对已存入舆情信息数据库中的经过步骤c挖掘的数据进行OLAP多维统计分析，分析舆情主题内容关注度、舆情主题情感倾向等舆情评测指标，为相关部门及时掌握舆情动态、适时发布舆情信息、做出正确决策提供支持。The public opinion information analysis module performs OLAP multidimensional statistical analysis on the data that has been stored in the public opinion information database and has been mined in step c, and analyzes the public opinion evaluation indicators such as the degree of attention to the content of the public opinion theme and the emotional tendency of the public opinion theme, so as to grasp the dynamics of public opinion in time for relevant departments , Release public opinion information in a timely manner, and provide support for making correct decisions.

通过采集、处理和挖掘分析产生的舆情主题，表示为：T=(T₁,T₂,…,T_n)，其中T_i表示舆情主题的文本。舆情主题文本的关注度表示为：T_i=αN_p+βN_r，舆情主题的关注度度量公式为：其中α,β表示权重，N_p表示舆情主题文本的点击数，N_r表示评论数；Np_i表示第i个舆情主题文本的点击数，Nr_i表示第i个舆情主题文本的评论数。由于N_p>N_r，经过统计，α取值为0.02，β取值为0.98。The public opinion topic generated through collection, processing and mining analysis is expressed as: T=(T ₁ ,T ₂ ,…,T _n ), where T _i represents the text of the public opinion topic. The attention degree of public opinion topic text is expressed as: T _i =αN _p +βN _r , and the attention degree measurement formula of public opinion topic is: Among them, α and β represent the weight, N _p represents the number of clicks on the public opinion topic text, and N _r represents the number of comments; Np_i represents the number of hits on the i-th public opinion topic text, and Nr_i represents the number of comments on the i-th public opinion topic text. Since N _p >N _r , after statistics, the value of α is 0.02, and the value of β is 0.98.

舆情主题的情感倾向基于舆情主题文本的聚类分析数据描述。首先设定一个阙值，只有当文本的倾向度量值大于阙值，文本才表现出极性（正面性、负面性）。文本的倾向度量值为正，则该文本为正面的评论，反之则为负面的评论。The emotional tendency of public opinion topics is based on the cluster analysis data description of public opinion topic texts. First, a threshold is set, and only when the tendency measure of the text is greater than the threshold, the text shows polarity (positiveness, negativity). If the propensity measure value of the text is positive, then the text is a positive comment, otherwise it is a negative comment.

舆情信息经过采集、预处理、信息萃取、挖掘和分析，可以得到舆情主题的详细数据，按照建立的舆情指标评价体系进行处理，处理的结果提供决策帮助。After public opinion information is collected, preprocessed, information extracted, mined and analyzed, detailed data on public opinion topics can be obtained, processed according to the established public opinion index evaluation system, and the processed results provide decision-making assistance.

Claims

1. A method for analyzing network public opinion information based on text semantic correlation, characterized in that: the method comprises a network public opinion information collection module, a public opinion information extraction module, a public opinion information preprocessing module, a public opinion information mining module, a public opinion information analysis module and a public opinion information database The network public opinion information analysis system includes the following steps:

a. The network public opinion information collection module collects various public opinion information from web pages and stores them in the public opinion information database;

b. The public opinion information extraction module and the public opinion information preprocessing module preliminarily filter and segment the public opinion information collected in step a, extract the content information contained in the text, and provide data services for public opinion information mining;

c. On the basis of step b, the public opinion information mining module adopts an improved text clustering analysis method based on semantic similarity to generate category description information, and screen out the text information contained in the clustering analysis results; use TFIDF word frequency features based on feature statistics The calculation method counts category features, obtains category feature words, selects nouns as candidate category feature words, sorts according to the weight of candidate feature words, uses the candidate feature words with larger weight values as category keywords, and utilizes the semantic relationship between category keywords, Form classification results; identify and establish new online public opinion topics, detect and track relevant content of existing public opinion topics;

d. Finally, the public opinion information analysis module performs OLAP multi-dimensional statistical analysis on the data mined by the public opinion information through step c, and analyzes the public opinion evaluation indicators such as the degree of attention to the content of the public opinion topic, the emotional tendency of the public opinion topic;

In step a, the public opinion information collection module is to collect network public opinion information sources, not only to complete the crawling of web pages, but also to format the content of the web pages, extract the theme and content of public opinion, and store the obtained data in txt or html format files, and store them in the public opinion information database; the network public opinion information collection module adopts time-sharing access, regular IP address replacement and simulated browser for single sign-on to prevent shielding.

2. the network public opinion information analysis method based on text semantics correlation according to claim 1, it is characterized in that, the concrete step that described public opinion information collection module executes is, starts from the URL of the relevant webpage of pre-defined topic, obtains in the webpage text information, and extract new URLs from the current webpage and put them into the queue until the collection of public opinion information that satisfies the conditions is completed and the URL queue is empty; the collected webpage text information is stored in the public opinion information database according to the field classification, Provide public opinion information extraction module calls.

3. the network public opinion information analysis method based on text semantic correlation according to claim 1, it is characterized in that, in step b, described public opinion information extraction module, is to remove the irrelevant content in the webpage, extracts useful to public opinion analysis The meta-information in the text part reconstructs the text and gathers the representative information of the theme together; the public opinion information preprocessing module extracts the collected public opinion information sources through the public opinion information extraction module, and performs Chinese Word segmentation processing, filtering stop words, named entity recognition, part-of-speech tagging, grammatical analysis and feature word extraction, establishment of forward index and inverted index; establishment of text feature semantic network graph, with entity E contained in the text as the node of the graph, The semantic relationship between two entities is used as the directed edge of the graph, and the semantic relationship between the entities is combined with the word frequency information as the weight of the node. The weight of the directed edge indicates the importance of the entity relationship in the text. The entity E includes things Entity NE, event entity VE, and event relationship entity RE; count the word frequency and text frequency information of the text, and then perform feature word extraction, and select words that reflect text features to represent the text.

4. the network public opinion information analysis method based on text semantic correlation according to claim 3, it is characterized in that, in step c, described public opinion information mining module is carrying out preprocessing to text collection, comprises Chinese word segmentation process, After stop word filtering and structured label information analysis, the text dataset generated by the information extraction module is used to calculate the semantic similarity between texts using the similarity evaluation method based on the text semantic feature description structure constructed by the text feature semantic network graph. Construct a similarity matrix, use the improved text clustering analysis method based on semantic similarity to generate clustering results; generate category description information from the clustering analysis results, and filter out the text information contained in the clustering analysis results; use TFIDF word frequency based on feature statistics The feature calculation method counts category features, obtains candidate category feature words, selects nouns as candidate category feature words, sorts candidate feature words according to their weight, determines candidate feature words as category keywords with weight values, and utilizes the semantic relationship between category keywords, Form classification results; build knowledge base with mining results.

5. according to claim 3 or 4 described network public opinion information analysis methods based on text semantic correlation, it is characterized in that, text feature semantic network graph is the directed graph that utilizes entity and its semantic relation to express public opinion information, through network node The expressed words are merged, and the node weights are added; then the directed edges are merged, and the weights of the directed edges are added to construct a text feature semantic network graph to describe the semantic information and topic features in the text.

6. the network public opinion information analysis method based on text semantic correlation according to claim 4, it is characterized in that, the semantic similarity evaluation method between texts is:

Let the text after public opinion information extraction and preprocessing in step b be D ₁ (t ₁₁ ,t ₁₂ ,t ₁₃ ,…,t _1m ), D ₂ (t ₂₁ ,t ₂₂ ,t ₂₃ ,…,t _2m ) , calculate the similarity between all keywords t _1i in text D ₁ and all keywords t _2i in text D ₂ , and form a similarity matrix as follows:

M m (({D D.}_{11},, {D D.}_{22})) = = (\begin{matrix} {Sim Sim}_{1111} & {Sim Sim}_{11 m m} \\ {Sim Sim}_{m m 11} & {Sim Sim}_{m m m m} \end{matrix})

Sim _ij (1=i, j=m) represents the similarity between text D ₁ keyword t _1i and text D ₂ keyword t _2j ; M(D ₁ , D ₂ ) represents the similarity between text D ₁ and text D ₂ Similarity matrix; i is the number of keywords in text D1; _m is the number of keywords in text D2 _;

Word similarity calculation formula S(T ₁ ,T ₂ )=Max _{(i=1,2,…,n; j=1,2,…,m)} S(y _1i ,y _2j ), that is, the word similarity is The maximum value in the similarity of all meaning items of two words, and described meaning item refers to a plurality of meanings contained in a word;

Traverse the similarity matrix M in turn, find the keyword combination with the largest similarity Sim value, and delete the corresponding row and column; then continue to traverse the similarity matrix M to find the keyword combination with the largest Sim value, and repeat the cycle until the matrix M is zero value matrix; finally, use the keyword combination sequence with the largest similarity obtained to obtain the semantic similarity of texts D ₁ and D ₂ , and the calculation formula is as follows:

S S i i m m (({D D.}_{11},, {D D.}_{22})) = = \frac{11}{m m} {Σ Σ}_{k k = = 11}^{m m} {Sim Sim}_{k k - - m m a a x x} (({t t}_{11 i i},, {t t}_{22 j j}))

Among them, max is the maximum value of similarity Sim; i is the number _of keywords in text D1 _; j is the number of keywords in text D2.

7. the network public opinion information analysis method based on text semantic correlation according to claim 6, is characterized in that, the improved text cluster analysis method based on semantic similarity is:

1) First, after preprocessing all the collected texts, use the TFIDF weighting method to carry out feature weighting on all category keywords, and extract m optimal feature keywords to form the original keyword-based feature vector Di*;

2) Preprocessing the keywords in the original keyword-based feature vector Di* according to the knowledge base: find the words matching the keywords in the knowledge base and replace them to form new feature vectors D _i , D _i =(T ₁ ,T ₂ ,...,T _i ), i=1,2,3,...,m;

3) Form m eigenvectors D _i of n texts, use the text semantic similarity calculation formula to calculate the semantic similarity between the collected texts, form the similarity matrix M of the text set, and calculate the average similarity of all eigenvectors degree MA; the calculation formula is as follows:

Among them, n is the number of texts;

4) Set three similarity thresholds, one repetition threshold is 0.9, one topic center threshold is 0.5, and one new topic threshold is 0.3;

5) Compare the text with the central topic, if the initial central similarity between the text and the central topic is greater than the repetition threshold 0.9, the text is considered to belong to the same topic and the same content text; if the similarity is less than the new topic threshold 0.3, the text needs to create a new one class; if the similarity is in the range of 0 to 0.5, the text belongs to the core content text discussed on different sides of the same topic, and is marked as the second center, and so on, forming a hierarchical clustering result of multiple centers;

6) For the topic representation method of multiple centers, select the maximum value of the similarity between the text and each center in the class as the similarity of the text in this class.