[go: up one dir, main page]

CN103544255B - Text semantic relativity based network public opinion information analysis method - Google Patents

Text semantic relativity based network public opinion information analysis method Download PDF

Info

Publication number
CN103544255B
CN103544255B CN201310482522.5A CN201310482522A CN103544255B CN 103544255 B CN103544255 B CN 103544255B CN 201310482522 A CN201310482522 A CN 201310482522A CN 103544255 B CN103544255 B CN 103544255B
Authority
CN
China
Prior art keywords
text
public opinion
opinion information
information
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310482522.5A
Other languages
Chinese (zh)
Other versions
CN103544255A (en
Inventor
陶宇炜
谢爱娟
熊长江
王娟琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Hualong Network Technology Co ltd
Original Assignee
Changzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou University filed Critical Changzhou University
Priority to CN201310482522.5A priority Critical patent/CN103544255B/en
Publication of CN103544255A publication Critical patent/CN103544255A/en
Application granted granted Critical
Publication of CN103544255B publication Critical patent/CN103544255B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及一种基于文本语义相关的网络舆情信息分析系统,包括以下模块:网络舆情信息采集模块,从网页中采集蕴含丰富的各种舆情信息;舆情信息萃取模块和舆情信息预处理模块将采集的舆情信息进行初步过滤和切分,提取正文部分的元信息,建立文本的特征语义网络图,并进行加权计算和特征抽取,为舆情信息挖掘提供服务。舆情信息挖掘模块,采用基于语义相似度的改进文本聚类分析方法,将文本进行归类;舆情信息分析模块,把舆情信息经过挖掘的数据进行OLAP多维统计,分析舆情评测指标,为相关舆情信息决策提供支持。本发明解决文本中词语语义信息不完整的问题,高效实现大规模网络环境下对动态数据的聚类分析和热点话题发现。

The present invention relates to a network public opinion information analysis system based on text semantic correlation, including the following modules: a network public opinion information collection module, which collects various public opinion information rich in content from web pages; a public opinion information extraction module and a public opinion information preprocessing module collect Initially filter and segment the public opinion information, extract the meta information of the text, establish the feature semantic network graph of the text, and perform weighted calculation and feature extraction to provide services for public opinion information mining. The public opinion information mining module uses an improved text clustering analysis method based on semantic similarity to classify texts; the public opinion information analysis module performs OLAP multi-dimensional statistics on the mined data of public opinion information, analyzes public opinion evaluation indicators, and generates relevant public opinion information Decision support. The invention solves the problem of incomplete semantic information of words in the text, and efficiently realizes cluster analysis and hot topic discovery of dynamic data in a large-scale network environment.

Description

基于文本语义相关的网络舆情信息分析方法Analysis method of network public opinion information based on text semantic correlation

技术领域technical field

本发明涉及网络信息技术领域,具体是一种基于文本语义相关的网络舆情信息分析方法。The invention relates to the technical field of network information, in particular to a method for analyzing network public opinion information based on text semantic correlation.

背景技术Background technique

当今社会,互联网已经渗透到人们的日常生活中,微博、论坛、博客等即时通信工具已经成为人们获取信息,进而发表看法、传播信息的重要渠道。借助网络平台,舆情信息迅速传播,引起广泛关注,其传播的速度之快、范围之广、影响力之大,远非传统媒体可比,网络空间的匿名交互性、非时空限制性等特点,使网络舆情这股强大的社会舆论力量,对社会发展和稳定产生一定的冲击和影响。正面的网络舆情似“正能量”,推动和促进社会发展;负面的网络舆情对社会稳定形成负面效应,引发舆情危机。由此,加强网络舆情信息监测、分析、管理,对稳定社会秩序、构建和谐社会具有重要的现实意义。对网络舆情信息及时监测、正确判断决策、迅速及时回应,积极采取有效措施化解舆情危机,成为网络舆情管理工作的重点和难点问题。In today's society, the Internet has penetrated into people's daily life, and instant communication tools such as Weibo, forums, and blogs have become important channels for people to obtain information, express opinions, and disseminate information. With the help of network platforms, public opinion information spreads rapidly and attracts widespread attention. Its speed of transmission, wide range, and influence are far from comparable to traditional media. Internet public opinion, a powerful force of social public opinion, has a certain impact and influence on social development and stability. Positive Internet public opinion is like "positive energy", which promotes and promotes social development; negative Internet public opinion has a negative effect on social stability and triggers a public opinion crisis. Therefore, strengthening the monitoring, analysis and management of network public opinion information has important practical significance for stabilizing social order and building a harmonious society. Timely monitoring of network public opinion information, correct judgment and decision-making, rapid and timely response, and active and effective measures to resolve public opinion crises have become the key and difficult issues of network public opinion management.

发明内容Contents of the invention

针对上述背景技术中网络舆情信息的特点和网络舆情信息管理中需要解决的问题,本发明提供一种基于文本语义相关的网络舆情信息分析方法。In view of the characteristics of Internet public opinion information in the above background technology and the problems to be solved in the management of Internet public opinion information, the present invention provides a method for analyzing Internet public opinion information based on text semantic correlation.

本发明解决其技术问题所采用的技术方案是,一种基于文本语义相关的网络舆情信息分析方法。采用包括网络舆情信息采集模块、舆情信息萃取模块、舆情信息预处理模块、舆情信息挖掘模块、舆情信息分析模块和包含舆情信息数据库的网络舆情信息分析系统,并包括如下步骤:The technical solution adopted by the present invention to solve the technical problem is a method for analyzing network public opinion information based on text semantic correlation. Adopting a network public opinion information analysis system including a network public opinion information collection module, a public opinion information extraction module, a public opinion information preprocessing module, a public opinion information mining module, a public opinion information analysis module and a public opinion information database, and including the following steps:

a.网络舆情信息采集模块从网页中采集各种舆情信息,并存储到舆情信息数据库中;a. The network public opinion information collection module collects various public opinion information from web pages and stores them in the public opinion information database;

b.舆情信息萃取模块和舆情信息预处理模块将步骤a采集的舆情信息进行初步过滤和切分,抽取文本所包含的内容信息,为舆情信息挖掘提供数据服务;b. The public opinion information extraction module and the public opinion information preprocessing module preliminarily filter and segment the public opinion information collected in step a, extract the content information contained in the text, and provide data services for public opinion information mining;

c.在步骤b基础上,舆情信息挖掘模块采用基于语义相似度的改进文本聚类分析方法,生成类别描述信息,筛选出聚类分析结果中包含的文本信息;利用基于特征统计的TFIDF词频特征计算方法统计类别特征,获取类别特征词,选择名词作为候选类别特征词,按照候选特征词权重排序,以权重值较大的候选特征词作为类别关键词,利用类别关键词之间的语义关系,形成分类结果;识别和建立新的网络舆情主题,检测、跟踪已有舆情主题的相关内容;c. On the basis of step b, the public opinion information mining module adopts an improved text clustering analysis method based on semantic similarity to generate category description information, and screen out the text information contained in the clustering analysis results; use TFIDF word frequency features based on feature statistics The calculation method counts category features, obtains category feature words, selects nouns as candidate category feature words, sorts according to the weight of candidate feature words, uses the candidate feature words with larger weight values as category keywords, and utilizes the semantic relationship between category keywords, Form classification results; identify and establish new online public opinion topics, detect and track relevant content of existing public opinion topics;

d.最后,舆情信息分析模块把舆情信息经过步骤c挖掘的数据进行OLAP多维统计分析,分析舆情主题内容关注度、舆情主题情感倾向等舆情评测指标。d. Finally, the public opinion information analysis module conducts OLAP multi-dimensional statistical analysis on the data mined by the public opinion information through step c, and analyzes the public opinion evaluation indicators such as the degree of attention to the content of the public opinion topic and the emotional tendency of the public opinion topic.

在步骤a中,所述舆情信息采集模块,是对网络舆情信息源进行采集,与一般的网络爬虫不同的是,它不仅要完成网页的爬取,而且要将网页内容进行格式化处理,提取舆情的主题和内容,所得数据存入txt格式或html格式文件,并存储到舆情信息数据库;网络舆情信息采集模块采用分时访问、定时更换IP地址和模拟浏览器进行单点登录三种技术结合进行防屏蔽。网络舆情信息采集模块采用分时访问、定时更换IP地址和模拟浏览器进行单点登录三种技术结合进行防屏蔽。网络舆情信息采集模块执行的具体步骤为:所述舆情信息采集模块执行的具体步骤为,从预先定义的主题相关网页的URL开始,获取网页中的文本信息,并从当前网页中抽取新的URL放入队列中,直到满足条件的舆情信息采集完毕,URL队列为空为止;将采集到的网页文本信息按照字段分类存储到舆情信息数据库中,提供舆情信息萃取模块调用。In step a, the public opinion information collection module is to collect network public opinion information sources, which is different from general web crawlers in that it not only completes the crawling of web pages, but also formats the content of web pages to extract The theme and content of public opinion, the obtained data are stored in txt or html format files, and stored in the public opinion information database; the network public opinion information collection module adopts a combination of time-sharing access, regular IP address replacement and simulated browser for single sign-on Perform anti-shielding. The network public opinion information collection module adopts a combination of time-sharing access, regular IP address replacement and simulated browser single sign-on for anti-shielding. The specific steps that the network public opinion information collection module executes are: the specific steps that the public opinion information collection module executes are, start from the URL of the pre-defined topic-related webpage, obtain the text information in the webpage, and extract new URL from the current webpage Put it into the queue until the public opinion information that meets the conditions is collected and the URL queue is empty; the collected webpage text information is stored in the public opinion information database according to the field classification, and the public opinion information extraction module is provided for calling.

所述舆情信息萃取模块,是清除网页中的无关内容,如网页中的广告、导航信息、图片、版权说明等噪声数据,提取对舆情分析有用的正文部分的元信息,对文本进行重构,将具有主题代表性的信息聚集在一起;所述舆情信息预处理模块,是对采集的舆情信息源经过所述舆情信息萃取模块萃取后,进行中文分词处理、过滤停用词、命名实体识别、词性标注、语法解析和特征词提取,建立正序索引和倒排索引;建立文本特征语义网络图,以文本中包含的实体E作为图的节点,两个实体之间的语义关系作为图的有向边,实体之间的语义关系结合词频信息作为节点的权重,有向边的权重表示实体关系在文本中的重要程度,所述实体E包括事物实体NE、事件实体VE、事件关系实体RE;统计文本的词频和文本频率信息,然后进行特征词抽取,选取体现文本特征的词表示该文本。The public opinion information extraction module is to remove irrelevant content in the webpage, such as noise data such as advertisements in the webpage, navigation information, pictures, copyright descriptions, extract the meta information of the useful text part for public opinion analysis, and reconstruct the text, Gather together representative information of the theme; the public opinion information preprocessing module is to perform Chinese word segmentation processing, filter stop words, named entity recognition, Part-of-speech tagging, grammatical analysis and feature word extraction, establishment of forward index and inverted index; establishment of text feature semantic network graph, with the entity E contained in the text as the node of the graph, and the semantic relationship between the two entities as the effective For the edge, the semantic relationship between entities combined with the word frequency information is used as the weight of the node. The weight of the directed edge indicates the importance of the entity relationship in the text. The entity E includes a thing entity NE, an event entity VE, and an event relationship entity RE; The word frequency and text frequency information of the text are counted, and then the feature words are extracted, and the words that reflect the text characteristics are selected to represent the text.

在步骤b中,所述舆情信息萃取模块,是清除网页中的无关内容,提取对舆情分析有用的正文部分的元信息,对文本进行重构,将具有主题代表性的信息聚集在一起;所述舆情信息预处理模块,是对采集的舆情信息源经过所述舆情信息萃取模块萃取后,进行中文分词处理、过滤停用词、命名实体识别、词性标注、语法解析和特征词提取,建立正序索引和倒排索引;建立文本特征语义网络图,以文本中包含的实体E作为图的节点,两个实体之间的语义关系作为图的有向边,实体之间的语义关系结合词频信息作为节点的权重,有向边的权重表示实体关系在文本中的重要程度,所述实体E包括事物实体NE、事件实体VE、事件关系实体RE;统计文本的词频和文本频率信息,然后进行特征词抽取,选取体现文本特征的词表示该文本。In step b, the public opinion information extraction module is to clear the irrelevant content in the webpage, extract the meta information of the text part useful for public opinion analysis, reconstruct the text, and gather together the representative information of the theme; The public opinion information preprocessing module is to extract the collected public opinion information sources through the public opinion information extraction module, perform Chinese word segmentation, filter stop words, named entity recognition, part-of-speech tagging, grammar analysis and feature word extraction, and establish positive Sequence index and inverted index; establish a semantic network graph of text features, take the entity E contained in the text as the node of the graph, and the semantic relationship between two entities as the directed edge of the graph, and the semantic relationship between entities combined with word frequency information As the weight of the node, the weight of the directed edge represents the importance of the entity relationship in the text, and the entity E includes the thing entity NE, the event entity VE, and the event relationship entity RE; the word frequency and text frequency information of the text are counted, and then the feature Word extraction, selecting words that reflect the characteristics of the text to represent the text.

要实现网络舆情信息文本挖掘、自然语言处理等文本分析,首先要进行分词处理,借鉴国内中文分词领域的研究成果,使用中国科学院计算技术研究所研制的ICTCLAS汉语词法分析系统所具有的词语切分、词性标注、命名实体识别等功能,通过对舆情信息文本内容进行分词,提取长度大于二的词语。在文本分词之后,过滤对计算机理解文本无用的停用词,保留名词、动词、名形词、动形词等词性的词,得到备选特征词集,有效减少索引的大小,增加检索效率,提高准确率。经过分词处理的文本文档,建立正序索引和倒排索引,实现用户的查询交互。文本经过分词、词性标注、去停用词后,建立文本的特征语义网络图,统计文本的词频和文本频率等信息,然后进行加权计算和特征抽取等。In order to realize text analysis such as network public opinion information text mining and natural language processing, word segmentation processing must first be carried out, drawing on domestic research results in the field of Chinese word segmentation, and using the word segmentation of the ICTCLAS Chinese lexical analysis system developed by the Institute of Computing Technology, Chinese Academy of Sciences , part-of-speech tagging, named entity recognition and other functions, through word segmentation of public opinion information text content, to extract words with a length greater than two. After the text is segmented, stop words that are useless for the computer to understand the text are filtered, and words with parts of speech such as nouns, verbs, nouns, and verbs are reserved to obtain a set of candidate feature words, which effectively reduces the size of the index and increases retrieval efficiency. Improve accuracy. For the text document processed by word segmentation, positive sequence index and inverted index are established to realize user query interaction. After the text is segmented, part-of-speech tagged, and stop words removed, a feature semantic network map of the text is established, word frequency and text frequency and other information of the text are counted, and then weighted calculations and feature extraction are performed.

在步骤c中,所述舆情信息挖掘模块,是在对文本集进行预处理,包括中文分词处理、停用词过滤和结构化标签信息分析后,将信息萃取模块生成的文本数据集,根据文本特征语义网络图构建的文本语义特征描述结构,利用相似度评价方法计算文本之间的语义相似度,构建相似度矩阵,采用基于语义相似度的改进文本聚类分析方法生成聚类结果;聚类分析结果生成类别描述信息,筛选出聚类分析结果中包含的文本信息;利用基于特征统计的TFIDF词频特征计算方法统计类别特征,获取候选类别特征词,选择名词作为候选类别特征词,按照候选特征词权重排序,以权重值确定候选特征词作为类别关键词,利用类别关键词之间的语义关系,形成分类结果;将挖掘结果构建知识库,知识库还可以设置成具有同时支持舆情主题发现、舆情倾向性分析等文本挖掘功能。In step c, the public opinion information mining module is to preprocess the text set, including Chinese word segmentation processing, stop word filtering and structured label information analysis, and then extract the text data set generated by the information extraction module according to the text The text semantic feature description structure constructed by the feature semantic network graph uses the similarity evaluation method to calculate the semantic similarity between texts, constructs a similarity matrix, and uses the improved text clustering analysis method based on semantic similarity to generate clustering results; clustering The analysis results generate category description information, and screen out the text information contained in the cluster analysis results; use the TFIDF word frequency feature calculation method based on feature statistics to count category features, obtain candidate category feature words, select nouns as candidate category feature words, and select nouns as candidate category feature words. Word weight sorting, the weight value is used to determine the candidate feature words as category keywords, and the semantic relationship between category keywords is used to form classification results; the mining results are constructed into a knowledge base, and the knowledge base can also be set to support public opinion topic discovery, Text mining functions such as public opinion tendency analysis.

在步骤d中,所述舆情信息分析模块,是对已存入舆情信息数据库中的经过步骤c挖掘的数据进行OLAP多维统计分析,分析舆情主题关注度、舆情内容敏感度、舆情传播扩散度、舆情发布影响度等舆情评测指标,为相关部门及时掌握舆情动态、适时发布舆情信息、做出正确决策提供支持。In step d, the public opinion information analysis module is to perform OLAP multi-dimensional statistical analysis on the data mined in step c in the public opinion information database, and analyze the public opinion topic attention, public opinion content sensitivity, public opinion diffusion degree, Public opinion evaluation indicators such as the influence of public opinion releases provide support for relevant departments to grasp public opinion trends in a timely manner, release public opinion information in a timely manner, and make correct decisions.

与现有技术相比,本发明具有以下有益效果:Compared with the prior art, the present invention has the following beneficial effects:

1.当前网络舆情信息反映出了海量性、动态性、不完整性、表现形式多样性等特点,而现有的舆情信息分析方法往往忽视了舆情信息文本内容的相关关系,导致舆情信息分析结果不准确;本发明采用构建舆情信息文本的文本特征语义网络图模型,在文本描述结构中引入词语语义关联及上下文语境之间的联系;结合基于语义相似度的改进文本聚类算法,挖掘分析出舆情信息文本中上下文语义相关的内容。1. The current network public opinion information reflects the characteristics of mass, dynamic, incomplete, and diverse forms of expression, but the existing public opinion information analysis methods often ignore the correlation between the content of the public opinion information text, resulting in the results of public opinion information analysis. Inaccurate; the present invention adopts the text feature semantic network graph model of constructing public opinion information text, and introduces the connection between the semantic association of words and the context context in the text description structure; combined with the improved text clustering algorithm based on semantic similarity, mining analysis Content related to the context and semantics in the public opinion information text.

2.通过建立舆情信息文本的文本特征语义网络图,将舆情信息文本中词语间的上下文关系形成特征项和权重组成的有向图结构,在保留文本词语上下文信息结构的同时,强化了文本中词语上下文语义的内涵,较好地描述文本中隐含的语义信息和主题特征,解决文本中词语语义信息缺失的问题。2. By establishing the text feature semantic network graph of the public opinion information text, the contextual relationship between words in the public opinion information text is formed into a directed graph structure composed of feature items and weights. The connotation of word context semantics can better describe the hidden semantic information and topic features in the text, and solve the problem of lack of word semantic information in the text.

3.基于语义相似度的改进文本聚类算法适合于大规模网络环境下对动态数据的聚类分析和舆情主题热点发现,通过对文本语义相似度计算,构建文本语义相似度矩阵,深度挖掘出舆情信息文本中上下文语义相关的内容,及时检测、跟踪新的主题事件;采用类内多个中心的主题表示方法,选择文本与类内每个中心的相似度最大值作为该类文本的相似度,有效地提高了系统运行效率,随着文本数量的增加,聚类分析效果会更加明显。3. The improved text clustering algorithm based on semantic similarity is suitable for the cluster analysis of dynamic data and the discovery of public opinion topic hotspots in a large-scale network environment. By calculating the semantic similarity of text, a text semantic similarity matrix is constructed to dig out in depth Contextual semantics-related content in public opinion information texts, timely detection and tracking of new topic events; using the topic representation method of multiple centers in the class, selecting the maximum similarity between the text and each center in the class as the similarity of the text in this class , which effectively improves the operating efficiency of the system. As the number of texts increases, the effect of cluster analysis will be more obvious.

附图说明Description of drawings

图1是本发明实施例基于文本语义相关的网络舆情信息分析方法的工作流程图。Fig. 1 is a working flow chart of a method for analyzing network public opinion information based on text semantic correlation according to an embodiment of the present invention.

具体实施方式detailed description

下面将结合附图和具体实施例对本发明做进一步说明。但本发明的实施方式不限于此。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments. However, the embodiments of the present invention are not limited thereto.

如图1所示,本发明的方法中,包括网络舆情信息采集模块、舆情信息萃取模块、舆情信息预处理模块、舆情信息挖掘模块、舆情信息分析模块和包含舆情信息数据库的网络舆情信息分析系统。其处理流程是:As shown in Figure 1, in the method of the present invention, comprise network public opinion information collection module, public opinion information extraction module, public opinion information preprocessing module, public opinion information digging module, public opinion information analysis module and the network public opinion information analysis system that comprises public opinion information database . Its processing flow is:

(1)舆情信息采集(1) Collection of public opinion information

对网络舆情信息源进行采集,与一般的网络爬虫不同的是,它不仅要完成网页的爬取,而且要将网页内容进行格式化处理,提取有用的舆情信息,如舆情的主题和内容,所得数据存入txt格式或html格式文件,写入原始舆情信息数据库。具体步骤为:按照预设的网络舆情信息采集策略,从多个种子网页的URL开始,通过各类端口发送遵循http协议的指令(采用GET方法);远程服务器根据申请指令的内容返回HTML类型的文档。舆情信息采集模块收集返回文档中所有的信息后先保存至缓存,然后传送到数据库中保存,获取网页中的文本信息;在获取网页文本信息过程中,不断从当前网页中抽取新出现的超链接URL访问,并剔除已经访问过的超链接URL,如此反复循环,直到满足搜索策略的网页文本信息采集完毕,未访问的URL队列为空为止。将采集的网页文本信息按照字段分类存储到数据库中,提供舆情信息萃取模块调用。Collecting network public opinion information sources is different from general web crawlers in that it not only crawls web pages, but also formats the content of web pages to extract useful public opinion information, such as the topic and content of public opinion. The data is stored in txt or html format files and written into the original public opinion information database. The specific steps are: according to the preset network public opinion information collection strategy, starting from the URLs of multiple seed webpages, sending instructions following the http protocol through various ports (using the GET method); the remote server returns an HTML-type URL according to the content of the application instruction document. The public opinion information collection module collects all the information in the returned document and saves it to the cache first, then transfers it to the database for storage, and obtains the text information in the web page; in the process of obtaining the text information of the web page, it continuously extracts new hyperlinks from the current web page URL access, and remove hyperlink URLs that have been visited, and repeat the cycle until the text information of web pages that meet the search strategy is collected and the queue of unvisited URLs is empty. Store the collected webpage text information in the database according to the field classification, and provide the call of the public opinion information extraction module.

网络舆情信息采集模块通常采用分时访问、定时更换IP地址、模拟浏览器进行单点登录等多种技术结合的防屏蔽策略。针对许多网站如论坛、博客、微博等通过用户登录方式才能访问,这里采用模拟浏览器的策略较易实现,利用微软.NET开发工具VisualStudio2008提供的Web Browser控件为微软IE浏览器的API调用,利用SSO单点登录模拟提交用户名及密码登录,等待用户登录信息加载完成后,页面跳转至相应URL地址,通过提交关键词进行检索,获得所需网页的源文件。The network public opinion information collection module usually adopts an anti-shielding strategy combining time-sharing access, regularly changing IP addresses, and simulating a browser for single sign-on. For many websites such as forums, blogs, microblogs, etc. can only be accessed through user login, it is easier to implement the strategy of simulating a browser here. Using the Web Browser control provided by Microsoft .NET development tool VisualStudio2008 to call the API of Microsoft IE browser, Use SSO single sign-on to simulate submitting user name and password to log in. After the user login information is loaded, the page jumps to the corresponding URL address, and searches by submitting keywords to obtain the source file of the required web page.

采集的网页文本信息包括Web内容信息、Web结构和使用记录信息两部分。Web内容信息包含新闻标题、正文内容、评论信息等文本内容信息,Web结构和Web使用记录信息包含点击量、浏览量、评论量等统计信息。The collected webpage text information includes two parts: web content information, web structure and usage record information. Web content information includes text content information such as news headlines, text content, and comment information, and Web structure and Web usage record information includes statistical information such as clicks, page views, and comments.

(2)舆情信息萃取(2) Public opinion information extraction

采集的网页信息含有广告、导航信息、图片、版权说明等噪声数据,对舆情信息分析来说真正需要的是正文部分的元信息,清除掉这些无关内容,提取对舆情信息分析有用的正文部分的元信息,为文本后续的挖掘、分析提供服务。具体流程如下:The collected webpage information contains noise data such as advertisements, navigation information, pictures, copyright descriptions, etc. What is really needed for the analysis of public opinion information is the meta information of the body part, clearing these irrelevant content, and extracting the meta information of the body part that is useful for the analysis of public opinion information. Meta information provides services for the subsequent mining and analysis of the text. The specific process is as follows:

(2-1)首先使用Tidy工具对正文网页进行HTML标记规范化,然后利用html parser工具构建HTML树,将HTML标记作为树的节点,这样表示便于对HTML代码的管理和操作,可以更好地对代码进行结构化挖掘。(2-1) First use the Tidy tool to normalize the HTML tags on the text webpage, then use the html parser tool to build an HTML tree, and use the HTML tags as the nodes of the tree, which means that it is convenient to manage and operate the HTML code, and can better understand Code for structured mining.

(2-2)从采集的舆情信息源中提取标题、关键词、正文、长度、更新时间和URL等相关信息,标题可截取标签<TITLE>与</TITLE>之间的信息;关键词包含在HTML文件头部的META标签,可从META标签信息中提取;时间信息可通过模式匹配分析和网页分析提取。(2-2) Extract relevant information such as title, keyword, text, length, update time and URL from the collected public opinion information sources. The title can intercept the information between the tags <TITLE> and </TITLE>; the keywords include The META tag in the head of the HTML file can be extracted from the META tag information; the time information can be extracted through pattern matching analysis and web page analysis.

(2-3)正文提取的具体步骤为:选择适当的关键词,获取相关网页的URL地址,通过访问URL地址所在的服务器,得到网页的HTML源代码;删除网页源代码中的无用标记行,保留网页主体内容;将HTML代码中的段落符号(如</p>、<br>等)替换为特殊符号(如*[/p]*、*[/br]*等),回车符和换行符替换为行分隔符,采用行结构存储方式,保留网页内容格式;提取每一行HTML标记“<”与“>”之间的文本;用回车符替换特殊符号(如*[/p]*、*[/br]*等),保持正文原有的段落;对结果字符串进行去除HTML特殊转义字符(如&quot、&lt等)处理,结合正则表达式,匹配并提取最终的正文结果。(2-3) The concrete steps that text extracts are: select appropriate keyword, obtain the URL address of relevant webpage, obtain the HTML source code of webpage by visiting the server where URL address is in; Delete the useless tag row in the webpage source code, Retain the main content of the webpage; replace paragraph symbols (such as </p>, <br>, etc.) in the HTML code with special symbols (such as *[/p]*, *[/br]*, etc.), carriage return and Newline characters are replaced by line separators, and the row structure storage method is used to retain the content format of the web page; the text between the HTML tags "<" and ">" of each line is extracted; special symbols (such as *[/p] are replaced by carriage returns *, *[/br]*, etc.), keep the original paragraph of the text; remove HTML special escape characters (such as &quot, &lt, etc.) from the result string, combine regular expressions, match and extract the final text result .

从采集的舆情信息源中提取标题、关键词、正文、长度、更新时间和URL等相关信息后,舆情信息萃取模块还要实现文本信息的重构。After extracting relevant information such as title, keywords, text, length, update time and URL from the collected public opinion information sources, the public opinion information extraction module also needs to realize the reconstruction of text information.

文本重构通过分析网络新闻、论坛帖子、微博博文等舆情信息存在形式和文本的结构特征,将具有代表性话题的信息组成“主旨块”,其余部分的信息组成“内容块”,以提高聚类分析效果。Text reconstruction analyzes the existing forms of public opinion information such as network news, forum posts, Weibo blog posts, and the structural characteristics of the text, and organizes information on representative topics into "subject blocks" and the rest of the information into "content blocks" to improve cluster analysis effect.

对于网页新闻的文本重构,是把网页新闻的标题和首段信息组成“主旨块”,其余的新闻描述信息和评论内容组成“内容块”。For the text reconstruction of webpage news, the title and the first paragraph of information of the webpage news are composed of "subject block", and the remaining news description information and comment content are composed of "content block".

对于论坛帖子的文本重构,是将帖子的标题和主帖组成“主旨块”,将回帖和跟帖信息净化处理,去除没有汉字内容的帖子和使用常用评价词的帖子,选择若干条帖子构成“内容块”。For the text reconstruction of forum posts, the title and main post of the post are composed of the "subject block", and the replies and follow-up information are purified, and the posts without Chinese characters and the posts using common evaluation words are removed, and several posts are selected to form "Content Block".

(3)舆情信息预处理(3) Preprocessing of public opinion information

舆情信息萃取后,接下来进行中文分词处理、命名实体识别、词性标注、语法解析、特征词提取等预处理,将结果保存到数据库中。要实现网络舆情信息文本挖掘、自然语言处理等文本分析,首先要进行分词处理,借鉴国内中文分词领域的研究成果,采用中国科学院计算技术研究所研制的汉语词法分析系统ICTCLAS进行文本的分词及词性标注,通过中文分词处理,提取长度大于二的词语。ICTCLAS的功能有中文文本的分词、词性标注、新词识别等;使用角色模型(role model)的方法进行命名实体识别;同时支持用户根据需要定义个性化词典,不仅具有较高的分词精度,分词效果也较好。其实现代码如下:After the public opinion information is extracted, preprocessing such as Chinese word segmentation, named entity recognition, part-of-speech tagging, grammar analysis, and feature word extraction is performed, and the results are saved in the database. In order to realize text analysis such as text mining and natural language processing of network public opinion information, word segmentation processing must be carried out first, drawing on domestic research results in the field of Chinese word segmentation, and using the Chinese lexical analysis system ICTCLAS developed by the Institute of Computing Technology, Chinese Academy of Sciences to perform text segmentation and parts of speech Annotation, through Chinese word segmentation processing, extracts words with a length greater than two. The functions of ICTCLAS include Chinese text word segmentation, part-of-speech tagging, new word recognition, etc.; use the method of role model (role model) for named entity recognition; at the same time, it supports users to define personalized dictionaries according to their needs, which not only has high word segmentation accuracy, but also The effect is also better. Its implementation code is as follows:

在文本分词之后,过滤对计算机理解文本无用的停用词,保留名词、动词、名形词、动形词等词性的词,得到备选特征词集,以避免文本的冗杂,有效减少索引的大小,增加检索效率,提高检索准确率。After the text is segmented, stop words that are useless for the computer to understand the text are filtered, and words of parts of speech such as nouns, verbs, nouns, and verbs are reserved to obtain a set of alternative feature words to avoid redundant text and effectively reduce indexing. size, increase retrieval efficiency, and improve retrieval accuracy.

经过分词处理的文本,建立正序索引和倒排索引,实现用户的查询交互。对于正序索引,根据词频的排序,选择前N个词语表示文本,用哈希表表示为:<文件名,关键词词组>;建立正序索引后,搜索文本中的关键词,找出包含此关键词的所有文件名,建立文件名词组,可得倒排索引,用哈希表表示为:<关键词,文件名词组>。For the text processed by word segmentation, a positive sequence index and an inverted index are established to realize user query interaction. For the positive sequence index, according to the sorting of the word frequency, select the first N words to represent the text, and use the hash table to express it as: <file name, keyword phrase>; after the positive sequence index is established, search for the keywords in the text to find out the For all file names of this keyword, establish a file noun group to obtain an inverted index, which is represented by a hash table as: <keyword, file noun group>.

索引的建立和索引的检索服务基于Apache开源项目Lucene实现,Lucene提供完整的查询引擎和索引引擎,文本分析引擎;采用Hadoop存储和管理海量的索引文件。The establishment of the index and the retrieval service of the index are implemented based on the Apache open source project Lucene, which provides a complete query engine, index engine, and text analysis engine; Hadoop is used to store and manage massive index files.

索引的建立过程如下:The index creation process is as follows:

1.创建索引写对象IndexWriter。该对象创建时需提供词汇解析器,不同的词汇解析器采用不同的词库。选用ThesaurusAnalyzer,能够提取内容摘要;1. Create an index write object IndexWriter. A lexical parser must be provided when creating this object, and different lexical parsers use different lexicons. Use ThesaurusAnalyzer to extract content summaries;

2.为取自数据库中的每个结果集创建一个Document对象;2. Create a Document object for each result set from the database;

3.将结果集中的数据元分别创建一个Field对象,并添加到Document对象;3. Create a Field object for each data element in the result set and add it to the Document object;

4.写入该Document对象。4. Write to the Document object.

索引检索的过程为:首先创建查询解析器,该查询解析器需要Field对象名以及对应的词汇解析器等参数;再由查询解析器和关键字获得查询对象;通过查询对象获取检索的结果集,结果集由Document对象构成。The process of index retrieval is as follows: first create a query parser, which needs parameters such as the Field object name and the corresponding vocabulary parser; then obtain the query object from the query parser and keywords; obtain the retrieved result set through the query object, The result set consists of Document objects.

文本经过分词、词性标注、去停用词后,建立文本的特征语义网络图,统计文本的词频和文本频率等信息,然后进行加权计算和特征抽取等。After the text is segmented, part-of-speech tagged, and stop words removed, a feature semantic network map of the text is established, word frequency and text frequency and other information of the text are counted, and then weighted calculations and feature extraction are performed.

文本特征语义网络图是一种用实体及其语义关系来表达舆情信息的有向图,以文本中包含的实体E(包括事物实体NE、事件实体VE、事件关系实体RE)作为图的节点,两个实体之间的语义关系作为图的有向边,实体之间的语义关系结合词频信息作为节点的权重,有向边的权重表示实体关系在文本中的重要程度。通过网络节点权值的引入和基于概念的合并与简化,构建文本特征语义网络图,提取文本的核心语义。即通过网络节点表示的词语合并,节点权值相加;再合并有向边,有向边权值相加,构建文本特征语义网络图,描述文本中的语义信息和主题特征。具体概念描述如下:The text feature semantic network graph is a directed graph that uses entities and their semantic relationships to express public opinion information. The entities E contained in the text (including thing entities NE, event entities VE, and event relationship entities RE) are used as graph nodes. The semantic relationship between two entities is used as the directed edge of the graph, and the semantic relationship between entities combined with word frequency information is used as the weight of the node. The weight of the directed edge indicates the importance of the entity relationship in the text. Through the introduction of network node weights and concept-based merging and simplification, a text feature semantic network graph is constructed to extract the core semantics of the text. That is, the words represented by the network nodes are merged, and the node weights are added; then the directed edges are merged, and the weights of the directed edges are added to construct a text feature semantic network graph to describe the semantic information and topic features in the text. The specific concepts are described as follows:

C1:事物实体NE定义为NE(id,concept,property,power)。id代表实体标识,concept代表实体概念,property代表实体属性,power代表权重。C1: The thing entity NE is defined as NE (id, concept, property, power). id stands for entity identifier, concept stands for entity concept, property stands for entity attribute, and power stands for weight.

C2:事件实体VE定义为VE(id,concept,property,power,isN,subT,objT1,objT2)。除了包含NE的几个数据项外,isN代表是否为否定,subT代表主体实体表头,objTl和objT2代表客体实体1与2的表头。C2: The event entity VE is defined as VE (id, concept, property, power, isN, subT, objT1, objT2). In addition to several data items including NE, isN represents whether it is negated, subT represents the header of the subject entity, and objT1 and objT2 represent the headers of object entities 1 and 2.

C3:事件关系实体RE定义为RE(id,concept,property,power,isN,subT,objT)。RE用一对主客体实体就可完全描述。C3: Event relation entity RE is defined as RE (id, concept, property, power, isN, subT, objT). RE can be fully described by a pair of subject and object entities.

文本特征语义网络图模型分析步骤如下:The analysis steps of text feature semantic network graph model are as follows:

S1:在分析文本时,首先以语句为单位,构建各条语句对应的特征语义网络图。逐句分析每句产生了哪些NE,将NE及其属性信息记入实体信息表。S1: When analyzing the text, firstly, the sentence is used as the unit to construct the feature semantic network graph corresponding to each sentence. Analyze which NEs are generated in each sentence sentence by sentence, and record the NEs and their attribute information into the entity information table.

S2:NE分析完毕后,分析VE,登记VE的概念,属性,主体和客体。主客体相同的VE实体表示为同一VE,否则设置不同的id。S2: After the NE analysis is completed, analyze the VE, and register the concepts, attributes, subjects and objects of the VE. VE entities with the same subject and object are represented as the same VE, otherwise different ids are set.

S3:接下来分析RE。分析RE要注意与NE、VE区分开来,把RE的概念、属性、主体、客体登记到实体信息表。S3: Next analyze the RE. When analyzing RE, it should be distinguished from NE and VE, and the concept, attribute, subject, and object of RE should be registered in the entity information table.

S4:分析结束后,得到该语句的实体信息表。实体信息表描述了实体之间的关系,用来构造实体关系图,NE与VE之间,RE与NE、VE之间,实体E与属性T之间通过不同的连线把实体关系可视化。S4: After the analysis ends, the entity information table of the statement is obtained. The entity information table describes the relationship between entities and is used to construct the entity relationship diagram, between NE and VE, between RE and NE, VE, and between entity E and attribute T to visualize the entity relationship through different connections.

S5:在分析构建第一条语句的特征语义网络图基础上,将后续语句的特征语义网络图合并,先合并节点,再合并有向边。S5: Based on the analysis and construction of the feature semantic network graph of the first sentence, merge the feature semantic network graphs of subsequent sentences, first merge nodes, and then merge directed edges.

S6:合并节点时,把节点之间词语相同或者语义相似度满足阈值条件的节点合并,节点权值相加;否则保留该节点。S6: When merging nodes, merge nodes with the same word or semantic similarity meeting the threshold condition, and add node weights; otherwise, keep the node.

S7:有向边合并,是把合并后的节点间存在的有向边进行合并,有向边权值相加。S7: Directed edge merging is to merge the directed edges existing between the merged nodes, and add the weights of the directed edges.

S8:更新新合并节点邻接边的权值为该节点的权值,强化节点之间的语义关系。S8: Update the weight of the adjacent edge of the newly merged node to the weight of the node to strengthen the semantic relationship between nodes.

S9:输出所有合并语句的特征语义网络图后,完成整个文本的特征语义网络图的构造。S9: After outputting the feature semantic network graph of all merged sentences, complete the construction of the feature semantic network graph of the entire text.

下一步对词性特征权重赋值,以准确标示文本。按照汉语词性特点及完整事件描述要素(时间、地点、人物以及事件内容),结合中国科学院汉语词性标记集,文本特征权重赋值分为:标题权重值为3,子标题和关键词权重值为2,摘要权重值为1.5,段首句和段尾句权重值为1.3。The next step is to assign weights to the part-of-speech features to accurately mark the text. According to Chinese part-of-speech characteristics and complete event description elements (time, place, person, and event content), combined with the Chinese Academy of Sciences Chinese part-of-speech tag set, the weight assignment of text features is divided into: title weight value 3, subtitle and keyword weight value 2 , the weight value of the abstract is 1.5, and the weight value of the first sentence and the last sentence of the paragraph is 1.3.

舆情信息经过预处理后,为文本的标题、正文和回复设置不同的标签,在计算权重时,读取关键词的标签信息,完成词语的位置权重的赋值。After the public opinion information is preprocessed, different tags are set for the title, body and reply of the text. When calculating the weight, the tag information of the keyword is read to complete the assignment of the position weight of the word.

(4)舆情信息挖掘(4) Public opinion information mining

舆情信息挖掘模块,是在对文本集进行预处理,包括中文分词处理、停用词过滤和结构化标签信息分析后,将信息萃取模块生成的文本数据集,根据文本特征语义网络图构建的文本语义特征描述结构,利用相似度评价方法计算文本之间的语义相似度,构建相似度矩阵,采用基于语义相似度的改进文本聚类分析方法生成聚类结果;聚类分析结果生成类别描述信息,筛选出聚类分析结果中包含的文本信息;利用基于特征统计的TFIDF词频特征计算方法统计类别特征,获取候选类别特征词,选择名词作为候选类别特征词,按照候选特征词权重排序,以权重值确定候选特征词作为类别关键词,利用类别关键词之间的语义关系,形成分类结果;将挖掘结果构建知识库,知识库还可以设置成具有同时支持舆情主题发现、舆情倾向性分析等文本挖掘功能。The public opinion information mining module is a text data set generated by the information extraction module after preprocessing the text set, including Chinese word segmentation processing, stop word filtering, and structured label information analysis, and constructing the text according to the text feature semantic network graph. Semantic feature description structure, use the similarity evaluation method to calculate the semantic similarity between texts, construct a similarity matrix, and use the improved text clustering analysis method based on semantic similarity to generate clustering results; the clustering analysis results generate category description information, Filter out the text information contained in the cluster analysis results; use the TFIDF word frequency feature calculation method based on feature statistics to count category features, obtain candidate category feature words, select nouns as candidate category feature words, sort according to the weight of candidate feature words, and use the weight value Determine the candidate feature words as category keywords, and use the semantic relationship between category keywords to form classification results; build a knowledge base from the mining results, and the knowledge base can also be set to support text mining such as public opinion topic discovery and public opinion tendency analysis. Features.

首先定义和计算文本之间的相似度,即文本之间所讨论主题的相关程度,用Sim(D1,D2)表示文本D1和文本D2之间的相似度。相似度取值范围在0和1之间,与文本D1和D2的相似程度成正比。文本之间的相似度越大,表明文本之间的主题相关程度越大。文本之间的语义相似度评价方法如下:Firstly, the similarity between texts is defined and calculated, that is, the degree of relevance between the topics discussed between texts, and Sim(D 1 , D 2 ) is used to represent the similarity between text D 1 and text D 2 . The value range of similarity is between 0 and 1 , which is proportional to the similarity between texts D1 and D2. The greater the similarity between texts, the greater the topic correlation between texts. The semantic similarity evaluation method between texts is as follows:

设经过步骤b的舆情信息萃取和预处理后的文本为D1(t11,t12,t13,…,t1m),D2(t21,t22,t23,…,t2m),计算文本D1中所有关键词t1i与文本D2中所有关键词t2i的相似度,形成相似度矩阵如下:Let the text after public opinion information extraction and preprocessing in step b be D 1 (t 11 ,t 12 ,t 13 ,…,t 1m ), D 2 (t 21 ,t 22 ,t 23 ,…,t 2m ) , calculate the similarity between all keywords t 1i in text D 1 and all keywords t 2i in text D 2 , and form a similarity matrix as follows:

Mm (( DD. 11 ,, DD. 22 )) == SimSim 1111 SimSim 11 mm SimSim mm 11 SimSim mmmm

Simij(1=i,j=m)表示文本D1关键词t1i与文本D2关键词t2j的相似度;M(D1,D2)表示文本D1与文本D2之间的相似度矩阵;i为文本D1的关键词数;m为文本D2的关键词数;Sim ij (1=i, j=m) represents the similarity between text D 1 keyword t 1i and text D 2 keyword t 2j ; M(D 1 , D 2 ) represents the similarity between text D 1 and text D 2 Similarity matrix; i is the number of keywords in text D1; m is the number of keywords in text D2 ;

词语相似度计算公式为:S(T1,T2)=Max(i=1,2,…,n;j=1,2,…,m)S(y1i,y2j),即词语相似度为两词语所有义项(一个词语所包含的多个词义)相似度中的最大值。The formula for calculating word similarity is: S(T 1 ,T 2 )=Max(i=1,2,…,n;j=1,2,…,m)S(y 1i ,y 2j ), that is, word similarity The degree is the maximum value of the similarity between all senses of two words (multiple senses contained in a word).

依次遍历相似度矩阵M,找到相似度Sim值最大的关键词对应组合,并删除对应的行和列。然后继续遍历相似度矩阵M找到相似度值最大的关键词组合,反复循环直至矩阵M为零值矩阵。最后利用得到的相似度最大关键词组合序列,求得文本D1和D2的语义相似度,计算公式如下:Traverse the similarity matrix M in turn, find the corresponding combination of keywords with the largest similarity Sim value, and delete the corresponding row and column. Then continue to traverse the similarity matrix M to find the keyword combination with the largest similarity value, and repeat the cycle until the matrix M is a zero-valued matrix. Finally, the semantic similarity of texts D 1 and D 2 is obtained by using the keyword combination sequence with the largest similarity obtained, and the calculation formula is as follows:

SimSim (( DD. 11 ,, DD. 22 )) == 11 mm &Sigma;&Sigma; kk == 11 mm SimSim kk -- maxmax (( tt 11 ii ,, tt 22 jj ))

其中,max为相似度Sim的最大值;i为文本D1的关键词数;j为文本D2的关键词数。Among them, max is the maximum value of similarity Sim; i is the number of keywords in text D1 ; j is the number of keywords in text D2.

基于语义相似度的改进文本聚类分析方法,描述如下:An improved text clustering analysis method based on semantic similarity is described as follows:

1.首先对所有采集的文本经过预处理后,采用TFIDF加权法对所有类别关键词进行特征加权,提取m个最优特征关键词形成原始的基于关键词特征向量Di*。1. First, after preprocessing all the collected texts, use the TFIDF weighting method to perform feature weighting on all category keywords, and extract m optimal feature keywords to form the original keyword-based feature vector Di*.

2.依据所述知识库对原始的基于关键词特征向量Di*中关键词进行预处理:在知识库中找到与关键词匹配的词汇并将其替换,形成新的特征向量Di,Di=(T1,T2,…,Ti),i=1,2,3,…,m。2. Preprocessing the keywords in the original keyword-based feature vector Di* according to the knowledge base: find and replace the words matching the keywords in the knowledge base to form new feature vectors D i , D i =(T 1 ,T 2 ,...,T i ), i=1,2,3,...,m.

3.形成n个文本的m个特征向量Di,利用文本语义相似度计算公式计算采集的文本之间的语义相似度,形成文本集的相似度矩阵M,并求出所有特征向量的平均相似度MA。计算公式如下:3. Form m feature vectors D i of n texts, use the text semantic similarity calculation formula to calculate the semantic similarity between the collected texts, form the similarity matrix M of the text set, and calculate the average similarity of all feature vectors Degree MA. Calculated as follows:

M = S 11 S 1 n S n 1 S nn , MA = &Sigma; i = 1 n &Sigma; j = 1 n Sij - n &Sigma; i = 1 n Sii n * ( n - 1 ) ; 其中,n为文本数; m = S 11 S 1 no S no 1 S n , MA = &Sigma; i = 1 no &Sigma; j = 1 no Sij - no &Sigma; i = 1 no Sii no * ( no - 1 ) ; Among them, n is the number of texts;

4.设定三个相似度阈值,一个重复度阈值为0.9,一个主题中心阈值为0.5,以及一个新主题阈值为0.3;4. Set three similarity thresholds, one repetition threshold is 0.9, one topic center threshold is 0.5, and one new topic threshold is 0.3;

5.将文本与中心主题比较,如果文本与中心主题的初始中心相似度大于重复度阈值0.9,认为该文本属于同一主题同一内容文本;如果相似度小于新主题阈值0.3,则该文本需要新建一个类;如果相似度在0~0.5范围内,则该文本属于同一主题的不同侧面讨论的核心内容文本,标记为第二个中心,以此类推,形成多个中心的层次化的聚类结果。5. Compare the text with the central topic. If the initial central similarity between the text and the central topic is greater than the repetition threshold of 0.9, the text is considered to belong to the same topic and the same content text; if the similarity is less than the new topic threshold of 0.3, the text needs to create a new one. class; if the similarity is in the range of 0 to 0.5, the text belongs to the core content text discussed on different sides of the same topic, and is marked as the second center, and so on, forming a hierarchical clustering result of multiple centers.

6.针对多个中心的主题表示方法,选择文本与类内每个中心的相似度的最大值作为该类文本的相似度。6. For the topic representation method of multiple centers, select the maximum value of the similarity between the text and each center in the class as the similarity of the text in this class.

基于语义相似度的改进文本聚类算法适合于大规模网络环境下对动态数据的聚类分析和舆情主题热点发现,能及时检测到新事件,检测、跟踪新的舆情主题;采用类内多个中心的舆情主题表示方法,有效地提高了系统运行效率,随着文本数量的增加,效果会更加明显。The improved text clustering algorithm based on semantic similarity is suitable for cluster analysis of dynamic data and discovery of public opinion topic hotspots in a large-scale network environment. It can detect new events in time, detect and track new public opinion topics; The public opinion theme representation method in the center has effectively improved the operating efficiency of the system, and the effect will be more obvious as the number of texts increases.

5)舆情信息分析5) Analysis of public opinion information

所述舆情信息分析模块对已存入舆情信息数据库中的经过步骤c挖掘的数据进行OLAP多维统计分析,分析舆情主题内容关注度、舆情主题情感倾向等舆情评测指标,为相关部门及时掌握舆情动态、适时发布舆情信息、做出正确决策提供支持。The public opinion information analysis module performs OLAP multidimensional statistical analysis on the data that has been stored in the public opinion information database and has been mined in step c, and analyzes the public opinion evaluation indicators such as the degree of attention to the content of the public opinion theme and the emotional tendency of the public opinion theme, so as to grasp the dynamics of public opinion in time for relevant departments , Release public opinion information in a timely manner, and provide support for making correct decisions.

通过采集、处理和挖掘分析产生的舆情主题,表示为:T=(T1,T2,…,Tn),其中Ti表示舆情主题的文本。舆情主题文本的关注度表示为:Ti=αNp+βNr,舆情主题的关注度度量公式为:其中α,β表示权重,Np表示舆情主题文本的点击数,Nr表示评论数;Np_i表示第i个舆情主题文本的点击数,Nr_i表示第i个舆情主题文本的评论数。由于Np>Nr,经过统计,α取值为0.02,β取值为0.98。The public opinion topic generated through collection, processing and mining analysis is expressed as: T=(T 1 ,T 2 ,…,T n ), where T i represents the text of the public opinion topic. The attention degree of public opinion topic text is expressed as: T i =αN p +βN r , and the attention degree measurement formula of public opinion topic is: Among them, α and β represent the weight, N p represents the number of clicks on the public opinion topic text, and N r represents the number of comments; Np_i represents the number of hits on the i-th public opinion topic text, and Nr_i represents the number of comments on the i-th public opinion topic text. Since N p >N r , after statistics, the value of α is 0.02, and the value of β is 0.98.

舆情主题的情感倾向基于舆情主题文本的聚类分析数据描述。首先设定一个阙值,只有当文本的倾向度量值大于阙值,文本才表现出极性(正面性、负面性)。文本的倾向度量值为正,则该文本为正面的评论,反之则为负面的评论。The emotional tendency of public opinion topics is based on the cluster analysis data description of public opinion topic texts. First, a threshold is set, and only when the tendency measure of the text is greater than the threshold, the text shows polarity (positiveness, negativity). If the propensity measure value of the text is positive, then the text is a positive comment, otherwise it is a negative comment.

舆情信息经过采集、预处理、信息萃取、挖掘和分析,可以得到舆情主题的详细数据,按照建立的舆情指标评价体系进行处理,处理的结果提供决策帮助。After public opinion information is collected, preprocessed, information extracted, mined and analyzed, detailed data on public opinion topics can be obtained, processed according to the established public opinion index evaluation system, and the processed results provide decision-making assistance.

Claims (7)

1.基于文本语义相关的网络舆情信息分析方法,其特征在于:采用包括网络舆情信息采集模块、舆情信息萃取模块、舆情信息预处理模块、舆情信息挖掘模块、舆情信息分析模块和包含舆情信息数据库的网络舆情信息分析系统,并包括如下步骤:1. A method for analyzing network public opinion information based on text semantic correlation, characterized in that: the method comprises a network public opinion information collection module, a public opinion information extraction module, a public opinion information preprocessing module, a public opinion information mining module, a public opinion information analysis module and a public opinion information database The network public opinion information analysis system includes the following steps: a.网络舆情信息采集模块从网页中采集各种舆情信息,并存储到舆情信息数据库中;a. The network public opinion information collection module collects various public opinion information from web pages and stores them in the public opinion information database; b.舆情信息萃取模块和舆情信息预处理模块将步骤a采集的舆情信息进行初步过滤和切分,抽取文本所包含的内容信息,为舆情信息挖掘提供数据服务;b. The public opinion information extraction module and the public opinion information preprocessing module preliminarily filter and segment the public opinion information collected in step a, extract the content information contained in the text, and provide data services for public opinion information mining; c.在步骤b基础上,舆情信息挖掘模块采用基于语义相似度的改进文本聚类分析方法,生成类别描述信息,筛选出聚类分析结果中包含的文本信息;利用基于特征统计的TFIDF词频特征计算方法统计类别特征,获取类别特征词,选择名词作为候选类别特征词,按照候选特征词权重排序,以权重值较大的候选特征词作为类别关键词,利用类别关键词之间的语义关系,形成分类结果;识别和建立新的网络舆情主题,检测、跟踪已有舆情主题的相关内容;c. On the basis of step b, the public opinion information mining module adopts an improved text clustering analysis method based on semantic similarity to generate category description information, and screen out the text information contained in the clustering analysis results; use TFIDF word frequency features based on feature statistics The calculation method counts category features, obtains category feature words, selects nouns as candidate category feature words, sorts according to the weight of candidate feature words, uses the candidate feature words with larger weight values as category keywords, and utilizes the semantic relationship between category keywords, Form classification results; identify and establish new online public opinion topics, detect and track relevant content of existing public opinion topics; d.最后,舆情信息分析模块把舆情信息经过步骤c挖掘的数据进行OLAP多维统计分析,分析舆情主题内容关注度、舆情主题情感倾向等舆情评测指标;d. Finally, the public opinion information analysis module performs OLAP multi-dimensional statistical analysis on the data mined by the public opinion information through step c, and analyzes the public opinion evaluation indicators such as the degree of attention to the content of the public opinion topic, the emotional tendency of the public opinion topic; 在步骤a中,所述舆情信息采集模块,是对网络舆情信息源进行采集,不仅要完成网页的爬取,而且要将网页内容进行格式化处理,提取舆情的主题和内容,所得数据存入txt格式或html格式文件,并存储到舆情信息数据库;网络舆情信息采集模块采用分时访问、定时更换IP地址和模拟浏览器进行单点登录三种技术结合进行防屏蔽。In step a, the public opinion information collection module is to collect network public opinion information sources, not only to complete the crawling of web pages, but also to format the content of the web pages, extract the theme and content of public opinion, and store the obtained data in txt or html format files, and store them in the public opinion information database; the network public opinion information collection module adopts time-sharing access, regular IP address replacement and simulated browser for single sign-on to prevent shielding. 2.根据权利要求1所述的基于文本语义相关的网络舆情信息分析方法,其特征是,所述舆情信息采集模块执行的具体步骤为,从预先定义的主题相关网页的URL开始,获取网页中的文本信息,并从当前网页中抽取新的URL放入队列中,直到满足条件的舆情信息采集完毕,URL队列为空为止;将采集到的网页文本信息按照字段分类存储到舆情信息数据库中,提供舆情信息萃取模块调用。2. the network public opinion information analysis method based on text semantics correlation according to claim 1, it is characterized in that, the concrete step that described public opinion information collection module executes is, starts from the URL of the relevant webpage of pre-defined topic, obtains in the webpage text information, and extract new URLs from the current webpage and put them into the queue until the collection of public opinion information that satisfies the conditions is completed and the URL queue is empty; the collected webpage text information is stored in the public opinion information database according to the field classification, Provide public opinion information extraction module calls. 3.根据权利要求1所述的基于文本语义相关的网络舆情信息分析方法,其特征是,在步骤b中,所述舆情信息萃取模块,是清除网页中的无关内容,提取对舆情分析有用的正文部分的元信息,对文本进行重构,将具有主题代表性的信息聚集在一起;所述舆情信息预处理模块,是对采集的舆情信息源经过所述舆情信息萃取模块萃取后,进行中文分词处理、过滤停用词、命名实体识别、词性标注、语法解析和特征词提取,建立正序索引和倒排索引;建立文本特征语义网络图,以文本中包含的实体E作为图的节点,两个实体之间的语义关系作为图的有向边,实体之间的语义关系结合词频信息作为节点的权重,有向边的权重表示实体关系在文本中的重要程度,所述实体E包括事物实体NE、事件实体VE、事件关系实体RE;统计文本的词频和文本频率信息,然后进行特征词抽取,选取体现文本特征的词表示该文本。3. the network public opinion information analysis method based on text semantic correlation according to claim 1, it is characterized in that, in step b, described public opinion information extraction module, is to remove the irrelevant content in the webpage, extracts useful to public opinion analysis The meta-information in the text part reconstructs the text and gathers the representative information of the theme together; the public opinion information preprocessing module extracts the collected public opinion information sources through the public opinion information extraction module, and performs Chinese Word segmentation processing, filtering stop words, named entity recognition, part-of-speech tagging, grammatical analysis and feature word extraction, establishment of forward index and inverted index; establishment of text feature semantic network graph, with entity E contained in the text as the node of the graph, The semantic relationship between two entities is used as the directed edge of the graph, and the semantic relationship between the entities is combined with the word frequency information as the weight of the node. The weight of the directed edge indicates the importance of the entity relationship in the text. The entity E includes things Entity NE, event entity VE, and event relationship entity RE; count the word frequency and text frequency information of the text, and then perform feature word extraction, and select words that reflect text features to represent the text. 4.根据权利要求3所述的基于文本语义相关的网络舆情信息分析方法,其特征是,在步骤c中,所述舆情信息挖掘模块,是在对文本集进行预处理,包括中文分词处理、停用词过滤和结构化标签信息分析后,将信息萃取模块生成的文本数据集,根据文本特征语义网络图构建的文本语义特征描述结构,利用相似度评价方法计算文本之间的语义相似度,构建相似度矩阵,采用基于语义相似度的改进文本聚类分析方法生成聚类结果;聚类分析结果生成类别描述信息,筛选出聚类分析结果中包含的文本信息;利用基于特征统计的TFIDF词频特征计算方法统计类别特征,获取候选类别特征词,选择名词作为候选类别特征词,按照候选特征词权重排序,以权重值确定候选特征词作为类别关键词,利用类别关键词之间的语义关系,形成分类结果;将挖掘结果构建知识库。4. the network public opinion information analysis method based on text semantic correlation according to claim 3, it is characterized in that, in step c, described public opinion information mining module is carrying out preprocessing to text collection, comprises Chinese word segmentation process, After stop word filtering and structured label information analysis, the text dataset generated by the information extraction module is used to calculate the semantic similarity between texts using the similarity evaluation method based on the text semantic feature description structure constructed by the text feature semantic network graph. Construct a similarity matrix, use the improved text clustering analysis method based on semantic similarity to generate clustering results; generate category description information from the clustering analysis results, and filter out the text information contained in the clustering analysis results; use TFIDF word frequency based on feature statistics The feature calculation method counts category features, obtains candidate category feature words, selects nouns as candidate category feature words, sorts candidate feature words according to their weight, determines candidate feature words as category keywords with weight values, and utilizes the semantic relationship between category keywords, Form classification results; build knowledge base with mining results. 5.根据权利要求3或4所述的基于文本语义相关的网络舆情信息分析方法,其特征是,文本特征语义网络图是利用实体及其语义关系来表达舆情信息的有向图,通过网络节点表示的词语合并,节点权值相加;再合并有向边,有向边权值相加,构建文本特征语义网络图,描述文本中的语义信息和主题特征。5. according to claim 3 or 4 described network public opinion information analysis methods based on text semantic correlation, it is characterized in that, text feature semantic network graph is the directed graph that utilizes entity and its semantic relation to express public opinion information, through network node The expressed words are merged, and the node weights are added; then the directed edges are merged, and the weights of the directed edges are added to construct a text feature semantic network graph to describe the semantic information and topic features in the text. 6.根据权利要求4所述的基于文本语义相关的网络舆情信息分析方法,其特征是,文本之间的语义相似度评价方法为:6. the network public opinion information analysis method based on text semantic correlation according to claim 4, it is characterized in that, the semantic similarity evaluation method between texts is: 设经过步骤b的舆情信息萃取和预处理后的文本为D1(t11,t12,t13,…,t1m),D2(t21,t22,t23,…,t2m),计算文本D1中所有关键词t1i与文本D2中所有关键词t2i的相似度,形成相似度矩阵如下:Let the text after public opinion information extraction and preprocessing in step b be D 1 (t 11 ,t 12 ,t 13 ,…,t 1m ), D 2 (t 21 ,t 22 ,t 23 ,…,t 2m ) , calculate the similarity between all keywords t 1i in text D 1 and all keywords t 2i in text D 2 , and form a similarity matrix as follows: Mm (( DD. 11 ,, DD. 22 )) == SimSim 1111 SimSim 11 mm SimSim mm 11 SimSim mm mm Simij(1=i,j=m)表示文本D1关键词t1i与文本D2关键词t2j的相似度;M(D1,D2)表示文本D1与文本D2之间的相似度矩阵;i为文本D1的关键词数;m为文本D2的关键词数;Sim ij (1=i, j=m) represents the similarity between text D 1 keyword t 1i and text D 2 keyword t 2j ; M(D 1 , D 2 ) represents the similarity between text D 1 and text D 2 Similarity matrix; i is the number of keywords in text D1; m is the number of keywords in text D2 ; 词语相似度计算公式S(T1,T2)=Max(i=1,2,…,n;j=1,2,…,m)S(y1i,y2j),即词语相似度为两词语所有义项相似度中的最大值,所述义项是指一个词语所包含的多个词义;Word similarity calculation formula S(T 1 ,T 2 )=Max (i=1,2,…,n; j=1,2,…,m) S(y 1i ,y 2j ), that is, the word similarity is The maximum value in the similarity of all meaning items of two words, and described meaning item refers to a plurality of meanings contained in a word; 依次遍历相似度矩阵M,找到相似度Sim值最大的关键词对应组合,并删除对应的行和列;然后继续遍历相似度矩阵M找到Sim值最大的关键词组合,反复循环直至矩阵M为零值矩阵;最后利用得到的相似度最大关键词组合序列,求得文本D1和D2的语义相似度,计算公式如下:Traverse the similarity matrix M in turn, find the keyword combination with the largest similarity Sim value, and delete the corresponding row and column; then continue to traverse the similarity matrix M to find the keyword combination with the largest Sim value, and repeat the cycle until the matrix M is zero value matrix; finally, use the keyword combination sequence with the largest similarity obtained to obtain the semantic similarity of texts D 1 and D 2 , and the calculation formula is as follows: SS ii mm (( DD. 11 ,, DD. 22 )) == 11 mm &Sigma;&Sigma; kk == 11 mm SimSim kk -- mm aa xx (( tt 11 ii ,, tt 22 jj )) 其中,max为相似度Sim的最大值;i为文本D1的关键词数;j为文本D2的关键词数。Among them, max is the maximum value of similarity Sim; i is the number of keywords in text D1 ; j is the number of keywords in text D2. 7.根据权利要求6所述的基于文本语义相关的网络舆情信息分析方法,其特征是,基于语义相似度的改进文本聚类分析方法为:7. the network public opinion information analysis method based on text semantic correlation according to claim 6, is characterized in that, the improved text cluster analysis method based on semantic similarity is: 1)首先对所有采集的文本经过预处理后,采用TFIDF加权法对所有类别关键词进行特征加权,提取m个最优特征关键词形成原始的基于关键词特征向量Di*;1) First, after preprocessing all the collected texts, use the TFIDF weighting method to carry out feature weighting on all category keywords, and extract m optimal feature keywords to form the original keyword-based feature vector Di*; 2)依据所述知识库对原始的基于关键词特征向量Di*中关键词进行预处理:在知识库中找到与关键词匹配的词汇并将其替换,形成新的特征向量Di,Di=(T1,T2,…,Ti),i=1,2,3,…,m;2) Preprocessing the keywords in the original keyword-based feature vector Di* according to the knowledge base: find the words matching the keywords in the knowledge base and replace them to form new feature vectors D i , D i =(T 1 ,T 2 ,...,T i ), i=1,2,3,...,m; 3)形成n个文本的m个特征向量Di,利用文本语义相似度计算公式计算采集的文本之间的语义相似度,形成文本集的相似度矩阵M,并求出所有特征向量的平均相似度MA;计算公式如下:3) Form m eigenvectors D i of n texts, use the text semantic similarity calculation formula to calculate the semantic similarity between the collected texts, form the similarity matrix M of the text set, and calculate the average similarity of all eigenvectors degree MA; the calculation formula is as follows: 其中,n为文本数; Among them, n is the number of texts; 4)设定三个相似度阈值,一个重复度阈值为0.9,一个主题中心阈值为0.5,以及一个新主题阈值为0.3;4) Set three similarity thresholds, one repetition threshold is 0.9, one topic center threshold is 0.5, and one new topic threshold is 0.3; 5)将文本与中心主题比较,如果文本与中心主题的初始中心相似度大于重复度阈值0.9,认为该文本属于同一主题同一内容文本;如果相似度小于新主题阈值0.3,则该文本需要新建一个类;如果相似度在0~0.5范围内,则该文本属于同一主题的不同侧面讨论的核心内容文本,标记为第二个中心,以此类推,形成多个中心的层次化的聚类结果;5) Compare the text with the central topic, if the initial central similarity between the text and the central topic is greater than the repetition threshold 0.9, the text is considered to belong to the same topic and the same content text; if the similarity is less than the new topic threshold 0.3, the text needs to create a new one class; if the similarity is in the range of 0 to 0.5, the text belongs to the core content text discussed on different sides of the same topic, and is marked as the second center, and so on, forming a hierarchical clustering result of multiple centers; 6)针对多个中心的主题表示方法,选择文本与类内每个中心的相似度的最大值作为该类文本的相似度。6) For the topic representation method of multiple centers, select the maximum value of the similarity between the text and each center in the class as the similarity of the text in this class.
CN201310482522.5A 2013-10-15 2013-10-15 Text semantic relativity based network public opinion information analysis method Active CN103544255B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310482522.5A CN103544255B (en) 2013-10-15 2013-10-15 Text semantic relativity based network public opinion information analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310482522.5A CN103544255B (en) 2013-10-15 2013-10-15 Text semantic relativity based network public opinion information analysis method

Publications (2)

Publication Number Publication Date
CN103544255A CN103544255A (en) 2014-01-29
CN103544255B true CN103544255B (en) 2017-01-11

Family

ID=49967707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310482522.5A Active CN103544255B (en) 2013-10-15 2013-10-15 Text semantic relativity based network public opinion information analysis method

Country Status (1)

Country Link
CN (1) CN103544255B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446409A (en) * 2018-09-19 2019-03-08 杭州安恒信息技术股份有限公司 A kind of recognition methods of the target object of doubtful multiple level marketing behavior

Families Citing this family (156)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902659B (en) * 2014-03-04 2017-06-27 深圳市至高通信技术发展有限公司 A kind of the analysis of public opinion method and corresponding device
CN103886051A (en) * 2014-03-13 2014-06-25 电子科技大学 Comment analysis method based on entities and features
CN103927545B (en) * 2014-03-14 2017-10-17 小米科技有限责任公司 Clustering method and relevant apparatus
CN104915359B (en) * 2014-03-14 2019-05-28 华为技术有限公司 Theme label recommended method and device
CN103902674B (en) * 2014-03-19 2017-10-27 百度在线网络技术(北京)有限公司 The acquisition method and device of the comment data of particular topic
CN103838886A (en) * 2014-03-31 2014-06-04 辽宁四维科技发展有限公司 Text content classification method based on representative word knowledge base
CN103841216A (en) * 2014-04-01 2014-06-04 深圳市科盾科技有限公司 Network public opinion monitoring system based on cloud platform
CN104199829B (en) * 2014-07-25 2017-07-04 中国科学院自动化研究所 Affection data sorting technique and system
CN104346425B (en) * 2014-07-28 2017-10-31 中国科学院计算技术研究所 A kind of method and system of the internet public feelings index system of stratification
CN104217718B (en) * 2014-09-03 2017-05-17 陈飞 Method and system for voice recognition based on environmental parameter and group trend data
JP6605022B2 (en) * 2014-09-03 2019-11-13 ザ ダン アンド ブラッドストリート コーポレーション Systems and processes for analyzing, selecting, and capturing sources of unstructured data by experience attributes
CN104268194A (en) * 2014-09-19 2015-01-07 国家电网公司 Method for dynamically generating public opinion brief report
CN105574047A (en) * 2014-10-17 2016-05-11 任子行网络技术股份有限公司 Website main page feature analysis based Chinese website sorting method and system
CN104504150B (en) * 2015-01-09 2017-09-29 成都布林特信息技术有限公司 News public sentiment monitoring system
CN105992194B (en) * 2015-01-30 2019-10-29 阿里巴巴集团控股有限公司 The acquisition methods and device of network data content
CN104699763B (en) * 2015-02-11 2017-10-17 中国科学院新疆理化技术研究所 The text similarity gauging system of multiple features fusion
CN106156041B (en) * 2015-03-26 2019-05-28 科大讯飞股份有限公司 Hot information finds method and system
CN106156192A (en) * 2015-04-21 2016-11-23 北大方正集团有限公司 Public sentiment data clustering method and public sentiment data clustering system
CN104820629B (en) * 2015-05-14 2018-01-30 中国电子科技集团公司第五十四研究所 A kind of intelligent public sentiment accident emergent treatment system and method
CN106294358A (en) * 2015-05-14 2017-01-04 北京大学 The search method of a kind of information and system
CN104899339A (en) * 2015-07-01 2015-09-09 北京奇虎科技有限公司 Method and system for classifying POI (Point of Interest) information
CN104915453A (en) * 2015-07-01 2015-09-16 北京奇虎科技有限公司 Method, device and system for classifying POI information
CN105183803A (en) * 2015-08-25 2015-12-23 天津大学 Personalized search method and search apparatus thereof in social network platform
CN105183478B (en) * 2015-09-11 2018-11-23 中山大学 A kind of webpage reconstructing method and its device based on color transfer
CN106528581B (en) * 2015-09-15 2019-05-07 阿里巴巴集团控股有限公司 Method for text detection and device
CN106649367B (en) * 2015-10-30 2020-03-03 北京国双科技有限公司 Method and device for detecting keyword popularization degree
EP3283984A4 (en) * 2015-11-03 2018-04-04 Hewlett-Packard Enterprise Development LP Relevance optimized representative content associated with a data storage system
CN105279277A (en) * 2015-11-12 2016-01-27 百度在线网络技术(北京)有限公司 Knowledge data processing method and device
CN105389389B (en) * 2015-12-10 2018-09-25 安徽博约信息科技股份有限公司 A kind of network public-opinion propagation situation medium control analysis method
CN105677802A (en) * 2015-12-31 2016-06-15 宁波公众信息产业有限公司 Internet information analysis system
CN105447202A (en) * 2015-12-31 2016-03-30 宁波公众信息产业有限公司 Internet information collecting system
CN105677873B (en) * 2016-01-11 2019-03-26 中国电子科技集团公司第十研究所 Text Intelligence association cluster based on model of the domain knowledge collects processing method
CN105740238B (en) * 2016-03-04 2019-02-01 北京理工大学 A kind of event relation intensity map construction method merging sentence justice information
CN105956070A (en) * 2016-04-28 2016-09-21 优品财富管理有限公司 Method and system for integrating repetitive records
CN105956069A (en) * 2016-04-28 2016-09-21 优品财富管理有限公司 Network information collection and analysis method and network information collection and analysis system
CN106126558B (en) * 2016-06-16 2019-09-20 东软集团股份有限公司 A kind of public sentiment monitoring method and device
CN106294542B (en) * 2016-07-25 2018-03-30 北京市信访矛盾分析研究中心 A kind of letters and calls data mining methods of marking and system
CN106294619A (en) * 2016-08-01 2017-01-04 上海交通大学 Public sentiment intelligent supervision method
EP3533066A1 (en) * 2016-10-25 2019-09-04 Koninklijke Philips N.V. Knowledge graph-based clinical diagnosis assistant
CN106651696B (en) * 2016-11-16 2020-10-27 福建天泉教育科技有限公司 Approximate question pushing method and system
CN106776724B (en) * 2016-11-16 2020-09-08 福建天泉教育科技有限公司 Question classification method and system
CN106599054B (en) * 2016-11-16 2019-12-24 福建天泉教育科技有限公司 Method and system for classifying and pushing questions
CN108090040B (en) * 2016-11-23 2021-08-17 北京国双科技有限公司 A text information classification method and system
CN107045524B (en) * 2016-12-30 2019-12-27 中央民族大学 Method and system for classifying network text public sentiments
CN107016068A (en) * 2017-03-21 2017-08-04 深圳前海乘方互联网金融服务有限公司 Knowledge mapping construction method and device
CN107918633B (en) * 2017-03-23 2021-07-02 广州思涵信息科技有限公司 Sensitive public opinion content identification method and early warning system based on semantic analysis technology
CN107145516B (en) * 2017-04-07 2021-03-19 北京捷通华声科技股份有限公司 Text clustering method and system
CN107066585B (en) * 2017-04-17 2019-10-01 济南大学 A kind of probability topic calculates and matched public sentiment monitoring method and system
CN107093021A (en) * 2017-04-21 2017-08-25 深圳市创艺工业技术有限公司 Electricity power engineering goods and materials contract is honoured an agreement sincere public sentiment monitoring system
CN107085608A (en) * 2017-04-21 2017-08-22 上海喆之信息科技有限公司 A kind of effective network hotspot monitoring system
CN107038156A (en) * 2017-04-28 2017-08-11 北京清博大数据科技有限公司 A kind of hot spot of public opinions Forecasting Methodology based on big data
CN107291808A (en) * 2017-05-16 2017-10-24 南京邮电大学 It is a kind of that big data sorting technique is manufactured based on semantic cloud
CN107220236A (en) * 2017-05-23 2017-09-29 武汉朱雀闻天科技有限公司 It is a kind of to determine the doubtful naked method and device for borrowing student
CN107315778A (en) * 2017-05-31 2017-11-03 温州市鹿城区中津先进科技研究院 A kind of natural language the analysis of public opinion method based on big data sentiment analysis
CN107292743A (en) * 2017-06-07 2017-10-24 前海梧桐(深圳)数据有限公司 The intelligent decision making method and its system invested and financed for enterprise
CN107231570A (en) * 2017-06-13 2017-10-03 中国传媒大学 News data content characteristic obtains system and application system
CN107291697A (en) * 2017-06-29 2017-10-24 浙江图讯科技股份有限公司 A kind of semantic analysis, electronic equipment, storage medium and its diagnostic system
CN107358344B (en) * 2017-06-29 2021-09-03 浙江图讯科技股份有限公司 Enterprise hidden danger management method and management system thereof, electronic equipment and storage medium
CN107276854B (en) * 2017-07-27 2021-11-09 浩鲸云计算科技股份有限公司 MOLAP statistical analysis method under big data
CN107527289B (en) * 2017-08-25 2021-08-06 上海优扬新媒信息技术有限公司 Investment portfolio industry configuration method, device, server and storage medium
CN107491438A (en) * 2017-08-25 2017-12-19 前海梧桐(深圳)数据有限公司 Business decision elements recognition method and its system based on natural language
CN107679084B (en) * 2017-08-31 2021-09-28 平安科技(深圳)有限公司 Clustering label generation method, electronic device and computer readable storage medium
CN107679977A (en) * 2017-09-06 2018-02-09 广东中标数据科技股份有限公司 A kind of tax administration platform and implementation method based on semantic analysis
CN107918644B (en) * 2017-10-31 2020-12-08 北京锐思爱特咨询股份有限公司 News topic analysis method and implementation system in reputation management framework
CN107908694A (en) * 2017-11-01 2018-04-13 平安科技(深圳)有限公司 Public sentiment clustering method, application server and the computer-readable recording medium of internet news
CN108052527A (en) * 2017-11-08 2018-05-18 中国传媒大学 Method is recommended in film bridge piecewise analysis based on label system
CN108170666A (en) * 2017-11-29 2018-06-15 同济大学 A kind of improved method based on TF-IDF keyword extractions
CN108197638B (en) * 2017-12-12 2020-03-20 阿里巴巴集团控股有限公司 Method and device for classifying sample to be evaluated
CN110019720B (en) * 2017-12-19 2022-02-08 阿里巴巴(中国)有限公司 Comment content acquisition method and system
CN108062306A (en) * 2017-12-29 2018-05-22 国信优易数据有限公司 A kind of index system establishment system and method for business environment evaluation
CN108363784A (en) * 2018-01-20 2018-08-03 西北工业大学 A kind of public sentiment trend estimate method based on text machine learning
CN108595466B (en) * 2018-02-09 2022-05-10 中山大学 A kind of Internet information filtering and Internet user information and network post structure analysis method
CN108287922B (en) * 2018-02-28 2022-03-08 福州大学 Text data viewpoint abstract mining method fusing topic attributes and emotional information
CN108536762A (en) * 2018-03-21 2018-09-14 上海蔚界信息科技有限公司 A kind of high-volume text data automatically analyzes scheme
CN108681977B (en) * 2018-03-27 2022-05-31 成都律云科技有限公司 Lawyer information processing method and system
CN108550380A (en) * 2018-04-12 2018-09-18 北京深度智耀科技有限公司 A kind of drug safety information monitoring method and device based on public network
CN108628994A (en) * 2018-04-28 2018-10-09 广东亿迅科技有限公司 A kind of public sentiment data processing system
CN108932291B (en) * 2018-05-23 2022-08-23 福建亿榕信息技术有限公司 Power grid public opinion evaluation method, storage medium and computer
CN108804594A (en) * 2018-05-28 2018-11-13 国家计算机网络与信息安全管理中心 A kind of construction method and device of news content full-text search engine
CN110633373B (en) * 2018-06-20 2023-06-09 上海财经大学 Automobile public opinion analysis method based on knowledge graph and deep learning
CN110727794A (en) * 2018-06-28 2020-01-24 上海传漾广告有限公司 System and method for collecting and analyzing network semantics and summarizing and analyzing content
CN109145085B (en) * 2018-07-18 2020-11-27 北京市农林科学院 Method and system for calculating semantic similarity
CN109376237B (en) * 2018-09-04 2024-05-28 中国平安人寿保险股份有限公司 Client stability prediction method, device, computer equipment and storage medium
CN109408808B (en) * 2018-09-12 2023-08-22 中国传媒大学 Evaluation method and evaluation system for literature works
CN109214008A (en) * 2018-09-28 2019-01-15 珠海中科先进技术研究院有限公司 A kind of sentiment analysis method and system based on keyword extraction
CN109299271B (en) * 2018-10-30 2022-04-05 腾讯科技(深圳)有限公司 Training sample generation method, text data method, public opinion event classification method and related equipment
CN109558586B (en) * 2018-11-02 2023-04-18 中国科学院自动化研究所 Self-evidence scoring method, equipment and storage medium for statement of information
CN109582953B (en) * 2018-11-02 2023-04-07 中国科学院自动化研究所 Data support scoring method and equipment for information and storage medium
CN109635074B (en) * 2018-11-13 2024-05-07 平安科技(深圳)有限公司 Entity relationship analysis method and terminal equipment based on public opinion information
CN109189934B (en) * 2018-11-13 2024-07-19 平安科技(深圳)有限公司 Public opinion recommendation method, public opinion recommendation device, computer equipment and storage medium
CN109635107A (en) * 2018-11-19 2019-04-16 北京亚鸿世纪科技发展有限公司 The method and device of semantic intellectual analysis and the event scenarios reduction of multi-data source
CN109526027B (en) * 2018-11-27 2022-07-01 中国移动通信集团福建有限公司 A cell capacity optimization method, device, equipment and computer storage medium
CN109766438B (en) * 2018-12-12 2024-07-16 平安科技(深圳)有限公司 Resume information extraction method, resume information extraction device, computer equipment and storage medium
CN110046292B (en) * 2018-12-13 2024-04-23 创新先进技术有限公司 Public opinion data processing method, device, equipment and storage medium
CN111435594A (en) * 2019-01-14 2020-07-21 珠海格力电器股份有限公司 Method and device for acquiring cooking parameters of cooking appliance and cooking appliance
CN109977995A (en) * 2019-02-11 2019-07-05 平安科技(深圳)有限公司 Text template recognition methods, device and computer readable storage medium
CN110110156A (en) * 2019-04-04 2019-08-09 平安科技(深圳)有限公司 Industry public sentiment monitoring method, device, computer equipment and storage medium
CN110134844A (en) * 2019-04-04 2019-08-16 平安科技(深圳)有限公司 Subdivision field public sentiment monitoring method, device, computer equipment and storage medium
CN110188196B (en) * 2019-04-29 2021-10-08 同济大学 A Random Forest-based Incremental Dimensionality Reduction Method for Text
CN110222172B (en) * 2019-05-15 2021-03-16 北京邮电大学 Multi-source network public opinion theme mining method based on improved hierarchical clustering
CN110119416A (en) * 2019-05-16 2019-08-13 重庆八戒传媒有限公司 A kind of service data analysis system and method
CN110188168B (en) * 2019-05-24 2021-09-03 北京邮电大学 Semantic relation recognition method and device
CN112182203A (en) * 2019-07-04 2021-01-05 北京航天长峰科技工业集团有限公司 Warning situation text data analysis system
CN110348539B (en) * 2019-07-19 2021-05-07 知者信息技术服务成都有限公司 Short text relevance judging method
CN112348421A (en) * 2019-08-08 2021-02-09 北京国双科技有限公司 Data processing method and device
CN110472055B (en) * 2019-08-21 2021-09-14 北京百度网讯科技有限公司 Method and device for marking data
CN110532492A (en) * 2019-08-27 2019-12-03 东北大学 A kind of forum data management classification system and method
CN112541105A (en) * 2019-09-20 2021-03-23 福建师范大学地理研究所 Keyword generation method, public opinion monitoring method, device, equipment and medium
CN110705288A (en) * 2019-09-29 2020-01-17 武汉海昌信息技术有限公司 Big data-based public opinion analysis system
CN110852090B (en) * 2019-11-07 2024-03-19 中科天玑数据科技股份有限公司 Mechanism characteristic vocabulary expansion system and method for public opinion crawling
CN110990389A (en) * 2019-11-29 2020-04-10 上海易点时空网络有限公司 Method and device for simplifying question bank and computer readable storage medium
CN110991190B (en) * 2019-11-29 2021-06-29 华中科技大学 A document topic enhancement system, text sentiment prediction system and method
CN110968668B (en) * 2019-11-29 2023-03-14 中国农业科学院农业信息研究所 Method and device for calculating similarity of network public sentiment topics based on hyper-network
CN111144575B (en) * 2019-12-05 2022-08-12 支付宝(杭州)信息技术有限公司 Public opinion early warning model training method, early warning method, device, equipment and medium
CN111158973B (en) * 2019-12-05 2021-06-18 北京大学 A method for monitoring the dynamic evolution of web applications
CN111160019B (en) * 2019-12-30 2023-08-15 中国联合网络通信集团有限公司 Public opinion monitoring method, device and system
CN111241077B (en) * 2020-01-03 2023-06-09 四川新网银行股份有限公司 Identification method of financial fraud based on internet data
CN111259635A (en) * 2020-01-09 2020-06-09 智业软件股份有限公司 Method and system for completing and predicting medical record written text
CN111291186B (en) * 2020-01-21 2024-01-09 北京捷通华声科技股份有限公司 Context mining method and device based on clustering algorithm and electronic equipment
CN111291162B (en) * 2020-02-26 2024-04-09 深圳前海微众银行股份有限公司 Quality inspection example sentence mining method, device, equipment and computer readable storage medium
CN111401074A (en) * 2020-04-03 2020-07-10 山东爱城市网信息技术有限公司 Short text emotion tendency analysis method, system and device based on Hadoop
CN111563190B (en) * 2020-04-07 2023-03-14 中国电子科技集团公司第二十九研究所 Multi-dimensional analysis and supervision method and system for user behaviors of regional network
CN111797333B (en) * 2020-06-04 2021-04-20 南京擎盾信息科技有限公司 Public opinion spreading task display method and device
CN111708886A (en) * 2020-06-11 2020-09-25 国网天津市电力公司 A data-driven public opinion analysis terminal and public opinion text analysis method
CN111914096B (en) * 2020-07-06 2024-02-02 同济大学 Public opinion knowledge graph-based public transportation passenger satisfaction evaluation method and system
CN111831922B (en) * 2020-07-14 2021-02-05 深圳市众创达企业咨询策划有限公司 Recommendation system and method based on internet information
CN111914141B (en) * 2020-07-30 2023-01-10 广州城市信息研究所有限公司 Public opinion knowledge base construction method and public opinion knowledge base
CN112084298B (en) * 2020-07-31 2025-02-18 北京明略昭辉科技有限公司 Public opinion topic processing method and device based on fast BTM
CN112214576B (en) * 2020-09-10 2024-02-06 深圳价值在线信息科技股份有限公司 Public opinion analysis method, public opinion analysis device, terminal equipment and computer readable storage medium
CN112184323A (en) * 2020-10-13 2021-01-05 上海风秩科技有限公司 Evaluation label generation method and device, storage medium and electronic equipment
CN112528197B (en) * 2020-11-20 2023-07-07 四川新网银行股份有限公司 System and method for monitoring network public opinion in real time based on artificial intelligence
CN112464653A (en) * 2020-12-03 2021-03-09 合肥天源迪科信息技术有限公司 Real-time event identification and matching method based on communication short message
CN114722801B (en) * 2020-12-22 2024-12-20 航天信息股份有限公司 Government data classification storage method and related device
CN112650848A (en) * 2020-12-30 2021-04-13 交控科技股份有限公司 Urban railway public opinion information analysis method based on text semantic related passenger evaluation
CN113282702B (en) * 2021-03-16 2023-12-19 广东医通软件有限公司 Intelligent retrieval method and retrieval system
CN113032653A (en) * 2021-04-02 2021-06-25 盐城师范学院 Big data-based public opinion monitoring platform
CN113822038B (en) * 2021-06-03 2024-06-25 腾讯科技(深圳)有限公司 Abstract generation method and related device
CN113468333B (en) * 2021-09-02 2021-11-19 华东交通大学 Event detection method and system fusing hierarchical category information
CN113836307B (en) * 2021-10-15 2024-02-20 国网北京市电力公司 A method, system, device and storage medium for hotspot discovery of power supply service work orders
CN114281994B (en) * 2021-12-27 2022-06-03 盐城工学院 A text clustering integration method and system based on three-layer weighted model
CN114297372B (en) * 2021-12-31 2025-02-07 思必驰科技股份有限公司 Personalized note generation method and system
CN114386422B (en) * 2022-01-14 2023-09-15 淮安市创新创业科技服务中心 Intelligent auxiliary decision-making method and device based on public opinion extraction on enterprise pollution
CN114491207A (en) * 2022-01-18 2022-05-13 平安普惠企业管理有限公司 Public opinion analysis methods and related products
CN114579757B (en) * 2022-02-23 2025-02-07 阿里巴巴(中国)有限公司 A text processing method and device based on knowledge graph assistance
CN114692593B (en) * 2022-03-21 2023-04-07 中国刑事警察学院 Network information safety monitoring and early warning method
CN114385890B (en) * 2022-03-22 2022-05-20 深圳市世纪联想广告有限公司 Internet public opinion monitoring system
CN114462393A (en) * 2022-04-12 2022-05-10 安徽数智建造研究院有限公司 Webpage text information extraction method and device, terminal equipment and storage medium
CN115082947B (en) * 2022-07-12 2023-08-15 江苏楚淮软件科技开发有限公司 Paper letter quick collecting, sorting and reading system
CN115757793B (en) * 2022-11-29 2023-09-05 海南达润丰企业管理合伙企业(有限合伙) Topic analysis early warning method and system based on artificial intelligence and cloud platform
CN116521858B (en) * 2023-04-20 2024-04-30 浙江浙里信征信有限公司 Context semantic sequence comparison method based on dynamic clustering and visualization
CN117786249B (en) * 2023-12-27 2024-11-15 北京新华多媒体数据有限公司 Network real-time hot topic mining analysis and public opinion extraction system
CN117743376B (en) * 2024-02-19 2024-05-03 蓝色火焰科技成都有限公司 Big data mining method, device and storage medium for digital financial service
CN117910467B (en) * 2024-03-15 2024-05-10 成都启英泰伦科技有限公司 Word segmentation processing method in offline voice recognition process
CN118520174B (en) * 2024-07-19 2024-09-27 西安银信博锐信息科技有限公司 Customer behavior feature extraction method based on data analysis
CN118656495B (en) * 2024-08-20 2024-11-22 湖南数据产业集团有限公司 Public opinion publishing traceability method, device, equipment and storage medium thereof
CN118656496B (en) * 2024-08-21 2024-11-22 舟谱数据技术南京有限公司 Retrieval data management method and system based on NLP

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101529418A (en) * 2006-01-19 2009-09-09 维里德克斯有限责任公司 Systems and methods for acquiring analyzing mining data and information
CN101788988A (en) * 2009-01-22 2010-07-28 蔡亮华 Information extraction method
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8874581B2 (en) * 2010-07-29 2014-10-28 Microsoft Corporation Employing topic models for semantic class mining

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101529418A (en) * 2006-01-19 2009-09-09 维里德克斯有限责任公司 Systems and methods for acquiring analyzing mining data and information
CN101788988A (en) * 2009-01-22 2010-07-28 蔡亮华 Information extraction method
CN102708096A (en) * 2012-05-29 2012-10-03 代松 Network intelligence public sentiment monitoring system based on semantics and work method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于语义相似度的文本聚类算法的研究;孙爽;《中国优秀硕士学位论文全文数据库 信息科技辑》;20080115(第01期);I140-15 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446409A (en) * 2018-09-19 2019-03-08 杭州安恒信息技术股份有限公司 A kind of recognition methods of the target object of doubtful multiple level marketing behavior

Also Published As

Publication number Publication date
CN103544255A (en) 2014-01-29

Similar Documents

Publication Publication Date Title
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN108763333B (en) Social media-based event map construction method
CN103226578B (en) A Method for Website Identification and Webpage Segmentation in the Medical Field
CN101661513B (en) Detection method of network focus and public sentiment
US9262532B2 (en) Ranking entity facets using user-click feedback
Xie et al. A novel text mining approach for scholar information extraction from web content in Chinese
Shen et al. A probabilistic model for linking named entities in web text with heterogeneous information networks
CN102254014B (en) Adaptive information extraction method for webpage characteristics
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
CN103365924B (en) A kind of method of internet information search, device and terminal
TWI695277B (en) Automatic website data collection method
CN113378565B (en) Event analysis method, device, device and storage medium for multi-source data fusion
CN111708740A (en) Cloud platform-based massive search query log calculation and analysis system
CN102110140A (en) Network-based method for analyzing opinion information in discrete text
CN106815307A (en) Public Culture knowledge mapping platform and its use method
CN101593204A (en) A Sentiment Analysis System Based on News Comment Webpage
CN107180045A (en) A kind of internet text contains the abstracting method of geographical entity relation
CN107918644B (en) News topic analysis method and implementation system in reputation management framework
CN102890702A (en) Internet forum-oriented opinion leader mining method
CN102609427A (en) Public opinion vertical search analysis system and method
Sleeman et al. Entity type recognition for heterogeneous semantic graphs
CN109284432A (en) Network public opinion analysis system based on big data platform
CN104268230A (en) Method for detecting objective points of Chinese micro-blogs based on heterogeneous graph random walk
CN115017302A (en) A public opinion monitoring method and public opinion monitoring system
CN108595466B (en) A kind of Internet information filtering and Internet user information and network post structure analysis method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220425

Address after: 213000 room 1505, No. 9-1, Taihu East Road, Xinbei District, Changzhou City, Jiangsu Province

Patentee after: CHANGZHOU HUALONG NETWORK TECHNOLOGY CO.,LTD.

Address before: Gehu Lake Road Wujin District 213164 Jiangsu city of Changzhou province No. 1

Patentee before: CHANGZHOU University

TR01 Transfer of patent right