CN101149739A

CN101149739A - An internet-oriented meaningful string mining method and system

Info

Publication number: CN101149739A
Application number: CNA2007101207555A
Authority: CN
Inventors: 张华平; 贺敏; 黄玉兰; 龚才春
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2007-08-24
Filing date: 2007-08-24
Publication date: 2008-03-26

Abstract

The invention discloses an Internet-oriented meaningful string mining method and system. The method includes the following steps: step A, repeat character string discovery; step B, filter the character string through context adjacency analysis; step C, filter the character string through language model analysis. It can effectively extract meaningful strings in web pages or large-scale text data.

Description

An internet-oriented meaningful string mining method and system

技术领域 technical field

本发明涉及一种信息检索领域和操作系统领域，特别是一种面向互联网的有意义串的挖掘方法和系统。The invention relates to an information retrieval field and an operating system field, in particular to an Internet-oriented meaningful string mining method and system.

背景技术 Background technique

互联网上有着浩瀚如海的信息，但其庞大的数目使得Web用户很难从中有效获取有用信息，用户们面对汪洋大海般的日夜更新的信息，往往感到不知所措，不知道如何从海量信息中寻求自己真正想要的信息，更不知道如何获取或把握海量信息中的关键信息，及时掌握当前的重要资讯。同时面对时时刻刻不断涌现的新信息，任何人都无法做到“眼观六路、耳听八方”。在这个时候，人们更加迫切需要自然语言处理技术的强力支持，以应对日益严重的信息过载问题。There is a vast amount of information on the Internet, but its huge amount makes it difficult for Web users to effectively obtain useful information from it. Users often feel overwhelmed by the vast ocean of information that is updated day and night, and do not know how to learn from the massive amount of information. Looking for the information you really want, let alone how to obtain or grasp the key information in the massive information, and grasp the current important information in time. At the same time, in the face of new information emerging all the time, no one can "see six directions and listen to all directions". At this time, people urgently need the strong support of natural language processing technology to deal with the increasingly serious problem of information overload.

从海量的网络信息中提取出有用的关键信息，成为了一大难题，也成为了在网络信息爆炸时代亟待解决的需求。而此问题的解决，也将有着广泛的应用前景：对于个人，能通过它更方便地发现和组织当前重要资讯，它能够成为人们掌控海量信息的切入点。对于企业，能通过它及时掌握企业相关领域的最新动态，战略伙伴的发展方向，竞争对手的最新动作，为企业制定战略方针提供资讯方面的帮助。对于国家，能通过它了解当前社会重要事件，流行趋向，舆论方向等等，成为了解和掌握社会状况的信息窗口，为相关决策的制定提供帮助。Extracting useful key information from massive network information has become a major problem, and it has also become an urgent need to be solved in the era of network information explosion. The solution to this problem will also have broad application prospects: for individuals, it can be used to more conveniently discover and organize current important information, and it can become an entry point for people to control massive amounts of information. For enterprises, it can be used to keep abreast of the latest developments in related fields, the development direction of strategic partners, and the latest actions of competitors, and provide information assistance for enterprises to formulate strategic guidelines. For the country, it can be used to understand current important social events, fashion trends, public opinion directions, etc., and become an information window to understand and grasp social conditions, and provide assistance for relevant decision-making.

在这样的背景下，如何提取网络文本中的有用信息，凸现出了其自身的重要性，成为一个值得深入研究的方向。In this context, how to extract useful information from web texts has highlighted its own importance and become a direction worthy of further study.

发明内容 Contents of the invention

本发明的目的是提供一种面向互联网的有意义串的挖掘方法和系统，其能够有效的提取网页或大规模文本数据中的有意义串。The purpose of the present invention is to provide an Internet-oriented meaningful string mining method and system, which can effectively extract meaningful strings in web pages or large-scale text data.

为实现本发明目的而提供的一种面向互联网的有意义串的挖掘方法，包括下列步骤：A kind of digging method of Internet-oriented meaningful string provided for realizing the purpose of the present invention, comprises the following steps:

步骤A，重复字符串发现；Step A, repeat string discovery;

步骤B，通过上下文邻接分析过滤所述字符串；Step B, filtering the character string through contextual adjacency analysis;

步骤C，通过语言模型分析过滤所述字符串。Step C, analyzing and filtering the character string through a language model.

所述步骤A包括下列步骤：Described step A comprises the following steps:

步骤A1，将网页语料处理得到格式化的纯文本文件，对文本文件进行分类，记录文本中重复出现的字符串及其出现的频次，将出现次数小于一定阈值的字符串滤掉。Step A1, process the webpage corpus to obtain formatted plain text files, classify the text files, record the repeated strings and their frequency of occurrence in the text, and filter out the strings whose occurrences are less than a certain threshold.

所述步骤B包括下列步骤：Said step B comprises the following steps:

步骤B1，计算每条重复串的上下文邻接特征量，并判断这些特征量是否达到设定的阈值，根据判断结果过滤掉没有达到阈值的文本串。Step B1, calculate the context adjacency feature value of each repeated string, and judge whether these feature values reach the set threshold, and filter out the text strings that do not reach the threshold according to the judgment result.

所述步骤C包括下列步骤：Described step C comprises the following steps:

步骤C1，对文本串逐字扫描相邻字对，查找相邻字对的耦合度，根据耦合度过滤文本串，然后根据文本串的位置成词概率，进一步进行过滤而得到有意义串。Step C1, scan the adjacent word pairs of the text string one by one, find the coupling degree of the adjacent word pairs, filter the text string according to the coupling degree, and then further filter according to the position word probability of the text string to obtain a meaningful string.

所述步骤A1包括下列步骤：Said step A1 comprises the following steps:

步骤A11，将网页语料处理得到格式化的纯文本文件，然后将汉字转化为对应的ID；Step A11, processing the webpage corpus to obtain a formatted plain text file, and then converting Chinese characters into corresponding IDs;

步骤A12，对处理好的ID序列建立索引，从每个单字索引的信息开始扩展得到所有重复串，新产生的重复串写入文件之后，继续扩展得到长串，反复迭代，直到出现间隔符号或者长度达到指定阈值，停止扩展；Step A12, build an index on the processed ID sequence, expand from the information of each single-word index to obtain all repeated strings, write the newly generated repeated strings into the file, continue to expand to obtain long strings, and iterate repeatedly until an interval symbol or When the length reaches the specified threshold, stop the extension;

步骤A13，记录每个串的邻接词信息以及文档信息，每类信息独立保存在一个文件中。Step A13, record the adjacent word information and document information of each string, each type of information is independently stored in a file.

所述步骤B1包括下列步骤：Said step B1 comprises the following steps:

步骤B11，计算每条重复串的上下文邻接特征量，判断这些特征量是否达到设定的阈值；Step B11, calculating the contextual adjacency feature quantity of each repeated string, and judging whether these feature quantities reach the set threshold;

步骤B12，如果达到阈值，则转入步骤C；Step B12, if the threshold is reached, go to step C;

步骤B13，如果特征量未达到阈值，则将其过滤掉。Step B13, if the feature quantity does not reach the threshold, it is filtered out.

所述步骤C1包括下列步骤：Said step C1 comprises the following steps:

步骤C11，对一部分训练语料进行标注，生成相邻字的耦合度词典和单字位置成词概率词典；Step C11, mark a part of the training corpus, and generate a dictionary of coupling degrees of adjacent words and a dictionary of probability of forming words by word positions;

步骤C12，逐字扫描相邻字对，查找相邻字对的耦合度；Step C12, scanning adjacent word pairs word by word, looking for the coupling degree of adjacent word pairs;

步骤C13，当相邻字对的耦合度小于设定阈值时，不构成词的一部分，作为垃圾串过滤掉；Step C13, when the coupling degree of the adjacent word pair is less than the set threshold, it does not constitute a part of the word, and is filtered out as a garbage string;

步骤C14，对相邻字对没有过滤掉的字符串，查找单字位置成词概率，判断其串首和串尾是否包含常用功能字；Step C14, for the character strings that are not filtered out by the adjacent word pairs, search for the word probability of the single word position, and judge whether the string head and the string tail include commonly used function words;

步骤C15，如果是功能字，则将其过滤掉；Step C15, if it is a function word, it is filtered out;

步骤C16，还没有被过滤掉的字符中确定为有意义串。In step C16, the characters that have not been filtered out are determined as meaningful strings.

为实现本发明目的还提供一种面向互联网的有意义串的挖掘系统，包括：Also provide a kind of Internet-oriented meaningful string mining system for realizing the object of the present invention, comprising:

重复串发现模块，用于将网页语料处理得到格式化的纯文本文件，对文本文件进行分类，记录文本中重复出现的字符串及其出现的频次，将出现次数小于一定阈值的字符串滤掉；The repeated string discovery module is used to process the webpage corpus to obtain formatted plain text files, classify the text files, record the repeated strings and their frequency of occurrence in the text, and filter out the strings whose occurrences are less than a certain threshold ;

上下文邻接分析模块，用于计算每条重复串的上下文邻接特征量，并判断这些特征量是否达到设定的阈值，根据判断结果过滤掉没有达到阈值的文本串；The context adjacency analysis module is used to calculate the context adjacency feature quantity of each repeated string, and judge whether these feature quantities reach a set threshold, and filter out text strings that do not reach the threshold according to the judgment result;

统计语言模型分析模块，用于对文本串逐字扫描相邻字对，查找相邻字对的耦合度，根据耦合度过滤文本串，得到有意义串。The statistical language model analysis module is used to scan the adjacent word pairs of the text string one by one, find the coupling degree of the adjacent word pairs, and filter the text string according to the coupling degree to obtain a meaningful string.

所述统计语言模型分析模块，还用于在扫描相邻字对后，根据文本串的位置成词概率，进一步进行过滤字符串而得到有意义串。The statistical language model analysis module is also used to further filter character strings to obtain meaningful strings according to the probability of forming words at positions of text strings after scanning adjacent word pairs.

所述上下文邻接特征量为邻接集合、邻接种类、邻接熵、邻接对集合、邻接对种类、邻接对熵中的一种或者一种以上的组合。The context adjacency feature quantity is one or a combination of adjacency set, adjacency seed type, adjacency entropy, adjacency pair set, adjacency pair type, and adjacency pair entropy.

所述记录文本中重复出现的字符串及其出现的频次，是通过后缀树算法、sequitur算法、n元递增分布算法或者改进的n元递增分布算法进行重复串发现而得到的。The character strings that appear repeatedly in the record text and their frequency of occurrence are obtained by discovering repeated strings through the suffix tree algorithm, sequitur algorithm, n-ary incremental distribution algorithm or improved n-ary incremental distribution algorithm.

本发明的有益效果是：本发明的面向互联网的有意义串的挖掘方法和系统，将待识别文本经过重复串发现、上下文邻接分析、统计语言模型分析三个阶段达到挖掘有意义串的目的。本发明在预处理中做了分词，进一步降低重复串发现的时间复杂度，同时也大幅提高了提取结果的准确率和召回率；重复串发现的空间复杂度是O(N)(N为语料规模大小)，能够对与内存大小相当的纯文本数据进行分析，比传统的后缀树方法处理规模大10倍左右；邻接分析时可以根据应用需要采用不同的特征量，邻接熵倾向于发现各种语用环境分布比较均匀的串，这些串空间分布较广泛，往往具有通用性；最后，采用双字耦合度来衡量两个字结合的紧密程度，与停用字判断相结合，更加灵活和智能。The beneficial effects of the present invention are: the Internet-oriented meaningful string mining method and system of the present invention realizes the purpose of mining meaningful strings through three stages of repeated string discovery, context adjacency analysis, and statistical language model analysis in the text to be recognized. The present invention has done participle in the preprocessing, further reduces the time complexity that repeating string finds, also improves the accuracy rate and the recall rate of extraction result greatly simultaneously; The space complexity that repeating string finds is O(N) (N is corpus scale), it can analyze plain text data comparable to the size of memory, which is about 10 times larger than the traditional suffix tree method; different feature quantities can be used in adjacency analysis according to application needs, and adjacency entropy tends to discover various Strings with relatively uniform linguistic environment distribution, these strings have a wide spatial distribution and are often versatile; finally, the double word coupling degree is used to measure the tightness of the combination of two words, combined with the judgment of stop words, it is more flexible and intelligent .

附图说明 Description of drawings

图1为本发明面向互联网的有意义串的挖掘方法过程示意图；Fig. 1 is the schematic diagram of the mining method process of the Internet-oriented meaningful string of the present invention;

图2为图1中从重复串提取有意义串过程流程图；Fig. 2 extracts meaningful string process flow chart from repeated string among Fig. 1;

图3为本发明面向互联网的有意义串的串首串尾判断过程流程图；Fig. 3 is the flow chart of the judging process of the beginning and end of a string of meaningful strings facing the Internet in the present invention;

图4为本发明面向互联网的有意义串的挖掘系统示意图。Fig. 4 is a schematic diagram of the Internet-oriented meaningful string mining system of the present invention.

具体实施方式 Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明的一种面向互联网的有意义串的挖掘方法和系统进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solution and advantages of the present invention clearer, the method and system for mining meaningful strings oriented to the Internet of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

本发明将在互联网中具有有用信息，在多种环境下应用的字符串定义为有意义串。有意义串最主要的特点是语义完整性，本发明从统计、结构、语用、语义几方面来分析，提出一种普适性的有意义串的挖掘方法和系统。The present invention defines a character string that has useful information on the Internet and is used in various environments as a meaningful string. The most important feature of meaningful strings is semantic integrity. The present invention analyzes statistics, structure, pragmatics and semantics, and proposes a universal mining method and system for meaningful strings.

本发明将有意义串挖掘方法过程分为重复串发现、上下文邻接分析、语言模型分析三个阶段，整个过程如图1所示，包括下列步骤：The present invention divides meaningful string mining method process into repeated string discovery, context adjacency analysis, language model analysis three stages, and whole process as shown in Figure 1, comprises the following steps:

步骤S100，在重复串发现阶段，将网页语料处理得到格式化的纯文本文件，对文本文件进行分类，记录文本中重复出现的字符串及其出现的频次，将出现次数小于一定阈值的字符串滤掉。Step S100, in the stage of discovering repeated strings, process the webpage corpus to obtain formatted plain text files, classify the text files, record the strings that appear repeatedly in the text and their frequency of occurrence, and list the strings that appear less than a certain threshold filter out.

步骤S200，在上下文邻接分析阶段，计算每条重复串的上下文邻接特征量，并判断这些特征量是否达到设定的阈值，根据判断结果过滤掉没有达到阈值的文本串。Step S200, in the context adjacency analysis stage, calculate the context adjacency feature value of each repeated string, judge whether these feature values reach the set threshold, and filter out text strings that do not reach the threshold according to the judgment result.

步骤S300，在统计语言模型分析阶段，对文本串逐字扫描相邻字对，查找相邻字对的耦合度，根据耦合度过滤文本串，然后根据文本串的位置成词概率，进一步进行过滤而得到有意义串。Step S300, in the statistical language model analysis stage, scan the text string for adjacent word pairs one by one, find the coupling degree of adjacent word pairs, filter the text string according to the coupling degree, and then further filter according to the position word probability of the text string And get a meaningful string.

本发明主要使用了两个标准来衡量。首先，本发明计算一个字符串中相邻的两个词结合的紧密程度，如果紧密程度小于一定阈值，就删除这个字符串。The present invention mainly uses two standards to measure. First, the present invention calculates the closeness of the combination of two adjacent words in a character string, and if the degree of closeness is less than a certain threshold, the character string is deleted.

其次，本发明还要测试一个词中的字，出现在它现在位置(位置指词首或词尾)的概率，如果概率低于一定的阈值，就删除该词。Secondly, the present invention also will test the word in a word, the probability that appears in its current position (the position refers to the beginning or the end of a word), if the probability is lower than a certain threshold, just delete this word.

下面详细说明步骤S100中，将网页语料处理得到格式化的纯文本文件，对文本文件进行分类，记录文本中重复出现的字符串及其出现的频次，将出现次数小于一定阈值的字符串滤掉的过程。In step S100, process the webpage corpus to obtain a formatted plain text file, classify the text file, record the strings that appear repeatedly in the text and their frequency of occurrence, and filter out the strings that appear less than a certain threshold the process of.

将网页语料处理得到格式化的纯文本文件，然后进行预处理，包括分词，将汉字转化为对应的ID。分词部分采用速度较快的最大匹配分词法。实验表明，分词词典包含6迈出多核心词汇，而且分词过程不做未登录词识别而进行分词，最大匹配分词这一步骤的效果要明显好于没有分词的结果。Process the webpage corpus to obtain a formatted plain text file, and then perform preprocessing, including word segmentation, to convert Chinese characters into corresponding IDs. The word segmentation part adopts the faster maximum matching word segmentation method. Experiments show that the word segmentation dictionary contains 6-step multi-core vocabulary, and the word segmentation process does not recognize unregistered words but performs word segmentation. The effect of the step of maximum matching word segmentation is significantly better than the result without word segmentation.

对处理好的ID序列建立索引，从每个单字索引的信息开始扩展得到所有重复串，新产生的重复串写入文件之后，继续扩展得到长串，反复迭代，直到出现间隔符号或者长度达到指定阈值时，停止扩展。同时，还要记录每个串的邻接词信息以及文档信息，每类信息独立保存在一个文件中。Build an index on the processed ID sequence, expand from the information of each single-word index to get all the repeated strings, after the newly generated repeated strings are written into the file, continue to expand to get a long string, and iterate repeatedly until the interval symbol appears or the length reaches the specified When the threshold is exceeded, the expansion is stopped. At the same time, it is also necessary to record the adjacent word information and document information of each string, and each type of information is stored in a file independently.

目前比较成熟的、应用于中文文本的重复串发现算法有后缀树算法、sequitur算法和n元递增分布算法等。应用其中任何一种算法都可以达到统计重复串的目的。本发明实施例采用的是改进的n元递增分布算法。具体做法如下。At present, the more mature repeated string discovery algorithms applied to Chinese text include suffix tree algorithm, sequitur algorithm and n-ary incremental distribution algorithm. Applying any one of these algorithms can achieve the purpose of counting repeated strings. The embodiment of the present invention adopts an improved n-element incremental distribution algorithm. The specific method is as follows.

本发明的方法比n元递增算法时间复杂度有所降低，因为索引记录了每个串的地址信息，扩展时根据地址信息和串长直接定位到下一个扩展字符，统计频次信息的范围仅仅是当前扩展串，而不需要遍历整个语料进行全局比较统计。The time complexity of the method of the present invention is lower than that of the n-element incremental algorithm, because the index records the address information of each string, and when expanding, the next extended character is directly located according to the address information and the string length, and the range of statistical frequency information is only The current extension string does not need to traverse the entire corpus for global comparison statistics.

同时，还要记录每个串的邻接词信息以及文档信息，每类信息独立保存在一个文件中。在后面的有意义串分析中需要利用串的文档信息和邻接对信息，如果重复串发现后再进行上述统计，则要对整个语料做多次遍历，增加时间开销。而发现重复串时每个串的地址信息是已知的，几乎在不增加时间复杂度的同时，能够获得上述信息。At the same time, it is also necessary to record the adjacent word information and document information of each string, and each type of information is stored in a file independently. In the subsequent analysis of meaningful strings, the document information and adjacency pair information of the strings need to be used. If repeated strings are found and then the above statistics are performed, the entire corpus needs to be traversed multiple times, which increases time overhead. However, when repeated strings are found, the address information of each string is known, and the above information can be obtained almost without increasing the time complexity.

通过实验验证，如果在查找重复串之前对文本进行分词，有意义串挖掘的效果将比较好。It is verified by experiments that if the text is segmented before searching for repeated strings, the effect of meaningful string mining will be better.

下面详细描述步骤S200中，计算每条重复串的上下文邻接特征量，并判断这些特征量是否达到设定的阈值，根据判断结果过滤掉没有达到阈值的文本串的过程。The following describes in detail the process of calculating the contextual adjacency feature values of each repeated string in step S200, judging whether these feature values reach a set threshold, and filtering out text strings that do not reach the threshold according to the judgment result.

为了描述字符串S的上下文环境灵活程度，本发明提出了一系列上下文邻接特征量概念，即邻接集合、邻接种类、邻接熵，以及邻接对集合、邻接对种类、邻接对熵。In order to describe the flexibility of the context environment of the character string S, the present invention proposes a series of context adjacency feature quantity concepts, namely adjacency set, adjacency seed type, adjacency entropy, and adjacency pair set, adjacency pair type, adjacency pair entropy.

邻接集合：分为左邻接集合L_NB和右邻接集合R_NB，分别指真实文本中，与字符串S左边或者右边相邻的字或词元素的集合。Adjacency set: divided into left adjacency set L _NB and right adjacency set R _NB , which respectively refer to the set of words or word elements adjacent to the left or right of the string S in the real text.

邻接种类：分为左邻接种类V_L和右邻接种类V_R，分别指左邻接集合中和右邻接集合种字或词元素的数目，它们反映了字符串S上文和下文语境种类的多少。Adjacency category: It is divided into left adjacency category V _L and right adjacency category V _R , respectively referring to the number of words or word elements in the left adjacency set and the right adjacency set, which reflect the number of context types above and below the string S .

邻接熵：表示字符串S的邻接集合的信息熵，字符串S有左邻接熵和右邻接熵。Adjacency entropy: Indicates the information entropy of the adjacency set of string S, and string S has left adjacency entropy and right adjacency entropy.

相应地，还提出了邻接对集合、邻接对种类、邻接对熵等上下文邻接特征值的概念。Correspondingly, the concepts of contextual adjacency feature values such as adjacency pair set, adjacency pair type, and adjacency pair entropy are also proposed.

邻接对集合：字符串S每次出现的左邻接元素和右邻接元素构成一个邻接对<Li，Ri>，字符串S的所有邻接对组成邻接对集合PNB。Adjacent pair set: the left and right adjacent elements of each occurrence of the string S form an adjacent pair <Li, Ri>, and all adjacent pairs of the string S form the adjacent pair set PNB.

邻接对种类：邻接对集合PNB中元素的个数称为邻接对种类VP。Adjacency pair type: The number of elements in the adjacency pair set PNB is called the adjacency pair type VP.

邻接对熵：表示邻接对集合的信息熵。Adjacency pair entropy: Indicates the information entropy of the adjacency pair set.

这些上下文邻接特征量都可以用来衡量一个字符串上下文环境。These context adjacency features can be used to measure a character string context.

如图2所示，上下文邻接分析主要计算每条重复串的上下文邻接特征量，包括邻接集合、邻接种类、邻接熵，以及邻接对集合、邻接对种类、邻接对熵等，判断这些特征量是否达到设定的阈值，如果达到，则说明该串在语言用途上比较灵活，进入统计语言模型分析阶段。As shown in Figure 2, contextual adjacency analysis mainly calculates the contextual adjacency feature quantity of each repeated string, including adjacency set, adjacency seeding type, adjacency entropy, and adjacency pair set, adjacency pair type, adjacency pair entropy, etc., to judge whether these feature quantities Reach the set threshold, if reached, it means that the string is more flexible in language use, and enter the statistical language model analysis stage.

计算重复串的上下文邻接特征量，包括邻接集合、邻接种类，以及邻接对集合、邻接对种类，是通过对重复串语料统计而得到。The calculation of context adjacency features of repeated strings, including adjacency sets, adjacency seed types, adjacency pair sets, and adjacency pair types, is obtained through statistics on the corpus of repeated strings.

熵(包括邻接熵、邻接对熵)是通过计算而得到。Entropy (including adjacency entropy and adjacency pair entropy) is obtained by calculation.

计算熵的公式如下：The formula for calculating entropy is as follows:

如邻接集合(如左邻接集合)L_NB中每个元素l_i在真实文本中对应一个出现频次n_i，频次总和记为N，则熵的计算公式为：For example, each element l _i in the adjacency set (such as the left adjacency set) L _NB corresponds to an occurrence frequency n _i in the real text, and the sum of the frequencies is recorded as N, then the entropy calculation formula is:

${E E.}_{L L} = = - - {Σ Σ}_{i i = = 11}^{| | {V V}_{L L} | |} \frac{{n no}_{i i}}{n no} log log ((\frac{{n no}_{i i}}{n no}))$

例如：新词“禽流感”从2000年开始频繁使用，出现在以下句子中：For example: The new word "bird flu" has been used frequently since 2000, appearing in the following sentences:

钟南山透露禽流感病毒尚未明显变异。Zhong Nanshan revealed that the bird flu virus has not mutated significantly.

广东的防控禽流感形势趋缓。The situation of bird flu prevention and control in Guangdong has slowed down.

有7人感染禽流感事件。Seven people were infected with bird flu.

发现一宗禽流感疑似病例。A suspected case of avian influenza has been detected.

颁布5条禁令防控禽流感。Promulgated five bans to prevent and control bird flu.

如果将词做为邻接分析的粒度，“禽流感”这些字符串中的上下文邻接特征量计算结果为：If words are used as the granularity of adjacency analysis, the calculation results of the context adjacency feature quantities in the strings of "bird flu" are:

左邻接集合：L_NB＝{透露，防控，感染，一宗}Left-adjacent set: L _NB = {disclosure, prevention and control, infection, one case}

右邻接集合：R_NB＝{病毒，形势，事件，疑似，EOS}Right adjacency set: R _NB = {virus, situation, event, suspected, EOS}

左邻接种类：V_L＝4Left Neighbor Class: V _L = 4

右邻接种类：V_R＝5Right Neighbor Class: _VR = 5

左邻接熵： $E_{L} = - (\frac{1}{5} \log \frac{1}{5} + \frac{2}{5} \log \frac{2}{5} + \frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5}) = - 0.718$ Left adjacency entropy: ${E.}_{L} = - (\frac{1}{5} \log \frac{1}{5} + \frac{2}{5} \log \frac{2}{5} + \frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5}) = - 0.718$

右邻接熵： $E_{R} = - (\frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5}) = - 0.699$ Right adjacency entropy: ${E.}_{R} = - (\frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5}) = - 0.699$

邻接对集合PNB＝{<透露，病毒>，<防控，形势>，<感染，事件>，<一宗，疑似>，<防控，EOS>}Adjacent pair set PNB={<disclosure, virus>, <control, situation>, <infection, event>, <one case, suspected>, <control, EOS>}

邻接对种类：PNB＝5Adjacency pair type: PNB=5

邻接对熵 $E_{P} = - (\frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5}) = - 0.699$ adjacency pair entropy ${E.}_{P} = - (\frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5} + \frac{1}{5} \log \frac{1}{5}) = - 0.699$

若特征量未达到阈值，则说明该串是垃圾串，将其过滤掉。其中，阈值是由训练语料训练得来的。If the feature quantity does not reach the threshold, it means that the string is a garbage string, and it will be filtered out. Among them, the threshold is obtained from the training corpus.

语料是在语言的实际使用中真实出现过的语言材料；以电子计算机为载体承载语言知识的基础资源。真实语料需要经过加工(分析和处理)，才能成为有用的资源。The corpus is the language material that has actually appeared in the actual use of the language; the basic resource of language knowledge carried by the computer as the carrier. Real corpus needs to be processed (analyzed and processed) before it can become a useful resource.

语料训练方法是一种现有技术，如通过隐马尔可夫模型(Hidden MarkovModel，HMM)对训练语料进行训练的方法。其不是本发明的发明点，因此，在本发明中不再一一详细描述。The corpus training method is a prior art, such as the method of training the training corpus by Hidden Markov Model (HMM). It is not the invention point of the present invention, therefore, it will not be described in detail one by one in the present invention.

通过实验验证，相邻元素的单位是词的准确率比是字的准确率要高。It is verified by experiments that the accuracy rate of the unit of adjacent elements is a word is higher than that of a word.

下面详细描述步骤S300中，对文本串逐字扫描相邻字对，查找相邻字对的耦合度，根据耦合度过滤文本串，然后根据文本串的成词概率，进一步进行过滤而得到有意义串的过程。The following describes step S300 in detail. The adjacent word pairs are scanned word by word for the text string, the coupling degree of the adjacent word pair is found, the text string is filtered according to the coupling degree, and then the word formation probability of the text string is further filtered to obtain a meaningful string process.

为了描述一个词中连续两个字的结合紧密程度，本发明定义了相邻字对的耦合度的概念。其定义是：在切分好的训练语料中扫描所有出现过的连续字对，统计出每组字对出现的总次数以及该字对作为某个词子串的总次数，后者与前者之比就叫做相邻字对的耦合度，用符号Coup表示。例如“过目”这一双字对在本文的统计中共出现16次，其中出现在“过目不忘”，“一一过目”这样的词中12次，而在“超过目前”这样的语境中共出现了4次，所以Coup(<过，目>)＝12/(12+4)＝0.75。In order to describe the tightness of combination of two consecutive words in a word, the present invention defines the concept of coupling degree of adjacent word pairs. Its definition is: scan all the consecutive word pairs that have appeared in the segmented training corpus, count the total number of occurrences of each word pair and the total number of times the word pair is used as a certain word string, the difference between the latter and the former The ratio is called the coupling degree of adjacent word pairs, represented by the symbol Coup. For example, the two-character pair of "Guomu" appears 16 times in the statistics of this article, among which it appears 12 times in words like "Knowing the memory" and "One by one", and it appears in the context of "beyond the present". 4 times, so Coup(<over, mesh>)=12/(12+4)=0.75.

Coup值越高，表明该字对的结合程度越高，反之表明该字对越不可能出现在一个词中。耦合度是由训练语料求得的。The higher the Coup value, the higher the combination degree of the word pair, and vice versa, the less likely the word pair appears in a word. The degree of coupling is obtained from the training corpus.

另外，本发明引入位置成词概率来表示某个汉字在某个位置(词首或词位等)出现的概率。如“阿”字的词首概率很大，但是词尾概率很小，如果“阿”出现在一个词的词尾，基本可以认为该词是一个垃圾串。位置成词概率也是由训练语料求得。In addition, the present invention introduces the positional word probability to represent the probability that a certain Chinese character appears at a certain position (word initial or lexeme, etc.). For example, the probability of the beginning of the word "A" is very high, but the probability of the end of the word is very small. If "A" appears at the end of a word, it can basically be considered that the word is a string of garbage. The word-forming probability of the position is also obtained from the training corpus.

在语言模块分析之前，应当对一部分训练语料进行人工标注，生成相邻字的耦合度词典(如双字耦合度词典)和单字位置成词概率词典。Before language module analysis, a part of the training corpus should be manually annotated to generate a dictionary of coupling degrees of adjacent words (such as a dictionary of double-word coupling degrees) and a dictionary of word-forming probability of single-word positions.

如图3所示，首先逐字扫描相邻两字对，查找其相邻字对的耦合度，如双字耦合度，小于设定阈值时，不构成某个词的一部分，应该作为垃圾串删除。As shown in Figure 3, first scan the adjacent word pairs word by word to find the coupling degree of the adjacent word pairs. For example, when the double word coupling degree is less than the set threshold, it does not constitute a part of a word and should be regarded as a garbage string. delete.

而双字对扫描没有删除的文字串要进入下一步过滤，查找单字位置成词概率。首先查找首字的位置成词概率，如果概率低于一定的阈值，代表这个字不应出现在字首，则将其过滤。The text strings that are not deleted by the double-word pair scanning will enter the next step of filtering to find the probability of word formation in the single-word position. First, find the word probability of the position of the first character. If the probability is lower than a certain threshold, it means that the word should not appear at the beginning of the word, and then filter it.

没有被删除的字符串，查找其尾字的位置成词概率，来判断其串首和串尾是否包含常用功能字，如果是功能字，则将其过滤。即如果位置成词概率低于设置的阈值，代表这个字符串不应该出现在字尾，将其过滤。For the character strings that have not been deleted, look up the word probability of the position of the tail character to determine whether the string head and the string tail contain common function words, and if it is a function word, filter it. That is, if the word-forming probability of the position is lower than the set threshold, it means that the string should not appear at the end of the word, and it will be filtered.

较佳地，还取出串中首字对判断其双字耦合度，如果大于某个阈值，则认为该字对结合紧密，够成某个词的首部，不再对首字的单字位置成词概率进行判断，这样能够避免垃圾头词典的绝对化问题。如“的士”这个双字对是构成词的，如果仅仅判断首字“的”的位置成词概率，也许需要过滤，但是首先判断字对的双字耦合度，发现其耦合程度高，应该保留。Preferably, the first word pair in the string is also taken out to judge its double-word coupling degree. If it is greater than a certain threshold, then the word pair is considered to be closely combined, enough to form the head of a certain word, and no longer form a word at the single word position of the first word. Probability is used to judge, which can avoid the problem of absolutization of garbage dictionary. For example, the double word pair of "taxi" constitutes a word. If you only judge the word probability of the position of the first word "的", it may need to be filtered, but first judge the double word coupling degree of the word pair, and find that its coupling degree is high, you should reserve.

经过这一步骤，还没有被过滤掉的字符串确定为有意义串。输出这些有意义串，过程结束。After this step, the strings that have not been filtered out are determined to be meaningful strings. Output these meaningful strings, and the process ends.

其中，这一过程中的所有阈值都是由训练语料，训练得到的。Wherein, all the thresholds in this process are obtained from the training corpus.

以来自新浪，网易等9个国内新闻网站实验的原始网页，作为测试数据的原始网页一部分，采集时间介于2006年4月19日到2006年6月14日之间，共有31万多张网页为测试数据，大小12G，提取正文后，最终正文的大小为470MB。本发明的有意义串的挖掘方法在这些新闻网页上提取有意义串的正确率可以达到70.55％。The original webpages from 9 domestic news websites such as Sina and Netease were used as part of the original webpages of the test data. The collection time was between April 19, 2006 and June 14, 2006, with a total of more than 310,000 webpages It is test data with a size of 12G. After extracting the text, the final text size is 470MB. The correct rate of extracting meaningful strings from these news web pages by the method for mining meaningful strings of the present invention can reach 70.55%.

与所述面向互联网的有意义串的挖掘方法相对应，本发明还提供一种面向互联网的有意义串的挖掘系统400，如图4所示，其包括：Corresponding to the method for mining meaningful Internet-oriented strings, the present invention also provides a mining system 400 for Internet-oriented meaningful strings, as shown in FIG. 4 , which includes:

重复串发现模块410，用于将网页语料处理得到格式化的纯文本文件，对文本文件进行分类，记录文本中重复出现的字符串及其出现的频次，将出现次数小于一定阈值的字符串滤掉。The repeated string discovery module 410 is used to process the webpage corpus to obtain formatted plain text files, classify the text files, record the strings that appear repeatedly in the text and their frequency of occurrence, and filter the strings that appear less than a certain threshold. Lose.

上下文邻接分析模块420，用于计算每条重复串的上下文邻接特征量，并判断这些特征量是否达到设定的阈值，根据判断结果过滤掉没有达到阈值的文本串。The context adjacency analysis module 420 is used to calculate the context adjacency feature value of each repeated string, judge whether these feature values reach a set threshold, and filter out text strings that do not reach the threshold according to the judgment result.

统计语言模型分析模块430，用于对文本串逐字扫描相邻字对，查找相邻字对的耦合度，根据耦合度过滤文本串，然后根据文本串的位置成词概率，进一步进行过滤而得到有意义串。The statistical language model analysis module 430 is used to scan the adjacent word pairs word by word for the text string, find the coupling degree of the adjacent word pair, filter the text string according to the coupling degree, and then further filter and form words according to the position word probability of the text string get a meaningful string.

本发明的面向互联网的有意义串的挖掘系统400，采用与面向互联网的有意义串的挖掘方法相同的过程工作，因此，在本发明实施例中，不再对该系统进行重复描述。The Internet-oriented meaningful string mining system 400 of the present invention uses the same process as the Internet-oriented meaningful string mining method. Therefore, in the embodiment of the present invention, the system will not be described repeatedly.

以上对本发明的具体实施例进行了描述和说明，这些实施例应被认为其只是示例性的，并不用于对本发明进行限制，本发明应根据所附的权利要求进行解释。The specific embodiments of the present invention have been described and illustrated above, and these embodiments should be considered as exemplary only, and are not used to limit the present invention, and the present invention should be interpreted according to the appended claims.

Claims

1. An Internet-oriented meaningful string mining method is characterized by comprising the following steps:

step A, repeating character string discovery;

step B, filtering the character string through context adjacency analysis;

and C, analyzing and filtering the character strings through a language model.

2. The method for mining internet-oriented meaningful strings according to claim 1, wherein the step a comprises the following steps:

step A1, processing the webpage linguistic data to obtain formatted plain text files, classifying the text files, recording character strings which repeatedly appear in the text and the occurrence frequency of the character strings, and filtering the character strings of which the occurrence frequency is less than a certain threshold value.

3. The method for mining internet-oriented meaningful strings according to claim 2, wherein the step B comprises the following steps:

and B1, calculating context adjacent characteristic quantities of each repeated string, judging whether the characteristic quantities reach a set threshold value, and filtering out text strings which do not reach the threshold value according to a judgment result.

4. The Internet-oriented meaningful string mining method according to claim 3, wherein the step C comprises the steps of:

and step C1, scanning adjacent character pairs of the text string character by character, searching the coupling degree of the adjacent character pairs, filtering the text string according to the coupling degree, and further filtering according to the position word forming probability of the text string to obtain the meaningful string.

5. The method for mining internet-oriented meaningful strings according to claim 2, wherein the step A1 comprises the following steps:

step A11, processing the webpage corpus to obtain a formatted plain text file, and then converting the Chinese characters into corresponding IDs;

step A12, establishing indexes for the processed ID sequences, starting to expand from the information of each single character index to obtain all repeated strings, continuously expanding to obtain long strings after the newly generated repeated strings are written into a file, repeatedly iterating until interval symbols appear or the length reaches a specified threshold value, and stopping expansion;

and step A13, recording the adjacent word information and the document information of each string, and independently storing each type of information in a file.

6. The method for mining Internet-oriented meaningful strings according to claim 3, wherein the step B1 comprises the following steps:

step B11, calculating context adjacent characteristic quantities of each repeated string, and judging whether the characteristic quantities reach a set threshold value;

step B12, if the threshold value is reached, the step C is carried out;

and step B13, if the characteristic quantity does not reach the threshold value, filtering the characteristic quantity.

7. The Internet-oriented meaningful string mining method according to claim 4, wherein the step C1 comprises the following steps:

step C11, labeling a part of the training corpus to generate a coupling degree dictionary of adjacent words and a word position word forming probability dictionary;

step C12, scanning adjacent word pairs word by word, and searching the coupling degree of the adjacent word pairs;

step C13, when the coupling degree of the adjacent character pairs is smaller than a set threshold value, the adjacent character pairs do not form a part of the characters and are used as garbage strings for filtering;

step C14, searching the character string of adjacent character pairs which is not filtered, searching the position word forming probability of the single character, and judging whether the head and the tail of the character string contain common functional characters;

step C15, if the character is a functional character, filtering the functional character;

in step C16, characters that have not been filtered are determined to be meaningful strings.

8. The Internet-oriented meaningful string mining method according to claim 4, wherein the step C1 comprises the following steps:

step C11', labeling a part of the training corpus to generate a coupling degree dictionary of adjacent words and a word forming probability dictionary of single word positions;

and step C12', taking out the first character pair in the character string, judging the coupling degree of the adjacent characters, if the coupling degree is more than a threshold value, considering that the character pair is tightly combined to form the head of the character, and not judging the word forming probability of the single character position of the first character.

9. An internet-oriented meaningful string mining system, comprising:

the repeated string finding module is used for processing the webpage linguistic data to obtain formatted plain text files, classifying the text files, recording character strings which repeatedly appear in the text and the appearance frequency of the character strings, and filtering out the character strings of which the appearance frequency is less than a certain threshold value;

the context adjacency analysis module is used for calculating the context adjacency characteristic quantity of each repeated string, judging whether the characteristic quantity reaches a set threshold value or not, and filtering out the text strings which do not reach the threshold value according to the judgment result;

and the statistical language model analysis module is used for scanning adjacent character pairs word by word of the text string, searching the coupling degree of the adjacent character pairs, and filtering the text string according to the coupling degree to obtain the meaningful string.

10. The system of claim 9, wherein the statistical language model analysis module is further configured to filter the strings to obtain the meaningful strings according to the position word-forming probability of the text strings after scanning the adjacent word pairs.

11. The system for mining internet-oriented meaningful strings according to claim 9 or 10, wherein the contextual adjacency feature quantity is one or more of an adjacency set, an adjacency category, an adjacency entropy, an adjacency pair set, an adjacency pair category, and an adjacency pair entropy.

12. The system for mining meaningful internet-oriented strings as claimed in claim 9 or 10, wherein the repeated strings and the occurrence frequency thereof in the recorded text are obtained by repeated string discovery through a suffix tree algorithm, a sequitur algorithm, an n-ary incremental distribution algorithm or a modified n-ary incremental distribution algorithm.