CN109783819B - Regular expression generation method and system - Google Patents
Regular expression generation method and system Download PDFInfo
- Publication number
- CN109783819B CN109783819B CN201910046964.2A CN201910046964A CN109783819B CN 109783819 B CN109783819 B CN 109783819B CN 201910046964 A CN201910046964 A CN 201910046964A CN 109783819 B CN109783819 B CN 109783819B
- Authority
- CN
- China
- Prior art keywords
- corpus information
- sentence
- regular expression
- current corpus
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Machine Translation (AREA)
Abstract
本发明属于数据处理技术领域,公开了一种正则表达式的生成方法及系统,其方法包括:获取当前语料信息;对所述当前语料信息进行语法分析,提取所述当前语料信息的句式主体;获取所述句式主体的字词的语义槽;根据所述句式主体、所述语义槽和所述当前语料信息中剩余的非主体部分生成正则表达式。本发明根据句式结构和词语的词性自动生成正则表达式,无需人工根据句子的意思推演到的规则进行编写,不仅节省人工成本,而且效率较高。
The invention belongs to the field of data processing technology and discloses a regular expression generation method and system. The method includes: obtaining current corpus information; performing syntax analysis on the current corpus information, and extracting the sentence subject of the current corpus information. ; Obtain the semantic slot of the word in the sentence subject; generate a regular expression according to the sentence subject, the semantic slot and the remaining non-subject part of the current corpus information. The invention automatically generates regular expressions based on the sentence structure and the part-of-speech of words, without the need for manual writing based on rules deduced from the meaning of the sentence, which not only saves labor costs, but also has high efficiency.
Description
技术领域Technical field
本发明属于数据处理技术领域,特别涉及一种正则表达式的生成方法及系统。The invention belongs to the field of data processing technology, and particularly relates to a regular expression generation method and system.
背景技术Background technique
随着网络技术的迅速发展,每天有大量的信息数据产生和需要处理,传统的正则表达式一般通过人工进行编写,其需要按照“查看语料→判断语料中关键词→编写词库→编写正则式”的步骤进行编写,即需要人工根据句子的意思推演到的规则进行编写,不仅过程复杂,而且人工查看语料并进行编写的效率较低,并且完全依靠人工编写正则表达式无法及时、准确地处理每天新增的大量信息数据,同时,由人工编写正则表达式对工作人员的要求较高。With the rapid development of network technology, a large amount of information data is generated and needs to be processed every day. Traditional regular expressions are generally written manually, which requires "viewing the corpus → judging the keywords in the corpus → writing the vocabulary → writing regular expressions" ” step to write, that is, it needs to be written manually according to the rules deduced from the meaning of the sentence. Not only is the process complicated, but also the efficiency of manually viewing the corpus and writing is low, and relying entirely on manual writing of regular expressions cannot be processed in a timely and accurate manner. A large amount of information data is added every day. At the same time, manual writing of regular expressions places high demands on staff.
因此,当前急需一种能由系统根据语料信息自动撰写语料对应的正则表达式的方法。Therefore, there is an urgent need for a method that allows the system to automatically compose regular expressions corresponding to the corpus based on the corpus information.
发明内容Contents of the invention
本发明的目的是提供一种正则表达式的生成方法及系统,实现自动生成正则表达式的目的,不仅节省人工成本,而且效率较高。The purpose of the present invention is to provide a regular expression generation method and system to achieve the purpose of automatically generating regular expressions, which not only saves labor costs, but also has high efficiency.
本发明提供的技术方案如下:The technical solutions provided by the invention are as follows:
一方面,提供一种正则表达式的生成方法,包括:On the one hand, a regular expression generation method is provided, including:
获取当前语料信息;Get the current corpus information;
对所述当前语料信息进行语法分析,提取所述当前语料信息的句式主体;Perform grammatical analysis on the current corpus information and extract the sentence subject of the current corpus information;
获取所述句式主体的字词的语义槽;Obtain the semantic slot of the word in the subject of the sentence pattern;
根据所述句式主体、所述语义槽和所述当前语料信息中剩余的非句式主体生成正则表达式。A regular expression is generated according to the sentence subject, the semantic slot and the remaining non-sentence subjects in the current corpus information.
进一步优选地,所述对所述当前语料信息进行语法分析,提取所述当前语料信息的句式主体具体包括:Further preferably, performing grammatical analysis on the current corpus information and extracting the sentence body of the current corpus information specifically includes:
对所述当前语料信息进行分词,得到所述当前语料信息中的字词及对应的词性;Perform word segmentation on the current corpus information to obtain the words and corresponding parts of speech in the current corpus information;
根据语法规则和所述当前语料信息中的字词的词性,对所述当前语料信息进行句式分析,得到对应的句式结构;According to the grammar rules and the part-of-speech of the words in the current corpus information, perform sentence analysis on the current corpus information to obtain the corresponding sentence structure;
根据所述句式结构,提取所述当前语料信息的句式主体。According to the sentence structure, the sentence main body of the current corpus information is extracted.
进一步优选地,所述根据所述句式主体、所述语义槽和所述当前语料信息中剩余的非句式主体生成正则表达式具体包括:Further preferably, generating a regular expression based on the sentence body, the semantic slot and the remaining non-sentence bodies in the current corpus information specifically includes:
将所述当前语料信息中的句式主体的字词替换为对应的语义槽;Replace the words in the sentence subject in the current corpus information with corresponding semantic slots;
将分词后的所述当前语料信息剩余的非句式主体和所述语义槽按照所述当前语料信息的句式结构进行排序,生成正则表达式。The remaining non-sentence subjects and the semantic slots of the current corpus information after word segmentation are sorted according to the sentence structure of the current corpus information, and a regular expression is generated.
进一步优选地,所述根据所述句式主体、所述语义槽和所述当前语料信息中剩余的非句式主体生成正则表达式具体包括:Further preferably, generating a regular expression based on the sentence body, the semantic slot and the remaining non-sentence bodies in the current corpus information specifically includes:
将所述当前语料信息中的句式主体的字词替换为对应的语义槽;Replace the words in the sentence subject in the current corpus information with corresponding semantic slots;
将分词后的所述当前语料信息剩余的非句式主体和所述语义槽按照语法结构进行排序生成排序不同且语义相同的至少一个正则表达式。The remaining non-sentence subjects and the semantic slots of the current corpus information after word segmentation are sorted according to the grammatical structure to generate at least one regular expression with different sorting and the same semantics.
进一步优选地,所述根据所述句式主体、所述语义槽和所述当前语料信息中剩余的非句式主体生成正则表达式还包括:Further preferably, generating a regular expression based on the sentence subject, the semantic slot and the remaining non-sentence subjects in the current corpus information further includes:
生成所述正则表达式后,在生成的所述正则表达式中加入连接词,生成另一语义相同的正则表达式。After the regular expression is generated, a connective is added to the generated regular expression to generate another regular expression with the same semantics.
另一方面,还提供一种正则表达式的生成系统,包括:On the other hand, a regular expression generation system is also provided, including:
语料信息获取模块,用于获取当前语料信息;Corpus information acquisition module, used to obtain current corpus information;
句式主体抽取模块,用于对所述当前语料信息进行语法分析,提取所述当前语料信息的句式主体;A sentence subject extraction module, used to perform grammatical analysis on the current corpus information and extract the sentence subject of the current corpus information;
语义槽获取模块,用于获取所述句式主体的字词的语义槽;A semantic slot acquisition module, used to acquire the semantic slots of the words in the subject of the sentence pattern;
正则表达式生成模块,用于根据所述句式主体、所述语义槽和所述当前语料信息中剩余的非句式主体生成正则表达式。A regular expression generation module, configured to generate a regular expression based on the sentence body, the semantic slot, and the remaining non-sentence bodies in the current corpus information.
进一步优选地,所述句式主体抽取模块包括:Further preferably, the sentence subject extraction module includes:
分词单元,用于对所述当前语料信息进行分词,得到所述当前语料信息中的字词及对应的词性;A word segmentation unit is used to segment the current corpus information and obtain the words and corresponding parts of speech in the current corpus information;
句式分析单元,用于根据语法规则和所述当前语料信息中的字词的词性,对所述当前语料信息进行句式分析,得到对应的句式结构;A sentence analysis unit, configured to perform sentence analysis on the current corpus information according to the grammatical rules and the part-of-speech of the words in the current corpus information, and obtain the corresponding sentence structure;
句式主体提取单元,用于根据所述句式结构,提取所述当前语料信息的句式主体。A sentence subject extraction unit is used to extract the sentence subject of the current corpus information according to the sentence structure.
进一步优选地,所述正则表达式生成模块包括:Further preferably, the regular expression generation module includes:
替换单元,用于将所述当前语料信息中的句式主体的字词替换为对应的语义槽;A replacement unit, used to replace the words of the sentence subject in the current corpus information with corresponding semantic slots;
正则表达式生成单元,用于将分词后的所述当前语料信息剩余的非句式主体和所述语义槽按照所述当前语料信息的句式结构进行排序,生成正则表达式。A regular expression generation unit is used to sort the remaining non-sentence subjects and the semantic slots of the current corpus information after word segmentation according to the sentence structure of the current corpus information, and generate a regular expression.
进一步优选地,所述正则表达式生成模块包括:Further preferably, the regular expression generation module includes:
替换单元,用于将所述当前语料信息中的句式主体的字词替换为对应的语义槽;A replacement unit, used to replace the words of the sentence subject in the current corpus information with corresponding semantic slots;
正则表达式生成单元,用于将分词后的所述当前语料信息剩余的非句式主体和所述语义槽按照语法结构进行排序生成排序不同且语义相同的至少一个正则表达式。A regular expression generation unit is configured to sort the remaining non-sentence subjects of the current corpus information after word segmentation and the semantic slots according to the grammatical structure to generate at least one regular expression with different sorting and the same semantics.
进一步优选地,所述正则表达式生成单元,还用于生成所述正则表达式后,在生成的所述正则表达式中加入连接词,生成另一语义相同的正则表达式。Further preferably, the regular expression generating unit is further configured to, after generating the regular expression, add a connective to the generated regular expression to generate another regular expression with the same semantics.
与现有技术相比,本发明提供的一种正则表达式的生成方法及系统具有以下有益效果:Compared with the existing technology, the regular expression generation method and system provided by the present invention have the following beneficial effects:
1、本发明获取到语料信息后,先对获取的语料信息进行句式分析,提取出语料信息中的句式主体,如主谓宾,然后将句式主体中的字词转换成对应的语义槽,最后根据句式主体中的字词对应的语义槽和语料信息中剩余的非句式主体生成正则表达式,本发明根据句式结构和词语的词性自动生成正则表达式,无需人工根据句子的意思推演到的规则进行编写,不仅节省人工成本,而且效率较高。1. After obtaining the corpus information, the present invention first performs sentence analysis on the acquired corpus information, extracts the sentence subjects in the corpus information, such as subject, predicate and object, and then converts the words in the sentence subjects into corresponding semantics. slots, and finally generate regular expressions based on the semantic slots corresponding to the words in the sentence body and the remaining non-sentence bodies in the corpus information. The present invention automatically generates regular expressions based on the sentence structure and the part-of-speech of the words, without the need to manually generate a regular expression based on the sentence. Writing the rules derived from the meaning not only saves labor costs, but also is more efficient.
2、在本发明一优选实施例中,通过将正则表达式的匹配项进行排列组合可实现根据一个语料信息生成多个语义相同的正则表达式的目的,以提高正则表达式的生成效率。2. In a preferred embodiment of the present invention, the purpose of generating multiple regular expressions with the same semantics based on one piece of corpus information can be achieved by arranging and combining the matching items of the regular expression, thereby improving the efficiency of regular expression generation.
附图说明Description of the drawings
下面将以明确易懂的方式,结合附图说明优选实施方式,对一种正则表达式的生成方法及系统的上述特性、技术特征、优点及其实现方式予以进一步说明。The preferred embodiments will be described below in a clear and easy-to-understand manner with reference to the accompanying drawings, and the above-mentioned characteristics, technical features, advantages and implementation methods of a regular expression generation method and system will be further described.
图1是本发明一种正则表达式的生成方法的第一实施例的流程示意图;Figure 1 is a schematic flow chart of a first embodiment of a regular expression generation method of the present invention;
图2是本发明一种正则表达式的生成方法的第二实施例的流程示意图;Figure 2 is a schematic flow chart of a second embodiment of a regular expression generation method of the present invention;
图3是本发明一种正则表达式的生成方法的第三实施例的流程示意图;Figure 3 is a schematic flow chart of a third embodiment of a regular expression generation method of the present invention;
图4是本发明一种正则表达式的生成方法的第四实施例的流程示意图;Figure 4 is a schematic flow chart of a fourth embodiment of a regular expression generation method of the present invention;
图5是本发明一种正则表达式的生成方法的第五实施例的流程示意图;Figure 5 is a schematic flow chart of a fifth embodiment of a regular expression generation method of the present invention;
图6是本发明一种正则表达式的生成方法的第六实施例的流程示意图;Figure 6 is a schematic flowchart of a sixth embodiment of a regular expression generation method of the present invention;
图7是本发明一种正则表达式的生成系统的结构示意框图。Figure 7 is a schematic structural block diagram of a regular expression generation system of the present invention.
附图标号说明Explanation of reference numbers
100、语料信息获取模块; 200、句式主体抽取模块;100. Corpus information acquisition module; 200. Sentence subject extraction module;
210、分词单元; 220、句式分析单元;210. Word segmentation unit; 220. Sentence analysis unit;
230、句式主体提取单元; 300、语义槽获取模块;230. Sentence subject extraction unit; 300. Semantic slot acquisition module;
400、正则表达式生成模块; 410、替换单元;400. Regular expression generation module; 410. Replacement unit;
420、正则表达式生成单元。420. Regular expression generation unit.
具体实施方式Detailed ways
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对照附图说明本发明的具体实施方式。显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图,并获得其他的实施方式。In order to explain the embodiments of the present invention or technical solutions in the prior art more clearly, the specific implementation modes of the present invention will be described below with reference to the accompanying drawings. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, without exerting creative efforts, other drawings can also be obtained based on these drawings, and obtain Other embodiments.
为使图面简洁,各图中只示意性地表示出了与本发明相关的部分,它们并不代表其作为产品的实际结构。另外,以使图面简洁便于理解,在有些图中具有相同结构或功能的部件,仅示意性地绘示了其中的一个,或仅标出了其中的一个。在本文中,“一个”不仅表示“仅此一个”,也可以表示“多于一个”的情形。In order to keep the drawings concise, only the parts related to the present invention are schematically shown in each figure, and they do not represent the actual structure of the product. In addition, in order to make the drawings concise and easy to understand, in some drawings, only one of the components with the same structure or function is schematically illustrated or labeled. In this article, "a" not only means "only one", but can also mean "more than one".
根据本发明提供的第一实施例,如图1所示,一种正则表达式的生成方法,包括:According to the first embodiment provided by the present invention, as shown in Figure 1, a method for generating regular expressions includes:
S100获取当前语料信息;S100 obtains the current corpus information;
S200对所述当前语料信息进行语法分析,提取所述当前语料信息的句式主体;S200 performs grammatical analysis on the current corpus information and extracts the sentence subject of the current corpus information;
S300获取所述句式主体的字词的语义槽;S300 obtains the semantic slot of the word in the subject of the sentence pattern;
S400根据所述句式主体、所述语义槽和所述当前语料信息中剩余的非句式主体生成正则表达式。S400 generates a regular expression according to the sentence subject, the semantic slot and the remaining non-sentence subjects in the current corpus information.
具体地,本发明通过获取大量的语料信息,然后根据获取的大量语料信息生成大量的正则表达式,正则表达式是指用来描述或者匹配一系列符合某个句法规则的字符串。本实施例以一个语料信息为例,具体说明其正则表达式的生成方法。Specifically, the present invention obtains a large amount of corpus information, and then generates a large number of regular expressions based on the obtained large amount of corpus information. Regular expressions are used to describe or match a series of strings that conform to a certain syntax rule. This embodiment takes a piece of corpus information as an example to specifically describe its regular expression generation method.
语料信息可以为文本信息,如用户文字输入的一句话或书本上的一句话,语料信息还可以是用户输入的语音信息或录制的音频信息等。本实施例以获取到的当前语料信息为例进行说明。The corpus information can be text information, such as a sentence input by the user or a sentence in a book. The corpus information can also be voice information input by the user or recorded audio information. This embodiment uses the obtained current corpus information as an example for explanation.
获取到当前语料信息后,对当前语料信息进行语法分析,提取当前语料信息的句式主体,如提取当前语料信息中的主语、谓语、宾语、定语等。例如,当前语料信息为“鲸鱼为什么会喷水”,提取出的句式主体为“鲸鱼喷水”,“鲸鱼”为主语,“喷水”为谓语。After obtaining the current corpus information, perform grammatical analysis on the current corpus information and extract the sentence subject of the current corpus information, such as extracting the subject, predicate, object, attributive, etc. in the current corpus information. For example, the current corpus information is "Why do whales spray water?" The extracted sentence subject is "Whales spray water", "Whale" is the subject, and "Squirt water" is the predicate.
提取出句式主体后,根据句式主体的字词的词性,将句式主体的字词转换为对应的语义槽,语义槽可为该字词对应的词性的所有词语,也可以为与该字词语义相同的词语。例如,句式主体为“鲸鱼喷水”,其中,“鲸鱼”为名词,“喷水”为动词,“鲸鱼”对应的语义槽可为名词库,“喷水”对应的语义槽可为动词库。After the sentence subject is extracted, the words in the sentence subject are converted into corresponding semantic slots according to the part-of-speech of the words in the sentence subject. The semantic slot can be all words with the part of speech corresponding to the word, or it can be all words with the part of speech corresponding to the word. Words with the same meaning. For example, the main body of the sentence is "whale spouting water", in which "whale" is a noun and "spraying water" is a verb. The semantic slot corresponding to "whale" can be a noun library, and the semantic slot corresponding to "spraying water" can be Verb library.
得到句式主体和句式主体的字词对应的语义槽后,即可根据句式主体、语义槽和当前语料信息中剩余的非句式主体生成当前语料信息对应的正则表达式。After obtaining the sentence subject and the semantic slots corresponding to the words in the sentence subject, the regular expression corresponding to the current corpus information can be generated based on the sentence subject, the semantic slot and the remaining non-sentence subjects in the current corpus information.
示例性地,当前语料信息为“鲸鱼为什么会喷水”,提取出的句式主体为“鲸鱼喷水”,“鲸鱼”对应的语义槽为名词库,“喷水”对应的语义槽为动词库,剩余的非句式主体为“为什么会”,根据得到的上述信息生成的正则表达式为“##名词库##[为什么][会]##动词库二##”。For example, the current corpus information is "Why does a whale spray water?", the extracted sentence subject is "a whale sprays water", the semantic slot corresponding to "whale" is the noun library, and the semantic slot corresponding to "spraying water" is In the verb library, the remaining non-sentence subject is "why", and the regular expression generated based on the above information is "## noun library ##[why][will]##verb library two##".
本发明获取到语料信息后,先对获取的语料信息进行句式分析,提取出语料信息中的句式主体,如主谓宾,然后将句式主体中的字词转换成对应的语义槽,最后根据句式主体中的字词对应的语义槽和语料信息中剩余的非句式主体生成正则表达式,本发明根据句式结构和词语的词性自动生成正则表达式,无需人工根据句子的意思推演到的规则进行编写,不仅节省人工成本,而且效率较高。After obtaining the corpus information, the present invention first performs sentence analysis on the obtained corpus information, extracts the sentence subjects in the corpus information, such as subject, predicate and object, and then converts the words in the sentence subjects into corresponding semantic slots. Finally, a regular expression is generated based on the semantic slot corresponding to the word in the sentence body and the remaining non-sentence body in the corpus information. The present invention automatically generates a regular expression based on the sentence structure and the part of speech of the word, without the need to manually determine the meaning of the sentence. Writing the deduced rules not only saves labor costs, but also is more efficient.
根据本发明提供的第二实施例,如图2所示,一种正则表达式的生成方法,包括:According to the second embodiment provided by the present invention, as shown in Figure 2, a method for generating regular expressions includes:
S100获取当前语料信息;S100 obtains the current corpus information;
S210对所述当前语料信息进行分词,得到所述当前语料信息中的字词及对应的词性;S210 perform word segmentation on the current corpus information to obtain the words and corresponding parts of speech in the current corpus information;
S220根据语法规则和所述当前语料信息中的字词的词性,对所述当前语料信息进行句式分析,得到对应的句式结构;S220 performs sentence analysis on the current corpus information according to the grammar rules and the part-of-speech of the words in the current corpus information, and obtains the corresponding sentence structure;
S230根据所述句式结构,提取所述当前语料信息的句式主体;S230 extracts the sentence body of the current corpus information according to the sentence structure;
S300获取所述句式主体的字词的语义槽;S300 obtains the semantic slot of the word in the subject of the sentence pattern;
S400根据所述句式主体、所述语义槽和所述当前语料信息中剩余的非句式主体生成正则表达式。S400 generates a regular expression according to the sentence subject, the semantic slot and the remaining non-sentence subjects in the current corpus information.
具体地,在上述实施例一中,提取当前语料信息的句式主体的方法具体可为:先对当前语料信息进行分词,得到当前语料信息中的字词的词性,然后根据语法规则和当前语料信息中的字词的词性,得到当前语料信息的句式结构,最后根据当前语料信息的句式结构,提取当前语料信息的句式主体。Specifically, in the first embodiment mentioned above, the method of extracting the sentence subject of the current corpus information may be: first segment the current corpus information to obtain the part-of-speech of the words in the current corpus information, and then according to the grammar rules and the current corpus information The part-of-speech of the words in the information is used to obtain the sentence structure of the current corpus information. Finally, based on the sentence structure of the current corpus information, the sentence main body of the current corpus information is extracted.
对当前语料信息进行分词是指将当前语料信息分为一个个的字或词,如将“不知道你在说什么”分为“不知道,你,在,说什么”;再如将“鲸鱼为什么会喷水”分为“鲸鱼,为什么,会,喷水”。Word segmentation of the current corpus information means dividing the current corpus information into individual words or words, such as dividing "I don't know what you are talking about" into "I don't know what you are talking about"; another example is dividing "whale" Why do they spray water?" is divided into "Why do whales spray water?"
对当前语料信息进行分词后,对分词后得到的字词进行分析得到当前语料信息中的字词的词性,如将“鲸鱼为什么会喷水”分词后得到的字词为“鲸鱼”(名词)、“为什么”(代词)、“会”(助动词),“喷水”(动词)。然后根据语料规则和当前语料信息中的字词的词性,对当前语料信息进行句式分析,得到当前语料信息“鲸鱼为什么会喷水”的句式结构为“主+状+谓”,最后根据当前语料信息的句式结构,对当前语料信息进行分析可知“鲸鱼喷水”为主谓结构,“为什么喷水”为状中结构,“会喷水”为状中结构,根据分析后的结果可知当前语料信息“鲸鱼为什么会喷水”的主体结构为主谓结构的“鲸鱼喷水”,因此,从“鲸鱼为什么会喷水”中提取出句式主体即为主谓结构“鲸鱼喷水”。After segmenting the current corpus information, analyze the words obtained after segmentation to obtain the part-of-speech of the words in the current corpus information. For example, the word obtained after segmenting "Why do whales spray water" is "whale" (noun) , "why" (pronoun), "will" (auxiliary verb), "spray" (verb). Then, according to the corpus rules and the part-of-speech of the words in the current corpus information, the sentence structure of the current corpus information is analyzed, and the sentence structure of the current corpus information "Why does the whale spray water" is obtained as "subject + adjective + predicate", and finally according to The sentence structure of the current corpus information. Analysis of the current corpus information shows that "whale sprays water" is the main predicate structure, "why sprays water" is the predicate-mid structure, and "can spray water" is the predicate-mid structure. According to the analysis results It can be seen that the subject structure of the current corpus information "Why does the whale spray water" has the main-predicate structure "whale sprays water". Therefore, the subject of the sentence extracted from "Why does the whale spray water" is the subject-predicate structure "the whale sprays water" ".
根据本发明提供的第三实施例,如图3所示,一种正则表达式的生成方法,包括:According to the third embodiment provided by the present invention, as shown in Figure 3, a method for generating regular expressions includes:
S100获取当前语料信息;S100 obtains the current corpus information;
S210对所述当前语料信息进行分词,得到所述当前语料信息中的字词及对应的词性;S210 perform word segmentation on the current corpus information to obtain the words and corresponding parts of speech in the current corpus information;
S220根据语法规则和所述当前语料信息中的字词的词性,对所述当前语料信息进行句式分析,得到对应的句式结构;S220 performs sentence analysis on the current corpus information according to the grammar rules and the part-of-speech of the words in the current corpus information, and obtains the corresponding sentence structure;
S230根据所述句式结构,提取所述当前语料信息的句式主体;S230 extracts the sentence body of the current corpus information according to the sentence structure;
S300获取所述句式主体的字词的语义槽;S300 obtains the semantic slot of the word in the subject of the sentence pattern;
S410将所述当前语料信息中的句式主体的字词替换为对应的语义槽;S410 replaces the words of the sentence subject in the current corpus information with the corresponding semantic slots;
S420将分词后的所述当前语料信息剩余的非句式主体和所述语义槽按照所述当前语料信息的句式结构进行排序,生成正则表达式。S420 sorts the remaining non-sentence subjects and the semantic slots of the current corpus information after word segmentation according to the sentence structure of the current corpus information, and generates a regular expression.
具体地,根据上述实施例的方法提取出当前语料信息的句式主体,并得到句式主体的字词对应的语义槽后,将分词后的当前语料信息中的非句式主体进行保留,然后将当前语料信息中的句式主体的字词替换为对应的语义槽,最后将语义槽和非句式主体按照当前语料信息自身的句式结构进行排序即可生成当前语料信息对应的正则表达式。Specifically, according to the method of the above embodiment, the sentence subject of the current corpus information is extracted, and the semantic slot corresponding to the word of the sentence subject is obtained, the non-sentence subject in the current corpus information after word segmentation is retained, and then Replace the words of the sentence subject in the current corpus information with the corresponding semantic slots, and finally sort the semantic slots and non-sentence subjects according to the sentence structure of the current corpus information to generate the regular expression corresponding to the current corpus information. .
示例性地,当前语料信息为“鲸鱼为什么会喷水”,句式主体为“鲸鱼喷水”,句式主体中的字词“鲸鱼”对应的语义槽为名词库,句式主体中的字词“喷水”对应的语义槽为动词库,分词后的当前语料信息中剩余的非句式主体为“为什么”和“会”,将非句式主体“为什么”、“会”以及句式主体的字词对应的语义槽“名词库”和“动词库”按照当前语料信息的句式结构进行排序即为“名词库”、“为什么”、“会”、“动词库”。“名词库”、“为什么”、“会”、“动词库”为正则表达式的匹配项,通过在匹配项之间加入正则表达式的符号即可生成当前语料信息对应的正则表达式“##名词库##[为什么][会]##动词库二##”。For example, the current corpus information is "Why do whales spray water?", the sentence body is "whale sprays water", the semantic slot corresponding to the word "whale" in the sentence body is the noun library, and the sentence body in the The semantic slot corresponding to the word "spray water" is the verb library. The remaining non-sentence subjects in the current corpus information after word segmentation are "why" and "will". The non-sentence subjects "why", "will" and the sentence The semantic slots "noun library" and "verb library" corresponding to the words in the form subject are sorted according to the sentence structure of the current corpus information, namely "noun library", "why", "will", and "verb library". "Noun library", "why", "will", and "verb library" are the matching items of regular expressions. By adding regular expression symbols between the matching items, the regular expression corresponding to the current corpus information can be generated. ##Noun library##[Why][Will]##Verb library two##".
根据本发明提供的第四实施例,如图4所示,一种正则表达式的生成方法,包括:According to the fourth embodiment provided by the present invention, as shown in Figure 4, a method for generating regular expressions includes:
S100获取当前语料信息;S100 obtains the current corpus information;
S210对所述当前语料信息进行分词,得到所述当前语料信息中的字词及对应的词性;S210 perform word segmentation on the current corpus information to obtain the words and corresponding parts of speech in the current corpus information;
S220根据语法规则和所述当前语料信息中的字词的词性,对所述当前语料信息进行句式分析,得到对应的句式结构;S220 performs sentence analysis on the current corpus information according to the grammar rules and the part-of-speech of the words in the current corpus information, and obtains the corresponding sentence structure;
S230根据所述句式结构,提取所述当前语料信息的句式主体;S230 extracts the sentence body of the current corpus information according to the sentence structure;
S300获取所述句式主体的字词的语义槽;S300 obtains the semantic slot of the word in the subject of the sentence pattern;
S410将所述当前语料信息中的句式主体的字词替换为对应的语义槽;S410 replaces the words of the sentence subject in the current corpus information with the corresponding semantic slots;
S430将分词后的所述当前语料信息剩余的非句式主体和所述语义槽按照语法结构进行排序生成排序不同且语义相同的至少一个正则表达式。S430 sorts the remaining non-sentence subjects and the semantic slots of the current corpus information after word segmentation according to the grammatical structure to generate at least one regular expression with different sorting and the same semantics.
具体地,本实施例与上述第三实施例的区别在于,根据实施例一或实施例二的方法提取出当前语料信息的句式主体,并得到句式主体的字词对应的语义槽后,将分词后的当前语料信息中的非句式主体进行保留,然后将当前语料信息中的句式主体的字词替换为对应的语义槽,最后将语义槽和非句式主体按照语法结构进行排序生成排序不同且语义相同的至少一个正则表达式。Specifically, the difference between this embodiment and the above-mentioned third embodiment is that after extracting the sentence subject of the current corpus information according to the method of Embodiment 1 or Embodiment 2, and obtaining the semantic slot corresponding to the words of the sentence subject, Retain the non-sentence subject in the current corpus information after word segmentation, then replace the words of the sentence subject in the current corpus information with the corresponding semantic slots, and finally sort the semantic slots and non-sentence subjects according to the grammatical structure Generate at least one regular expression that is ordered differently and has the same semantics.
示例性地,当前语料信息为“鲸鱼为什么会喷水”,句式主体为“鲸鱼喷水”,句式主体中的字词“鲸鱼”对应的语义槽为名词库,句式主体中的字词“喷水”对应的语义槽为动词库,分词后的当前语料信息中剩余的非句式主体为“为什么”和“会”。在保持语义相同的前提下,将非句式主体“为什么”、“会”以及句式主体的字词对应的语义槽“名词库”和“动词库”按照语法结构排序可得到“名词库”、“为什么”、“会”、“动词库”以及“为什么”、“名词库”、“会”、“动词库”。For example, the current corpus information is "Why do whales spray water?", the sentence body is "whale sprays water", the semantic slot corresponding to the word "whale" in the sentence body is the noun library, and the sentence body in the The semantic slot corresponding to the word "spray" is the verb library, and the remaining non-sentence subjects in the current corpus information after word segmentation are "why" and "will". On the premise of keeping the same semantics, the semantic slots "noun library" and "verb library" corresponding to the non-sentence subjects "why" and "will" and the words of the sentence subject can be sorted according to the grammatical structure to obtain "noun" library", "why", "would", "verb library" and "why", "noun library", "would", "verb library".
根据“名词库”、“为什么”、“会”、“动词库”的排序得到的正则表达式为“##名词库##[为什么][会]##动词库二##”。根据“为什么”、“名词库”、“会”、“动词库”的排序得到的正则表达式为“##[为什么] ##名词库##[会]##动词库二##”。本实施例通过将正则表达式的匹配项进行排列组合可实现根据一个语料信息生成多个语义相同的正则表达式的目的,以提高正则表达式的生成效率。The regular expression obtained based on the sorting of "noun library", "why", "will" and "verb library" is "## noun library ##[why][will]##verb library two##". The regular expression obtained based on the sorting of "why", "noun library", "hui" and "verb library" is "##[Why] ##Noun library##[will]##Verb library two## ". This embodiment can achieve the purpose of generating multiple regular expressions with the same semantics based on one piece of corpus information by arranging and combining the matching items of the regular expression, thereby improving the efficiency of regular expression generation.
根据本发明提供的第五实施例,如图5所示,一种正则表达式的生成方法,包括:According to the fifth embodiment provided by the present invention, as shown in Figure 5, a method for generating regular expressions includes:
S100获取当前语料信息;S100 obtains the current corpus information;
S210对所述当前语料信息进行分词,得到所述当前语料信息中的字词及对应的词性;S210 perform word segmentation on the current corpus information to obtain the words and corresponding parts of speech in the current corpus information;
S220根据语法规则和所述当前语料信息中的字词的词性,对所述当前语料信息进行句式分析,得到对应的句式结构;S220 performs sentence analysis on the current corpus information according to the grammar rules and the part-of-speech of the words in the current corpus information, and obtains the corresponding sentence structure;
S230根据所述句式结构,提取所述当前语料信息的句式主体;S230 extracts the sentence body of the current corpus information according to the sentence structure;
S300获取所述句式主体的字词的语义槽;S300 obtains the semantic slot of the word in the subject of the sentence pattern;
S410将所述当前语料信息中的句式主体的字词替换为对应的语义槽;S410 replaces the words of the sentence subject in the current corpus information with the corresponding semantic slots;
S420将分词后的所述当前语料信息剩余的非句式主体和所述语义槽按照所述当前语料信息的句式结构进行排序,生成正则表达式;S420 sorts the remaining non-sentence subjects and the semantic slots of the current corpus information after word segmentation according to the sentence structure of the current corpus information, and generates regular expressions;
S440生成正则表达式后,在生成的正则表达式中加入连接词,生成另一语义相同的正则表达式。After S440 generates a regular expression, a connective is added to the generated regular expression to generate another regular expression with the same semantics.
具体地,在中文语法中,还存在主动句和被动句等句式不同但语义相同的情况,为充分考虑这种情况,在不改变当前语料信息的意图的情况下,在生成的正则表达式中加入连接词(如把、被等),然后将生成的正则表达式中的匹配项重新进行组合排列生成另一语义相同的正则表达式。Specifically, in Chinese grammar, there are cases where active sentences and passive sentences have different sentence patterns but the same semantics. In order to fully consider this situation, without changing the intention of the current corpus information, the generated regular expression Add connectives (such as to, be, etc.), and then recombine and arrange the matching items in the generated regular expression to generate another regular expression with the same semantics.
示例性地,当前语料信息为“给老师词典”,“给”为动词,对应的语义槽为动词库;“老师”为名词,对应的语义槽为名词库;“词典”为名词,对应的语义槽为名词库,生成的正则表达式为“##动词库##名词库##名词库##”。在当前语料信息“给老师词典”中加入关系词“把”后当前语料信息变为“把词典给老师”,因此,在生成的正则表达式“##动词库##名词库##名词库##”中加入连接词“把”后生成的另一语义相同的正则表达式为“把##名词库##动词库##名词库##”。For example, the current corpus information is "give to teacher dictionary", "give" is a verb, and the corresponding semantic slot is the verb library; "teacher" is a noun, and the corresponding semantic slot is the noun library; "dictionary" is a noun, and the corresponding semantic slot is the noun library. The semantic slot is the noun library, and the generated regular expression is "##verb library##noun library##noun library##". After adding the relative word "ba" to the current corpus information "give the teacher the dictionary", the current corpus information becomes "give the dictionary to the teacher". Therefore, in the generated regular expression "##verb library##noun library##name Another regular expression with the same semantics generated by adding the connective "ba" to the vocabulary library ## is "ba## noun library ## verb library ## noun library ##".
根据本发明提供的第六实施例,如图6所示,一种正则表达式的生成方法,包括:According to the sixth embodiment provided by the present invention, as shown in Figure 6, a method for generating regular expressions includes:
S100获取当前语料信息;S100 obtains the current corpus information;
S210对所述当前语料信息进行分词,得到所述当前语料信息中的字词及对应的词性;S210 perform word segmentation on the current corpus information to obtain the words and corresponding parts of speech in the current corpus information;
S220根据语法规则和所述当前语料信息中的字词的词性,对所述当前语料信息进行句式分析,得到对应的句式结构;S220 performs sentence analysis on the current corpus information according to the grammar rules and the part-of-speech of the words in the current corpus information, and obtains the corresponding sentence structure;
S230根据所述句式结构,提取所述当前语料信息的句式主体;S230 extracts the sentence body of the current corpus information according to the sentence structure;
S300获取所述句式主体的字词的语义槽;S300 obtains the semantic slot of the word in the subject of the sentence pattern;
S410将所述当前语料信息中的句式主体的字词替换为对应的语义槽;S410 replaces the words of the sentence subject in the current corpus information with the corresponding semantic slots;
S430将分词后的所述当前语料信息剩余的非句式主体和所述语义槽按照语法结构进行排序生成排序不同且语义相同的至少一个正则表达式;S430 sorts the remaining non-sentence subjects and the semantic slots of the current corpus information after word segmentation according to the grammatical structure to generate at least one regular expression with different sorting and the same semantics;
S440生成正则表达式后,在生成的正则表达式中加入连接词,生成另一语义相同的正则表达式。After S440 generates a regular expression, a connective is added to the generated regular expression to generate another regular expression with the same semantics.
具体地,在中文语法中,还存在主动句和被动句等句式不同但语义相同的情况,为充分考虑这种情况,在不改变当前语料信息的意图的情况下,在生成的正则表达式中加入连接词(如把、被等),然后将生成的正则表达式中的匹配项重新进行组合排列生成另一语义相同的正则表达式。Specifically, in Chinese grammar, there are cases where active sentences and passive sentences have different sentence patterns but the same semantics. In order to fully consider this situation, without changing the intention of the current corpus information, the generated regular expression Add connectives (such as to, be, etc.), and then recombine and arrange the matching items in the generated regular expression to generate another regular expression with the same semantics.
示例性地,当前语料信息为“给老师词典”,“给”为动词,对应的语义槽为动词库;“老师”为名词,对应的语义槽为名词库;“词典”为名词,对应的语义槽为名词库,生成的正则表达式为“##动词库##名词库##名词库##”。在当前语料信息“给老师词典”中加入关系词“把”后当前语料信息变为“把词典给老师”,因此,在生成的正则表达式“##动词库##名词库##名词库##”中加入连接词“把”后生成的另一语义相同的正则表达式为“把##名词库##动词库##名词库##”。For example, the current corpus information is "give to teacher dictionary", "give" is a verb, and the corresponding semantic slot is the verb library; "teacher" is a noun, and the corresponding semantic slot is the noun library; "dictionary" is a noun, and the corresponding semantic slot is the noun library. The semantic slot is the noun library, and the generated regular expression is "##verb library##noun library##noun library##". After adding the relative word "ba" to the current corpus information "give the teacher the dictionary", the current corpus information becomes "give the dictionary to the teacher". Therefore, in the generated regular expression "##verb library##noun library##name Another regular expression with the same semantics generated by adding the connective "ba" to the vocabulary library ## is "ba## noun library ## verb library ## noun library ##".
根据本发明提供的第七实施例,如图7所示,一种正则表达式的生成系统,包括:According to the seventh embodiment provided by the present invention, as shown in Figure 7, a regular expression generation system includes:
语料信息获取模块100,用于获取当前语料信息;The corpus information acquisition module 100 is used to obtain the current corpus information;
句式主体抽取模块200,用于对所述当前语料信息进行语法分析,提取所述当前语料信息的句式主体;The sentence subject extraction module 200 is used to perform grammatical analysis on the current corpus information and extract the sentence subject of the current corpus information;
语义槽获取模块300,用于获取所述句式主体的字词的语义槽;The semantic slot acquisition module 300 is used to acquire the semantic slots of the words in the sentence body;
正则表达式生成模块400,用于根据所述句式主体、所述语义槽和所述当前语料信息中剩余的非句式主体生成正则表达式。The regular expression generation module 400 is configured to generate a regular expression based on the sentence body, the semantic slot, and the remaining non-sentence bodies in the current corpus information.
具体地,本发明通过获取大量的语料信息,然后根据获取的大量语料信息生成大量的正则表达式,正则表达式是指用来描述或者匹配一系列符合某个句法规则的字符串。本实施例以一个语料信息为例,具体说明其正则表达式的生成方法。Specifically, the present invention obtains a large amount of corpus information, and then generates a large number of regular expressions based on the obtained large amount of corpus information. Regular expressions are used to describe or match a series of strings that conform to a certain syntax rule. This embodiment takes a piece of corpus information as an example to specifically describe its regular expression generation method.
语料信息可以为文本信息,如用户文字输入的一句话或书本上的一句话,语料信息还可以是用户输入的语音信息或录制的音频信息等。本实施例以获取到的当前语料信息为例进行说明。The corpus information can be text information, such as a sentence input by the user or a sentence in a book. The corpus information can also be voice information input by the user or recorded audio information. This embodiment uses the obtained current corpus information as an example for explanation.
获取到当前语料信息后,对当前语料信息进行语法分析,提取当前语料信息的句式主体,如提取当前语料信息中的主语、谓语、宾语、定语等。例如,当前语料信息为“鲸鱼为什么会喷水”,提取出的句式主体为“鲸鱼喷水”,“鲸鱼”为主语,“喷水”为谓语。After obtaining the current corpus information, perform grammatical analysis on the current corpus information and extract the sentence subject of the current corpus information, such as extracting the subject, predicate, object, attributive, etc. in the current corpus information. For example, the current corpus information is "Why do whales spray water?" The extracted sentence subject is "Whales spray water", "Whale" is the subject, and "Squirt water" is the predicate.
提取出句式主体后,根据句式主体的字词的词性,将句式主体的字词转换为对应的语义槽,语义槽可为该字词对应的词性的所有词语,也可以为与该字词语义相同的词语。例如,句式主体为“鲸鱼喷水”,其中,“鲸鱼”为名词,“喷水”为动词,“鲸鱼”对应的语义槽可为名词库,“喷水”对应的语义槽可为动词库。After the sentence subject is extracted, the words in the sentence subject are converted into corresponding semantic slots according to the part-of-speech of the words in the sentence subject. The semantic slot can be all words with the part of speech corresponding to the word, or it can be all words with the part of speech corresponding to the word. Words with the same meaning. For example, the main body of the sentence is "whale spouting water", in which "whale" is a noun and "spraying water" is a verb. The semantic slot corresponding to "whale" can be a noun library, and the semantic slot corresponding to "spraying water" can be Verb library.
得到句式主体和句式主体的字词对应的语义槽后,即可根据句式主体、语义槽和当前语料信息中剩余的非句式主体生成当前语料信息对应的正则表达式。After obtaining the sentence subject and the semantic slots corresponding to the words in the sentence subject, the regular expression corresponding to the current corpus information can be generated based on the sentence subject, the semantic slot and the remaining non-sentence subjects in the current corpus information.
示例性地,当前语料信息为“鲸鱼为什么会喷水”,提取出的句式主体为“鲸鱼喷水”,“鲸鱼”对应的语义槽为名词库,“喷水”对应的语义槽为动词库,剩余的非句式主体为“为什么会”,根据得到的上述信息生成的正则表达式为“##名词库##[为什么][会]##动词库二##”。For example, the current corpus information is "Why does a whale spray water?", the extracted sentence subject is "a whale sprays water", the semantic slot corresponding to "whale" is the noun library, and the semantic slot corresponding to "spraying water" is In the verb library, the remaining non-sentence subject is "why", and the regular expression generated based on the above information is "## noun library ##[why][will]##verb library two##".
本发明获取到语料信息后,先对获取的语料信息进行句式分析,提取出语料信息中的句式主体,如主谓宾,然后将句式主体中的字词转换成对应的语义槽,最后根据句式主体中的字词对应的语义槽和语料信息中剩余的非句式主体生成正则表达式,本发明根据句式结构和词语的词性自动生成正则表达式,无需人工根据句子的意思推演到的规则进行编写,不仅节省人工成本,而且效率较高。After obtaining the corpus information, the present invention first performs sentence analysis on the obtained corpus information, extracts the sentence subjects in the corpus information, such as subject, predicate and object, and then converts the words in the sentence subjects into corresponding semantic slots. Finally, a regular expression is generated based on the semantic slot corresponding to the word in the sentence body and the remaining non-sentence body in the corpus information. The present invention automatically generates a regular expression based on the sentence structure and the part of speech of the word, without the need to manually determine the meaning of the sentence. Writing the deduced rules not only saves labor costs, but also is more efficient.
优选地,所述句式主体抽取模块200包括:Preferably, the sentence subject extraction module 200 includes:
分词单元210,用于对所述当前语料信息进行分词,得到所述当前语料信息中的字词及对应的词性;The word segmentation unit 210 is used to segment the current corpus information and obtain the words and corresponding parts of speech in the current corpus information;
句式分析单元220,用于根据语法规则和所述当前语料信息中的字词的词性,对所述当前语料信息进行句式分析,得到对应的句式结构;The sentence analysis unit 220 is used to perform sentence analysis on the current corpus information according to the grammar rules and the part-of-speech of the words in the current corpus information, and obtain the corresponding sentence structure;
句式主体提取单元230,用于根据所述句式结构,提取所述当前语料信息的句式主体。The sentence subject extraction unit 230 is configured to extract the sentence subject of the current corpus information according to the sentence structure.
具体地,在上述实施例一中,提取当前语料信息的句式主体的方法具体可为:先对当前语料信息进行分词,得到当前语料信息中的字词的词性,然后根据语法规则和当前语料信息中的字词的词性,得到当前语料信息的句式结构,最后根据当前语料信息的句式结构,提取当前语料信息的句式主体。Specifically, in the first embodiment mentioned above, the method of extracting the sentence subject of the current corpus information may be: first segment the current corpus information to obtain the part-of-speech of the words in the current corpus information, and then according to the grammar rules and the current corpus information The part-of-speech of the words in the information is used to obtain the sentence structure of the current corpus information. Finally, based on the sentence structure of the current corpus information, the sentence main body of the current corpus information is extracted.
对当前语料信息进行分词是指将当前语料信息分为一个个的字或词,如将“不知道你在说什么”分为“不知道,你,在,说什么”;再如将“鲸鱼为什么会喷水”分为“鲸鱼,为什么,会,喷水”。Word segmentation of the current corpus information means dividing the current corpus information into individual words or words, such as dividing "I don't know what you are talking about" into "I don't know what you are talking about"; another example is dividing "whale" Why do they spray water?" is divided into "Why do whales spray water?"
对当前语料信息进行分词后,对分词后得到的字词进行分析得到当前语料信息中的字词的词性,如将“鲸鱼为什么会喷水”分词后得到的字词为“鲸鱼”(名词)、“为什么”(代词)、“会”(助动词),“喷水”(动词)。然后根据语料规则和当前语料信息中的字词的词性,对当前语料信息进行句式分析,得到当前语料信息“鲸鱼为什么会喷水”的句式结构为“主+状+谓”,最后根据当前语料信息的句式结构,对当前语料信息进行分析可知“鲸鱼喷水”为主谓结构,“为什么喷水”为状中结构,“会喷水”为状中结构,根据分析后的结果可知当前语料信息“鲸鱼为什么会喷水”的主体结构为主谓结构的“鲸鱼喷水”,因此,从“鲸鱼为什么会喷水”中提取出句式主体即为主谓结构“鲸鱼喷水”。After segmenting the current corpus information, analyze the words obtained after segmentation to obtain the part-of-speech of the words in the current corpus information. For example, the word obtained after segmenting "Why do whales spray water" is "whale" (noun) , "why" (pronoun), "will" (auxiliary verb), "spray" (verb). Then, according to the corpus rules and the part-of-speech of the words in the current corpus information, the sentence structure of the current corpus information is analyzed, and the sentence structure of the current corpus information "Why does the whale spray water" is obtained as "subject + adjective + predicate", and finally according to The sentence structure of the current corpus information. Analysis of the current corpus information shows that "whale sprays water" is the main predicate structure, "why sprays water" is the predicate-mid structure, and "can spray water" is the predicate-mid structure. According to the analysis results It can be seen that the subject structure of the current corpus information "Why does the whale spray water" has the main-predicate structure "whale sprays water". Therefore, the subject of the sentence extracted from "Why does the whale spray water" is the subject-predicate structure "the whale sprays water" ".
优选地,所述正则表达式生成模块400包括:Preferably, the regular expression generation module 400 includes:
替换单元410,用于将所述当前语料信息中的句式主体的字词替换为对应的语义槽;The replacement unit 410 is used to replace the words of the sentence subject in the current corpus information with the corresponding semantic slots;
正则表达式生成单元420,用于将分词后的所述当前语料信息剩余的非句式主体和所述语义槽按照所述当前语料信息的句式结构进行排序,生成正则表达式。The regular expression generation unit 420 is configured to sort the remaining non-sentence subjects and the semantic slots of the current corpus information after word segmentation according to the sentence structure of the current corpus information, and generate a regular expression.
具体地,根据上述实施例的方法提取出当前语料信息的句式主体,并得到句式主体的字词对应的语义槽后,将分词后的当前语料信息中的非句式主体进行保留,然后将当前语料信息中的句式主体的字词替换为对应的语义槽,最后将语义槽和非句式主体按照当前语料信息自身的句式结构进行排序即可生成当前语料信息对应的正则表达式。Specifically, according to the method of the above embodiment, the sentence subject of the current corpus information is extracted, and the semantic slot corresponding to the word of the sentence subject is obtained, the non-sentence subject in the current corpus information after word segmentation is retained, and then Replace the words of the sentence subject in the current corpus information with the corresponding semantic slots, and finally sort the semantic slots and non-sentence subjects according to the sentence structure of the current corpus information to generate the regular expression corresponding to the current corpus information. .
示例性地,当前语料信息为“鲸鱼为什么会喷水”,句式主体为“鲸鱼喷水”,句式主体中的字词“鲸鱼”对应的语义槽为名词库,句式主体中的字词“喷水”对应的语义槽为动词库,分词后的当前语料信息中剩余的非句式主体为“为什么”和“会”,将非句式主体“为什么”、“会”以及句式主体的字词对应的语义槽“名词库”和“动词库”按照当前语料信息的句式结构进行排序即为“名词库”、“为什么”、“会”、“动词库”。“名词库”、“为什么”、“会”、“动词库”为正则表达式的匹配项,通过在匹配项之间加入正则表达式的符号即可生成当前语料信息对应的正则表达式“##名词库##[为什么][会]##动词库二##”。For example, the current corpus information is "Why do whales spray water?", the sentence body is "whale sprays water", the semantic slot corresponding to the word "whale" in the sentence body is the noun library, and the sentence body in the The semantic slot corresponding to the word "spray water" is the verb library. The remaining non-sentence subjects in the current corpus information after word segmentation are "why" and "will". The non-sentence subjects "why", "will" and the sentence The semantic slots "noun library" and "verb library" corresponding to the words in the form subject are sorted according to the sentence structure of the current corpus information, namely "noun library", "why", "will", and "verb library". "Noun library", "why", "will", and "verb library" are the matching items of regular expressions. By adding regular expression symbols between the matching items, the regular expression corresponding to the current corpus information can be generated. ##Noun library##[Why][Will]##Verb library two##".
优选地,所述正则表达式生成单元420,还用于生成所述正则表达式后,在生成的所述正则表达式中加入连接词,生成另一语义相同的正则表达式。Preferably, the regular expression generating unit 420 is also configured to add connectives to the generated regular expression after generating the regular expression to generate another regular expression with the same semantics.
具体地,在中文语法中,还存在主动句和被动句等句式不同但语义相同的情况,为充分考虑这种情况,在不改变当前语料信息的意图的情况下,在生成的正则表达式中加入连接词(如把、被等),然后将生成的正则表达式中的匹配项重新进行组合排列生成另一语义相同的正则表达式。Specifically, in Chinese grammar, there are cases where active sentences and passive sentences have different sentence patterns but the same semantics. In order to fully consider this situation, without changing the intention of the current corpus information, the generated regular expression Add connectives (such as to, be, etc.), and then recombine and arrange the matching items in the generated regular expression to generate another regular expression with the same semantics.
示例性地,当前语料信息为“给老师词典”,“给”为动词,对应的语义槽为动词库;“老师”为名词,对应的语义槽为名词库;“词典”为名词,对应的语义槽为名词库,生成的正则表达式为“##动词库##名词库##名词库##”。在当前语料信息“给老师词典”中加入关系词“把”后当前语料信息变为“把词典给老师”,因此,在生成的正则表达式“##动词库##名词库##名词库##”中加入连接词“把”后生成的另一语义相同的正则表达式为“把##名词库##动词库##名词库##”。For example, the current corpus information is "give to teacher dictionary", "give" is a verb, and the corresponding semantic slot is the verb library; "teacher" is a noun, and the corresponding semantic slot is the noun library; "dictionary" is a noun, and the corresponding semantic slot is the noun library. The semantic slot is the noun library, and the generated regular expression is "##verb library##noun library##noun library##". After adding the relative word "ba" to the current corpus information "give the teacher the dictionary", the current corpus information becomes "give the dictionary to the teacher". Therefore, in the generated regular expression "##verb library##noun library##name Another regular expression with the same semantics generated by adding the connective "ba" to the vocabulary library ## is "ba## noun library ## verb library ## noun library ##".
根据本发明提供的第八实施例,如图7所示,一种正则表达式的生成系统,包括:According to the eighth embodiment of the present invention, as shown in Figure 7, a regular expression generation system includes:
语料信息获取模块100,用于获取当前语料信息;The corpus information acquisition module 100 is used to obtain the current corpus information;
句式主体抽取模块200,用于对所述当前语料信息进行语法分析,提取所述当前语料信息的句式主体;The sentence subject extraction module 200 is used to perform grammatical analysis on the current corpus information and extract the sentence subject of the current corpus information;
语义槽获取模块300,用于获取所述句式主体的字词的语义槽;The semantic slot acquisition module 300 is used to acquire the semantic slots of the words in the sentence body;
正则表达式生成模块400,用于根据所述句式主体、所述语义槽和所述当前语料信息中剩余的非句式主体生成正则表达式。The regular expression generation module 400 is configured to generate a regular expression based on the sentence body, the semantic slot, and the remaining non-sentence bodies in the current corpus information.
所述句式主体抽取模块200包括:The sentence subject extraction module 200 includes:
分词单元210,用于对所述当前语料信息进行分词,得到所述当前语料信息中的字词及对应的词性;The word segmentation unit 210 is used to segment the current corpus information and obtain the words and corresponding parts of speech in the current corpus information;
句式分析单元220,用于根据语法规则和所述当前语料信息中的字词的词性,对所述当前语料信息进行句式分析,得到对应的句式结构;The sentence analysis unit 220 is used to perform sentence analysis on the current corpus information according to the grammar rules and the part-of-speech of the words in the current corpus information, and obtain the corresponding sentence structure;
句式主体提取单元230,用于根据所述句式结构,提取所述当前语料信息的句式主体。The sentence subject extraction unit 230 is configured to extract the sentence subject of the current corpus information according to the sentence structure.
所述正则表达式生成模块400包括:The regular expression generation module 400 includes:
替换单元410,用于将所述当前语料信息中的句式主体的字词替换为对应的语义槽;The replacement unit 410 is used to replace the words of the sentence subject in the current corpus information with the corresponding semantic slots;
正则表达式生成单元420,用于将分词后的所述当前语料信息剩余的非句式主体和所述语义槽按照语法结构进行排序生成排序不同且语义相同的至少一个正则表达式。The regular expression generation unit 420 is configured to sort the remaining non-sentence subjects and the semantic slots of the current corpus information after word segmentation according to the grammatical structure to generate at least one regular expression with different sorting and the same semantics.
具体地,获取到当前语料信息后,先对当前语料信息进行分词,得到当前语料信息中的字词的词性,然后根据语法规则和当前语料信息中的字词的词性,得到当前语料信息的句式结构,最后根据当前语料信息的句式结构,提取当前语料信息的句式主体。Specifically, after obtaining the current corpus information, first segment the current corpus information to obtain the part-of-speech of the words in the current corpus information, and then obtain the sentence of the current corpus information according to the grammar rules and the part-of-speech of the words in the current corpus information. structure, and finally extract the sentence subject of the current corpus information based on the sentence structure of the current corpus information.
对当前语料信息进行分词是指将当前语料信息分为一个个的字或词,如将“不知道你在说什么”分为“不知道,你,在,说什么”;再如将“鲸鱼为什么会喷水”分为“鲸鱼,为什么,会,喷水”。Word segmentation of the current corpus information means dividing the current corpus information into individual words or words, such as dividing "I don't know what you are talking about" into "I don't know what you are talking about"; another example is dividing "whale" Why do they spray water?" is divided into "Why do whales spray water?"
对当前语料信息进行分词后,对分词后得到的字词进行分析得到当前语料信息中的字词的词性,如将“鲸鱼为什么会喷水”分词后得到的字词为“鲸鱼”(名词)、“为什么”(代词)、“会”(助动词),“喷水”(动词)。然后根据语料规则和当前语料信息中的字词的词性,对当前语料信息进行句式分析,得到当前语料信息“鲸鱼为什么会喷水”的句式结构为“主+状+谓”,最后根据当前语料信息的句式结构,对当前语料信息进行分析可知“鲸鱼喷水”为主谓结构,“为什么喷水”为状中结构,“会喷水”为状中结构,根据分析后的结果可知当前语料信息“鲸鱼为什么会喷水”的主体结构为主谓结构的“鲸鱼喷水”,因此,从“鲸鱼为什么会喷水”中提取出句式主体即为主谓结构“鲸鱼喷水”。After segmenting the current corpus information, analyze the words obtained after segmentation to obtain the part-of-speech of the words in the current corpus information. For example, the word obtained after segmenting "Why do whales spray water" is "whale" (noun) , "why" (pronoun), "will" (auxiliary verb), "spray" (verb). Then, according to the corpus rules and the part-of-speech of the words in the current corpus information, the sentence structure of the current corpus information is analyzed, and the sentence structure of the current corpus information "Why does the whale spray water" is obtained as "subject + adjective + predicate", and finally according to The sentence structure of the current corpus information. Analysis of the current corpus information shows that "whale sprays water" is the main predicate structure, "why sprays water" is the predicate-mid structure, and "can spray water" is the predicate-mid structure. According to the analysis results It can be seen that the subject structure of the current corpus information "Why does the whale spray water" has the main-predicate structure "whale sprays water". Therefore, the subject of the sentence extracted from "Why does the whale spray water" is the subject-predicate structure "the whale sprays water" ".
提取出句式主体后,根据句式主体的字词的词性,将句式主体的字词转换为对应的语义槽,语义槽可为该字词对应的词性的所有词语,也可以为与该字词语义相同的词语。例如,句式主体为“鲸鱼喷水”,其中,“鲸鱼”为名词,“喷水”为动词,“鲸鱼”对应的语义槽可为名词库,“喷水”对应的语义槽可为动词库。After the sentence subject is extracted, the words in the sentence subject are converted into corresponding semantic slots according to the part-of-speech of the words in the sentence subject. The semantic slot can be all words with the part of speech corresponding to the word, or it can be all words with the part of speech corresponding to the word. Words with the same meaning. For example, the main body of the sentence is "whale spouting water", in which "whale" is a noun and "spraying water" is a verb. The semantic slot corresponding to "whale" can be a noun library, and the semantic slot corresponding to "spraying water" can be Verb library.
提取出当前语料信息的句式主体,并得到句式主体的字词对应的语义槽后,将分词后的当前语料信息中的非句式主体进行保留,然后将当前语料信息中的句式主体的字词替换为对应的语义槽,最后将语义槽和非句式主体按照语法结构进行排序生成排序不同且语义相同的至少一个正则表达式。After extracting the sentence subject of the current corpus information and obtaining the semantic slots corresponding to the words in the sentence subject, the non-sentence subjects in the current corpus information after word segmentation are retained, and then the sentence subjects in the current corpus information are retained. The words are replaced with corresponding semantic slots, and finally the semantic slots and non-sentence subjects are sorted according to the grammatical structure to generate at least one regular expression with different sorting and the same semantics.
示例性地,当前语料信息为“鲸鱼为什么会喷水”,句式主体为“鲸鱼喷水”,句式主体中的字词“鲸鱼”对应的语义槽为名词库,句式主体中的字词“喷水”对应的语义槽为动词库,分词后的当前语料信息中剩余的非句式主体为“为什么”和“会”。在保持语义相同的前提下,将非句式主体“为什么”、“会”以及句式主体的字词对应的语义槽“名词库”和“动词库”按照语法结构排序可得到“名词库”、“为什么”、“会”、“动词库”以及“为什么”、“名词库”、“会”、“动词库”。For example, the current corpus information is "Why do whales spray water?", the sentence body is "whale sprays water", the semantic slot corresponding to the word "whale" in the sentence body is the noun library, and the sentence body in the The semantic slot corresponding to the word "spray" is the verb library, and the remaining non-sentence subjects in the current corpus information after word segmentation are "why" and "will". On the premise of keeping the same semantics, the semantic slots "noun library" and "verb library" corresponding to the non-sentence subjects "why" and "will" and the words of the sentence subject can be sorted according to the grammatical structure to obtain "noun" library", "why", "would", "verb library" and "why", "noun library", "would", "verb library".
根据“名词库”、“为什么”、“会”、“动词库”的排序得到的正则表达式为“##名词库##[为什么][会]##动词库二##”。根据“为什么”、“名词库”、“会”、“动词库”的排序得到的正则表达式为“##[为什么] ##名词库##[会]##动词库二##”。本实施例通过将正则表达式的匹配项进行排列组合可实现根据一个语料信息生成多个语义相同的正则表达式的目的,以提高正则表达式的生成效率。The regular expression obtained based on the sorting of "noun library", "why", "will" and "verb library" is "## noun library ##[why][will]##verb library two##". The regular expression obtained based on the sorting of "why", "noun library", "hui" and "verb library" is "##[Why] ##Noun library##[will]##Verb library two## ". This embodiment can achieve the purpose of generating multiple regular expressions with the same semantics based on one piece of corpus information by arranging and combining the matching items of the regular expression, thereby improving the efficiency of regular expression generation.
优选地,所述正则表达式生成单元420,还用于生成所述正则表达式后,在生成的所述正则表达式中加入连接词,生成另一语义相同的正则表达式。Preferably, the regular expression generating unit 420 is also configured to add connectives to the generated regular expression after generating the regular expression to generate another regular expression with the same semantics.
具体地,在中文语法中,还存在主动句和被动句等句式不同但语义相同的情况,为充分考虑这种情况,在不改变当前语料信息的意图的情况下,在生成的正则表达式中加入连接词(如把、被等),然后将生成的正则表达式中的匹配项重新进行组合排列生成另一语义相同的正则表达式。Specifically, in Chinese grammar, there are cases where active sentences and passive sentences have different sentence patterns but the same semantics. In order to fully consider this situation, without changing the intention of the current corpus information, the generated regular expression Add connectives (such as to, be, etc.), and then recombine and arrange the matching items in the generated regular expression to generate another regular expression with the same semantics.
示例性地,当前语料信息为“给老师词典”,“给”为动词,对应的语义槽为动词库;“老师”为名词,对应的语义槽为名词库;“词典”为名词,对应的语义槽为名词库,生成的正则表达式为“##动词库##名词库##名词库##”。在当前语料信息“给老师词典”中加入关系词“把”后当前语料信息变为“把词典给老师”,因此,在生成的正则表达式“##动词库##名词库##名词库##”中加入连接词“把”后生成的另一语义相同的正则表达式为“把##名词库##动词库##名词库##”。For example, the current corpus information is "give to teacher dictionary", "give" is a verb, and the corresponding semantic slot is the verb library; "teacher" is a noun, and the corresponding semantic slot is the noun library; "dictionary" is a noun, and the corresponding semantic slot is the noun library. The semantic slot is the noun library, and the generated regular expression is "##verb library##noun library##noun library##". After adding the relative word "ba" to the current corpus information "give the teacher the dictionary", the current corpus information becomes "give the dictionary to the teacher". Therefore, in the generated regular expression "##verb library##noun library##name Another regular expression with the same semantics generated by adding the connective "ba" to the vocabulary library ## is "ba## noun library ## verb library ## noun library ##".
应当说明的是,上述实施例均可根据需要自由组合。以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。It should be noted that the above embodiments can be freely combined as needed. The above are only preferred embodiments of the present invention. It should be noted that those skilled in the art can make several improvements and modifications without departing from the principles of the present invention. These improvements and modifications can also be made. should be regarded as the protection scope of the present invention.
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910046964.2A CN109783819B (en) | 2019-01-18 | 2019-01-18 | Regular expression generation method and system |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910046964.2A CN109783819B (en) | 2019-01-18 | 2019-01-18 | Regular expression generation method and system |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN109783819A CN109783819A (en) | 2019-05-21 |
| CN109783819B true CN109783819B (en) | 2023-10-20 |
Family
ID=66501654
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201910046964.2A Active CN109783819B (en) | 2019-01-18 | 2019-01-18 | Regular expression generation method and system |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN109783819B (en) |
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110909160A (en) * | 2019-10-11 | 2020-03-24 | 平安科技(深圳)有限公司 | Regular expression generation method, server and computer-readable storage medium |
| CN111159384B (en) * | 2019-12-31 | 2022-07-08 | 思必驰科技股份有限公司 | Rule-based sentence generation method and device |
| CN111428469B (en) * | 2020-02-27 | 2023-06-16 | 宋继华 | Interactive labeling method and system for sentence-oriented structure graphic analysis |
| CN112115313B (en) * | 2020-09-08 | 2023-07-28 | 北京百度网讯科技有限公司 | Generation of regular expressions, data extraction method, device, equipment and medium |
Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
| CN105095186A (en) * | 2015-07-28 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Semantic analysis method and device |
| CN105512105A (en) * | 2015-12-07 | 2016-04-20 | 百度在线网络技术(北京)有限公司 | Semantic parsing method and device |
| CN107315737A (en) * | 2017-07-04 | 2017-11-03 | 北京奇艺世纪科技有限公司 | A kind of semantic logic processing method and system |
| CN107369443A (en) * | 2017-06-29 | 2017-11-21 | 北京百度网讯科技有限公司 | Dialogue management method and device based on artificial intelligence |
| CN107608949A (en) * | 2017-10-16 | 2018-01-19 | 北京神州泰岳软件股份有限公司 | A kind of Text Information Extraction method and device based on semantic model |
| CN107766560A (en) * | 2017-11-03 | 2018-03-06 | 广州杰赛科技股份有限公司 | The evaluation method and system of customer service flow |
| CN108563790A (en) * | 2018-04-28 | 2018-09-21 | 科大讯飞股份有限公司 | A kind of semantic understanding method and device, equipment, computer-readable medium |
| CN109063035A (en) * | 2018-07-16 | 2018-12-21 | 哈尔滨工业大学 | A kind of man-machine more wheel dialogue methods towards trip field |
-
2019
- 2019-01-18 CN CN201910046964.2A patent/CN109783819B/en active Active
Patent Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101582075A (en) * | 2009-06-24 | 2009-11-18 | 大连海事大学 | Web information extraction system |
| CN105095186A (en) * | 2015-07-28 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Semantic analysis method and device |
| CN105512105A (en) * | 2015-12-07 | 2016-04-20 | 百度在线网络技术(北京)有限公司 | Semantic parsing method and device |
| CN107369443A (en) * | 2017-06-29 | 2017-11-21 | 北京百度网讯科技有限公司 | Dialogue management method and device based on artificial intelligence |
| CN107315737A (en) * | 2017-07-04 | 2017-11-03 | 北京奇艺世纪科技有限公司 | A kind of semantic logic processing method and system |
| CN107608949A (en) * | 2017-10-16 | 2018-01-19 | 北京神州泰岳软件股份有限公司 | A kind of Text Information Extraction method and device based on semantic model |
| CN107766560A (en) * | 2017-11-03 | 2018-03-06 | 广州杰赛科技股份有限公司 | The evaluation method and system of customer service flow |
| CN108563790A (en) * | 2018-04-28 | 2018-09-21 | 科大讯飞股份有限公司 | A kind of semantic understanding method and device, equipment, computer-readable medium |
| CN109063035A (en) * | 2018-07-16 | 2018-12-21 | 哈尔滨工业大学 | A kind of man-machine more wheel dialogue methods towards trip field |
Also Published As
| Publication number | Publication date |
|---|---|
| CN109783819A (en) | 2019-05-21 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| KR102844617B1 (en) | System and method for performing semantic search using a natural language understanding (NLU) framework | |
| US11645547B2 (en) | Human-machine interactive method and device based on artificial intelligence | |
| CN108874937B (en) | A sentiment classification method based on part-of-speech combination and feature selection | |
| Candito et al. | Benchmarking of statistical dependency parsers for french | |
| CN112417846B (en) | Text automatic generation method and device, electronic equipment and storage medium | |
| CN109783819B (en) | Regular expression generation method and system | |
| WO2021179701A1 (en) | Multilingual speech recognition method and apparatus, and electronic device | |
| CN108595696A (en) | A kind of human-computer interaction intelligent answering method and system based on cloud platform | |
| US20110040553A1 (en) | Natural language processing | |
| CN109271492A (en) | Automatic generation method and system of corpus regular expression | |
| CN110502744A (en) | A Text Emotion Recognition Method and Device for Evaluation of Historical Parks | |
| CN108536673B (en) | News event extraction method and device | |
| CN112632272B (en) | Microblog sentiment classification method and system based on syntactic analysis | |
| CN110569510A (en) | method for identifying named entity of user request data | |
| JP2011123565A (en) | Faq candidate extracting system and faq candidate extracting program | |
| CN110309513B (en) | Text dependency analysis method and device | |
| CN101499056A (en) | Backward reference sentence pattern language analysis method | |
| CN114444469A (en) | Processing device based on 95598 customer service data resources | |
| CN120012771A (en) | A multilingual universal part-of-speech recognition method and system based on large language model | |
| CN102930042A (en) | Tendency text automatic classification system and achieving method of the same | |
| CN1208901A (en) | The Method of Automatic Analysis and Processing of Chinese Polyphonic Characters | |
| Ratnam et al. | Phonogram-based automatic typo correction in malayalam social media comments | |
| CN109783820B (en) | A semantic analysis method and system | |
| Tsai et al. | Applying an NVEF Word-Pair Identifier to the Chinese Syllable-to-Word Conversion Problem | |
| CN112395889A (en) | Machine-synchronized translation |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |