CN111104801B

CN111104801B - Text segmentation method, system, equipment and media based on website domain name

Info

Publication number: CN111104801B
Application number: CN201911367979.5A
Authority: CN
Inventors: 杜韬; 李依谦; 曲守宁; 朱连江; 王信堂; 王希普
Original assignee: University of Jinan
Current assignee: University of Jinan
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2023-09-26
Anticipated expiration: 2039-12-26
Also published as: CN111104801A

Abstract

The invention discloses a text segmentation method, system, equipment and medium based on website domain names, including: data collection, collecting several website domain names; performing word segmentation processing on each website domain name; and text formatting processing of words after word segmentation processing; Analyze the text formatting process to obtain the part-of-speech of the word; perform lemmatization according to the word's part-of-speech; store the result after lemmatization in the word library; use the two-way maximum matching algorithm to match the URL domain name to be segmented with the word library , if the match is successful, the text vectorization result will be obtained; if the match fails, the URL domain name to be segmented will be cleaned, and the cleaned result will be matched with the word library using the two-way maximum matching algorithm again.

Description

Text segmentation method, system, equipment and media based on website domain name

技术领域Technical field

本公开涉及自然语言处理技术领域，特别是涉及基于网址域名的文本分词方法、系统、设备及介质。The present disclosure relates to the technical field of natural language processing, and in particular to text segmentation methods, systems, equipment and media based on website domain names.

背景技术Background technique

本部分的陈述仅仅是提到了与本公开相关的背景技术，并不必然构成现有技术。本公开以不追踪用户行为、不获取用户隐私为前提。The statements in this section merely mention background technology related to the present disclosure and do not necessarily constitute prior art. This disclosure is based on the premise of not tracking user behavior and not obtaining user privacy.

近些年来，互联网络已经成为人类社会最重要的基础设施之一，对人们的经济活动与社会活动正产生着日益广泛而深入的影响。对于用户来说，不同网址间的跳转可以视为该用户的行为轨迹，随之产生的巨量上网行为数据中网址域名是最具代表性的，它包含了用户浏览网页的名称和性质等，能够充分反映用户在网址间的偏好性和相应网址间的关联性。In recent years, the Internet has become one of the most important infrastructures of human society, and is having an increasingly broad and profound impact on people's economic and social activities. For users, jumps between different URLs can be regarded as the user's behavioral trajectory. Among the huge amounts of online behavior data generated, the URL domain name is the most representative. It includes the name and nature of the web pages the user browses. , which can fully reflect the user's preference between URLs and the correlation between corresponding URLs.

网址域名主要是由英文字母、阿拉伯数字及一些特殊字符“_”、“@”、“/”等组成，目的是为了方便记忆和沟通一组服务器的地址(网站、电子邮件、FTP等)。The website domain name is mainly composed of English letters, Arabic numerals and some special characters "_", "@", "/", etc. The purpose is to facilitate the memory and communication of a group of server addresses (website, email, FTP, etc.).

在实现本公开的过程中，发明人发现现有技术中存在以下技术问题：In the process of realizing the present disclosure, the inventor discovered that the following technical problems exist in the prior art:

第一：网址域名长度极短，现有分词技术无法有效的提取关键字。First: The URL domain name is extremely short, and existing word segmentation technology cannot effectively extract keywords.

第二：网址域名是不规则的非结构化文本，使得从中提取符合需要的精炼、可理解的知识，和后期将文本向量化都增加了难度。Second: URL domain names are irregular unstructured texts, which makes it more difficult to extract the required refined and understandable knowledge and to vectorize the text later.

第三：各公司、组织或个人在设置自己的网址域名的时候，会按照个人习惯来进行命名，常常会出现域名缩写、错拼、语言不一致等情况。Third: When each company, organization or individual sets up their own website domain name, they will name it according to their personal habits. Domain name abbreviations, misspellings, language inconsistencies, etc. often occur.

第四：对现在现有的网址域名进行web挖掘，时间、空间复杂度都过高，容易造成维度灾难。Fourth: The time and space complexity of web mining for existing website domain names is too high, which can easily cause dimensionality disaster.

这些问题会造成数据分析师无法从网址域名中快速的得到该网页的性质信息，从而影响在分析用户上网行为时的准确性和效率。These problems will cause data analysts to be unable to quickly obtain the nature information of the web page from the URL domain name, thus affecting the accuracy and efficiency of analyzing user online behavior.

发明内容Contents of the invention

为了解决现有技术的不足，本公开提供了基于网址域名的文本分词方法、系统、设备及介质；其能够对现有任意网址域名进行文本解析，可实现具有较高准确率地提取其中关键字的技术。In order to solve the deficiencies of the existing technology, the present disclosure provides a text segmentation method, system, device and medium based on a website domain name; it can perform text parsing on any existing website domain name, and can extract keywords with high accuracy Technology.

第一方面，本公开提供了基于网址域名的文本分词方法；In the first aspect, this disclosure provides a text segmentation method based on website domain names;

基于网址域名的文本分词方法，包括：Text segmentation methods based on URL domain names include:

数据采集，采集若干个网址域名；对每个网址域名进行分词处理；Data collection: collect several website domain names; perform word segmentation processing on each website domain name;

将分词处理后的单词进行文本格式化处理；分析文本格式化处理后得到单词的单词词性；Perform text formatting on the words after word segmentation; analyze the text formatting to obtain the part-of-speech of the word;

根据单词词性进行词形还原；将词形还原后的结果存储到单词库中；Perform lemmatization according to the part of speech of the word; store the result of lemmatization in the word library;

将待分词的网址域名，采用双向最大匹配算法与单词库进行匹配，如果匹配成功，则得到文本向量化结果；如果匹配失败，则对待分词的网址域名进行清洗，将清洗后的结果再次采用双向最大匹配算法与单词库进行匹配。Use the two-way maximum matching algorithm to match the URL domain name to be segmented with the word library. If the match is successful, the text vectorization result will be obtained; if the match fails, the URL domain name to be segmented will be cleaned, and the cleaned results will be used again using the two-way The maximum matching algorithm matches the word library.

第二方面，本公开还提供了基于网址域名的文本分词系统；In the second aspect, the present disclosure also provides a text word segmentation system based on website domain names;

基于网址域名的文本分词系统，包括：Text segmentation system based on URL domain name, including:

数据采集模块，其被配置为：采集若干个网址域名；对每个网址域名进行分词处理；The data collection module is configured to: collect several URL domain names; perform word segmentation processing on each URL domain name;

文本格式化模块，其被配置为：将分词处理后的单词进行文本格式化处理；分析文本格式化处理后得到单词的单词词性；A text formatting module, which is configured to: format the words after word segmentation processing; analyze the text formatting processing to obtain the part-of-speech of the word;

词形还原模块，其被配置为：根据单词词性进行词形还原；将词形还原后的结果存储到单词库中；The lemmatization module is configured to: perform lemmatization according to the part of speech of the word; store the result of lemmatization into the word library;

匹配输出模块，其被配置为：将待分词的网址域名，采用双向最大匹配算法与单词库进行匹配，如果匹配成功，则得到文本向量化结果；如果匹配失败，则对待分词的网址域名进行清洗，将清洗后的结果再次采用双向最大匹配算法与单词库进行匹配。Matching output module, which is configured as follows: use the two-way maximum matching algorithm to match the URL domain name to be segmented with the word library. If the match is successful, the text vectorization result will be obtained; if the match fails, the URL domain name to be segmented will be cleaned. , the cleaned results are matched with the word library again using the two-way maximum matching algorithm.

第三方面，本公开还提供了一种电子设备，包括存储器和处理器以及存储在存储器上并在处理器上运行的计算机指令，所述计算机指令被处理器运行时，完成第一方面所述方法的步骤。In a third aspect, the present disclosure also provides an electronic device, including a memory, a processor, and computer instructions stored in the memory and executed on the processor. When the computer instructions are executed by the processor, the computer instructions in the first aspect are completed. Method steps.

第四方面，本公开还提供了一种计算机可读存储介质，用于存储计算机指令，所述计算机指令被处理器执行时，完成第一方面所述方法的步骤。In a fourth aspect, the present disclosure also provides a computer-readable storage medium for storing computer instructions. When the computer instructions are executed by a processor, the steps of the method described in the first aspect are completed.

与现有技术相比，本公开的有益效果是：Compared with the prior art, the beneficial effects of the present disclosure are:

此方法可实现更快速剔除公司、组织或个人在命名自己网站时出现的域名冗余、无意义标识等信息；更高准确率的修改域名错拼的情况；并结合个性化词库与官方词典可更高效的、更有针对性的切分出域名中的主要信息。为下一步上网行为分析中对网址域名向量化工作，提供了可靠的准备。对于需要从巨量用户的行为轨迹中分析规律的情况下，本公开改进了原有分析用户上网行为需逐条网址记录加载后再根据网页性质人为分类的传统方法，本公开提供了一种耗时极少，消耗空间极少的方法，无需加载网页，不受网络带宽影响，通过网址域名，进行文本分析，实时获取网页性质，加强用户上网行为分析的时效性，降低了分析的研究成本。This method can more quickly eliminate domain name redundancy, meaningless identification and other information that appears when companies, organizations or individuals name their own websites; modify domain name misspellings with higher accuracy; and combine personalized thesaurus and official dictionaries It can segment the main information in the domain name more efficiently and more targetedly. It provides reliable preparation for the vectorization of website domain names in the next step of online behavior analysis. For situations where it is necessary to analyze patterns from the behavior trajectories of a large number of users, this disclosure improves the traditional method of analyzing users' online behavior by loading URL records one by one and then artificially classifying them according to the nature of the web pages. This disclosure provides a time-consuming method It is a very small method that consumes very little space. It does not require loading web pages and is not affected by network bandwidth. It performs text analysis through URL domain names and obtains the properties of web pages in real time. It enhances the timeliness of user online behavior analysis and reduces the research cost of analysis.

附图说明Description of the drawings

构成本申请的一部分的说明书附图用来提供对本申请的进一步理解，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。The description and drawings that constitute a part of this application are used to provide a further understanding of this application. The illustrative embodiments and their descriptions of this application are used to explain this application and do not constitute an improper limitation of this application.

图1为第一个实施例的方法流程图；Figure 1 is a method flow chart of the first embodiment;

图2为第一个实施例的数据采集后的其中随机一条原始数据；Figure 2 shows a random piece of raw data after data collection in the first embodiment;

图3为第一个实施例的经过基于网址域名的极小文本的分词技术处理后的一条数据。Figure 3 is a piece of data processed by the word segmentation technology based on the very small text of the website domain name in the first embodiment.

具体实施方式Detailed ways

应该指出，以下详细说明都是示例性的，旨在对本申请提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the present application. Unless otherwise defined, all technical and scientific terms used herein have the same meanings commonly understood by one of ordinary skill in the art to which this application belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本申请的示例性实施方式。如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，当在本说明书中使用术语“包含”和/或“包括”时，其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terms used herein are only for describing specific embodiments and are not intended to limit the exemplary embodiments according to the present application. As used herein, the singular forms are also intended to include the plural forms unless the context clearly indicates otherwise. Furthermore, it will be understood that when the terms "comprises" and/or "includes" are used in this specification, they indicate There are features, steps, operations, means, components and/or combinations thereof.

实施例一，本实施例提供了基于网址域名的文本分词方法；Embodiment 1: This embodiment provides a text segmentation method based on website domain names;

如图1所示，基于网址域名的文本分词方法，包括：As shown in Figure 1, the text segmentation method based on the URL domain name includes:

S1：数据采集，采集若干个网址域名；对每个网址域名进行分词处理；S1: Data collection, collect several website domain names; perform word segmentation processing on each website domain name;

S2：将分词处理后的单词进行文本格式化处理；分析文本格式化处理后得到单词的单词词性；S2: Perform text formatting on the words after word segmentation; analyze the text formatting to obtain the part-of-speech of the word;

S3：根据单词词性进行词形还原；将词形还原后的结果存储到单词库中；S3: Perform lemmatization according to the part of speech of the word; store the result of lemmatization in the word library;

S4：将待分词的网址域名，采用双向最大匹配算法与单词库进行匹配，如果匹配成功，则得到文本向量化结果；如果匹配失败，则对待分词的网址域名进行清洗，将清洗后的结果再次采用双向最大匹配算法与单词库进行匹配。S4: Use the two-way maximum matching algorithm to match the URL domain name to be segmented with the word library. If the match is successful, the text vectorization result will be obtained; if the match fails, the URL domain name to be segmented will be cleaned, and the cleaned results will be processed again. A two-way maximum matching algorithm is used to match the word library.

作为一个或多个实施例，所述S1中，数据采集，采集若干个网址域名；具体步骤包括：As one or more embodiments, in S1, data collection includes collecting several website domain names; specific steps include:

采集若干个网址域名，对每个网址域名去除设定的敏感单词，将去除敏感单词后的网址域名，按照时间为单位进行存储，存储到数据集S中。Collect several website domain names, remove the set sensitive words from each website domain name, store the website domain name after removing the sensitive words in units of time, and store it in the data set S.

作为一个或多个实施例，所述采集若干个网址域名步骤之后，所述对每个网址域名进行分词处理步骤之前，还包括：数据预处理步骤；所述数据预处理步骤，包括：As one or more embodiments, after the step of collecting several website domain names and before the step of word segmentation processing for each website domain name, the step further includes: a data preprocessing step; the data preprocessing step includes:

S101：对数据集S中的每个网址域名进行缺失值删除或缺失值补全；S101: Delete missing values or complete missing values for each URL domain name in data set S;

S102：以用户为单位，提取网址域名至列向量。S102: Taking the user as the unit, extract the URL domain name into a column vector.

应理解的，所述采集若干个网址域名步骤之后，所述对每个网址域名进行分词处理步骤之前，还包括：数据预处理步骤；所述数据预处理步骤，包括：It should be understood that after the step of collecting several website domain names and before the step of word segmentation processing of each website domain name, a data preprocessing step is also included; the data preprocessing step includes:

将数据集S进行数据预处理和去噪处理，对数据出现的缺失值，若该属性仅仅包含极少量的缺失值，则可以通过缺失值删除的操作；若该属性含有部分缺失值，可使用同类均值插补的方法进行补全。Perform data preprocessing and denoising on the data set S. For missing values in the data, if the attribute contains only a very small amount of missing values, you can delete the missing values; if the attribute contains some missing values, you can use Completion is performed using the method of similar mean interpolation.

在针对该数据进行文本切分操作，原始数据如图2所示，其中包含了服务器、用户终端等信息，针对用户上网行为分析，我们需要通过文本之间的一些标记来区分，并按每个用户为单位，提取浏览网站域名至列向量L₁。When performing text segmentation operations on this data, the original data is shown in Figure 2, which contains server, user terminal and other information. For analysis of user online behavior, we need to distinguish between some tags between texts, and classify each Taking the user as a unit, extract the domain name of the browsing website to the column vector L ₁ .

作为一个或多个实施例，所述S1中，对每个网址域名进行分词处理；具体步骤包括：As one or more embodiments, in S1, word segmentation processing is performed on each website domain name; specific steps include:

对每个网址域名，利用jieba分词工具进行分词处理。For each website domain name, use jieba word segmentation tool for word segmentation processing.

应理解的，所述S1中，对每个网址域名进行分词处理；具体步骤包括：It should be understood that in S1, word segmentation processing is performed on each website domain name; specific steps include:

基于Trie树结构实现高效的词图扫描，生成句子中英文所有可能成词情况所构成的有向无环图(DAG)，采用动态规划查找最大概率路径，找出基于词频的最大切分组合，将网址域名列向量L₁输入jieba分词全模式模型，剔除符号，将每条记录中包含的所有可以看作是词语的字符串都扫描出来，存储至列向量L₂。Realize efficient word graph scanning based on Trie tree structure, generate a directed acyclic graph (DAG) composed of all possible word formation situations in English and Chinese sentences, use dynamic programming to find the maximum probability path, and find the maximum segmentation combination based on word frequency. Enter the website domain name column vector L ₁ into the jieba word segmentation full-mode model, remove symbols, scan out all the strings contained in each record that can be regarded as words, and store them in the column vector L ₂ .

作为一个或多个实施例，所述S2中，将分词处理后的单词进行文本格式化处理；具体步骤包括：As one or more embodiments, in S2, text formatting is performed on the words after word segmentation processing; specific steps include:

将分词处理后的单词进行文本格式化处理，删除标志符号和设定的无用字符。Format text after word segmentation, and delete glyphs and set useless characters.

应理解的，所述S2中，将分词处理后的单词进行文本格式化处理；具体步骤包括：It should be understood that in S2, text formatting is performed on the words after word segmentation; specific steps include:

针对列向量L₂实行文本格式化操作，彻底删除标志符号和无用的字符，并以一条网址域名为单位记录，其中包含的若干单词字符串做为子记录，储存至数据集S₁中。Perform text formatting operations on the column vector L ₂ , completely delete symbols and useless characters, and record them in units of a URL domain name, with several word strings contained in it as sub-records, and store them in the data set S ₁ .

作为一个或多个实施例，所述S2中，分析文本格式化处理后得到单词的单词词性；具体步骤包括：As one or more embodiments, in S2, the part-of-speech of the word is obtained after analyzing the text formatting process; specific steps include:

基于单词中的后缀信息得到当前单词的词性。Get the part of speech of the current word based on the suffix information in the word.

应理解的，所述S2中，分析文本格式化处理后得到单词的单词词性；具体步骤包括：It should be understood that in S2, the part of speech of the word is obtained after analyzing the text formatting process; the specific steps include:

采用正则表达式标注器，通过制定tagset转化为统一符号，利用英语单词中的后缀等信息来推测一个单词的词性，将数据集S₁中的子记录按照顺序匹配，当全部都不匹配时，会被标注为概率最大的词性，最后按照一条网址域名为单位记录，以各英语单词与其对应的词性为子记录，储存至数据集S₂。Use a regular expression annotator to convert the tagset into a unified symbol, use the suffix and other information in English words to infer the part of speech of a word, and match the sub-records in the data set _S1 in order. When none of them match, will be marked as the part of speech with the highest probability, and finally recorded in units of a website domain name, with each English word and its corresponding part of speech as sub-records, and stored in the data set S ₂ .

作为一个或多个实施例，所述S3中，根据单词词性进行词形还原；具体步骤包括：As one or more embodiments, in S3, lemmatization is performed according to the part of speech of the word; specific steps include:

根据单词词性，调用WordNet函数，进行词形还原操作，进而将各种单词的变形都还原为同一个形式，生成词典D₁。According to the part-of-speech of the word, the WordNet function is called to perform the lemmatization operation, and then the deformations of various words are restored to the same form, and a dictionary D ₁ is generated.

应理解的，所述S3中，根据单词词性进行词形还原；具体步骤包括：It should be understood that in S3, lemmatization is performed according to the part of speech of the word; specific steps include:

提取数据集S₂各个子记录中英语单词和其对应的词性，调用WordNet函数，进行词形还原操作，把各种类型的单词的变形，都归一为一个形式，并按照一条网址域名为单位记录，存储至数据集S₃。Extract the English words and their corresponding parts of speech in each sub-record of the data set _S2 , call the WordNet function, perform lemmatization operations, and normalize the deformations of various types of words into one form, and use a URL domain name as the unit Record and store to data set S ₃ .

作为一个或多个实施例，所述S3中，将词形还原后的结果存储到单词库中；具体步骤包括：As one or more embodiments, in S3, the result after lemmatization is stored in the word library; specific steps include:

用户构建个性化词库D₂，在NLTK中利用StandfordNLP工具包完成对词库D₂的操作；取个性化词库D₂与词典D₁的并集，生成词库D3，D3＝D1∪D2。The user builds a personalized vocabulary D ₂ and uses the StandfordNLP toolkit in NLTK to complete the operation of the vocabulary D ₂ ; take the union of the personalized vocabulary D ₂ and the dictionary D ₁ to generate the vocabulary D3, D3 = D1 ∪ D2 .

作为一个或多个实施例，所述S4中，将待分词的网址域名，采用双向最大匹配算法与单词库进行匹配；具体步骤包括：As one or more embodiments, in S4, the website domain name to be segmented is matched with the word library using a two-way maximum matching algorithm; specific steps include:

将待分词的网址域名，采用正向最大匹配算法与词库D3进行匹配，记录下匹配结果R₁；Use the forward maximum matching algorithm to match the URL domain name to be segmented with the vocabulary database D3, and record the matching result R ₁ ;

将待分词的网址域名，采用逆向最大匹配算法与词库D3进行匹配，记录下匹配结果R₂；Use the reverse maximum matching algorithm to match the URL domain name to be segmented with the vocabulary database D3, and record the matching result R ₂ ;

如果匹配结果R₁等于匹配结果R₂，则选择匹配结果R₁作为将待分词的网址域名的最终分词结果。If the matching result R ₁ is equal to the matching result R ₂ , then the matching result R ₁ is selected as the final word segmentation result of the URL domain name to be segmented.

进一步地，若匹配结果R₁不等于匹配结果R₂，则选取网址域名正向最大匹配算法的结果R₁和网址域名逆向最大匹配算法的结果R₂中单个英文单词字数较多的结果，作为待匹配网址域名双向最大匹配算法的最终结果R₃。Further, if the matching result R ₁ is not equal to the matching result R ₂ , then the result R ₁ of the forward maximum matching algorithm of the website domain name and the result R ₂ of the reverse maximum matching algorithm of the website domain name is selected, which has a larger number of single English words, as The final result R ₃ of the two-way maximum matching algorithm for the URL and domain name to be matched.

应理解的，所述S4中，将待分词的网址域名，采用双向最大匹配算法与单词库进行匹配；具体步骤包括：It should be understood that in S4, the website domain name to be segmented is matched with the word library using a two-way maximum matching algorithm; the specific steps include:

先采用网址域名正向最大匹配算法，然后和词库D₃进行比对：First use the forward maximum matching algorithm of the website domain name, and then compare it with the vocabulary D ₃ :

如果是一个英文单词就记录下来，否则通过增加一个单字，继续由左向右进行比较，直到还剩下一个单字则终止，If it is an English word, record it. Otherwise, by adding a word, continue the comparison from left to right until there is one word left, then terminate.

如果该字符串无法切分，则作为未登录处理，将处理好的这条网址域名为单位，再次匹配词库D₃，若该记录正确匹配，记录下此网址域名正向最大匹配算法的结果R₁；If the string cannot be segmented, it will be treated as not logged in, and the processed URL domain name will be used as a unit, and the dictionary _D3 will be matched again. If the record matches correctly, the result of the forward maximum matching algorithm of this URL domain name will be recorded. _R1 ;

再将S₃采用网址域名逆向最大匹配算法，与词库D₃进行比对：Then S ₃ is compared with the vocabulary D ₃ using the URL domain name reverse maximum matching algorithm:

如果是一个英文单词就记录下来，否则通过减少一个单字，继续由右向左进行比较，直到还剩下一个单字则终止，If it is an English word, record it. Otherwise, by reducing one word, continue the comparison from right to left until there is one word left, then terminate.

如果该字符串无法切分，则作为未登录处理，将处理好的这条网址域名为单位，再次匹配词库D₃，若该记录正确匹配，记录下此网址域名逆向最大匹配算法的结果R₂。If the string cannot be segmented, it will be treated as not logged in. The processed URL domain name will be used as a unit and the vocabulary database D ₃ will be matched again. If the record matches correctly, the result R of the reverse maximum matching algorithm of this URL domain name will be recorded. ₂ .

若R₁等于R₂，即可选择网址域名正向最大匹配算法的结果R₁为该记录网址域名双向最大匹配算法的最终结果R₃；If R ₁ is equal to R ₂ , the result R ₁ of the forward maximum matching algorithm for the URL and domain name can be selected as the final result R ₃ of the bidirectional maximum matching algorithm for the URL and domain name of this record;

若匹配结果R₁不等于匹配结果R₂，则选取网址域名正向最大匹配算法的结果R₁和网址域名逆向最大匹配算法的结果R₂中单个英文单词字数较多的结果，作为待匹配网址域名双向最大匹配算法的最终结果R₃；If the matching result R ₁ is not equal to the matching result R ₂ , then the result R ₁ of the forward maximum matching algorithm of the URL domain name and the result R ₂ of the reverse maximum matching algorithm of the URL domain name is selected as the result with a larger number of single English words as the URL to be matched. The final result of the domain name two-way maximum matching algorithm is R ₃ ;

将最终结果R₃储存至数据集S₄中。The final result R ₃ is stored in the data set S ₄ .

作为一个或多个实施例，所述S4中，如果匹配失败，则对待分词的网址域名进行清洗，将清洗后的结果再次采用双向最大匹配算法与单词库进行匹配，具体步骤包括：As one or more embodiments, in S4, if the matching fails, the URL domain name to be segmented is cleaned, and the cleaned result is matched with the word library again using the two-way maximum matching algorithm. The specific steps include:

若待分词的网址域名无法正确匹配，则清洗多余字符串，重新返回双向最大匹配算法，一直到待分词的网址域名所有字符串全部正确匹配词库D₃且完成储存至数据集S₄的操作则终止；最终所得数据集S₄即为待分词网址域名的分词结果。If the URL domain name to be segmented cannot be correctly matched, the excess strings will be cleaned and the two-way maximum matching algorithm will be returned again until all strings of the URL domain name to be segmented correctly match the vocabulary D ₃ and the operation of storing it in the data set S ₄ is completed. then terminate; the final data set S ₄ is the word segmentation result of the domain name of the website to be segmented.

由图2可知域名网址可能会出现的问题，有若干干扰项，如：dldir1，针对这类样本没有实际含义，需要清洗掉；有单词组合拼接，如：checkresupdate，针对这类将若干单词连写还夹杂简写、错拼的样本，需要挑选出来有用的单词，剔除没有意义的单词，以最大概率将简写、错拼的单词还原；As can be seen from Figure 2, problems that may arise in domain name URLs include several interference items, such as: dldir1, which have no actual meaning for this type of sample and need to be cleaned; there are word combinations and splicing, such as: checkresupdate, for which several words are concatenated. For samples containing abbreviations and misspellings, it is necessary to select useful words, eliminate meaningless words, and restore abbreviations and misspelled words with the greatest probability;

有字符标识混合命名，如：80002486_fa55fa1d3a4b43bab792c6a8ff463f72.zip、wrd_template_HEAD_06281609，针对这类样本，需要删除标识符且在样本中提取有意义的单词、还原单词的时态、被动等变换，并且文件后缀需要设定较高的权重，因为其在判别性质方面具有较高的辨识度。There are mixed names of character identifiers, such as: 80002486_fa55fa1d3a4b43bab792c6a8ff463f72.zip, wrd_template_HEAD_06281609. For this type of sample, it is necessary to delete the identifier and extract meaningful words in the sample, restore the tense, passive and other transformations of the word, and the file suffix needs to be set relatively A high weight because it has a high degree of discrimination in terms of discriminative properties.

图3为经过基于网址域名的极小文本的分词技术处理后的一条数据。Figure 3 shows a piece of data processed by word segmentation technology based on the very small text of the website domain name.

表1案例1Table 1 Case 1

表2案例2Table 2 Case 2

表3案例3Table 3 Case 3

表4案例4Table 4 Case 4

表5案例5Table 5 Case 5

实施例二，本实施例还提供了基于网址域名的文本分词系统；Embodiment 2: This embodiment also provides a text segmentation system based on website domain names;

实施例三，本实施例还提供了一种电子设备，包括存储器和处理器以及存储在存储器上并在处理器上运行的计算机指令，所述计算机指令被处理器运行时，完成实施例一所述方法的步骤。Embodiment 3: This embodiment also provides an electronic device, including a memory, a processor, and computer instructions stored in the memory and run on the processor. When the computer instructions are run by the processor, the steps of Embodiment 1 are completed. Describe the steps of the method.

实施例四，本实施例还提供了一种计算机可读存储介质，用于存储计算机指令，所述计算机指令被处理器执行时，完成实施例一所述方法的步骤。Embodiment 4: This embodiment also provides a computer-readable storage medium for storing computer instructions. When the computer instructions are executed by a processor, the steps of the method described in Embodiment 1 are completed.

以上所述仅为本申请的优选实施例而已，并不用于限制本申请，对于本领域的技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above descriptions are only preferred embodiments of the present application and are not intended to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application shall be included in the protection scope of this application.

Claims

1. The text word segmentation method based on the website domain name is characterized by comprising the following steps:

data acquisition, namely acquiring a plurality of website domain names; word segmentation processing is carried out on each website domain name, wherein the website domain name is extracted and browsed to a column vector L1 according to each user as a unit; based on the Trie structure, realizing efficient word graph scanning, generating a directed acyclic graph formed by all word forming conditions of Chinese and English in sentences, adopting dynamic programming to search a maximum probability path, finding out a maximum segmentation combination based on word frequency, inputting a website domain name column vector L1 into a jieba word segmentation full-mode model, eliminating symbols, scanning out all character strings which are regarded as words and are contained in each record, and storing the character strings into a column vector L2;

carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process, wherein text formatting operation is carried out on the column vector L2, sign symbols and useless characters are thoroughly deleted, a website domain name is used as a unit for recording, and a plurality of word character strings contained in the character strings are used as sub-records and stored in a data set S1;

adopting a regular expression labeler, converting tagset into unified symbols by formulating, utilizing suffix information in English words to infer the part of speech of a word, and collecting data S ₁ The sub-records in the table are matched in sequence, and when all the sub-records are not matched, the sub-records are marked as probabilityThe largest part of speech is recorded according to a website domain name as a unit, each English word and the part of speech corresponding to the English word are recorded as sub-records and stored in a data set S ₂ ；

Performing morphological reduction according to word parts of speech; storing the word shape restored result into a word library, specifically: according to word part of speech, call WordNet function, carry on the morphological reduction operation, and then reduce the deformation of various words into the same form, produce dictionary D ₁ ；

User-built personalized word stock D ₂ Word library D is completed in NLTK by using StanfordNLP toolkit ₂ Is performed according to the operation of (1); taking personalized word stock D ₂ And dictionary D ₁ Generates word stock by union of (1)，/>；

Matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm, wherein the method specifically comprises the following steps: the website domain name to be segmented adopts a forward maximum matching algorithm and a word stockMatching is carried out, and matching results are recorded>The method comprises the steps of carrying out a first treatment on the surface of the The website domain name to be segmented adopts a reverse maximum matching algorithm and a word stock +.>Matching is carried out, and matching results are recorded>；

If the result is matchedEqual to match result->Then select the matching result +.>As a final word segmentation result of the website domain name to be segmented;

if the matching result is thatUnequal match results->Selecting the result R of the forward maximum matching algorithm of the website domain name ₁ Result R of reverse maximum matching algorithm with website domain name ₂ The result with more Chinese and single English words is used as the final result R of the bidirectional maximum matching algorithm of the website domain name to be matched ₃ 。

2. The method of claim 1, wherein data collection is performed to collect a plurality of web site names; the method comprises the following specific steps:

and collecting a plurality of website domain names, removing set sensitive words from each website domain name, and storing the website domain names with the sensitive words removed according to time units into a data set S.

3. The method of claim 1, wherein after the step of collecting a plurality of web site domain names, before the step of word segmentation for each web site domain name, further comprises: a data preprocessing step; the data preprocessing step comprises the following steps:

s101: deleting the missing value or complementing the missing value of each website domain name in the data set S;

s102: and extracting the website domain name to a column vector by taking the user as a unit.

4. The method of claim 1, wherein each web site domain name is subjected to word segmentation; the method comprises the following specific steps: and performing word segmentation processing on each website domain name by utilizing a jieba word segmentation tool.

5. The text word segmentation system based on the website domain name is characterized by comprising the following components:

a data acquisition module configured to: collecting a plurality of website domain names; word segmentation processing is carried out on each website domain name, wherein the website domain name is extracted and browsed to a column vector L1 according to each user as a unit; based on the Trie structure, realizing efficient word graph scanning, generating a directed acyclic graph formed by all word forming conditions of Chinese and English in sentences, adopting dynamic programming to search a maximum probability path, finding out a maximum segmentation combination based on word frequency, inputting a website domain name column vector L1 into a jieba word segmentation full-mode model, eliminating symbols, scanning out all character strings which are regarded as words and are contained in each record, and storing the character strings into a column vector L2;

a text formatting module configured to: carrying out text formatting processing on the word subjected to word segmentation processing; analyzing word parts of speech of the word obtained after the text formatting process, wherein text formatting operation is carried out on the column vector L2, sign symbols and useless characters are thoroughly deleted, a website domain name is used as a unit for recording, and a plurality of word character strings contained in the character strings are used as sub-records and stored in a data set S1;

adopting a regular expression labeler, converting tagset into unified symbols by formulating, utilizing suffix information in English words to infer the part of speech of a word, and collecting data S ₁ The sub-records in the database are matched in sequence, when all the sub-records are not matched, the sub-records are marked as part of speech with the highest probability, finally the sub-records are recorded according to a website domain name as a unit, each English word and the part of speech corresponding to each English word are used as sub-records, and the sub-records are stored in a data set S ₂ ；

A lexical reduction module configured to: performing morphological reduction according to word parts of speech; knot for restoring word shapeThe results are stored in a word stock, specifically: according to word part of speech, call WordNet function, carry on the morphological reduction operation, and then reduce the deformation of various words into the same form, produce dictionary D ₁ ；

A match output module configured to: matching the website domain name to be segmented with a word library by adopting a bidirectional maximum matching algorithm, and if the matching is successful, obtaining a text vectorization result; if the matching fails, cleaning the website domain name of the word to be segmented, and matching the cleaned result with the word library again by adopting a bidirectional maximum matching algorithm, wherein the method specifically comprises the following steps: the website domain name to be segmented adopts a forward maximum matching algorithm and a word stockMatching is carried out, and matching results are recorded>The method comprises the steps of carrying out a first treatment on the surface of the The website domain name to be segmented adopts a reverse maximum matching algorithm and a word stock +.>Matching is carried out, and matching results are recorded>；

6. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the method of any of claims 1-4.

7. A computer readable storage medium storing computer instructions which, when executed by a processor, cause the steps of the method of any one of claims 1-4.