CN111104801B - Text segmentation method, system, equipment and media based on website domain name - Google Patents
Text segmentation method, system, equipment and media based on website domain name Download PDFInfo
- Publication number
- CN111104801B CN111104801B CN201911367979.5A CN201911367979A CN111104801B CN 111104801 B CN111104801 B CN 111104801B CN 201911367979 A CN201911367979 A CN 201911367979A CN 111104801 B CN111104801 B CN 111104801B
- Authority
- CN
- China
- Prior art keywords
- word
- domain name
- website domain
- result
- matching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
本发明公开了基于网址域名的文本分词方法、系统、设备及介质,包括:数据采集,采集若干个网址域名;对每个网址域名进行分词处理;将分词处理后的单词进行文本格式化处理;分析文本格式化处理后得到单词的单词词性;根据单词词性进行词形还原;将词形还原后的结果存储到单词库中;将待分词的网址域名,采用双向最大匹配算法与单词库进行匹配,如果匹配成功,则得到文本向量化结果;如果匹配失败,则对待分词的网址域名进行清洗,将清洗后的结果再次采用双向最大匹配算法与单词库进行匹配。
The invention discloses a text segmentation method, system, equipment and medium based on website domain names, including: data collection, collecting several website domain names; performing word segmentation processing on each website domain name; and text formatting processing of words after word segmentation processing; Analyze the text formatting process to obtain the part-of-speech of the word; perform lemmatization according to the word's part-of-speech; store the result after lemmatization in the word library; use the two-way maximum matching algorithm to match the URL domain name to be segmented with the word library , if the match is successful, the text vectorization result will be obtained; if the match fails, the URL domain name to be segmented will be cleaned, and the cleaned result will be matched with the word library using the two-way maximum matching algorithm again.
Description
技术领域Technical field
本公开涉及自然语言处理技术领域,特别是涉及基于网址域名的文本分词方法、系统、设备及介质。The present disclosure relates to the technical field of natural language processing, and in particular to text segmentation methods, systems, equipment and media based on website domain names.
背景技术Background technique
本部分的陈述仅仅是提到了与本公开相关的背景技术,并不必然构成现有技术。本公开以不追踪用户行为、不获取用户隐私为前提。The statements in this section merely mention background technology related to the present disclosure and do not necessarily constitute prior art. This disclosure is based on the premise of not tracking user behavior and not obtaining user privacy.
近些年来,互联网络已经成为人类社会最重要的基础设施之一,对人们的经济活动与社会活动正产生着日益广泛而深入的影响。对于用户来说,不同网址间的跳转可以视为该用户的行为轨迹,随之产生的巨量上网行为数据中网址域名是最具代表性的,它包含了用户浏览网页的名称和性质等,能够充分反映用户在网址间的偏好性和相应网址间的关联性。In recent years, the Internet has become one of the most important infrastructures of human society, and is having an increasingly broad and profound impact on people's economic and social activities. For users, jumps between different URLs can be regarded as the user's behavioral trajectory. Among the huge amounts of online behavior data generated, the URL domain name is the most representative. It includes the name and nature of the web pages the user browses. , which can fully reflect the user's preference between URLs and the correlation between corresponding URLs.
网址域名主要是由英文字母、阿拉伯数字及一些特殊字符“_”、“@”、“/”等组成,目的是为了方便记忆和沟通一组服务器的地址(网站、电子邮件、FTP等)。The website domain name is mainly composed of English letters, Arabic numerals and some special characters "_", "@", "/", etc. The purpose is to facilitate the memory and communication of a group of server addresses (website, email, FTP, etc.).
在实现本公开的过程中,发明人发现现有技术中存在以下技术问题:In the process of realizing the present disclosure, the inventor discovered that the following technical problems exist in the prior art:
第一:网址域名长度极短,现有分词技术无法有效的提取关键字。First: The URL domain name is extremely short, and existing word segmentation technology cannot effectively extract keywords.
第二:网址域名是不规则的非结构化文本,使得从中提取符合需要的精炼、可理解的知识,和后期将文本向量化都增加了难度。Second: URL domain names are irregular unstructured texts, which makes it more difficult to extract the required refined and understandable knowledge and to vectorize the text later.
第三:各公司、组织或个人在设置自己的网址域名的时候,会按照个人习惯来进行命名,常常会出现域名缩写、错拼、语言不一致等情况。Third: When each company, organization or individual sets up their own website domain name, they will name it according to their personal habits. Domain name abbreviations, misspellings, language inconsistencies, etc. often occur.
第四:对现在现有的网址域名进行web挖掘,时间、空间复杂度都过高,容易造成维度灾难。Fourth: The time and space complexity of web mining for existing website domain names is too high, which can easily cause dimensionality disaster.
这些问题会造成数据分析师无法从网址域名中快速的得到该网页的性质信息,从而影响在分析用户上网行为时的准确性和效率。These problems will cause data analysts to be unable to quickly obtain the nature information of the web page from the URL domain name, thus affecting the accuracy and efficiency of analyzing user online behavior.
发明内容Contents of the invention
为了解决现有技术的不足,本公开提供了基于网址域名的文本分词方法、系统、设备及介质;其能够对现有任意网址域名进行文本解析,可实现具有较高准确率地提取其中关键字的技术。In order to solve the deficiencies of the existing technology, the present disclosure provides a text segmentation method, system, device and medium based on a website domain name; it can perform text parsing on any existing website domain name, and can extract keywords with high accuracy Technology.
第一方面,本公开提供了基于网址域名的文本分词方法;In the first aspect, this disclosure provides a text segmentation method based on website domain names;
基于网址域名的文本分词方法,包括:Text segmentation methods based on URL domain names include:
数据采集,采集若干个网址域名;对每个网址域名进行分词处理;Data collection: collect several website domain names; perform word segmentation processing on each website domain name;
将分词处理后的单词进行文本格式化处理;分析文本格式化处理后得到单词的单词词性;Perform text formatting on the words after word segmentation; analyze the text formatting to obtain the part-of-speech of the word;
根据单词词性进行词形还原;将词形还原后的结果存储到单词库中;Perform lemmatization according to the part of speech of the word; store the result of lemmatization in the word library;
将待分词的网址域名,采用双向最大匹配算法与单词库进行匹配,如果匹配成功,则得到文本向量化结果;如果匹配失败,则对待分词的网址域名进行清洗,将清洗后的结果再次采用双向最大匹配算法与单词库进行匹配。Use the two-way maximum matching algorithm to match the URL domain name to be segmented with the word library. If the match is successful, the text vectorization result will be obtained; if the match fails, the URL domain name to be segmented will be cleaned, and the cleaned results will be used again using the two-way The maximum matching algorithm matches the word library.
第二方面,本公开还提供了基于网址域名的文本分词系统;In the second aspect, the present disclosure also provides a text word segmentation system based on website domain names;
基于网址域名的文本分词系统,包括:Text segmentation system based on URL domain name, including:
数据采集模块,其被配置为:采集若干个网址域名;对每个网址域名进行分词处理;The data collection module is configured to: collect several URL domain names; perform word segmentation processing on each URL domain name;
文本格式化模块,其被配置为:将分词处理后的单词进行文本格式化处理;分析文本格式化处理后得到单词的单词词性;A text formatting module, which is configured to: format the words after word segmentation processing; analyze the text formatting processing to obtain the part-of-speech of the word;
词形还原模块,其被配置为:根据单词词性进行词形还原;将词形还原后的结果存储到单词库中;The lemmatization module is configured to: perform lemmatization according to the part of speech of the word; store the result of lemmatization into the word library;
匹配输出模块,其被配置为:将待分词的网址域名,采用双向最大匹配算法与单词库进行匹配,如果匹配成功,则得到文本向量化结果;如果匹配失败,则对待分词的网址域名进行清洗,将清洗后的结果再次采用双向最大匹配算法与单词库进行匹配。Matching output module, which is configured as follows: use the two-way maximum matching algorithm to match the URL domain name to be segmented with the word library. If the match is successful, the text vectorization result will be obtained; if the match fails, the URL domain name to be segmented will be cleaned. , the cleaned results are matched with the word library again using the two-way maximum matching algorithm.
第三方面,本公开还提供了一种电子设备,包括存储器和处理器以及存储在存储器上并在处理器上运行的计算机指令,所述计算机指令被处理器运行时,完成第一方面所述方法的步骤。In a third aspect, the present disclosure also provides an electronic device, including a memory, a processor, and computer instructions stored in the memory and executed on the processor. When the computer instructions are executed by the processor, the computer instructions in the first aspect are completed. Method steps.
第四方面,本公开还提供了一种计算机可读存储介质,用于存储计算机指令,所述计算机指令被处理器执行时,完成第一方面所述方法的步骤。In a fourth aspect, the present disclosure also provides a computer-readable storage medium for storing computer instructions. When the computer instructions are executed by a processor, the steps of the method described in the first aspect are completed.
与现有技术相比,本公开的有益效果是:Compared with the prior art, the beneficial effects of the present disclosure are:
此方法可实现更快速剔除公司、组织或个人在命名自己网站时出现的域名冗余、无意义标识等信息;更高准确率的修改域名错拼的情况;并结合个性化词库与官方词典可更高效的、更有针对性的切分出域名中的主要信息。为下一步上网行为分析中对网址域名向量化工作,提供了可靠的准备。对于需要从巨量用户的行为轨迹中分析规律的情况下,本公开改进了原有分析用户上网行为需逐条网址记录加载后再根据网页性质人为分类的传统方法,本公开提供了一种耗时极少,消耗空间极少的方法,无需加载网页,不受网络带宽影响,通过网址域名,进行文本分析,实时获取网页性质,加强用户上网行为分析的时效性,降低了分析的研究成本。This method can more quickly eliminate domain name redundancy, meaningless identification and other information that appears when companies, organizations or individuals name their own websites; modify domain name misspellings with higher accuracy; and combine personalized thesaurus and official dictionaries It can segment the main information in the domain name more efficiently and more targetedly. It provides reliable preparation for the vectorization of website domain names in the next step of online behavior analysis. For situations where it is necessary to analyze patterns from the behavior trajectories of a large number of users, this disclosure improves the traditional method of analyzing users' online behavior by loading URL records one by one and then artificially classifying them according to the nature of the web pages. This disclosure provides a time-consuming method It is a very small method that consumes very little space. It does not require loading web pages and is not affected by network bandwidth. It performs text analysis through URL domain names and obtains the properties of web pages in real time. It enhances the timeliness of user online behavior analysis and reduces the research cost of analysis.
附图说明Description of the drawings
构成本申请的一部分的说明书附图用来提供对本申请的进一步理解,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。The description and drawings that constitute a part of this application are used to provide a further understanding of this application. The illustrative embodiments and their descriptions of this application are used to explain this application and do not constitute an improper limitation of this application.
图1为第一个实施例的方法流程图;Figure 1 is a method flow chart of the first embodiment;
图2为第一个实施例的数据采集后的其中随机一条原始数据;Figure 2 shows a random piece of raw data after data collection in the first embodiment;
图3为第一个实施例的经过基于网址域名的极小文本的分词技术处理后的一条数据。Figure 3 is a piece of data processed by the word segmentation technology based on the very small text of the website domain name in the first embodiment.
具体实施方式Detailed ways
应该指出,以下详细说明都是示例性的,旨在对本申请提供进一步的说明。除非另有指明,本文使用的所有技术和科学术语具有与本申请所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the present application. Unless otherwise defined, all technical and scientific terms used herein have the same meanings commonly understood by one of ordinary skill in the art to which this application belongs.
需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限制根据本申请的示例性实施方式。如在这里所使用的,除非上下文另外明确指出,否则单数形式也意图包括复数形式,此外,还应当理解的是,当在本说明书中使用术语“包含”和/或“包括”时,其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terms used herein are only for describing specific embodiments and are not intended to limit the exemplary embodiments according to the present application. As used herein, the singular forms are also intended to include the plural forms unless the context clearly indicates otherwise. Furthermore, it will be understood that when the terms "comprises" and/or "includes" are used in this specification, they indicate There are features, steps, operations, means, components and/or combinations thereof.
实施例一,本实施例提供了基于网址域名的文本分词方法;Embodiment 1: This embodiment provides a text segmentation method based on website domain names;
如图1所示,基于网址域名的文本分词方法,包括:As shown in Figure 1, the text segmentation method based on the URL domain name includes:
S1:数据采集,采集若干个网址域名;对每个网址域名进行分词处理;S1: Data collection, collect several website domain names; perform word segmentation processing on each website domain name;
S2:将分词处理后的单词进行文本格式化处理;分析文本格式化处理后得到单词的单词词性;S2: Perform text formatting on the words after word segmentation; analyze the text formatting to obtain the part-of-speech of the word;
S3:根据单词词性进行词形还原;将词形还原后的结果存储到单词库中;S3: Perform lemmatization according to the part of speech of the word; store the result of lemmatization in the word library;
S4:将待分词的网址域名,采用双向最大匹配算法与单词库进行匹配,如果匹配成功,则得到文本向量化结果;如果匹配失败,则对待分词的网址域名进行清洗,将清洗后的结果再次采用双向最大匹配算法与单词库进行匹配。S4: Use the two-way maximum matching algorithm to match the URL domain name to be segmented with the word library. If the match is successful, the text vectorization result will be obtained; if the match fails, the URL domain name to be segmented will be cleaned, and the cleaned results will be processed again. A two-way maximum matching algorithm is used to match the word library.
作为一个或多个实施例,所述S1中,数据采集,采集若干个网址域名;具体步骤包括:As one or more embodiments, in S1, data collection includes collecting several website domain names; specific steps include:
采集若干个网址域名,对每个网址域名去除设定的敏感单词,将去除敏感单词后的网址域名,按照时间为单位进行存储,存储到数据集S中。Collect several website domain names, remove the set sensitive words from each website domain name, store the website domain name after removing the sensitive words in units of time, and store it in the data set S.
作为一个或多个实施例,所述采集若干个网址域名步骤之后,所述对每个网址域名进行分词处理步骤之前,还包括:数据预处理步骤;所述数据预处理步骤,包括:As one or more embodiments, after the step of collecting several website domain names and before the step of word segmentation processing for each website domain name, the step further includes: a data preprocessing step; the data preprocessing step includes:
S101:对数据集S中的每个网址域名进行缺失值删除或缺失值补全;S101: Delete missing values or complete missing values for each URL domain name in data set S;
S102:以用户为单位,提取网址域名至列向量。S102: Taking the user as the unit, extract the URL domain name into a column vector.
应理解的,所述采集若干个网址域名步骤之后,所述对每个网址域名进行分词处理步骤之前,还包括:数据预处理步骤;所述数据预处理步骤,包括:It should be understood that after the step of collecting several website domain names and before the step of word segmentation processing of each website domain name, a data preprocessing step is also included; the data preprocessing step includes:
将数据集S进行数据预处理和去噪处理,对数据出现的缺失值,若该属性仅仅包含极少量的缺失值,则可以通过缺失值删除的操作;若该属性含有部分缺失值,可使用同类均值插补的方法进行补全。Perform data preprocessing and denoising on the data set S. For missing values in the data, if the attribute contains only a very small amount of missing values, you can delete the missing values; if the attribute contains some missing values, you can use Completion is performed using the method of similar mean interpolation.
在针对该数据进行文本切分操作,原始数据如图2所示,其中包含了服务器、用户终端等信息,针对用户上网行为分析,我们需要通过文本之间的一些标记来区分,并按每个用户为单位,提取浏览网站域名至列向量L1。When performing text segmentation operations on this data, the original data is shown in Figure 2, which contains server, user terminal and other information. For analysis of user online behavior, we need to distinguish between some tags between texts, and classify each Taking the user as a unit, extract the domain name of the browsing website to the column vector L 1 .
作为一个或多个实施例,所述S1中,对每个网址域名进行分词处理;具体步骤包括:As one or more embodiments, in S1, word segmentation processing is performed on each website domain name; specific steps include:
对每个网址域名,利用jieba分词工具进行分词处理。For each website domain name, use jieba word segmentation tool for word segmentation processing.
应理解的,所述S1中,对每个网址域名进行分词处理;具体步骤包括:It should be understood that in S1, word segmentation processing is performed on each website domain name; specific steps include:
基于Trie树结构实现高效的词图扫描,生成句子中英文所有可能成词情况所构成的有向无环图(DAG),采用动态规划查找最大概率路径,找出基于词频的最大切分组合,将网址域名列向量L1输入jieba分词全模式模型,剔除符号,将每条记录中包含的所有可以看作是词语的字符串都扫描出来,存储至列向量L2。Realize efficient word graph scanning based on Trie tree structure, generate a directed acyclic graph (DAG) composed of all possible word formation situations in English and Chinese sentences, use dynamic programming to find the maximum probability path, and find the maximum segmentation combination based on word frequency. Enter the website domain name column vector L 1 into the jieba word segmentation full-mode model, remove symbols, scan out all the strings contained in each record that can be regarded as words, and store them in the column vector L 2 .
作为一个或多个实施例,所述S2中,将分词处理后的单词进行文本格式化处理;具体步骤包括:As one or more embodiments, in S2, text formatting is performed on the words after word segmentation processing; specific steps include:
将分词处理后的单词进行文本格式化处理,删除标志符号和设定的无用字符。Format text after word segmentation, and delete glyphs and set useless characters.
应理解的,所述S2中,将分词处理后的单词进行文本格式化处理;具体步骤包括:It should be understood that in S2, text formatting is performed on the words after word segmentation; specific steps include:
针对列向量L2实行文本格式化操作,彻底删除标志符号和无用的字符,并以一条网址域名为单位记录,其中包含的若干单词字符串做为子记录,储存至数据集S1中。Perform text formatting operations on the column vector L 2 , completely delete symbols and useless characters, and record them in units of a URL domain name, with several word strings contained in it as sub-records, and store them in the data set S 1 .
作为一个或多个实施例,所述S2中,分析文本格式化处理后得到单词的单词词性;具体步骤包括:As one or more embodiments, in S2, the part-of-speech of the word is obtained after analyzing the text formatting process; specific steps include:
基于单词中的后缀信息得到当前单词的词性。Get the part of speech of the current word based on the suffix information in the word.
应理解的,所述S2中,分析文本格式化处理后得到单词的单词词性;具体步骤包括:It should be understood that in S2, the part of speech of the word is obtained after analyzing the text formatting process; the specific steps include:
采用正则表达式标注器,通过制定tagset转化为统一符号,利用英语单词中的后缀等信息来推测一个单词的词性,将数据集S1中的子记录按照顺序匹配,当全部都不匹配时,会被标注为概率最大的词性,最后按照一条网址域名为单位记录,以各英语单词与其对应的词性为子记录,储存至数据集S2。Use a regular expression annotator to convert the tagset into a unified symbol, use the suffix and other information in English words to infer the part of speech of a word, and match the sub-records in the data set S1 in order. When none of them match, will be marked as the part of speech with the highest probability, and finally recorded in units of a website domain name, with each English word and its corresponding part of speech as sub-records, and stored in the data set S 2 .
作为一个或多个实施例,所述S3中,根据单词词性进行词形还原;具体步骤包括:As one or more embodiments, in S3, lemmatization is performed according to the part of speech of the word; specific steps include:
根据单词词性,调用WordNet函数,进行词形还原操作,进而将各种单词的变形都还原为同一个形式,生成词典D1。According to the part-of-speech of the word, the WordNet function is called to perform the lemmatization operation, and then the deformations of various words are restored to the same form, and a dictionary D 1 is generated.
应理解的,所述S3中,根据单词词性进行词形还原;具体步骤包括:It should be understood that in S3, lemmatization is performed according to the part of speech of the word; specific steps include:
提取数据集S2各个子记录中英语单词和其对应的词性,调用WordNet函数,进行词形还原操作,把各种类型的单词的变形,都归一为一个形式,并按照一条网址域名为单位记录,存储至数据集S3。Extract the English words and their corresponding parts of speech in each sub-record of the data set S2 , call the WordNet function, perform lemmatization operations, and normalize the deformations of various types of words into one form, and use a URL domain name as the unit Record and store to data set S 3 .
作为一个或多个实施例,所述S3中,将词形还原后的结果存储到单词库中;具体步骤包括:As one or more embodiments, in S3, the result after lemmatization is stored in the word library; specific steps include:
用户构建个性化词库D2,在NLTK中利用StandfordNLP工具包完成对词库D2的操作;取个性化词库D2与词典D1的并集,生成词库D3,D3=D1∪D2。The user builds a personalized vocabulary D 2 and uses the StandfordNLP toolkit in NLTK to complete the operation of the vocabulary D 2 ; take the union of the personalized vocabulary D 2 and the dictionary D 1 to generate the vocabulary D3, D3 = D1 ∪ D2 .
作为一个或多个实施例,所述S4中,将待分词的网址域名,采用双向最大匹配算法与单词库进行匹配;具体步骤包括:As one or more embodiments, in S4, the website domain name to be segmented is matched with the word library using a two-way maximum matching algorithm; specific steps include:
将待分词的网址域名,采用正向最大匹配算法与词库D3进行匹配,记录下匹配结果R1;Use the forward maximum matching algorithm to match the URL domain name to be segmented with the vocabulary database D3, and record the matching result R 1 ;
将待分词的网址域名,采用逆向最大匹配算法与词库D3进行匹配,记录下匹配结果R2;Use the reverse maximum matching algorithm to match the URL domain name to be segmented with the vocabulary database D3, and record the matching result R 2 ;
如果匹配结果R1等于匹配结果R2,则选择匹配结果R1作为将待分词的网址域名的最终分词结果。If the matching result R 1 is equal to the matching result R 2 , then the matching result R 1 is selected as the final word segmentation result of the URL domain name to be segmented.
进一步地,若匹配结果R1不等于匹配结果R2,则选取网址域名正向最大匹配算法的结果R1和网址域名逆向最大匹配算法的结果R2中单个英文单词字数较多的结果,作为待匹配网址域名双向最大匹配算法的最终结果R3。Further, if the matching result R 1 is not equal to the matching result R 2 , then the result R 1 of the forward maximum matching algorithm of the website domain name and the result R 2 of the reverse maximum matching algorithm of the website domain name is selected, which has a larger number of single English words, as The final result R 3 of the two-way maximum matching algorithm for the URL and domain name to be matched.
应理解的,所述S4中,将待分词的网址域名,采用双向最大匹配算法与单词库进行匹配;具体步骤包括:It should be understood that in S4, the website domain name to be segmented is matched with the word library using a two-way maximum matching algorithm; the specific steps include:
先采用网址域名正向最大匹配算法,然后和词库D3进行比对:First use the forward maximum matching algorithm of the website domain name, and then compare it with the vocabulary D 3 :
如果是一个英文单词就记录下来,否则通过增加一个单字,继续由左向右进行比较,直到还剩下一个单字则终止,If it is an English word, record it. Otherwise, by adding a word, continue the comparison from left to right until there is one word left, then terminate.
如果该字符串无法切分,则作为未登录处理,将处理好的这条网址域名为单位,再次匹配词库D3,若该记录正确匹配,记录下此网址域名正向最大匹配算法的结果R1;If the string cannot be segmented, it will be treated as not logged in, and the processed URL domain name will be used as a unit, and the dictionary D3 will be matched again. If the record matches correctly, the result of the forward maximum matching algorithm of this URL domain name will be recorded. R1 ;
再将S3采用网址域名逆向最大匹配算法,与词库D3进行比对:Then S 3 is compared with the vocabulary D 3 using the URL domain name reverse maximum matching algorithm:
如果是一个英文单词就记录下来,否则通过减少一个单字,继续由右向左进行比较,直到还剩下一个单字则终止,If it is an English word, record it. Otherwise, by reducing one word, continue the comparison from right to left until there is one word left, then terminate.
如果该字符串无法切分,则作为未登录处理,将处理好的这条网址域名为单位,再次匹配词库D3,若该记录正确匹配,记录下此网址域名逆向最大匹配算法的结果R2。If the string cannot be segmented, it will be treated as not logged in. The processed URL domain name will be used as a unit and the vocabulary database D 3 will be matched again. If the record matches correctly, the result R of the reverse maximum matching algorithm of this URL domain name will be recorded. 2 .
若R1等于R2,即可选择网址域名正向最大匹配算法的结果R1为该记录网址域名双向最大匹配算法的最终结果R3;If R 1 is equal to R 2 , the result R 1 of the forward maximum matching algorithm for the URL and domain name can be selected as the final result R 3 of the bidirectional maximum matching algorithm for the URL and domain name of this record;
若匹配结果R1不等于匹配结果R2,则选取网址域名正向最大匹配算法的结果R1和网址域名逆向最大匹配算法的结果R2中单个英文单词字数较多的结果,作为待匹配网址域名双向最大匹配算法的最终结果R3;If the matching result R 1 is not equal to the matching result R 2 , then the result R 1 of the forward maximum matching algorithm of the URL domain name and the result R 2 of the reverse maximum matching algorithm of the URL domain name is selected as the result with a larger number of single English words as the URL to be matched. The final result of the domain name two-way maximum matching algorithm is R 3 ;
将最终结果R3储存至数据集S4中。The final result R 3 is stored in the data set S 4 .
作为一个或多个实施例,所述S4中,如果匹配失败,则对待分词的网址域名进行清洗,将清洗后的结果再次采用双向最大匹配算法与单词库进行匹配,具体步骤包括:As one or more embodiments, in S4, if the matching fails, the URL domain name to be segmented is cleaned, and the cleaned result is matched with the word library again using the two-way maximum matching algorithm. The specific steps include:
若待分词的网址域名无法正确匹配,则清洗多余字符串,重新返回双向最大匹配算法,一直到待分词的网址域名所有字符串全部正确匹配词库D3且完成储存至数据集S4的操作则终止;最终所得数据集S4即为待分词网址域名的分词结果。If the URL domain name to be segmented cannot be correctly matched, the excess strings will be cleaned and the two-way maximum matching algorithm will be returned again until all strings of the URL domain name to be segmented correctly match the vocabulary D 3 and the operation of storing it in the data set S 4 is completed. then terminate; the final data set S 4 is the word segmentation result of the domain name of the website to be segmented.
由图2可知域名网址可能会出现的问题,有若干干扰项,如:dldir1,针对这类样本没有实际含义,需要清洗掉;有单词组合拼接,如:checkresupdate,针对这类将若干单词连写还夹杂简写、错拼的样本,需要挑选出来有用的单词,剔除没有意义的单词,以最大概率将简写、错拼的单词还原;As can be seen from Figure 2, problems that may arise in domain name URLs include several interference items, such as: dldir1, which have no actual meaning for this type of sample and need to be cleaned; there are word combinations and splicing, such as: checkresupdate, for which several words are concatenated. For samples containing abbreviations and misspellings, it is necessary to select useful words, eliminate meaningless words, and restore abbreviations and misspelled words with the greatest probability;
有字符标识混合命名,如:80002486_fa55fa1d3a4b43bab792c6a8ff463f72.zip、wrd_template_HEAD_06281609,针对这类样本,需要删除标识符且在样本中提取有意义的单词、还原单词的时态、被动等变换,并且文件后缀需要设定较高的权重,因为其在判别性质方面具有较高的辨识度。There are mixed names of character identifiers, such as: 80002486_fa55fa1d3a4b43bab792c6a8ff463f72.zip, wrd_template_HEAD_06281609. For this type of sample, it is necessary to delete the identifier and extract meaningful words in the sample, restore the tense, passive and other transformations of the word, and the file suffix needs to be set relatively A high weight because it has a high degree of discrimination in terms of discriminative properties.
图3为经过基于网址域名的极小文本的分词技术处理后的一条数据。Figure 3 shows a piece of data processed by word segmentation technology based on the very small text of the website domain name.
表1案例1Table 1 Case 1
表2案例2Table 2 Case 2
表3案例3Table 3 Case 3
表4案例4Table 4 Case 4
表5案例5Table 5 Case 5
实施例二,本实施例还提供了基于网址域名的文本分词系统;Embodiment 2: This embodiment also provides a text segmentation system based on website domain names;
基于网址域名的文本分词系统,包括:Text segmentation system based on URL domain name, including:
数据采集模块,其被配置为:采集若干个网址域名;对每个网址域名进行分词处理;The data collection module is configured to: collect several URL domain names; perform word segmentation processing on each URL domain name;
文本格式化模块,其被配置为:将分词处理后的单词进行文本格式化处理;分析文本格式化处理后得到单词的单词词性;A text formatting module, which is configured to: format the words after word segmentation processing; analyze the text formatting processing to obtain the part-of-speech of the word;
词形还原模块,其被配置为:根据单词词性进行词形还原;将词形还原后的结果存储到单词库中;The lemmatization module is configured to: perform lemmatization according to the part of speech of the word; store the result of lemmatization into the word library;
匹配输出模块,其被配置为:将待分词的网址域名,采用双向最大匹配算法与单词库进行匹配,如果匹配成功,则得到文本向量化结果;如果匹配失败,则对待分词的网址域名进行清洗,将清洗后的结果再次采用双向最大匹配算法与单词库进行匹配。Matching output module, which is configured as follows: use the two-way maximum matching algorithm to match the URL domain name to be segmented with the word library. If the match is successful, the text vectorization result will be obtained; if the match fails, the URL domain name to be segmented will be cleaned. , the cleaned results are matched with the word library again using the two-way maximum matching algorithm.
实施例三,本实施例还提供了一种电子设备,包括存储器和处理器以及存储在存储器上并在处理器上运行的计算机指令,所述计算机指令被处理器运行时,完成实施例一所述方法的步骤。Embodiment 3: This embodiment also provides an electronic device, including a memory, a processor, and computer instructions stored in the memory and run on the processor. When the computer instructions are run by the processor, the steps of Embodiment 1 are completed. Describe the steps of the method.
实施例四,本实施例还提供了一种计算机可读存储介质,用于存储计算机指令,所述计算机指令被处理器执行时,完成实施例一所述方法的步骤。Embodiment 4: This embodiment also provides a computer-readable storage medium for storing computer instructions. When the computer instructions are executed by a processor, the steps of the method described in Embodiment 1 are completed.
以上所述仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above descriptions are only preferred embodiments of the present application and are not intended to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application shall be included in the protection scope of this application.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911367979.5A CN111104801B (en) | 2019-12-26 | 2019-12-26 | Text segmentation method, system, equipment and media based on website domain name |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911367979.5A CN111104801B (en) | 2019-12-26 | 2019-12-26 | Text segmentation method, system, equipment and media based on website domain name |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111104801A CN111104801A (en) | 2020-05-05 |
CN111104801B true CN111104801B (en) | 2023-09-26 |
Family
ID=70424414
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911367979.5A Active CN111104801B (en) | 2019-12-26 | 2019-12-26 | Text segmentation method, system, equipment and media based on website domain name |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111104801B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112992376A (en) * | 2021-03-04 | 2021-06-18 | 山东大学 | Disease name matching method and system based on weight adjustment |
CN113095050A (en) * | 2021-04-19 | 2021-07-09 | 广东电网有限责任公司 | Intelligent ticketing method, system, equipment and storage medium |
CN113645240B (en) * | 2021-08-11 | 2023-05-23 | 积至(海南)信息技术有限公司 | Malicious domain name community mining method based on graph structure |
CN113806477A (en) * | 2021-08-26 | 2021-12-17 | 广东广信通信服务有限公司 | Automatic text labeling method, device, terminal and storage medium |
CN116579344B (en) * | 2023-07-12 | 2023-10-20 | 吉奥时空信息技术股份有限公司 | Case main body extraction method |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101901249A (en) * | 2009-05-26 | 2010-12-01 | 复旦大学 | A Text-Based Query Expansion and Ranking Method in Image Retrieval |
CN105975454A (en) * | 2016-04-21 | 2016-09-28 | 广州精点计算机科技有限公司 | Chinese word segmentation method and device of webpage text |
CN108228710A (en) * | 2017-11-30 | 2018-06-29 | 中国科学院信息工程研究所 | A kind of segmenting method and device for URL |
CN108509419A (en) * | 2018-03-21 | 2018-09-07 | 山东中医药大学 | Ancient TCM books document participle and part of speech indexing method and system |
CN109271626A (en) * | 2018-08-31 | 2019-01-25 | 北京工业大学 | Text semantic analysis method |
CN109344263A (en) * | 2018-08-01 | 2019-02-15 | 昆明理工大学 | an address matching method |
CN110457466A (en) * | 2019-06-28 | 2019-11-15 | 谭浩 | Generate method, computer readable storage medium and the terminal device of interview report |
-
2019
- 2019-12-26 CN CN201911367979.5A patent/CN111104801B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101901249A (en) * | 2009-05-26 | 2010-12-01 | 复旦大学 | A Text-Based Query Expansion and Ranking Method in Image Retrieval |
CN105975454A (en) * | 2016-04-21 | 2016-09-28 | 广州精点计算机科技有限公司 | Chinese word segmentation method and device of webpage text |
CN108228710A (en) * | 2017-11-30 | 2018-06-29 | 中国科学院信息工程研究所 | A kind of segmenting method and device for URL |
CN108509419A (en) * | 2018-03-21 | 2018-09-07 | 山东中医药大学 | Ancient TCM books document participle and part of speech indexing method and system |
CN109344263A (en) * | 2018-08-01 | 2019-02-15 | 昆明理工大学 | an address matching method |
CN109271626A (en) * | 2018-08-31 | 2019-01-25 | 北京工业大学 | Text semantic analysis method |
CN110457466A (en) * | 2019-06-28 | 2019-11-15 | 谭浩 | Generate method, computer readable storage medium and the terminal device of interview report |
Non-Patent Citations (1)
Title |
---|
党倩娜.数据预处理与文本分词.《新兴技术弱信号监测机制研究》.2018,第89-92页. * |
Also Published As
Publication number | Publication date |
---|---|
CN111104801A (en) | 2020-05-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111104801B (en) | Text segmentation method, system, equipment and media based on website domain name | |
Nayak et al. | Survey on pre-processing techniques for text mining | |
Iqbal et al. | Mining writeprints from anonymous e-mails for forensic investigation | |
Li et al. | Twiner: named entity recognition in targeted twitter stream | |
Urvoy et al. | Tracking web spam with html style similarities | |
US9507867B2 (en) | Discovery engine | |
WO2015149533A1 (en) | Method and device for word segmentation processing on basis of webpage content classification | |
CN106126619A (en) | A kind of video retrieval method based on video content and system | |
CN112559747A (en) | Event classification processing method and device, electronic equipment and storage medium | |
TW201826145A (en) | Method and system for knowledge extraction from Chinese corpus useful for extracting knowledge from source corpuses mainly written in Chinese | |
CN105956192A (en) | Method and system for acquiring shortened form of organization name based on website homepage information | |
CN112069312A (en) | Text classification method based on entity recognition and electronic device | |
CN112445862B (en) | Internet of things equipment data set construction method and device, electronic equipment and storage medium | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN104346382A (en) | Text Analysis System and Method Using Linguistic Queries | |
CN108595466B (en) | A kind of Internet information filtering and Internet user information and network post structure analysis method | |
Babbar et al. | Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text | |
CN117786249B (en) | Network real-time hot topic mining analysis and public opinion extraction system | |
CN113934910A (en) | Automatic optimization and updating theme library construction method and hot event real-time updating method | |
CN113157857A (en) | Hot topic detection method, device and equipment for news | |
Al-Sultany et al. | Enriching tweets for topic modeling via linking to the wikipedia | |
CN118535792A (en) | Chinese corpus acquisition method and system based on Common Crawl data | |
Yang et al. | Post-level spam detection for social bookmarking web sites | |
Zhang et al. | Event-based summarization for scientific literature in chinese | |
Ramachandran et al. | Document clustering using keyword extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |