WO2015149533A1 - 一种基于网页内容分类进行分词处理的方法和装置 - Google Patents
一种基于网页内容分类进行分词处理的方法和装置 Download PDFInfo
- Publication number
- WO2015149533A1 WO2015149533A1 PCT/CN2014/093396 CN2014093396W WO2015149533A1 WO 2015149533 A1 WO2015149533 A1 WO 2015149533A1 CN 2014093396 W CN2014093396 W CN 2014093396W WO 2015149533 A1 WO2015149533 A1 WO 2015149533A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word
- participle
- category
- word segmentation
- text information
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
Definitions
- the present invention relates to the technical field of searching, and in particular, to a method for word segmentation processing based on a web content category and an apparatus for word segmentation processing based on a web content category.
- users often need to enter key information to obtain related information. For example, enter a keyword search web page information in a search engine, a keyword search post in a forum, and the like.
- the word segmentation is the basis for information processing and information retrieval. All information processing and information retrieval work are performed after the word segmentation. Therefore, the error of the word segmentation will be superimposed on the subsequent processing and it is difficult to eliminate. Because of this, the pursuit of the accuracy of word segmentation is a continuous process, and because of the inherent characteristics of Chinese language: there is no clear definition of words, no separators between words and words, new words, proper nouns, etc. These factors make it difficult to achieve 100% accuracy of the participle.
- the main method used by the current word segmentation system is based on statistical word segmentation.
- words are a combination of stable words, so in the context, the more times adjacent words appear at the same time, the more likely they are to constitute a word. Therefore, the frequency or probability of co-occurrence of words and words can be better reflected in the credibility of words.
- the frequency of the combination of adjacent words co-occurring in the corpus can be counted and their mutual information can be calculated. Define the mutual information of two words and calculate the adjacent co-occurrence probability of two Chinese characters X and Y.
- the mutual information reflects the closeness of the relationship between Chinese characters. When the degree of tightness is above a certain threshold, the word group may be considered to constitute a word.
- This method only needs to count the frequency of the words in the corpus, but this method also has certain limitations. It often extracts some common characters with high frequency, but not words, such as "this”. “One”, “Yes”, “My”, “Many”, etc., and the recognition accuracy of commonly used words is poor, and the space-time overhead is large.
- the result of the word segmentation caused the related information obtained in the later period to be very different from the original expectation, and the user experience was very poor, which wasted equipment resources.
- the user needs to obtain the associated information, and then enters the key information again to search, and the device needs to search, compare, and filter the massive information again to obtain information related to the search keyword, which is not only cumbersome for the user operation, but also consumes the user's time. And will greatly increase the burden on the device and consume more equipment resources.
- the present invention has been made in order to provide a method for word segmentation based on web content categories and a corresponding device for word segmentation based on web content categories, which overcomes the above problems or at least partially solves or alleviates the above problems. .
- a method for word segmentation processing based on a web content category including:
- Extracting text information of webpage content in the search resource
- the text information is subjected to word segmentation processing according to the word segment dictionary corresponding to the category to which the text information belongs.
- an apparatus for word segmentation processing based on a web content category includes:
- An extraction module configured to extract text information of webpage content in the search resource
- a dividing module configured to divide a category of the text information according to the content category of the webpage
- the word segmentation module is adapted to perform word segmentation processing on the text information according to a word segmentation dictionary corresponding to the category to which the text information belongs.
- a computer program comprising computer readable code, when the computer readable code is run on a computing device, causing the computing device to perform word segmentation based on a web content category as described above The method of processing.
- a computer readable medium wherein the computer program described above is stored.
- the embodiment of the invention divides the text information of the webpage content in the search resource, and classifies the information according to the word segmentation dictionary of the category, so as to better adapt to different language types.
- Sexuality also improves the accuracy of different categories of word segmentation, and achieves the optimal processing of partial words; and the accuracy of word segmentation is closer to the user's intention, which improves the user experience and reduces user re-entry and search.
- Such operations improve the simplicity of the operation, and also reduce the device's response to user operations, reducing the cost of equipment system resources.
- the embodiment of the present invention divides the text information of the webpage content in the search resource, classifies the information based on the word segmentation dictionary of the category, and then uses the first segmentation obtained by the word segmentation to establish an inverted index, thereby avoiding the inverted based on the global text information.
- the singularity and one-sidedness of the index index improve the accuracy of the inverted index in each category, which improves the index running efficiency of the inverted index and reduces the indexing time.
- the text information of the webpage in the search resource includes new and odd.
- the special text information conforming to the linguistic characteristics of the category, using the wisdom of others and the collective collected in the search resources makes up for its own definition, its own artificial deficiencies, and greatly reduces the labor cost.
- FIG. 1 is a flow chart showing the steps of an embodiment of a method for word segmentation based on a web content category, in accordance with one embodiment of the present invention
- FIG. 2 is a block diagram showing an embodiment of an apparatus for performing word segmentation processing based on a web content category according to an embodiment of the present invention
- Figure 3 schematically shows a block diagram of a computing device for performing the method according to the invention
- Fig. 4 schematically shows a storage unit for holding or carrying program code implementing the method according to the invention.
- FIG. 1 a flow chart of steps of a method for performing word segmentation processing based on a web content category according to an embodiment of the present invention is shown, which may include the following steps:
- Step 101 Extract text information of webpage content in the search resource
- the processing flow of the search engine can generally be divided into two parts, the first part is the front-end user request, and the second part is the back-end production data.
- the front-end user request processing process can include:
- the user enters a keyword
- Sorting sorting candidate webpages according to dimensions such as content relevance and timeliness;
- the backend production data process can include:
- Index production analyze the crawled and saved webpages, segment the page title and page text, and make an inverted index based on the word segmentation results for front-end retrieval.
- the webpage crawled by the crawler can be saved in the webpage database to form a large number of search resources, and the webpage content can include a large amount of text information.
- the text information of the webpage content in the search resource may be extracted from the webpage database.
- Step 102 Divide a category of the text information according to the content category of the webpage
- the webpage category may be obtained based on the webpage of the webpage, and the category of the textual information may be divided according to the webpage category.
- the webpage URL of the animation field generally has a domain name marked with "comic”
- the webpage URL of the sports field generally has a domain name labeled "sports”, such as comic.XXX.com, sports.XXX.com, detected in the webpage URL.
- the domain name identified by "comic” or "sports” it can be identified that the webpage category is an animation field or a sports field, and then the text information can be divided into an animation field or a sports field.
- the visited webpage usually has tag information
- the webpage category can be obtained based on the tag information, and the category of the text information is divided according to the webpage category. For example, if a webpage has tag information such as video, movie, comedy movie, etc., it can recognize that the webpage category is a movie field, and then divide the text information into a movie field.
- the specific word in the web content title can be analyzed to know the webpage category, for example, the title contains a specific word such as basketball, football, NBA or World Cup, which is known as the sports field.
- the webpage category may be known in webpage navigation (such as Breadcrumb Trail), and the category of the textual information may be divided according to the webpage category.
- Breadcrumb navigation usually appears horizontally at the top of the page, usually below the title or header.
- Breadcrumb navigation provides the user with a link back to any previous page (these links are also the path to the current page), which is usually the parent page of the page in the hierarchy.
- Breadcrumb navigation provides the user with a path back to the home page or portal page of the website, usually with a greater-than sign (>), and some designs are other symbols (such as >>). For example, "Home > Category page > Sub-category page" or "Home>> Category page>> Sub-category page".
- the breadcrumb navigation of the webpage is “XX Portal>Sports>Chinese Football>Zhongchao”, and the corresponding webpages are classified into sports, Chinese football, Chinese super, and The matching classification in the actual application of the embodiment of the present invention may be used.
- the embodiment of the present invention can classify the categories according to actual needs. For example, for the sports field, in addition to the categories for the entire sports field, the categories of the next level such as basketball and soccer can be divided.
- the invention can be divided into NBA (National Basketball Association), CBA (Chinese Basketball Association), La Liga (Spain Football League), Super League (China Football Super League) and other sub-categories, the present invention The embodiment does not limit this.
- the accuracy of the word segmentation in the embodiment of the present invention is higher.
- Step 103 Perform word segmentation processing on the text information according to the word segment dictionary corresponding to the category to which the text information belongs.
- each category can correspond to a specific word segmentation dictionary, in the text This information is based on the wording of the language of the category.
- the word segmentation dictionary can be generated in the following manner:
- Sub-step S11 acquiring a first training document
- the first training document may be text information of a webpage in the search resource.
- Sub-step S12 dividing the category of the first training document
- the category of the first training document may be divided according to the webpage category.
- Sub-step S13 performing a word segmentation process on the first training document corresponding to the category to obtain a second word segmentation
- the general word segmentation dictionary first training document can be used for word segmentation processing.
- the general word segmentation dictionary can be a generalized word segmentation dictionary, and there is no technical terminology in a specific specific field, such as angelica, grass, and the like in the field of Chinese herbal medicine, which mainly includes general terms whose appearance frequency is higher than a preset threshold.
- the general word segmentation dictionary may include a general word and a certain meaning word.
- Generic words can include adjectives, conjunctions, and verbs of general meaning, such as happiness, but, participation, and the like. Determining a meaningful word can be a word that can express a certain range of meaning, usually some nouns and verbs.
- Sub-step S14 counting the word frequency and the first co-occurrence rate of the second participle corresponding to the category
- the N-Gram model can be trained based on the second participle.
- the first co-occurrence rate may be the probability that two or more second participles appear simultaneously.
- the first co-occurrence rate may include a ratio of a frequency of the first word to a frequency of the second word;
- the first word frequency includes a word frequency after the current second participle appears after the target second participle; and the target second participle includes one or more second participles appearing before the current environmental participle;
- the second word frequency includes a total word frequency of the target second participle.
- the appearance of a word depends only on the two words that appear before it, it is called a trigram.
- the bigram and trigram are the main ones, and the N-Gram model higher than the quaternary is used less, because the training of the quaternary N-Gram model requires a larger corpus, and the data is sparsely severe. Time complexity is high and accuracy is not much improved.
- Second participle Total word frequency I 3437 Want 1215 To 3256 Eat 938 Chinese 213 Food 1506 Lunch 459
- Table 2 The frequency statistics of the word before the second participle of the target appears before the second participle of the target
- 1087 in the third row and third column indicates that the current second participle "want” appears after the target second participle "I” has a frequency of 1087.
- Sub-step S15 using the second word segmentation and its first co-occurrence rate to generate a word segment dictionary corresponding to the category.
- the commonly used query methods may include: hash query, TRIE tree (also known as word search tree or key tree) query, binary query and sequential query, and the like.
- hash query also known as word search tree or key tree
- TRIE tree query also known as word search tree or key tree
- binary query and sequential query and the like.
- a variety of query methods can be used.
- a combination of a hash query and a binary query can be used, and a TRIE tree query is used in the word dictionary of the TRIE mechanism. Combine with the two-point query, and combine multiple query methods to query, which can improve the query efficiency.
- the word segmentation dictionary may be generated according to a query mechanism formed by one or more of a query manner such as a hash query, a TRIE tree query, a binary query, and a sequential query, to implement a hash query and a TRIE.
- a query mechanism formed by one or more of query methods such as tree query, binary query, and sequential query.
- the word segmentation dictionary can be updated in the following manner:
- Sub-step S21 acquiring a second training document
- the second training document may be text information of a webpage in the search resource.
- Sub-step S22 dividing the category to which the second training document belongs
- the category of the second training document may be divided according to the webpage category.
- Sub-step S23 segmenting the text information according to the word segment dictionary corresponding to the category Processing, obtaining the third participle;
- the word segmentation process may be performed by using the word segmentation dictionary second training document corresponding to the category to which the second training document belongs.
- the word segmentation dictionary second training document For a given character string in the second training document, according to certain certain principles, such as forward maximum matching method (MM), reverse maximum matching method (RMM) or two-way scanning method, etc., the string is cut.
- MM forward maximum matching method
- RMM reverse maximum matching method
- Substring if the substring matches a term in the word segmentation dictionary, the substring is considered to be the third participle, the segmentation flag is inserted, and the remaining part is continued to be divided, and the remaining part is empty; otherwise the substring is not the first
- the third word re-cut the substring of the string for the next match.
- Sub-step S24 counting the word frequency and the second co-occurrence rate of the third participle corresponding to the category
- the N-Gram model can be trained based on the third participle.
- the second co-occurrence rate may be the probability that two or more third participles appear simultaneously.
- the second co-occurrence rate may include a ratio of a third word frequency to a fourth word frequency;
- the third word frequency includes a frequency of words after the current third participle appears after the target third participle; and the target third participle includes one or more third participles appearing before the current environmental participle;
- the fourth word frequency includes a total word frequency of the target third participle.
- Sub-step S25 updating the word segment dictionary corresponding to the category by using the third word segmentation and the second co-occurrence rate thereof.
- the word segmentation dictionary may be updated according to a query mechanism formed by one or more of a query manner such as a hash query, a TRIE tree query, a binary query, and a sequential query.
- the word information corresponding to the category to which the text information belongs may be directly used to perform word segmentation processing on the text information.
- step 103 may comprise the following sub-steps:
- Sub-step S31 performing word segmentation processing on the text information according to the word segment dictionary corresponding to the category to which the text information belongs, and the general dictionary;
- Sub-step S32 the word segment with the highest frequency of words obtained after the word segmentation process is used as the first word segment obtained by the word segmentation process.
- the word segmentation dictionary corresponding to the category to which the text information belongs and the general dictionary (general word segmentation dictionary) can be used to perform word segmentation processing on the text information.
- step 103 may include the following sub-steps:
- Sub-step S41 when the text information belongs to a plurality of categories, the word information is processed according to the word segment dictionary corresponding to the category;
- Sub-step S42 the word segment with the highest frequency of words obtained after the word segmentation process is used as the first word segment obtained by the word segmentation process.
- the text information may be divided into multiple categories and belong to a cross-domain.
- the classification of text information about an aircraft may be in the field of mechanics or in the field of aviation.
- the information of the present document can be segmented based on the word segmentation dictionary corresponding to the classification, and finally the word frequency is the highest.
- the main method used by the current word segmentation system is based on statistical word segmentation. Simply put, when determining the segmentation point of a word, it mainly relies on information such as word frequency and transition probability between candidate words. Since it is a statistic, it must satisfy most of the sacrifices, that is to say, the pursuit of the global statistically optimal, rather than the local optimization, so that the local word segmentation processing accuracy is very low.
- the embodiment of the invention divides the text information of the webpage content in the search resource, and classifies the information according to the word segmentation dictionary of the category, so as to better adapt to the language characteristics of different categories, and also improve the segmentation accuracy of different categories.
- the optimal processing of the partial words is realized; and the accuracy of the word segmentation is closer to the user's intention, which improves the user experience, and then reduces the user's re-input, search, etc., improves the simplicity of the operation, and also reduces
- the device's response to user operations reduces the cost of the system's system resources.
- Step 104 For the category, use the first participle obtained by the word segmentation process to establish an inverted index.
- the inverted index is derived from the actual application and needs to find records based on the value of the attribute.
- Each entry in this index table includes an attribute value and the address of each record having the attribute value. Since the attribute value is not determined by the record, but the position of the record is determined by the attribute value, it is called an inverted index.
- a file with an inverted index is called an inverted index file, or simply an inverted file.
- Inverted file an index object is a word in a document or a collection of documents (such as a web page), and is used to store a storage location of the words in a document or a group of documents, which is a collection of documents or documents.
- a commonly used indexing mechanism is used.
- step 104 may include the following sub-steps:
- Sub-step S51 for the category, recording the appearance position of the first participle corresponding to the category that appears;
- Sub-step S52 the first participle and its corresponding appearance position are recorded in the inverted index.
- the location of the first participle may include the web page that appears, or the web page that appears and its location in the web page.
- T1 “it is what it is”
- T3 “it is a banana”
- banana ⁇ (2, 3) ⁇ is “banana” in the text information of the third web page (T3), and the position of the third web page is the fourth word (address is 3).
- the embodiment of the present invention divides the text information of the webpage content in the search resource, classifies the information based on the word segmentation dictionary of the category, and then uses the first segmentation obtained by the word segmentation to establish an inverted index, thereby avoiding the inverted based on the global text information.
- the singularity and one-sidedness of the index index improve the accuracy of the inverted index in each category, which improves the index running efficiency of the inverted index and reduces the indexing time.
- the text information of the webpage in the search resource includes new and odd.
- the special text information conforming to the linguistic characteristics of the category, using the wisdom of others and the collective collected in the search resources makes up for its own definition, its own artificial deficiencies, and greatly reduces the labor cost.
- FIG. 2 a block diagram showing an embodiment of an apparatus for performing word segmentation processing based on a webpage content category according to an embodiment of the present invention is shown, which may include the following modules:
- the extracting module 201 is adapted to extract text information of webpage content in the search resource
- the dividing module 202 is adapted to divide the category of the text information according to the content category of the webpage;
- the word segmentation module 203 is adapted to perform word segmentation processing on the text information according to the word segmentation dictionary corresponding to the category to which the text information belongs.
- a building module is adapted to establish an inverted index for the first participle obtained by the word segmentation process for the category.
- the establishing module may further be adapted to:
- the first participle and its corresponding appearance position are recorded in the inverted index.
- the word segmentation dictionary can be generated in the following manner:
- a word segment dictionary corresponding to the category is generated.
- the first co-occurrence rate may include a ratio of a first word frequency to a second word frequency
- the first word frequency includes a word frequency after the current second participle appears after the target second participle; and the target second participle includes one or more second participles appearing before the current environmental participle;
- the second word frequency includes a total word frequency of the target second participle.
- the word segmentation dictionary can be updated in the following manner:
- the second co-occurrence rate may include a ratio of a third word frequency to a fourth word frequency
- the third word frequency includes a frequency of words after the current third participle appears after the target third participle; and the target third participle includes one or more third participles appearing before the current environmental participle;
- the fourth word frequency includes a total word frequency of the target third participle.
- the word segmentation module 203 can also be adapted to:
- the participle with the highest frequency of words obtained after the word segmentation is treated as the first participle obtained by the word segmentation process.
- the word segmentation module 203 can also be adapted to:
- the text information belongs to a plurality of categories
- the text information is subjected to word segmentation processing according to the word segment dictionary corresponding to the category;
- the participle with the highest frequency of words obtained after the word segmentation is treated as the first participle obtained by the word segmentation process.
- the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
- the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. It will be understood by those skilled in the art that a microprocessor or digital signal processor (DSP) may be used in practice to implement some or some of some or all of the components of the device for word segmentation based on web content classification according to embodiments of the present invention. All features.
- the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein.
- Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
- FIG. 3 illustrates a computing device, such as a retrieval server, that can perform word segmentation based on web content classification in accordance with the present invention.
- the computing device conventionally includes a processor 310 and a computer program product or computer readable medium in the form of a memory 320.
- the memory 320 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
- the memory 320 has a memory space 330 for program code 331 for performing any of the method steps described above.
- storage space 330 for program code may include various program code 331 for implementing various steps in the above methods, respectively.
- the program code can be read from or written to one or more computer program products.
- These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
- Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG.
- the storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 320 in the computing device of FIG.
- the program code can be compressed, for example, in an appropriate form.
- the storage unit includes computer readable code 331', code that can be read by a processor, such as 310, which, when executed by a computing device, causes the computing device to perform the above Each step in the described method.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种基于网页内容分类进行分词处理的方法和装置,所述的方法包括:提取搜索资源中网页内容的文本信息;依据所述网页内容类别划分所述文本信息所属类别;按照所述文本信息所属类别对应的分词词典,对所述文本信息进行分词处理。本发明实施例对搜索资源中网页内容的文本信息划分类别,基于该类别的分词词典对本文信息进行分词,更好地适应不同类别的的语言特性,同时也提高了不同类别的分词准确度,实现了局部分词的最优处理;并且,分词准确度的提高,更加贴近用户的意图,提升了用户体验,继而减少了用户重新输入、搜索等操作,提高了操作的简便性,同时也减少了设备对用户操作的响应,减少了设备系统资源的耗费。
Description
本发明涉及搜索的技术领域,尤其涉及一种基于网页内容类别进行分词处理的方法和一种基于网页内容类别进行分词处理的装置。
随着互联网的高速发展,网络应用趋向多元化,网上的信息量急剧增加。
在各种场合下,用户经常需要输入关键信息进行关联信息的获取。例如,在搜索引擎中输入关键词搜索网页信息、在论坛中输入关键词搜索帖子等等。
分词是进行信息处理、信息检索的基础,所有的信息处理、信息检索工作都是在分词之后进行的。所以分词的错误会被叠加到后续的处理过程中,而且很难被消除。正因为这样所以对分词的准确率的追求是个持续的过程,同时由于中文语言的固有的特性:没有明确的词的定义、词和词之间没有分隔符、新词、专有名词不断涌现等这些因素导致分词很难做到100%的准确。
当前分词系统使用的主要方法是基于统计的分词。从形式上看,词是稳定的字的组合,因此在上下文中,相邻的字同时出现的次数越多,就越有可能构成一个词。因此字与字相邻共现的频率或概率能够较好的反映成词的可信度。可以对语料中相邻共现的各个字的组合的频度进行统计,计算它们的互现信息。定义两个字的互现信息,计算两个汉字X、Y的相邻共现概率。互现信息体现了汉字之间结合关系的紧密程度。当紧密程度高于某一个阈值时,便可认为此字组可能构成了一个词。这种方法只需对语料中的字组频度进行统计,但这种方法也有一定的局限性,会经常抽出一些共现频度高、但并不是词的常用字组,例如“这一”、“之一”、“有的”、“我的”、“许多的”等,并且对常用词的识别精度差,时空开销大。
一方面,分词的结果错误致使后期获取的关联信息与当初的预期有很大差别,用户体验十分差,浪费了设备系统资源。另一方面,用户需要获取关联信息,会再次输入关键信息进行搜索,设备要再次进行海量信息的搜索、对比、筛选等获取与搜索关键词相关的信息,不仅用户操作更加繁琐,耗费用户的时间,而且将大大增加设备的负担,耗费更多的设备资源。
发明内容
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决或者减缓上述问题的一种基于网页内容类别进行分词处理的方法和相应的一种基于网页内容类别进行分词处理的装置。
根据本发明的一个方面,提供了一种基于网页内容类别进行分词处理的方法,包括:
提取搜索资源中网页内容的文本信息;
依据所述网页内容类别划分所述文本信息所属类别;
按照所述文本信息所属类别对应的分词词典,对所述文本信息进行分词处理。
根据本发明的另一个方面,提供了一种基于网页内容类别进行分词处理的装置,包括:
提取模块,适于提取搜索资源中网页内容的文本信息;
划分模块,适于依据所述网页内容类别划分所述文本信息所属类别;
分词模块,适于按照所述文本信息所属类别对应的分词词典,对所述文本信息进行分词处理。
根据本发明的又一个方面,提供了一种计算机程序,其包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行上述的基于网页内容类别进行分词处理的方法。
根据本发明的再一个方面,提供了一种计算机可读介质,其中存储了上述的计算机程序。
本发明的有益效果为:
本发明实施例对搜索资源中网页内容的文本信息划分类别,基于该类别的分词词典对本文信息进行分词,更好地适应不同类别的的语言特
性,同时也提高了不同类别的分词准确度,实现了局部分词的最优处理;并且,分词准确度的提高,更加贴近用户的意图,提升了用户体验,继而减少了用户重新输入、搜索等操作,提高了操作的简便性,同时也减少了设备对用户操作的响应,减少了设备系统资源的耗费。
本发明实施例对搜索资源中网页内容的文本信息划分类别,基于该类别的分词词典对本文信息进行分词,再采用分词处理获得的第一分词建立倒排索引,避免了基于全局文本信息的倒排索引的单一性和片面性,提升了在各个类别中倒排索引的准确率,继而提升了倒排索引的索引运行效率,减少索引的时间;并且,搜索资源中网页的文本信息包括新、奇、特等各种符合该类别语言特性的文本信息,利用搜索资源中搜集的他人和集体的智慧,弥补了自身定义、自己人工的不足,大大减少了人工运营成本。
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1示意性示出了根据本发明一个实施例的一种基于网页内容类别进行分词处理的方法实施例的步骤流程图;
图2示意性示出了根据本发明一个实施例的一种基于网页内容类别进行分词处理的装置实施例的结构框图;
图3示意性地示出了用于执行根据本发明的方法的计算设备的框图;以及
图4示意性地示出了用于保持或者携带实现根据本发明的方法的程序代码的存储单元。
下面结合附图和具体的实施方式对本发明作进一步的描述。
参照图1,示出了根据本发明一个实施例的一种基于网页内容类别进行分词处理的方法实施例的步骤流程图,可以包括如下步骤:
步骤101,提取搜索资源中网页内容的文本信息;
搜索引擎的处理流程一般可以分为二个部分,第一部分是前端用户请求,第二部分是后端制作数据。
一、前端用户请求处理过程可以包括:
1.用户输入关键字;
2.查询词分析,搜索引擎对关键字分词;
3.检索,根据分词结果,从事先制作的索引中,找出相关的网页集合;
4.排序,对候选的网页集合,根据内容相关性、时效性等维度进行排序;
5.展现:将排序后的网页进行展现。
二、后端制作数据过程可以包括:
1.网页抓取,爬虫通过网页间的链接关系,抓取互联网的网页并保存;
2.索引制作,对已抓取保存的网页进行分析,对网页标题和页面文本分词,根据分词结果制作倒排索引,供前端检索使用。
爬虫抓取的网页可以保存在网页数据库中形成大量的搜索资源,而网页内容中可以包括大量的文本信息。则在本发明实施例中,可以从网页数据库中提取搜索资源中网页内容的文本信息。
步骤102,依据所述网页内容类别划分所述文本信息所属类别;
在一种情形中,可以基于网页的网址获得网页类别,再依据网页类别划分文本信息所属类别。例如,动漫领域的网页网址一般带有“comic”标识的域名,体育领域的网页网址一般带有“sports”标识的域名,如comic.XXX.com,sports.XXX.com,在网页网址中检测出“comic”或“sports”标识的域名时,则可以识别出该网页类别为动漫领域或体育领域,继而可以将文本信息划分为动漫领域或体育领域。
在又一种情形中,访问的网页通常带有标签(tag)信息,可以基于标签信息获得网页类别,再依据网页类别划分文本信息所属类别。例如,某网页带有视频、电影、喜剧电影等标签信息,则可以识别出该网页类别为电影领域,继而将文本信息划分为电影领域。
再一种情况中,可以分析网页内容标题(topic)中特定词获知网页类别,比如标题中含有篮球、足球、NBA或世界杯等特定词可知是体育领域。
在又一种情形中,可以在网页导航(比如面包屑导航Breadcrumb Trail)中获知网页类别,再依据网页类别划分文本信息所属类别。面包屑导航通常在页面顶部水平出现,一般会位于标题或页头的下方。面包屑导航提供给用户返回之前任何一个页面的链接(这些链接也是能到达当前页面的路径),在层级架构中通常是这个页面的父级页面。面包屑导航提供给用户回溯到网站首页或入口页面的一条路径,通常是以大于号(>)出现,还有一些设计是其他的符号(如>>)。例如,“首页>分类页>次级分类页”或者“首页>>分类页>>次级分类页”。从面包屑导航中分类即可获知网页的自定义分类,例如网页的面包屑导航为“XX门户网站>体育>中国足球>中超”,其对应的网页分类为体育、中国足球、中超,选取与本发明实施例实际应用中匹配的分类即可。
需要说明的是,本发明实施例可以根据实际需要,划分类别的等级,例如,对于体育领域,除了可以划分针对整个体育领域的类别外,也可以划分篮球、足球等下一级的类别,还可以划分NBA(National Basketball Association,国家篮球协会)、CBA(Chinese Basketball Association,中国篮球协会)、西甲(西班牙足球甲级联赛)、中超(中国足球超级联赛)等更下一级的类别,本发明实施例对此不加以限制。
而随着类别的等级划分精度越高,其所收集的文本信息重叠的概率也就越低,反之精确就越高,则本发明实施例中分词的精确度也会越高。
步骤103,按照所述文本信息所属类别对应的分词词典,对所述文本信息进行分词处理。
在具体实现中,每一个类别都可以对应有一个特定的分词词典,以对文
本信息进行合乎该类别语言特点的分词。
在本发明的一种优选实施例中,所述分词词典可以通过以下方式生成:
子步骤S11,获取第一训练文档;
在本发明实施中,第一训练文档可以为搜索资源中网页的文本信息。
子步骤S12,划分所述第一训练文档所属类别;
在具体实现中,当第一训练文档可以为搜索资源中网页的文本信息时,可以依据该网页类别划分第一训练文档所属类别。
需要说明的是,由于子步骤S11、子步骤S12与步骤101、步骤102的应用基本相似,所以描述的比较简单,相关之处参见步骤101、步骤102的部分说明即可,本发明实施例在此不加以详述。
子步骤S13,对所述类别对应的所述第一训练文档进行分词处理,获得第二分词;
应用本发明实施例,可以采用通用的分词词典第一训练文档进行分词处理。通用的分词词典可以是通用领域的分词词典,没有特定的具体领域中的技术术语划分,比如中草药领域的当归、草乌等,其主要包含通用的、出现频率高于预设阈值的词条。具体地,通用的分词词典中可以包括通用词和确定意义词。通用词可以包括形容词、连词和一些通用意义的动词,例如高兴、但是、参加等。确定意义词可以为能够表达一定意义范围的词,通常是一些名词和动词。
子步骤S14,统计所述类别对应的所述第二分词的词频数和第一同现率;
在本发明实施例的一种优选示例中,可以基于第二分词训练N-Gram模型。
则在此示例中,第一同现率可以为两个或两个以上第二分词同时出现的概率。具体地,所述第一同现率可以包括第一词频数与第二词频数的比值;
其中,所述第一词频数包括当前第二分词出现在目标第二分词之后的词频数;所述目标第二分词包括出现在当前环境分词之前的一个或多个第二分词;
所述第二词频数包括所述目标第二分词总的词频数。
N-Gram模型为大词汇连续语音识别中常用的一种语言模型,基于马尔科夫假设,即一个词的出现仅仅依赖于它前面出现的有限的一个或者几个词。对于一个句子T,可以假设T是由词序列W1,W2,W3,…,Wn组成,那么这个句子T由W1,W2,W3,…,Wn连接组成的连接概率为P(T)=p(W1W2W3…Wn)=p(W1)P(W2|W1)P(W3|W1W2)…P(Wn|W1W2…Wn-1)。
如果一个词的出现仅依赖于它前面出现的一个词,则称之为bigram。即P(T)=P(W1W2W3…Wn)=p(W1)P(W2|W1)P(W3|W1W2)…P(Wn|W1W2…Wn-1)≈P(W1)P(W2|W1)P(W3|W2)…P(Wn|Wn-1)。
如果一个词的出现仅依赖于它前面出现的两个词,则称之为trigram。在N-Gram模型的实际应用中以bigram和trigram为主,而高于四元的N-Gram模型应用较少,因为训练四元的N-Gram模型需要更庞大的语料,而且数据稀疏严重,时间复杂度高,精度却提高的不多。
以下以文本信息“I want to eat Chinese food lunch”为例进行说明:
对某个环境类型中的第二文本信息“I want eat Chinese food lunch”进行分词,得到第二分词“I”、“want”、“to”、“eat”、“Chinese”、“food”、“lunch”,该第二分词及其词频数表1和表2所示。
表1第二分词的总词频数统计表
第二分词 | 总词频数 |
I | 3437 |
want | 1215 |
to | 3256 |
eat | 938 |
Chinese | 213 |
food | 1506 |
lunch | 459 |
表2当前第二分词出现在目标第二分词前的词频数统计表
I | want | to | eat | Chinese | food | lunch | |
I | 8 | 1087 | 0 | 13 | 0 | 0 | 0 |
want | 3 | 0 | 786 | 0 | 6 | 8 | 6 |
to | 3 | 0 | 10 | 860 | 3 | 0 | 12 |
eat | 0 | 0 | 2 | 0 | 19 | 2 | 52 |
Chinese | 2 | 0 | 0 | 0 | 0 | 120 | 1 |
food | 19 | 0 | 17 | 0 | 0 | 0 | 0 |
lunch | 4 | 0 | 0 | 0 | 0 | 1 | 0 |
例如,第二行第三列中的1087表示当前第二分词“want”出现在目标第二分词“I”后面的词频数为1087。
子步骤S15,采用所述第二分词及其第一同现率生成所述类别对应的分词词典。
在基于分词词典的分词过程中,一般用到的查询方式可以包括:哈希查询、TRIE树(又称单词查找树或键树)查询、二分查询和顺序查询等等。而在实际的分词过程中可以用到多种查询方式,例如在哈希机制的分词词典中,可以采用哈希查询和二分查询相结合的方式,在TRIE机制的分词词典中采用了TRIE树查询和二分查询相结合的方式,将多种查询方式相结合进行查询,可以提高查询效率。
则在本发明实施例中,可以按照哈希查询、TRIE树查询、二分查询和顺序查询等等查询方式中的一种或多种形成的查询机制,生成分词词典,以实现哈希查询、TRIE树查询、二分查询和顺序查询等等查询方式中的一种或多种形成的查询机制。
在本发明的一种优选实施例中,所述分词词典可以通过以下方式更新:
子步骤S21,获取第二训练文档;
在本发明实施中,第二训练文档可以为搜索资源中网页的文本信息。
子步骤S22,划分所述第二训练文档所属的类别;
在具体实现中,当第二训练文档可以为搜索资源中网页的文本信息时,可以依据该网页类别划分第二训练文档所属类别。
需要说明的是,由于子步骤S21、子步骤S22与步骤101、步骤102的应用基本相似,所以描述的比较简单,相关之处参见步骤101、步骤102的部分说明即可,本发明实施例在此不加以详述。
子步骤S23,按照所述类别对应的分词词典,对所述文本信息进行分词
处理,获得第三分词;
应用本发明实施例,可以采用第二训练文档所属的类别对应的分词词典第二训练文档进行分词处理。对给定第二训练文档中待分词的字符串,按照某种确定的原则,例如正向最大匹配法(MM)、反向最大匹配法(RMM)或双向扫描法等等,切取字符串的子串,若该子串与分词词典中某词条相匹配,则认为该子串为第三分词,插入切分标志,继续分割剩余的部分,知道剩余部分为空;否则该子串不是第三分词,重新切取字符串的子串进行下一次匹配。
子步骤S24,统计所述类别对应的所述第三分词的词频数和第二同现率;
在本发明实施例的一种优选示例中,可以基于第三分词训练N-Gram模型。
则在此示例中,第二同现率可以为两个或两个以上第三分词同时出现的概率。具体地,所述第二同现率可以包括第三词频数与第四词频数的比值;
其中,所述第三词频数包括当前第三分词出现在目标第三分词之后的词频数;所述目标第三分词包括出现在当前环境分词之前的一个或多个第三分词;
所述第四词频数包括所述目标第三分词总的词频数。
子步骤S25,采用所述第三分词及其第二同现率更新所述类别对应的分词词典。
在本发明实施例中,可以按照哈希查询、TRIE树查询、二分查询和顺序查询等等查询方式中的一种或多种形成的查询机制,更新分词词典。
在本发明的一种优选实施例中,可以直接采用该文本信息所属类别对应的分词词典,对文本信息进行分词处理。
在本发明的一种优选实施例中,步骤103可以包括如下子步骤:
子步骤S31,按照所述文本信息所属类别对应的分词词典,以及通用词典,对所述文本信息进行分词处理;
子步骤S32,将分词处理后获得的词频数最高的分词,作为分词处理获得的第一分词。
在本发明实施例中,可以同时使用文本信息所属类别对应的分词词典,以及通用词典(通用的分词词典)对文本信息进行分词处理。
例如,对于文本信息“人参与当归”,由于在整体语料中必然是“人”、“参与”的词频大于“人参”、“与”的词频,所以基于通用词典分词,这个文本信息会被切分成“人”、“参与”、“当归”,但显然这样切分其实是错误的。“人参与当归”这个文本信息经常出现在医疗保健类的文档中,如果基于其所属的医疗保健类的分词词典进行分词,这个信息会被切分成“人参”、“与”、“当归”。比较词频数之后,会发现“人参”的相对词频数会显著高于整体语料中的相对词频数,因此,最终会选取“人参”、“与”、“当归”作为“人参与当归”的分词结果。
在本发明的又一种优选实施例中,步骤103可以包括如下子步骤:
子步骤S41,当所述文本信息所属的类别为多个时,分别按照所述类别对应的分词词典,对所述文本信息进行分词处理;
子步骤S42,将分词处理后获得的词频数最高的分词,作为分词处理获得的第一分词。
在本发明实施例中,文本信息可以划分为多个类别,属于交叉领域。例如,关于飞机的文本信息的分类可以为机械领域,也可以为航空领域。
在此种交叉领域的情况中,可以分别基于所属分类对应的分词词典对本文信息进行分词处理,最终以词频最高的作为分词结果。
当前分词系统使用的主要方法是基于统计的分词,简单的说就是在决定词的切分点的时候主要依靠候选词之间的词频、转移概率等信息。既然是统计,必然是满足大多数牺牲小部分,也就是说追求的是全局的统计意义上的最优,而非每个局部的最优,使得在局部的分词处理准确度很低。
本发明实施例对搜索资源中网页内容的文本信息划分类别,基于该类别的分词词典对本文信息进行分词,更好地适应不同类别的的语言特性,同时也提高了不同类别的分词准确度,实现了局部分词的最优处理;并且,分词准确度的提高,更加贴近用户的意图,提升了用户体验,继而减少了用户重新输入、搜索等操作,提高了操作的简便性,同时也减少了设备对用户操作的响应,减少了设备系统资源的耗费。
在本发明的一种优选实施例中,还可以包括如下步骤:
步骤104,针对所述类别,采用分词处理获得的第一分词建立倒排索引。
倒排索引源于实际应用中需要根据属性的值来查找记录。这种索引表中的每一项都包括一个属性值和具有该属性值的各记录的地址。由于不是由记录来确定属性值,而是由属性值来确定记录的位置,因而称为倒排索引(inverted index)。带有倒排索引的文件称为倒排索引文件,简称倒排文件(inverted file)。
倒排文件(倒排索引),索引对象是文档或者文档集合(例如网页)中的单词等,用来存储这些单词在一个文档或者一组文档中的存储位置,是对文档或者文档集合的一种常用的索引机制。
在本发明的一种优选实施例中,步骤104可以包括如下子步骤:
子步骤S51,针对所述类别,记录出现的所述类别对应的第一分词的出现位置;
子步骤S52,将所述第一分词及其对应的出现位置记录在倒排索引中。
在具体实现中,第一分词的出现位置可以包括出现的网页,或者,出现的网页及其在该网页中的位置。
以英文为例,以下为要被索引的网页中的文本信息:
T1=″it is what it is″;
T2=″what is it″;
T3=″it is a banana″;
以下为倒排索引:
其中,″banana″:{(2,3)}为″banana″在第三个网页(T3)的文本信息里,而且在第三个网页的位置是第四个单词(地址为3)。
本发明实施例对搜索资源中网页内容的文本信息划分类别,基于该类别的分词词典对本文信息进行分词,再采用分词处理获得的第一分词建立倒排索引,避免了基于全局文本信息的倒排索引的单一性和片面性,提升了在各个类别中倒排索引的准确率,继而提升了倒排索引的索引运行效率,减少索引的时间;并且,搜索资源中网页的文本信息包括新、奇、特等各种符合该类别语言特性的文本信息,利用搜索资源中搜集的他人和集体的智慧,弥补了自身定义、自己人工的不足,大大减少了人工运营成本。
对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明实施例并不受所描述的动作顺序的限制,因为依据本发明实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明实施例所必须的。
参照图2,示出了根据本发明一个实施例的示出了根据本发明一个实施例的一种基于网页内容类别进行分词处理的装置实施例的结构框图,可以包括如下模块:
提取模块201,适于提取搜索资源中网页内容的文本信息;
划分模块202,适于依据所述网页内容类别划分所述文本信息所属类别;
分词模块203,适于按照所述文本信息所属类别对应的分词词典,对所述文本信息进行分词处理。
在本发明的一种优选实施例中,还可以包括如下模块:
建立模块,适于针对所述类别,采用分词处理获得的第一分词建立倒排索引。
在本发明的一种优选实施例中,所述建立模块还可以适于:
针对所述类别,记录出现的所述类别对应的第一分词的出现位置;
将所述第一分词及其对应的出现位置记录在倒排索引中。
在本发明的一种优选实施例中,所述分词词典可以通过以下方式生成:
获取第一训练文档;
划分所述第一训练文档所属类别;
对所述类别对应的所述第一训练文档进行分词处理,获得第二分词;
统计所述类别对应的所述第二分词的词频数和第一同现率;
采用所述第二分词及其第一同现率生成所述类别对应的分词词典。
在本发明的一种优选实施例中,所述第一同现率可以包括第一词频数与第二词频数的比值;
其中,所述第一词频数包括当前第二分词出现在目标第二分词之后的词频数;所述目标第二分词包括出现在当前环境分词之前的一个或多个第二分词;
所述第二词频数包括所述目标第二分词总的词频数。
在本发明的一种优选实施例中,所述分词词典可以通过以下方式更新:
获取第二训练文档;
划分所述第二训练文档所属的类别;
按照所述类别对应的分词词典,对所述文本信息进行分词处理,获得第三分词
统计所述类别对应的所述第三分词的词频数和第二同现率;
采用所述第三分词及其第二同现率更新所述类别对应的分词词典。
在本发明的一种优选实施例中,所述第二同现率可以包括第三词频数与第四词频数的比值;
其中,所述第三词频数包括当前第三分词出现在目标第三分词之后的词频数;所述目标第三分词包括出现在当前环境分词之前的一个或多个第三分词;
所述第四词频数包括所述目标第三分词总的词频数。
在本发明的一种优选实施例中,所述分词模块203还可以适于:
按照所述文本信息所属类别对应的分词词典,以及通用词典,对所述文本信息进行分词处理;
将分词处理后获得的词频数最高的分词,作为分词处理获得的第一分词。
在本发明的一种优选实施例中,所述分词模块203还可以适于:
当所述文本信息所属的类别为多个时,分别按照所述类别对应的分词词典,对所述文本信息进行分词处理;
将分词处理后获得的词频数最高的分词,作为分词处理获得的第一分词。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的基于网页内容分类进行分词处理的设备中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
例如,图3示出了可以实现根据本发明的基于网页内容分类进行分词处理的计算设备,例如检索服务器。该计算设备传统上包括处理器310和以存储器320形式的计算机程序产品或者计算机可读介质。存储器320可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器320具有用于执行上述方法中的任何方法步骤的程序代码331的存储空间330。例如,用于程序代码的存储空间330可以包括分别用于实现上面的方法中的各种步骤的各个程序代码331。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图4所述的便携式或者固定存储单元。该存储单元可以具有与图3的计算设备中的存储器320类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码331’,即可以由例如诸如310之类的处理器读取的代码,这些代码当由计算设备运行时,导致该计算设备执行上面所
描述的方法中的各个步骤。
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。
Claims (20)
- 一种基于网页内容类别进行分词处理的方法,包括步骤:提取搜索资源中网页内容的文本信息;依据所述网页内容类别划分所述文本信息所属类别;按照所述文本信息所属类别对应的分词词典,对所述文本信息进行分词处理。
- 如权利要求1所述的方法,其特征在于,还包括步骤:针对所述类别,采用分词处理获得的第一分词建立倒排索引。
- 如权利要求1至2任一项所述的方法,其特征在于,所述针对所述类别,采用分词处理获得的第一分词建立倒排索引的步骤包括:针对所述类别,记录出现的所述类别对应的第一分词的出现位置;将所述第一分词及其对应的出现位置记录在倒排索引中。
- 如权利要求1所述的方法,其特征在于,所述分词词典通过以下方式生成:获取第一训练文档;划分所述第一训练文档所属类别;对所述类别对应的所述第一训练文档进行分词处理,获得第二分词;统计所述类别对应的所述第二分词的词频数和第一同现率;采用所述第二分词及其第一同现率生成所述类别对应的分词词典。
- 如权利要求4所述的方法,其特征在于,所述第一同现率包括第一词频数与第二词频数的比值;其中,所述第一词频数包括当前第二分词出现在目标第二分词之后的词频数;所述目标第二分词包括出现在当前环境分词之前的一个或多个第二分词;所述第二词频数包括所述目标第二分词总的词频数。
- 如权利要求1或4所述的方法,其特征在于,所述分词词典通过以下方式更新:获取第二训练文档;划分所述第二训练文档所属的类别;按照所述类别对应的分词词典,对所述文本信息进行分词处理,获得第三分词统计所述类别对应的所述第三分词的词频数和第二同现率;采用所述第三分词及其第二同现率更新所述类别对应的分词词典。
- 如权利要求6所述的方法,其特征在于,所述第二同现率包括第三词频数与第四词频数的比值;其中,所述第三词频数包括当前第三分词出现在目标第三分词之后的词频数;所述目标第三分词包括出现在当前环境分词之前的一个或多个第三分词;所述第四词频数包括所述目标第三分词总的词频数。
- 如权利要求1所述的方法,其特征在于,所述按照所述文本信息所属类别对应的分词词典,对所述文本信息进行分词处理的步骤包括:按照所述文本信息所属类别对应的分词词典,以及通用词典,对所述文本信息进行分词处理;将分词处理后获得的词频数最高的分词,作为分词处理获得的第一分词。
- 如权利要求1所述的方法,其特征在于,所述按照所述文本信息所属的类别对应的分词词典,对所述文本信息进行分词处理的步骤包括:当所述文本信息所属的类别为多个时,分别按照所述类别对应的分词词典,对所述文本信息进行分词处理;将分词处理后获得的词频数最高的分词,作为分词处理获得的第一分词。
- 一种基于网页内容类别进行分词处理的装置,包括:提取模块,适于提取搜索资源中网页内容的文本信息;划分模块,适于依据所述网页内容类别划分所述文本信息所属类别;分词模块,适于按照所述文本信息所属类别对应的分词词典,对所述文本信息进行分词处理。
- 如权利要求10所述的装置,其特征在于,还包括:建立模块,适于针对所述类别,采用分词处理获得的第一分词建立倒排索引。
- 如权利要求10至11任一项所述的装置,其特征在于,所述建立模块还适于:针对所述类别,记录出现的所述类别对应的第一分词的出现位置;将所述第一分词及其对应的出现位置记录在倒排索引中。
- 如权利要求10所述的装置,其特征在于,所述分词词典通过以下方式生成:获取第一训练文档;划分所述第一训练文档所属类别;对所述类别对应的所述第一训练文档进行分词处理,获得第二分词;统计所述类别对应的所述第二分词的词频数和第一同现率;采用所述第二分词及其第一同现率生成所述类别对应的分词词典。
- 如权利要求13所述的装置,其特征在于,所述第一同现率包括第一词频数与第二词频数的比值;其中,所述第一词频数包括当前第二分词出现在目标第二分词之后的词频数;所述目标第二分词包括出现在当前环境分词之前的一个或多个第二分词;所述第二词频数包括所述目标第二分词总的词频数。
- 如权利要求10或13所述的装置,其特征在于,所述分词词典通过以下方式更新:获取第二训练文档;划分所述第二训练文档所属的类别;按照所述类别对应的分词词典,对所述文本信息进行分词处理,获得第三分词统计所述类别对应的所述第三分词的词频数和第二同现率;采用所述第三分词及其第二同现率更新所述类别对应的分词词典。
- 如权利要求15所述的装置,其特征在于,所述第二同现率包括第三词频数与第四词频数的比值;其中,所述第三词频数包括当前第三分词出现在目标第三分词之后的词频数;所述目标第三分词包括出现在当前环境分词之前的一个或多个第三分词;所述第四词频数包括所述目标第三分词总的词频数。
- 如权利要求10所述的装置,其特征在于,所述分词模块还适于:按照所述文本信息所属类别对应的分词词典,以及通用词典,对所述文本信息进行分词处理;将分词处理后获得的词频数最高的分词,作为分词处理获得的第一分词。
- 如权利要求10所述的装置,其特征在于,所述分词模块还适于:当所述文本信息所属的类别为多个时,分别按照所述类别对应的分词词典,对所述文本信息进行分词处理;将分词处理后获得的词频数最高的分词,作为分词处理获得的第一分词。
- 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1-9中的任一个所述的基于网页内容类别进行分词处理的方法。
- 一种计算机可读介质,其中存储了如权利要求19所述的计算机程序。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410126465.1 | 2014-03-31 | ||
CN201410126465.1A CN104008126A (zh) | 2014-03-31 | 2014-03-31 | 一种基于网页内容分类进行分词处理的方法和装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015149533A1 true WO2015149533A1 (zh) | 2015-10-08 |
Family
ID=51368783
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2014/093396 WO2015149533A1 (zh) | 2014-03-31 | 2014-12-09 | 一种基于网页内容分类进行分词处理的方法和装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN104008126A (zh) |
WO (1) | WO2015149533A1 (zh) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109522417A (zh) * | 2018-10-26 | 2019-03-26 | 浪潮软件股份有限公司 | 一种公司名的商号抽取方法 |
CN109740152A (zh) * | 2018-12-25 | 2019-05-10 | 腾讯科技(深圳)有限公司 | 文本类目的确定方法、装置、存储介质和计算机设备 |
CN110020420A (zh) * | 2018-01-10 | 2019-07-16 | 腾讯科技(深圳)有限公司 | 文本处理方法、装置、计算机设备和存储介质 |
CN110096695A (zh) * | 2018-01-30 | 2019-08-06 | 腾讯科技(深圳)有限公司 | 超链接标记方法和装置、文本分类方法和装置 |
CN110516259A (zh) * | 2019-08-30 | 2019-11-29 | 盈盛智创科技(广州)有限公司 | 一种技术关键词的识别方法、装置、计算机设备和存储介质 |
CN113268978A (zh) * | 2020-02-17 | 2021-08-17 | 北京搜狗科技发展有限公司 | 一种信息生成方法、装置和电子设备 |
CN114610873A (zh) * | 2021-07-19 | 2022-06-10 | 亚信科技(中国)有限公司 | 文本处理方法、装置、电子设备及计算机可读存储介质 |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104008126A (zh) * | 2014-03-31 | 2014-08-27 | 北京奇虎科技有限公司 | 一种基于网页内容分类进行分词处理的方法和装置 |
CN104636465B (zh) * | 2015-02-10 | 2018-11-16 | 百度在线网络技术(北京)有限公司 | 网页摘要生成方法、展示方法及相应装置 |
CN104881403B (zh) * | 2015-06-04 | 2018-05-04 | 百度在线网络技术(北京)有限公司 | 分词方法和装置 |
CN107368489B (zh) * | 2016-05-12 | 2020-07-03 | 阿里巴巴集团控股有限公司 | 一种资讯数据处理方法及装置 |
CN110020038A (zh) * | 2017-08-01 | 2019-07-16 | 阿里巴巴集团控股有限公司 | 网页信息提取方法、装置、系统及电子设备 |
CN108334610A (zh) * | 2018-02-06 | 2018-07-27 | 北京神州泰岳软件股份有限公司 | 一种新闻文本分类方法、装置及服务器 |
CN108874869A (zh) * | 2018-04-24 | 2018-11-23 | 中国地质大学(武汉) | 一种基于数据协同的地质分类词库的建立方法 |
CN108763200A (zh) * | 2018-05-15 | 2018-11-06 | 达而观信息科技(上海)有限公司 | 中文分词方法及装置 |
CN109326279A (zh) * | 2018-11-23 | 2019-02-12 | 北京羽扇智信息科技有限公司 | 一种文本转语音的方法、装置、电子设备和存储介质 |
CN112069288A (zh) * | 2019-05-23 | 2020-12-11 | 中国移动通信集团河南有限公司 | 数据的处理方法、装置和电子设备 |
CN111079428B (zh) * | 2019-12-27 | 2023-09-19 | 北京羽扇智信息科技有限公司 | 一种分词和行业词典构建方法、装置以及可读存储介质 |
CN111414648B (zh) * | 2020-03-04 | 2023-05-12 | 传神语联网网络科技股份有限公司 | 语料鉴权方法及装置 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101206653A (zh) * | 2006-12-22 | 2008-06-25 | 英业达股份有限公司 | 自动收集网络信息的系统及其方法 |
CN101620608A (zh) * | 2008-07-04 | 2010-01-06 | 全国组织机构代码管理中心 | 信息采集方法及系统 |
CN101763395A (zh) * | 2009-12-31 | 2010-06-30 | 浙江大学 | 采用人工智能技术自动生成网页的方法 |
CN104008126A (zh) * | 2014-03-31 | 2014-08-27 | 北京奇虎科技有限公司 | 一种基于网页内容分类进行分词处理的方法和装置 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1536483A (zh) * | 2003-04-04 | 2004-10-13 | 陈文中 | 网络信息抽取及处理的方法及系统 |
CN101206673A (zh) * | 2007-12-25 | 2008-06-25 | 北京科文书业信息技术有限公司 | 网络搜索过程中关键词的智能纠错系统及方法 |
CN101441663B (zh) * | 2008-12-02 | 2010-06-23 | 西安交通大学 | 一种基于lzw压缩算法的中文文本分类特征词典生成方法 |
CN102280106A (zh) * | 2010-06-12 | 2011-12-14 | 三星电子株式会社 | 用于移动通信终端的语音网络搜索方法及其装置 |
-
2014
- 2014-03-31 CN CN201410126465.1A patent/CN104008126A/zh active Pending
- 2014-12-09 WO PCT/CN2014/093396 patent/WO2015149533A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101206653A (zh) * | 2006-12-22 | 2008-06-25 | 英业达股份有限公司 | 自动收集网络信息的系统及其方法 |
CN101620608A (zh) * | 2008-07-04 | 2010-01-06 | 全国组织机构代码管理中心 | 信息采集方法及系统 |
CN101763395A (zh) * | 2009-12-31 | 2010-06-30 | 浙江大学 | 采用人工智能技术自动生成网页的方法 |
CN104008126A (zh) * | 2014-03-31 | 2014-08-27 | 北京奇虎科技有限公司 | 一种基于网页内容分类进行分词处理的方法和装置 |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110020420A (zh) * | 2018-01-10 | 2019-07-16 | 腾讯科技(深圳)有限公司 | 文本处理方法、装置、计算机设备和存储介质 |
CN110020420B (zh) * | 2018-01-10 | 2023-07-21 | 腾讯科技(深圳)有限公司 | 文本处理方法、装置、计算机设备和存储介质 |
CN110096695A (zh) * | 2018-01-30 | 2019-08-06 | 腾讯科技(深圳)有限公司 | 超链接标记方法和装置、文本分类方法和装置 |
CN110096695B (zh) * | 2018-01-30 | 2023-01-03 | 腾讯科技(深圳)有限公司 | 超链接标记方法和装置、文本分类方法和装置 |
CN109522417A (zh) * | 2018-10-26 | 2019-03-26 | 浪潮软件股份有限公司 | 一种公司名的商号抽取方法 |
CN109740152A (zh) * | 2018-12-25 | 2019-05-10 | 腾讯科技(深圳)有限公司 | 文本类目的确定方法、装置、存储介质和计算机设备 |
CN110516259A (zh) * | 2019-08-30 | 2019-11-29 | 盈盛智创科技(广州)有限公司 | 一种技术关键词的识别方法、装置、计算机设备和存储介质 |
CN110516259B (zh) * | 2019-08-30 | 2023-03-07 | 盈盛智创科技(广州)有限公司 | 一种技术关键词的识别方法、装置、计算机设备和存储介质 |
CN113268978A (zh) * | 2020-02-17 | 2021-08-17 | 北京搜狗科技发展有限公司 | 一种信息生成方法、装置和电子设备 |
CN114610873A (zh) * | 2021-07-19 | 2022-06-10 | 亚信科技(中国)有限公司 | 文本处理方法、装置、电子设备及计算机可读存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN104008126A (zh) | 2014-08-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2015149533A1 (zh) | 一种基于网页内容分类进行分词处理的方法和装置 | |
CN106649818B (zh) | 应用搜索意图的识别方法、装置、应用搜索方法和服务器 | |
US7461056B2 (en) | Text mining apparatus and associated methods | |
US8744839B2 (en) | Recognition of target words using designated characteristic values | |
US8073877B2 (en) | Scalable semi-structured named entity detection | |
WO2017167067A1 (zh) | 网页文本分类的方法和装置,网页文本识别的方法和装置 | |
CA2774278C (en) | Methods and systems for extracting keyphrases from natural text for search engine indexing | |
US20130060769A1 (en) | System and method for identifying social media interactions | |
US11227183B1 (en) | Section segmentation based information retrieval with entity expansion | |
CN107577671A (zh) | 一种基于多特征融合的主题词提取方法 | |
CN102831246A (zh) | 藏文网页分类方法和装置 | |
CN107577663B (zh) | 一种关键短语抽取方法和装置 | |
CN106126619A (zh) | 一种基于视频内容的视频检索方法及系统 | |
CN111291177A (zh) | 一种信息处理方法、装置和计算机存储介质 | |
CN108038099B (zh) | 基于词聚类的低频关键词识别方法 | |
CN109472022B (zh) | 基于机器学习的新词识别方法及终端设备 | |
CN108345694B (zh) | 一种基于主题数据库的文献检索方法及系统 | |
CN111104801A (zh) | 基于网址域名的文本分词方法、系统、设备及介质 | |
CN105808615A (zh) | 一种基于分词权重的文档索引生成方法和装置 | |
CN104346382B (zh) | 使用语言查询的文本分析系统和方法 | |
CN116738988A (zh) | 文本检测方法、计算机设备和存储介质 | |
Jia et al. | A Chinese unknown word recognition method for micro-blog short text based on improved FP-growth | |
CN106844482A (zh) | 一种基于搜索引擎的检索信息匹配方法及装置 | |
CN110705285B (zh) | 一种政务文本主题词库构建方法、装置、服务器及可读存储介质 | |
Wang et al. | Constructing a comprehensive events database from the web |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14888187 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase | ||
122 | Ep: pct application non-entry in european phase |
Ref document number: 14888187 Country of ref document: EP Kind code of ref document: A1 |