US20060020588A1 - Constructing and maintaining a personalized category tree, displaying documents by category and personalized categorization system - Google Patents
Constructing and maintaining a personalized category tree, displaying documents by category and personalized categorization system Download PDFInfo
- Publication number
- US20060020588A1 US20060020588A1 US11/188,194 US18819405A US2006020588A1 US 20060020588 A1 US20060020588 A1 US 20060020588A1 US 18819405 A US18819405 A US 18819405A US 2006020588 A1 US2006020588 A1 US 2006020588A1
- Authority
- US
- United States
- Prior art keywords
- category
- node
- personalized
- documents
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Definitions
- the present invention relates to data processing technologies, more specifically, to the technology of constructing a personalized category tree and categorizing documents by utilizing the personalized category tree.
- a category tree is defined by a domain expert, and each category node in the category tree is defined with a training set of manually labeled documents.
- a corresponding categorizer is then constructed by utilizing the set of training documents.
- the documents to be categorized are automatically categorized with the categorizer.
- the precision of the traditional categorization method depends on the number and quality of training samples available in the training set.
- the distribution of training samples among various categories of a category tree is often not even, with some category nodes only having a few training samples.
- the most common (commonly used) category is “earn”, having 2,877 training documents, but 82% of the categories have less than 100 instances, and 33% of the categories have less than 10 instances.
- the test results with the above methods showed that their performances are function of the training-set category frequency. For those categories with training document size less than 10, its macro-averaging F measure only achieves less than 0.2, while for those categories with training-set frequency more than 2000, its macro-averaging F measure can reach to 0.9 or more. From this, we can see that, in case of small training set, statistical methods cannot work very well.
- the user defines a sub-tree: and wants to put documents related to IBM products into this sub-tree, i.e. put documents related to IBM PC into the category “PC” and documents related to IBM Server into the category “Server”. But, with the passage of time, the user may want to collect some documents about DELL PC into the category “PC”. However, this operation will introduce semantic inconsistency into this personalized category tree. Traditional categorization methods cannot introduce the documents about DELL PC of semantic inconsistency into the category “PC”, and thus cannot realize such a personalized category tree.
- a user may desire to be able to create arbitrarily a personalized category tree that is similar to his file folder hierarchy, and map freely a semantic structure that meets his demands onto this personalized category tree, without being limited by traditional semantic consistency, and at the same time, may also desire that there be no need to perform manually the work of specifying a great deal of training samples, which is lengthy, and time and energy-consuming, thereby realizing personalized document categorization that can satisfy personal needs.
- the present invention provide methods, systems and apparatus for constructing and maintaining a personalized category tree, and displaying documents by category using the personalized category tree and a personalized categorization system. This, thereby enables a user to perform personalized document categorization by defining a personalized category tree that satisfies his personal needs, without having to labeling manually the set of a great deal of training document and considering the problem of semantic inconsistency.
- the personalized category tree is a category tree that includes at least one category node, the step of creating independently each of the at least one category node comprising: defining a label for the category node; and specifying at least one keyword for the category node.
- the label for the category is a keyword for the category.
- the personalized category tree is a tree that includes at least one category node, the step of creating each of the at least one category node comprising: searching for documents using at least one keyword; selecting at least one document from the search results; defining a label for the category node; specifying the keyword used in the searching step as the keyword for the category node; and specifying the selected at least one document as the feature document for the category node.
- a personalized categorization system comprising: category tree editor for creating and modifying a personalized category tree.
- FIG. 1 is a flowchart of a method for constructing a personalized category tree according to one embodiment of the present invention
- FIG. 2 is a flowchart of a method for constructing each node in a personalized category tree according to another embodiment of the present invention
- FIG. 3 is a flowchart of a method for maintaining a personalized category tree according to one embodiment of the present invention
- FIG. 4 is a flowchart illustrating detailed steps of performing topic tracking for a category node to retrieve relevant documents in a method for maintaining a personalized category tree according to one embodiment of the present invention
- FIG. 5 is a diagram for illustrating document length normalization for a document
- FIG. 6 is a flowchart of a method for displaying documents by category using a personalized category tree according to one embodiment of the present invention
- FIGS. 7A-7C shows the results of displaying documents under different displaying modes in a method of displaying documents by category using personalized category tree
- FIG. 8 is a block diagram of a personalized categorization system according to one embodiment of the present invention.
- the present invention provides methods, systems and apparatus for constructing and maintaining a personalized category tree, and displaying documents by category using the personalized category tree and a personalized categorization system. This, enables a user to perform personalized document categorization by defining a personalized category tree that satisfies his personal needs, without having to labeling manually the set of a great deal of training document and considering the problem of semantic inconsistency.
- the personalized category tree is a category tree that includes at least one category node.
- the step of creating independently each of the at least one category node comprising: defining a label for the category node; and specifying at least one keyword for the category node.
- the label for the category is a keyword for the category.
- a method for constructing a personalized category tree is a tree that includes at least one category node, the step of creating each of the at least one category node comprising: searching for documents using at least one keyword; selecting at least one document from the search results; defining a label for the category node; specifying the keyword used in the searching step as the keyword for the category node; and specifying the selected at least one document as the feature document for the category node.
- a method for maintaining a personalized category tree is a category tree that includes at least one category node, each of the at least one category node includes a label and at least one keyword, the method comprising: for each of the at least one category node, searching for documents using the at least one keyword included in the category node; selecting at least one document from the search results as the feature document for the category node; and performing topic tracking to add documents relevant to the category node based on the at least one feature document.
- the personalized category tree is a category tree that includes at least one category node, each of the at least one category node includes a label, at least one keyword and at least one feature document, the method comprising: for each of the at least one category node, performing topic tracking to add documents relevant to the category node based on the at least one feature document.
- the personalized category tree is a tree that includes at least one category node, each of the at least one category node includes a label and relevant documents belonging to the category node, the method comprising: selecting a category node in the personalized category tree; and displaying the relevant documents belonging to that category node.
- a personalized categorization system includes: a category tree editor for creating and modifying a personalized category tree, wherein the personalized category tree is a category tree that includes at least one category node, each of the at least one category node includes a label and at least one keyword; and a category node editor for configuring a category node in the personalized category tree.
- a category tree is constructed by a domain expert, and a set of a great deal of training documents are selected for each category in the category tree, and thereby the training sets are utilized to perform semantic identification for new documents in order to assign the new documents into the respective categories in the category tree.
- Such a category tree complies with the semantic consistency requirement, not allowing a user to introduce semantic inconsistencies.
- Traditional category tree organizes various documents and the relationships between them in the form of a tree, in which a parent node and a child node form a containment relationship, there is a strict semantic qualification relationship between them, they are dependent on each other during training, and the qualification of the child node includes the semantic qualification of the parent node, that is, the parent node includes all the documents belonging to the child node.
- the parent node and the child node are separately and independently qualified, their semantics are relatively independent, and it is through different views, document organization/customization/filtering to satisfy a user's needs for browsing and searching for documents. That is, in a category tree according to the present invention, while the form of the path organization of a parent node and a child node suggests a parent-child relationship, the qualifications and contents of the parent node and the child node are independent of each other.
- FIG. 1 is a flowchart of a method for constructing a personalized category tree according to one embodiment of the present invention.
- a personalized category tree of the present invention allows semantic inconsistency, so, when constructing a personalized category tree, there is no need to consider the consistency problem between a child node and a parent node, each category node being created with same steps.
- Step 105 initialization is performed to create a category tree that only contains the root node.
- Step 110 a category node is added into the personalized category tree.
- a label is defined for that new category node.
- the label should be able to represent category features of the node, similar to the name of a file folder.
- At Step 120 at least one keyword is specified for that new category node.
- the label for the category may serve as a keyword for the category.
- the keywords for a category node are used to describe the topic content of that category node, and as described later, the keywords can be utilized to find documents relevant to the category node and the feature documents for the category node.
- an information source is specified for that new category node.
- the information source is used to indicate the source of relevant documents of the category node, and may be, for example, an URL, a path, an IP address or a computer name, etc.
- a category node can be specified with either one information source or multiple information sources, and multiple category nodes can share one information source.
- that category node inherits the information source of its parent node by default.
- a feature document is a document that is highly relevant to the category node and can best represent the content of the category, equivalent to training samples in the traditional categorization methods.
- the difference with the traditional methods is that the number of feature documents of the present invention can be much more smaller than the number of training samples in the traditional categorization methods (for example, user may only need to select 3 to 5 samples), thereby saving the time of a user specifying the feature documents.
- Step 135 it is determined whether the task of constructing the personalized category tree has been accomplished, and if a new category node still needs to be added, and then return to Step 110 , repeating the above described Step 110 to 130 to add the new node onto the personalized category tree.
- Step 135 If the determination in Step 135 is that the constructing task has been accomplished, then the method ends at Step 140 .
- each category node can be created simply and equally, so it can be done conveniently even by ordinary user who is not a language expert.
- a user needs not to specify a great deal of training samples, thereby reducing the workload.
- Step 125 and/or Step 130 can be omitted, that is, information source and feature documents are not specified for each node.
- the same information source can be specified for the whole category tree or a child node can use the information source of its parent node, and the feature documents can be selected during the process of maintaining the personalized category tree as described later.
- the information source is not specified, but the information source to which the user can or is authorized to access is regarded as the information source for the node, and the feature documents may also not be selected, but the documents that the user frequently accesses are used as the feature documents, or only the keywords are used to perform categorization.
- the workload of constructing a personalized category tree by a user can be further saved.
- FIG. 2 is a flowchart of a method for constructing each node in a personalized category tree according to another embodiment of the present invention.
- the method for constructing a personalized category tree in the present embodiment differs from the method for constructing a personalized category tree in the above embodiment in that the process of creating each category node is different.
- the creation of each category node in a personalized category tree in the present embodiment is accomplished simultaneously when a user retrieves documents.
- the user searches for documents from the information source by utilizing one or more keywords.
- the user can utilize the keyword(s) to find documents that include the keyword(s) in a local or network path, or, for example, the user can enter keywords in a search engine to query relevant documents.
- At Step 210 at least one document is selected from the search results of the previous step. Specifically, the user can select one or more desired documents by browsing the abstract or the body text of each document found.
- a category node is added into the personalized category tree.
- the user can add the category node at any desired location in the personalized category tree.
- Step 220 a label is defined for the category node to label the category.
- Step 225 the keywords used for the search in Step 205 are specified as the keywords for that category node.
- Step 230 the documents selected in Step 210 are specified as the feature documents for that category node.
- an information source is specified for that category node.
- the information source may be a path used to find the documents in the previous Step 205 , or may be the URL or path of the documents found in case that the user has performed the query through a search engine.
- multiple information sources can also be specified for the category node, for example, when the documents found are from different locations.
- the child node when adding a new node, can inherit the attributes of its parent node, such as the information source, the keywords, etc., and common attributes, such as the information source, can be set for the category tree.
- each category node in a category tree is created separately according to the respective needs during constructing the category tree, the category nodes are equal to and independent of each other, so the personalized category tree constructed with the above embodiments does not have the problem of semantic constraints between the category nodes, thereby allowing semantic inconsistencies.
- the process of constructing a category tree is greatly simplified, saving a great deal of manpower and precious time.
- information sources can be specified for each category node respectively and one category node can have multiple information sources in the personalized category tree, it is even more convenient for a user to manage documents using a personalized category tree.
- the work of adding a new category node into a personalized category tree can be accomplished simultaneously when a user s document, thereby combining together the retrieval and the creation of the personalized category tree by the user, further simplifying the user's work.
- FIG. 3 is a flowchart of a method for maintaining a personalized category tree according to one embodiment of the present invention. It should be pointed out that the present embodiment applies to a personalized category tree that is generated by the above described method for constructing a personalized category tree, and includes at least one category node, each category node including a label, at least one keyword and an information source used to indicate the source of relevant documents of the category node.
- a category node is selected from the personalized category tree. Because a parent node and a child node in a personalized category tree in present invention are relatively independent and have no strict semantic constraint relationship, the category nodes can be selected for processing one by one in any order, such as depth-first or width-first or another order, when maintaining the personalized category tree.
- the keywords are used to search for relevant documents from the information source specified for the category node.
- Step 315 at least one document is selected from the search results in the previous step as the feature document for the category node.
- Step 320 topic tracking is performed on the documents in the information source specified for the category node according to the at least one feature document, to add relevant documents of the category node.
- Step 325 it is determined whether the maintaining work for the personalized category tree has been accomplished, and if there is any other node needing to be maintained in the personalized category tree, the method proceeds to Step 330 .
- Step 330 the next category node that needs to be maintained in the personalized category tree is selected, and the method returns to Step 310 and repeats above described Step 310 to Step 325 .
- Step 325 If the determination at Step 325 is that all nodes have been processed, then the method ends at Step 335 .
- the category nodes in the personalized category tree has already been specified feature documents, so during the process of maintaining the node, Step 310 and Step 315 can be omitted, and the topic tracking can be performed directly according to the specified feature documents.
- an information source has not been specified for each category node in the personalized category tree, thus during the process of maintaining the node, document finding and/or topic tracking can be performed on documents in a common information source.
- FIG. 4 shows the detailed steps of performing topic tracking for a category node to retrieve relevant documents in the method for maintaining a personalized category tree according to one embodiment of the present invention.
- At least one keyword is extracted from the feature documents of the category node. Specifically, it can be done using, for example, the tf(term frequency) method or the tf ⁇ idf (term frequency ⁇ inverse document frequency) method, etc.
- the tf method ranks and calculates the weight according to the number of times each keyword appears in the document.
- Step 410 a document is selected from the information source specified for the category node.
- Step 415 the at least one keyword extracted from the feature documents and the keywords included in the category node are used to perform document length normalization for the document in the information source of the category node.
- length normalization is perform on the documents in the information source of the category node by using the keywords extracted from the feature documents as well as the keywords specified for the category node, in order to solve the above problem.
- FIG. 5 is a diagram for illustrating document length normalization for a document.
- the document length normalization for the document is to treat the each of the keywords as a seed.
- the surrounding texts that include the seed from the document are extracted, and here, the basic unit of the surrounding texts extracted is a paragraph that includes the seed in the document.
- the extracted surrounding texts are combined as the length normalized structure of the document. Thus the parts in the text that are irrelevant to the desired topic are excluded.
- Step 420 the degree of topic relevance between the length normalized document and the length normalized feature documents in the category node is calculated.
- d 1 represents the first document
- d 2 represents the second document
- t w 1 is the adjusted term frequency of word w in document i
- t w i t w i _ ⁇ + t w i _
- t w i _ is the term frequency of word w in document i
- ⁇ is an adjust coefficient for adjusting the difference between the maximum and the minimum value of term frequency
- ⁇ is the feature document set included in the node
- ⁇ (w, ⁇ ) idf 0 (w)+ ⁇
- ⁇ ⁇ ( w , ⁇ ) ⁇ 0 ⁇ 2 ⁇ n w , ⁇ n w + n ⁇ , wherein, n w is the total number of the documents that include word w, n ⁇ is the total number of the feature documents included in the category node, and n w ⁇ is the total number of documents that include word w in the document set ⁇ ; ⁇ 0 is an adjustable proportional coefficient for adjusting the degree of importance of item ⁇ (w, ⁇ ).
- Step 425 it is determined whether the degree of topic relevance between the document and the feature documents in the category node is greater than a first specified threshold.
- the first specified threshold may be, for example, 40%. If the determination is “Yes”, then the method proceeds to Step 430 , adding the document as a relevant document below into the node; otherwise, the method proceeds to Step 445 .
- Step 435 is performed after Step 430 , determining if the degree of topic relevance between the document and the feature documents in the category node is greater than a second specified threshold, which is greater than the first specified threshold, and may be, for example, 60%. If the determination is “Yes”, then the method proceeds to Step 440 , adding the document as a feature document of the category node; otherwise, the method proceeds to Step 445 .
- Step 445 it is determined whether all the documents in the information source of the category node have been processed. If there are still documents to be processed, then the method proceeds to Step 450 , selecting the next document in the information source and returning to Step 415 , repeating the above described process to process that document; otherwise, the method ends at Step 455 .
- document length normalization may not be performed for the document to be processed, therefore, Step 405 , 415 can be omitted.
- the number of feature documents of a category node can be extended continuously, thereby automatically adjusting the topic tracking and gradually increasing the accuracy of document categorization as well.
- the method of the present embodiment can also be used as a complementary method when there are relatively few training samples in a category node in a traditional categorization method.
- FIG. 6 is a flowchart of the method for displaying documents by category with a personalized category tree according to one embodiment of the present invention.
- the personalized category tree may be, for example, a personalized category tree generated by the above described method for constructing a personalized category tree and maintained by the above described method for maintaining a personalized category tree.
- the personalized category tree includes at least one category node and each category node includes a label, keywords, feature documents and relevant documents belonging to the category node.
- Step 605 a category node in the personalized category tree is selected.
- a display mode is selected, that is, the user selects the mode for displaying a document with an input device.
- the mode of displaying a document includes: Common view, Lower view, Upper view and Limited view.
- the relevant documents in a selected category node are displayed to a user in “Common view”.
- “Common view” only the relevant documents belonging to the selected category node will be displayed; in “Lower view”, the relevant documents belonging to the selected category node and its child node(s) will be displayed, as shown in FIG. 7B ; in “Upper view”, the relevant documents belonging to the selected category node and its parent node will be displayed, as shown in FIG. 7A ; in “Limited view”, the relevant documents belonging to the child node(s) of the category node will be excluded, as shown in FIG. 7C .
- the above mentioned display modes can be used in combination to display the relevant documents.
- the documents can be displayed by category with strict semantics as in a traditional category tree.
- Step 615 it is determined whether the user has selected “Lower view”. If “Yes”, then perform Step 625 , displaying the relevant documents belonging to the selected category node and its child node(s).
- Step 620 it is determined whether the user has selected “Upper view”. If “Yes”, then Step 630 is performed, displaying the relevant documents belonging to the selected category node and its parent node.
- Step 635 it is determined whether the user has selected “Limited view”. If “Yes”, then Step 640 is performed, excluding the relevant documents belonging to the child node(s) of the category node from the list of the displayed documents.
- Step 645 the method ends at Step 645 .
- the above steps can be performed repeatedly, thereby allowing the user to continually select category nodes to display documents by category.
- the abstract information of the selected documents in that list can also be displayed.
- the document list also displays the documents in the order of the degree of relevance between the relevant documents and the feature documents in the category node.
- the method of displaying documents by category with a personalized category tree of the present embodiment may use the above described personalized category tree to display the relevant documents by category. And utilizing the multiple display modes provided in the present embodiment, the relevant documents can be organized in multiple ways for displaying; further, the inconsistency in the personalized category tree can also be remedied.
- FIG. 8 is a block diagram of a personalized categorization system according to an embodiment of the present invention.
- the personalized categorization system 800 of the present embodiment comprises: a category tree editor 801 , a category node editor 802 , a crawler 803 , a personalized categorizer 804 , a category display means 806 and a category tree storage means 807 .
- the category tree editor 801 is used to create and modify a personalized category tree, such as add a category node, delete a category node and modify the tree structure, etc.
- the category node editor 802 is used to configure the category nodes in the personalized category tree, such as define a label, keywords, feature documents and information source, etc. for a node.
- the category node editor can inherit the setting of its parent node by default.
- the crawler 803 is used to obtain documents from specified information sources.
- the crawler 803 may be a network crawler known in prior art.
- the crawler 803 can get documents from the information source specified for each category node.
- the personalized categorizer 804 is used to categorize the documents obtained by the crawler 803 into the personalized category tree. According to the present embodiment, the personalized categorizer 804 further comprises: a keyword extraction unit 8042 , a document length normalization unit 8044 and a relevance calculation unit 8046 .
- the keyword extraction unit 8042 is used to extract keywords from the specified feature documents.
- the document length normalization unit 8044 is used to perform length normalization on the documents based on the keyword.
- the relevance calculation unit 8046 is used to calculate the degree of topic relevance between the documents processed and the set of feature documents, for example, by using the above described Okapi algorithm.
- the personalized categorizer 804 can determine if the documents should be categorized into the node based on the degree of topic relevance, in addition, it can also determine if the relevant documents should be added as feature documents for the node based on its degree of topic relevance.
- the category display means 806 is used to display the relevant documents by category by utilizing the personalized category tree.
- the category display means 806 can display the relevant documents in the various display modes described above.
- the category tree storage means 807 is used to store the personalized category tree, including: for example, the attribute information in each category node and its relevant documents, feature documents, etc.
- the personalized categorization system of the present invention and its components may be implemented in the form of hardware and software, and may be combined with other devices as needed, for example, it may be implemented on various devices with information processing capabilities, such as a personal computer, a server, a notebook computer, a handhold computer, a PDA, etc. and can be physically separated from and operationally interconnected to each other to function.
- the present invention can be realized in hardware, software, or a combination of hardware and software.
- a visualization tool according to the present invention can be realized in a centralized fashion in one computer system, or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system—or other apparatus adapted for carrying out the methods and/or functions described herein—is suitable.
- a typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein.
- the present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods.
- Computer program means or computer program in the present context include any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after conversion to another language, code or notation, and/or reproduction in a different material form.
- the invention includes an article of manufacture which comprises a computer usable medium having computer readable program code means embodied therein for causing a function described above.
- the computer readable program code means in the article of manufacture comprises computer readable program code means for causing a computer to effect the steps of a method of this invention.
- the present invention may be implemented as a computer program product comprising a computer usable medium having computer readable program code means embodied therein for causing a function described above.
- the computer readable program code means in the computer program product comprising computer readable program code means for causing a computer to affect one or more functions of this invention.
- the present invention may be implemented as a program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for causing one or more functions of this invention.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- The present invention relates to data processing technologies, more specifically, to the technology of constructing a personalized category tree and categorizing documents by utilizing the personalized category tree.
- Both for enterprises and individuals, there exists the problem of categorizing and storing the information documents they own. Especially for those enterprises which own a great deal of information documents and individuals who need to process various documents, it will certainly be advantageous to their working efficiency that these documents be stored orderly according to their categories. Now, many statistical categorization methods have been successfully applied in real world document categorization, such as Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Decision Tree, Naive Bayesian, and etc. With these statistical methods, precision and recall of document categorization can reach to more than 85%.
- With traditional document categorization technologies, before categorizing documents, a category tree is defined by a domain expert, and each category node in the category tree is defined with a training set of manually labeled documents. A corresponding categorizer is then constructed by utilizing the set of training documents. And finally, the documents to be categorized are automatically categorized with the categorizer. However, the precision of the traditional categorization method depends on the number and quality of training samples available in the training set.
- In the article “A re-examination of text categorization methods”, Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99, pp 42-49), 1999 by Yiming Yang and Xin Liu, five statistical categorization methods, including SVM (Support Vector Machine), KNN (K-Nearest Neighbor), LLSF (Linear Least-Squares Fit), NN (Neural Network) and NB (Naive Bayesian) methods, have been tested. As recorded in the article, the tests with Reuters-21578 showed that for categories containing more training samples (more than 300 training samples), the precision and recall of the above methods is significantly good, while for categories containing fewer training samples (fewer than 10 training samples), the precision and recall of the above methods is quite poor.
- In real practice, the distribution of training samples among various categories of a category tree is often not even, with some category nodes only having a few training samples. According to the statistics in the article, with the ApteMod version, the most common (commonly used) category is “earn”, having 2,877 training documents, but 82% of the categories have less than 100 instances, and 33% of the categories have less than 10 instances. As recorded in the article, the test results with the above methods showed that their performances are function of the training-set category frequency. For those categories with training document size less than 10, its macro-averaging F measure only achieves less than 0.2, while for those categories with training-set frequency more than 2000, its macro-averaging F measure can reach to 0.9 or more. From this, we can see that, in case of small training set, statistical methods cannot work very well.
- Furthermore, all the above algorithms are based on a pre-defined and well-structured category tree, of which each category has been manually configured with tens or hundreds training samples. However, regardless of the sophistication of the pre-defined category tree, it is highly unlikely that any particular category tree defined by an expert can fully satisfy the degree of detail required by a user. In most cases, an ordinary user would treat a category tree as his file folder hierarchy in the hard disk, and hope to be able to manage the category tree in the same customized and personalized manner as a file folder. Therefore, a general application system should allow a user to arbitrarily define his personalized category tree, and in such a category tree, the user should also be allowed to introduce inconsistency in semantics. For example, at first, the user defines a sub-tree:
and wants to put documents related to IBM products into this sub-tree, i.e. put documents related to IBM PC into the category “PC” and documents related to IBM Server into the category “Server”. But, with the passage of time, the user may want to collect some documents about DELL PC into the category “PC”. However, this operation will introduce semantic inconsistency into this personalized category tree. Traditional categorization methods cannot introduce the documents about DELL PC of semantic inconsistency into the category “PC”, and thus cannot realize such a personalized category tree. - Therefore, a user may desire to be able to create arbitrarily a personalized category tree that is similar to his file folder hierarchy, and map freely a semantic structure that meets his demands onto this personalized category tree, without being limited by traditional semantic consistency, and at the same time, may also desire that there be no need to perform manually the work of specifying a great deal of training samples, which is lengthy, and time and energy-consuming, thereby realizing personalized document categorization that can satisfy personal needs.
- Therefore, in order to solve the above mentioned problems in prior art, the present invention provide methods, systems and apparatus for constructing and maintaining a personalized category tree, and displaying documents by category using the personalized category tree and a personalized categorization system. This, thereby enables a user to perform personalized document categorization by defining a personalized category tree that satisfies his personal needs, without having to labeling manually the set of a great deal of training document and considering the problem of semantic inconsistency.
- According to one aspect of the present invention, there is provided a method for constructing a personalized category tree, the personalized category tree is a category tree that includes at least one category node, the step of creating independently each of the at least one category node comprising: defining a label for the category node; and specifying at least one keyword for the category node. By default, the label for the category is a keyword for the category.
- According to another aspect of the present invention, there is provided a method for constructing a personalized category tree, the personalized category tree is a tree that includes at least one category node, the step of creating each of the at least one category node comprising: searching for documents using at least one keyword; selecting at least one document from the search results; defining a label for the category node; specifying the keyword used in the searching step as the keyword for the category node; and specifying the selected at least one document as the feature document for the category node.
- According to another aspect of the present invention, there is provided methods for maintaining a personalized category tree.
- According to another aspect of the present invention, there is provided a method for displaying documents by category using personalized category tree.
- According to another aspect of the present invention, there is provided a personalized categorization system, comprising: category tree editor for creating and modifying a personalized category tree.
- The features, advantages and purposes of the present invention will be better understood from the following description of the detailed implementations of the present invention taken in conjunction with the accompanying drawings, in which:
-
FIG. 1 is a flowchart of a method for constructing a personalized category tree according to one embodiment of the present invention; -
FIG. 2 is a flowchart of a method for constructing each node in a personalized category tree according to another embodiment of the present invention; -
FIG. 3 is a flowchart of a method for maintaining a personalized category tree according to one embodiment of the present invention; -
FIG. 4 is a flowchart illustrating detailed steps of performing topic tracking for a category node to retrieve relevant documents in a method for maintaining a personalized category tree according to one embodiment of the present invention; -
FIG. 5 is a diagram for illustrating document length normalization for a document; -
FIG. 6 is a flowchart of a method for displaying documents by category using a personalized category tree according to one embodiment of the present invention; -
FIGS. 7A-7C shows the results of displaying documents under different displaying modes in a method of displaying documents by category using personalized category tree; -
FIG. 8 is a block diagram of a personalized categorization system according to one embodiment of the present invention. - The present invention provides methods, systems and apparatus for constructing and maintaining a personalized category tree, and displaying documents by category using the personalized category tree and a personalized categorization system. This, enables a user to perform personalized document categorization by defining a personalized category tree that satisfies his personal needs, without having to labeling manually the set of a great deal of training document and considering the problem of semantic inconsistency.
- In an example embodiment of the present invention, there is provided a method for constructing a personalized category tree. The personalized category tree is a category tree that includes at least one category node. The step of creating independently each of the at least one category node comprising: defining a label for the category node; and specifying at least one keyword for the category node. By default, the label for the category is a keyword for the category.
- In an example embodiment of the present invention, there is provided a method for constructing a personalized category tree, the personalized category tree is a tree that includes at least one category node, the step of creating each of the at least one category node comprising: searching for documents using at least one keyword; selecting at least one document from the search results; defining a label for the category node; specifying the keyword used in the searching step as the keyword for the category node; and specifying the selected at least one document as the feature document for the category node.
- In another example embodiment of the present invention, there is provided a method for maintaining a personalized category tree, the personalized category tree is a category tree that includes at least one category node, each of the at least one category node includes a label and at least one keyword, the method comprising: for each of the at least one category node, searching for documents using the at least one keyword included in the category node; selecting at least one document from the search results as the feature document for the category node; and performing topic tracking to add documents relevant to the category node based on the at least one feature document.
- In another example embodiment of the present invention, there is provided a method for maintaining a personalized category tree, the personalized category tree is a category tree that includes at least one category node, each of the at least one category node includes a label, at least one keyword and at least one feature document, the method comprising: for each of the at least one category node, performing topic tracking to add documents relevant to the category node based on the at least one feature document.
- In another example embodiment of the present invention, there is provided a method for displaying documents by category using personalized category tree, the personalized category tree is a tree that includes at least one category node, each of the at least one category node includes a label and relevant documents belonging to the category node, the method comprising: selecting a category node in the personalized category tree; and displaying the relevant documents belonging to that category node.
- In another example embodiment of the present invention, there is provided a personalized categorization system. The system includes: a category tree editor for creating and modifying a personalized category tree, wherein the personalized category tree is a category tree that includes at least one category node, each of the at least one category node includes a label and at least one keyword; and a category node editor for configuring a category node in the personalized category tree.
- Next, various preferred embodiments of the present invention will be described in detail in conjunction with accompany drawings. As mentioned above, in traditional document categorization methods, a category tree is constructed by a domain expert, and a set of a great deal of training documents are selected for each category in the category tree, and thereby the training sets are utilized to perform semantic identification for new documents in order to assign the new documents into the respective categories in the category tree. Such a category tree complies with the semantic consistency requirement, not allowing a user to introduce semantic inconsistencies. Further, such a category tree that has been defined by an expert presents great difficulty for a user to create his own personalized categories, because the user will have to select a great deal of training documents for the new categories defined, and this is difficult for an ordinary user who is not a language expert, moreover, the accuracy of categorization cannot be assured. Therefore, if a user can construct a personalized category tree according to his own needs, and utilize that tree to automatically categorize and manage documents with no or few training samples, the user will be spared of a lot of burdensome work of document management.
- Traditional category tree organizes various documents and the relationships between them in the form of a tree, in which a parent node and a child node form a containment relationship, there is a strict semantic qualification relationship between them, they are dependent on each other during training, and the qualification of the child node includes the semantic qualification of the parent node, that is, the parent node includes all the documents belonging to the child node. This ensures semantic consistency in the category tree. Whereas in the present invention, the parent node and the child node are separately and independently qualified, their semantics are relatively independent, and it is through different views, document organization/customization/filtering to satisfy a user's needs for browsing and searching for documents. That is, in a category tree according to the present invention, while the form of the path organization of a parent node and a child node suggests a parent-child relationship, the qualifications and contents of the parent node and the child node are independent of each other.
- Method for Constructing a Personalized Category Tree
- According to one aspect of the present invention, there is provided a method for constructing a personalized category tree. Below, the method will be described in detail in conjunction with accompany drawings.
-
FIG. 1 is a flowchart of a method for constructing a personalized category tree according to one embodiment of the present invention. A personalized category tree of the present invention allows semantic inconsistency, so, when constructing a personalized category tree, there is no need to consider the consistency problem between a child node and a parent node, each category node being created with same steps. - As shown in
FIG. 1 , with the start of the method for constructing a personalized category tree of the present embodiment, first atStep 105, initialization is performed to create a category tree that only contains the root node. - Next at
Step 110, a category node is added into the personalized category tree. - Next at
Step 115, a label is defined for that new category node. The label should be able to represent category features of the node, similar to the name of a file folder. - Next at
Step 120, at least one keyword is specified for that new category node. Preferably, the label for the category may serve as a keyword for the category. The keywords for a category node are used to describe the topic content of that category node, and as described later, the keywords can be utilized to find documents relevant to the category node and the feature documents for the category node. - Next at
Step 125, an information source is specified for that new category node. The information source is used to indicate the source of relevant documents of the category node, and may be, for example, an URL, a path, an IP address or a computer name, etc. Notably, a category node can be specified with either one information source or multiple information sources, and multiple category nodes can share one information source. When a new category node has not been specified with an information source, that category node inherits the information source of its parent node by default. - Next at
Step 130, at least one feature document is specified for the new category node. In the present invention, a feature document is a document that is highly relevant to the category node and can best represent the content of the category, equivalent to training samples in the traditional categorization methods. But the difference with the traditional methods is that the number of feature documents of the present invention can be much more smaller than the number of training samples in the traditional categorization methods (for example, user may only need to select 3 to 5 samples), thereby saving the time of a user specifying the feature documents. - Next at
Step 135, it is determined whether the task of constructing the personalized category tree has been accomplished, and if a new category node still needs to be added, and then return toStep 110, repeating the above describedStep 110 to 130 to add the new node onto the personalized category tree. - If the determination in
Step 135 is that the constructing task has been accomplished, then the method ends at Step 140. - In employing the method of the present embodiment for constructing a personalized category tree, since there is no need to consider the consistency between a parent node and a child node, each category node can be created simply and equally, so it can be done conveniently even by ordinary user who is not a language expert. In addition, according to the present implementation, a user needs not to specify a great deal of training samples, thereby reducing the workload.
- Further, according to a variation of the present embodiment,
Step 125 and/orStep 130 can be omitted, that is, information source and feature documents are not specified for each node. Wherein, the same information source can be specified for the whole category tree or a child node can use the information source of its parent node, and the feature documents can be selected during the process of maintaining the personalized category tree as described later. Alternatively, the information source is not specified, but the information source to which the user can or is authorized to access is regarded as the information source for the node, and the feature documents may also not be selected, but the documents that the user frequently accesses are used as the feature documents, or only the keywords are used to perform categorization. Thus, the workload of constructing a personalized category tree by a user can be further saved. -
FIG. 2 is a flowchart of a method for constructing each node in a personalized category tree according to another embodiment of the present invention. The method for constructing a personalized category tree in the present embodiment differs from the method for constructing a personalized category tree in the above embodiment in that the process of creating each category node is different. The creation of each category node in a personalized category tree in the present embodiment is accomplished simultaneously when a user retrieves documents. - As shown in
FIG. 2 , first atStep 205, the user searches for documents from the information source by utilizing one or more keywords. Specifically, the user can utilize the keyword(s) to find documents that include the keyword(s) in a local or network path, or, for example, the user can enter keywords in a search engine to query relevant documents. - Next at
Step 210, at least one document is selected from the search results of the previous step. Specifically, the user can select one or more desired documents by browsing the abstract or the body text of each document found. - Next at
Step 215, a category node is added into the personalized category tree. The user can add the category node at any desired location in the personalized category tree. - Next at
Step 220, a label is defined for the category node to label the category. - Next at
Step 225, the keywords used for the search inStep 205 are specified as the keywords for that category node. - Next at
Step 230, the documents selected inStep 210 are specified as the feature documents for that category node. - Next at
Step 235, an information source is specified for that category node. The information source may be a path used to find the documents in theprevious Step 205, or may be the URL or path of the documents found in case that the user has performed the query through a search engine. Naturally, multiple information sources can also be specified for the category node, for example, when the documents found are from different locations. - The method of constructing a personalized category tree of the present embodiment has been described above in conjunction with accompanying
drawings - From the above description it can be seen that, since each category node in a category tree is created separately according to the respective needs during constructing the category tree, the category nodes are equal to and independent of each other, so the personalized category tree constructed with the above embodiments does not have the problem of semantic constraints between the category nodes, thereby allowing semantic inconsistencies. In addition, because no or only a few feature documents need to be specified for each category node, rather than specifying the set of a great deal of training samples by a language expert as with a traditional category tree, the process of constructing a category tree is greatly simplified, saving a great deal of manpower and precious time.
- Further, because information sources can be specified for each category node respectively and one category node can have multiple information sources in the personalized category tree, it is even more convenient for a user to manage documents using a personalized category tree.
- In addition, in the previous embodiment, the work of adding a new category node into a personalized category tree can be accomplished simultaneously when a user s document, thereby combining together the retrieval and the creation of the personalized category tree by the user, further simplifying the user's work.
- Method for Maintaining a Personalized Category Tree
- In the same inventive conception, according to another aspect of the present invention, there is provided a method for maintaining a personalized category tree, which may be generated with, for example, the above described method for constructing a personalized category tree. The method will be described below in conjunction with accompany drawings.
-
FIG. 3 is a flowchart of a method for maintaining a personalized category tree according to one embodiment of the present invention. It should be pointed out that the present embodiment applies to a personalized category tree that is generated by the above described method for constructing a personalized category tree, and includes at least one category node, each category node including a label, at least one keyword and an information source used to indicate the source of relevant documents of the category node. - As shown in
FIG. 3 , first atStep 305, a category node is selected from the personalized category tree. Because a parent node and a child node in a personalized category tree in present invention are relatively independent and have no strict semantic constraint relationship, the category nodes can be selected for processing one by one in any order, such as depth-first or width-first or another order, when maintaining the personalized category tree. - Next at
Step 310, for the selected category node, the keywords are used to search for relevant documents from the information source specified for the category node. - Next at
Step 315, at least one document is selected from the search results in the previous step as the feature document for the category node. - Next at
Step 320, topic tracking is performed on the documents in the information source specified for the category node according to the at least one feature document, to add relevant documents of the category node. - There are various methods for topic tracking in the prior art, such as those proposed in the article “Unsupervised and Supervised Clustering for Topic Tracking” (NAACL-2001) by Martin Franz, et al., and those in the article “NIST's 1998 Topic Detection and Tracking Evaluation (TDT2)” (issued in Proceedings of the DARPA Broadcast News Workshop, 1999) by J. G. Fiscus, et al., all of which are incorporated herein in their entirety for reference. The tracking method will be described in detail in conjunction with accompany drawings hereinafter.
- Next at
Step 325, it is determined whether the maintaining work for the personalized category tree has been accomplished, and if there is any other node needing to be maintained in the personalized category tree, the method proceeds to Step 330. AtStep 330, the next category node that needs to be maintained in the personalized category tree is selected, and the method returns to Step 310 and repeats above describedStep 310 to Step 325. - If the determination at
Step 325 is that all nodes have been processed, then the method ends atStep 335. - In addition, according to a variation of the present embodiment, the category nodes in the personalized category tree has already been specified feature documents, so during the process of maintaining the node,
Step 310 andStep 315 can be omitted, and the topic tracking can be performed directly according to the specified feature documents. In addition, according to another variation of the present embodiment, an information source has not been specified for each category node in the personalized category tree, thus during the process of maintaining the node, document finding and/or topic tracking can be performed on documents in a common information source. -
FIG. 4 shows the detailed steps of performing topic tracking for a category node to retrieve relevant documents in the method for maintaining a personalized category tree according to one embodiment of the present invention. - As shown in
FIG. 4 , first atStep 405, at least one keyword is extracted from the feature documents of the category node. Specifically, it can be done using, for example, the tf(term frequency) method or the tf−idf (term frequency˜inverse document frequency) method, etc. The tf method ranks and calculates the weight according to the number of times each keyword appears in the document. Whereas the tf˜idf method determines the weight of each keyword by calculating tf×idf, wherein tf is the frequency (number of times) the word appears in the document, and idf=all sentences/term_sentences, wherein all sentences is the number of all sentences in the document, term_sentences is the number of sentences that include the word. Then, extract one or more keywords with high weight according to the above calculation results. - Next at
Step 410, a document is selected from the information source specified for the category node. - Next at
Step 415, the at least one keyword extracted from the feature documents and the keywords included in the category node are used to perform document length normalization for the document in the information source of the category node. - Because the structures and lengths of various kinds of documents are different, and sometimes a document will include contents relevant or irrelevant to the desired topic in different parts. In this case, if the degree of topic relevance between the document and the feature documents is calculated directly, often the calculated degree of relevance is very low, thereby missing the relevant documents which ought to be selected. Therefore, in the present embodiment, length normalization is perform on the documents in the information source of the category node by using the keywords extracted from the feature documents as well as the keywords specified for the category node, in order to solve the above problem.
-
FIG. 5 is a diagram for illustrating document length normalization for a document. As shown inFIG. 5 , specifically, the document length normalization for the document is to treat the each of the keywords as a seed. For each seed occurred in the document, the surrounding texts that include the seed from the document are extracted, and here, the basic unit of the surrounding texts extracted is a paragraph that includes the seed in the document. Then, the extracted surrounding texts are combined as the length normalized structure of the document. Thus the parts in the text that are irrelevant to the desired topic are excluded. - Next at
Step 420, the degree of topic relevance between the length normalized document and the length normalized feature documents in the category node is calculated. - Specifically, various methods can be used to calculate the degree of topic relevance, such as described in the above mentioned prior art papers. In the present embodiment, Okapi formula is used to calculate the degree of topic relevance between a first document and a second document, that is
wherein, d1 represents the first document, d2 represents the second document, tw 1 is the adjusted term frequency of word w in document i,
is the term frequency of word w in document i, α is an adjust coefficient for adjusting the difference between the maximum and the minimum value of term frequency; μ is the feature document set included in the node; λ(w,μ)=idf0(w)+Δλ(w,μ) , idf0(w) is the inverse document frequency of word w, Δλ(w,μ) is mainly used to compare the degree of similarity between two document sets, the two document sets being: Dw, the set of documents that include the word w; and μ, the set of feature documents included in the category node.
wherein, nw is the total number of the documents that include word w, nμ is the total number of the feature documents included in the category node, and nwμ is the total number of documents that include word w in the document set μ; λ0 is an adjustable proportional coefficient for adjusting the degree of importance of item Δλ(w,μ). - Next at
Step 425, it is determined whether the degree of topic relevance between the document and the feature documents in the category node is greater than a first specified threshold. The first specified threshold may be, for example, 40%. If the determination is “Yes”, then the method proceeds to Step 430, adding the document as a relevant document below into the node; otherwise, the method proceeds to Step 445. - Step 435 is performed after
Step 430, determining if the degree of topic relevance between the document and the feature documents in the category node is greater than a second specified threshold, which is greater than the first specified threshold, and may be, for example, 60%. If the determination is “Yes”, then the method proceeds to Step 440, adding the document as a feature document of the category node; otherwise, the method proceeds to Step 445. - Then at
Step 445, it is determined whether all the documents in the information source of the category node have been processed. If there are still documents to be processed, then the method proceeds to Step 450, selecting the next document in the information source and returning toStep 415, repeating the above described process to process that document; otherwise, the method ends atStep 455. - Further, according to a variation of the present embodiment, document length normalization may not be performed for the document to be processed, therefore,
Step - In addition, according to another embodiment of the present invention, during the process of maintaining a personalized category tree, a determination is made as to whether the feature documents in a node is greater than a predetermined number, for example, 100, and if “Yes”, the maintenance may be performed using a traditional statistical categorization method.
- From the above description it can be seen that the maintenance of a personalized category tree is realized by each category node obtaining relevant documents from the information source directly, therefore, there is no need to consider the problem of semantic constraints between category nodes. And categorization of documents can be performed without specifying any feature document or only specifying a few feature documents.
- In addition, in the method of maintaining a personalized category tree of the present embodiment, during maintaining a node, i.e. categorization of documents, the number of feature documents of a category node can be extended continuously, thereby automatically adjusting the topic tracking and gradually increasing the accuracy of document categorization as well.
- Further, in the method for maintaining a personalized category tree, when the number of feature documents of a category node in the personalized category tree reaches a predefined number, a traditional categorization method can be used, therefore, the method of the present embodiment can also be used as a complementary method when there are relatively few training samples in a category node in a traditional categorization method.
- Method for Displaying Documents by Category by Utilizing a Personalized Category Tree
- In the same inventive conception, according to another aspect of the present invention, there is provided a method for displaying documents by category by utilizing a personalized category tree.
- The method will be described below in conjunction with accompany drawings.
-
FIG. 6 is a flowchart of the method for displaying documents by category with a personalized category tree according to one embodiment of the present invention. Wherein the personalized category tree may be, for example, a personalized category tree generated by the above described method for constructing a personalized category tree and maintained by the above described method for maintaining a personalized category tree. The personalized category tree includes at least one category node and each category node includes a label, keywords, feature documents and relevant documents belonging to the category node. - As shown in
FIG. 6 , first atStep 605, a category node in the personalized category tree is selected. - Next at
Step 610, a display mode is selected, that is, the user selects the mode for displaying a document with an input device. In the present embodiment, the mode of displaying a document includes: Common view, Lower view, Upper view and Limited view. By default, the relevant documents in a selected category node are displayed to a user in “Common view”. Wherein, in “Common view”, only the relevant documents belonging to the selected category node will be displayed; in “Lower view”, the relevant documents belonging to the selected category node and its child node(s) will be displayed, as shown inFIG. 7B ; in “Upper view”, the relevant documents belonging to the selected category node and its parent node will be displayed, as shown inFIG. 7A ; in “Limited view”, the relevant documents belonging to the child node(s) of the category node will be excluded, as shown inFIG. 7C . - It is noted that the above mentioned display modes can be used in combination to display the relevant documents. For example, when “Upper view” and “Limited view” are selected in combination, as shown in
FIG. 7C , the documents can be displayed by category with strict semantics as in a traditional category tree. - Specifically, at
Step 615, it is determined whether the user has selected “Lower view”. If “Yes”, then performStep 625, displaying the relevant documents belonging to the selected category node and its child node(s). - Next at
Step 620, it is determined whether the user has selected “Upper view”. If “Yes”, then Step 630 is performed, displaying the relevant documents belonging to the selected category node and its parent node. - Next at
Step 635, it is determined whether the user has selected “Limited view”. If “Yes”, then Step 640 is performed, excluding the relevant documents belonging to the child node(s) of the category node from the list of the displayed documents. - Finally, the method ends at
Step 645. Naturally, the above steps can be performed repeatedly, thereby allowing the user to continually select category nodes to display documents by category. - Further, in the present embodiment, apart from displaying the list of documents that meet the criteria to a user, the abstract information of the selected documents in that list can also be displayed. At the same time, the document list also displays the documents in the order of the degree of relevance between the relevant documents and the feature documents in the category node.
- From the above description it can be seen that the method of displaying documents by category with a personalized category tree of the present embodiment may use the above described personalized category tree to display the relevant documents by category. And utilizing the multiple display modes provided in the present embodiment, the relevant documents can be organized in multiple ways for displaying; further, the inconsistency in the personalized category tree can also be remedied.
- Personalized Categorization System
- In the same inventive conception, according to another aspect of the present invention, there is provided a personalized categorization system. The system will be described below in conjunction with accompany drawings.
-
FIG. 8 is a block diagram of a personalized categorization system according to an embodiment of the present invention. As shown inFIG. 8 , thepersonalized categorization system 800 of the present embodiment comprises: acategory tree editor 801, acategory node editor 802, acrawler 803, apersonalized categorizer 804, a category display means 806 and a category tree storage means 807. - Wherein, the
category tree editor 801 is used to create and modify a personalized category tree, such as add a category node, delete a category node and modify the tree structure, etc. - The
category node editor 802 is used to configure the category nodes in the personalized category tree, such as define a label, keywords, feature documents and information source, etc. for a node. When the user has not specified the keywords, feature documents and information source for the category node, the category node editor can inherit the setting of its parent node by default. - The
crawler 803 is used to obtain documents from specified information sources. Thecrawler 803 may be a network crawler known in prior art. When each category node in a personalized category tree has been specified an information source, thecrawler 803 can get documents from the information source specified for each category node. - The
personalized categorizer 804 is used to categorize the documents obtained by thecrawler 803 into the personalized category tree. According to the present embodiment, thepersonalized categorizer 804 further comprises: akeyword extraction unit 8042, a documentlength normalization unit 8044 and arelevance calculation unit 8046. - Wherein, the
keyword extraction unit 8042 is used to extract keywords from the specified feature documents. The documentlength normalization unit 8044 is used to perform length normalization on the documents based on the keyword. Therelevance calculation unit 8046 is used to calculate the degree of topic relevance between the documents processed and the set of feature documents, for example, by using the above described Okapi algorithm. Further, thepersonalized categorizer 804 can determine if the documents should be categorized into the node based on the degree of topic relevance, in addition, it can also determine if the relevant documents should be added as feature documents for the node based on its degree of topic relevance. - The category display means 806 is used to display the relevant documents by category by utilizing the personalized category tree. In the present embodiment, the category display means 806 can display the relevant documents in the various display modes described above.
- The category tree storage means 807 is used to store the personalized category tree, including: for example, the attribute information in each category node and its relevant documents, feature documents, etc.
- From the above description it can be seen that with the personalized categorization system of the present embodiment, the above described method for constructing a personalized category tree, the method for maintaining a personalized category tree and the method for displaying documents by category by utilizing a personalized category tree can be realized.
- It should be pointed out that the personalized categorization system of the present invention and its components may be implemented in the form of hardware and software, and may be combined with other devices as needed, for example, it may be implemented on various devices with information processing capabilities, such as a personal computer, a server, a notebook computer, a handhold computer, a PDA, etc. and can be physically separated from and operationally interconnected to each other to function.
- Although a method for constructing a personalized category tree, a method for maintaining a personalized category tree, a method for displaying documents by category by utilizing a personalized category tree and a personalized categorization system of the present invention have been described in details through some exemplary embodiments, these embodiments are not exhaustive, and those skilled in the art can make various variations and modifications thereof within the spirit and scope of the present invention. Therefore, the present invention is not limited to these embodiments, and the scope of the present invention is only defined by the appended claims.
- Variations described for the present invention can be realized in any combination desirable for each particular application. Thus particular limitations, and/or embodiment enhancements described herein, which may have particular advantages to a particular application need not be used for all applications. Also, not all limitations need be implemented in methods, systems and/or apparatus including one or more concepts of the present invention. Methods may be implemented as signal methods employing signals to implement one or more steps. Signals include those emanating from the Internet, etc.
- The present invention can be realized in hardware, software, or a combination of hardware and software. A visualization tool according to the present invention can be realized in a centralized fashion in one computer system, or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system—or other apparatus adapted for carrying out the methods and/or functions described herein—is suitable. A typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein. The present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods.
- Computer program means or computer program in the present context include any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after conversion to another language, code or notation, and/or reproduction in a different material form.
- Thus the invention includes an article of manufacture which comprises a computer usable medium having computer readable program code means embodied therein for causing a function described above. The computer readable program code means in the article of manufacture comprises computer readable program code means for causing a computer to effect the steps of a method of this invention. Similarly, the present invention may be implemented as a computer program product comprising a computer usable medium having computer readable program code means embodied therein for causing a function described above. The computer readable program code means in the computer program product comprising computer readable program code means for causing a computer to affect one or more functions of this invention. Furthermore, the present invention may be implemented as a program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for causing one or more functions of this invention.
- It is noted that the foregoing has outlined some of the more pertinent objects and embodiments of the present invention. This invention may be used for many applications. Thus, although the description is made for particular arrangements and methods, the intent and concept of the invention is suitable and applicable to other arrangements and applications. It will be clear to those skilled in the art that modifications to the disclosed embodiments can be effected without departing from the spirit and scope of the invention. The described embodiments ought to be construed to be merely illustrative of some of the more prominent features and applications of the invention. Other beneficial results can be realized by applying the disclosed invention in a different manner or modifying the invention in ways known to those familiar with the art.
Claims (28)
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2004100546318A CN100538695C (en) | 2004-07-22 | 2004-07-22 | The method and system of structure, the personalized classification tree of maintenance |
CN200410054631 | 2004-07-22 | ||
CN200410054631.8 | 2004-07-22 |
Publications (2)
Publication Number | Publication Date |
---|---|
US20060020588A1 true US20060020588A1 (en) | 2006-01-26 |
US7865530B2 US7865530B2 (en) | 2011-01-04 |
Family
ID=35658481
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/188,194 Expired - Fee Related US7865530B2 (en) | 2004-07-22 | 2005-07-22 | Constructing and maintaining a personalized category tree, displaying documents by category and personalized categorization system |
Country Status (2)
Country | Link |
---|---|
US (1) | US7865530B2 (en) |
CN (1) | CN100538695C (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080068641A1 (en) * | 2006-09-19 | 2008-03-20 | Xerox Corporation | Document processing system |
US20080162456A1 (en) * | 2006-12-27 | 2008-07-03 | Rakshit Daga | Structure extraction from unstructured documents |
US20080228734A1 (en) * | 2007-03-17 | 2008-09-18 | Samsung Electronics Co., Ltd. | Document management method and document management apparatus using the same |
US20080263594A1 (en) * | 2005-04-06 | 2008-10-23 | Ruzz Tv Pty Ltd | Schedule of a Broadcast Management System |
KR101182280B1 (en) * | 2007-03-17 | 2012-09-14 | 삼성전자주식회사 | Document management method and document management apparatus |
WO2013070601A2 (en) | 2011-11-08 | 2013-05-16 | Univation Technologies, Llc | Methods of preparing a catalyst system |
US8799336B1 (en) * | 2007-04-12 | 2014-08-05 | United Services Automobile Association | Electronic file management hierarchical structure |
US20140229460A1 (en) * | 2012-06-19 | 2014-08-14 | Bublup, Inc. | Systems and methods for semantic overlay for a searchable space |
US20150012543A1 (en) * | 2013-07-02 | 2015-01-08 | Via Technologies, Inc. | Region labeling method and device of data documents |
US9569728B2 (en) | 2014-11-14 | 2017-02-14 | Bublup Technologies, Inc. | Deriving semantic relationships based on empirical organization of content by users |
EP3309182A2 (en) | 2007-11-15 | 2018-04-18 | Univation Technologies, LLC | Polymerization catalysts, methods of making; methods of using, and polyolefinproducts made therefrom |
CN108052636A (en) * | 2017-12-20 | 2018-05-18 | 北京工业大学 | Determine the method, apparatus and terminal device of the text subject degree of correlation |
WO2018191000A1 (en) | 2017-04-10 | 2018-10-18 | Exxonmobil Chemicl Patents Inc. | Methods for making polyolefin polymer compositions |
US10155826B2 (en) | 2014-12-12 | 2018-12-18 | Exxonmobil Research And Engineering Company | Olefin polymerization catalyst system comprising mesoporous organosilica support |
US11314757B2 (en) * | 2015-06-12 | 2022-04-26 | Bublup, Inc. | Search results modulator |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100395755C (en) * | 2006-02-23 | 2008-06-18 | 无锡永中科技有限公司 | Method for building tree file structure in computer |
KR100849497B1 (en) * | 2006-09-29 | 2008-07-31 | 한국전자통신연구원 | Method of Protein Name Normalization Using Ontology Mapping |
CN101315624B (en) * | 2007-05-29 | 2015-11-25 | 阿里巴巴集团控股有限公司 | A kind of method and apparatus of text subject recommending |
CN101714142B (en) * | 2008-10-06 | 2012-10-17 | 易搜比控股公司 | Method for merging file clusters |
US20110282858A1 (en) * | 2010-05-11 | 2011-11-17 | Microsoft Corporation | Hierarchical Content Classification Into Deep Taxonomies |
CN103106262B (en) * | 2013-01-28 | 2016-05-11 | 新浪网技术(中国)有限公司 | The method and apparatus that document classification, supporting vector machine model generate |
KR20170142215A (en) * | 2013-07-02 | 2017-12-27 | 콘비다 와이어리스, 엘엘씨 | Mechanisms for semantics publishing and discovery |
US9400839B2 (en) | 2013-07-03 | 2016-07-26 | International Business Machines Corporation | Enhanced keyword find operation in a web page |
CN103605796B (en) * | 2013-12-05 | 2016-08-03 | 用友优普信息技术有限公司 | Support document management apparatus and the document management method of version iteration |
CN108512854B (en) * | 2018-04-09 | 2021-09-07 | 平安科技(深圳)有限公司 | System information safety monitoring method and device, computer equipment and storage medium |
CN108509424B (en) * | 2018-04-09 | 2021-08-10 | 平安科技(深圳)有限公司 | System information processing method, apparatus, computer device and storage medium |
CN112015893B (en) * | 2020-08-12 | 2024-11-29 | 北京字节跳动网络技术有限公司 | Data processing method and device |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5644764A (en) * | 1995-01-31 | 1997-07-01 | Unisys Corporation | Method for supporting object modeling in a repository |
US6002750A (en) * | 1997-12-12 | 1999-12-14 | U S West, Inc. | Method and system for providing integrated wireline/wireless voice messaging service |
US6014662A (en) * | 1997-11-26 | 2000-01-11 | International Business Machines Corporation | Configurable briefing presentations of search results on a graphical interface |
US6047284A (en) * | 1997-05-14 | 2000-04-04 | Portal Software, Inc. | Method and apparatus for object oriented storage and retrieval of data from a relational database |
US6055540A (en) * | 1997-06-13 | 2000-04-25 | Sun Microsystems, Inc. | Method and apparatus for creating a category hierarchy for classification of documents |
US6216134B1 (en) * | 1998-06-25 | 2001-04-10 | Microsoft Corporation | Method and system for visualization of clusters and classifications |
US6223145B1 (en) * | 1997-11-26 | 2001-04-24 | Zerox Corporation | Interactive interface for specifying searches |
US6345252B1 (en) * | 1999-04-09 | 2002-02-05 | International Business Machines Corporation | Methods and apparatus for retrieving audio information using content and speaker information |
US20020032672A1 (en) * | 2000-03-09 | 2002-03-14 | The Web Access, Inc | Method and apparatus for formatting information within a directory tree structure into an encylopedia-like entry |
US7162540B2 (en) * | 2000-05-15 | 2007-01-09 | Catchfire Systems, Inc. | Method and system for prioritizing network services |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5097328B2 (en) | 2001-05-25 | 2012-12-12 | オラクル・オゥ・ティー・シィ・サブシディアリィ・リミテッド・ライアビリティ・カンパニー | Hierarchical data driven navigation system and method for information retrieval |
-
2004
- 2004-07-22 CN CNB2004100546318A patent/CN100538695C/en not_active Expired - Fee Related
-
2005
- 2005-07-22 US US11/188,194 patent/US7865530B2/en not_active Expired - Fee Related
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5644764A (en) * | 1995-01-31 | 1997-07-01 | Unisys Corporation | Method for supporting object modeling in a repository |
US6047284A (en) * | 1997-05-14 | 2000-04-04 | Portal Software, Inc. | Method and apparatus for object oriented storage and retrieval of data from a relational database |
US6055540A (en) * | 1997-06-13 | 2000-04-25 | Sun Microsystems, Inc. | Method and apparatus for creating a category hierarchy for classification of documents |
US6014662A (en) * | 1997-11-26 | 2000-01-11 | International Business Machines Corporation | Configurable briefing presentations of search results on a graphical interface |
US6223145B1 (en) * | 1997-11-26 | 2001-04-24 | Zerox Corporation | Interactive interface for specifying searches |
US6002750A (en) * | 1997-12-12 | 1999-12-14 | U S West, Inc. | Method and system for providing integrated wireline/wireless voice messaging service |
US6216134B1 (en) * | 1998-06-25 | 2001-04-10 | Microsoft Corporation | Method and system for visualization of clusters and classifications |
US6345252B1 (en) * | 1999-04-09 | 2002-02-05 | International Business Machines Corporation | Methods and apparatus for retrieving audio information using content and speaker information |
US20020032672A1 (en) * | 2000-03-09 | 2002-03-14 | The Web Access, Inc | Method and apparatus for formatting information within a directory tree structure into an encylopedia-like entry |
US7162540B2 (en) * | 2000-05-15 | 2007-01-09 | Catchfire Systems, Inc. | Method and system for prioritizing network services |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080263594A1 (en) * | 2005-04-06 | 2008-10-23 | Ruzz Tv Pty Ltd | Schedule of a Broadcast Management System |
US20080068641A1 (en) * | 2006-09-19 | 2008-03-20 | Xerox Corporation | Document processing system |
US20080162456A1 (en) * | 2006-12-27 | 2008-07-03 | Rakshit Daga | Structure extraction from unstructured documents |
US7562088B2 (en) * | 2006-12-27 | 2009-07-14 | Sap Ag | Structure extraction from unstructured documents |
US9069883B2 (en) | 2007-03-17 | 2015-06-30 | Samsung Electronics Co., Ltd. | Document management method and document management apparatus using the same |
US20080228734A1 (en) * | 2007-03-17 | 2008-09-18 | Samsung Electronics Co., Ltd. | Document management method and document management apparatus using the same |
KR101182280B1 (en) * | 2007-03-17 | 2012-09-14 | 삼성전자주식회사 | Document management method and document management apparatus |
US8799336B1 (en) * | 2007-04-12 | 2014-08-05 | United Services Automobile Association | Electronic file management hierarchical structure |
EP3309182A2 (en) | 2007-11-15 | 2018-04-18 | Univation Technologies, LLC | Polymerization catalysts, methods of making; methods of using, and polyolefinproducts made therefrom |
US9234060B2 (en) | 2011-11-08 | 2016-01-12 | Univation Technologies, Llc | Methods of preparing a catalyst system |
WO2013070601A2 (en) | 2011-11-08 | 2013-05-16 | Univation Technologies, Llc | Methods of preparing a catalyst system |
US20140236918A1 (en) * | 2012-06-19 | 2014-08-21 | Bublup, Inc. | Systems and methods for semantic overlay for a searchable space |
US20140229460A1 (en) * | 2012-06-19 | 2014-08-14 | Bublup, Inc. | Systems and methods for semantic overlay for a searchable space |
US20150012543A1 (en) * | 2013-07-02 | 2015-01-08 | Via Technologies, Inc. | Region labeling method and device of data documents |
US9569728B2 (en) | 2014-11-14 | 2017-02-14 | Bublup Technologies, Inc. | Deriving semantic relationships based on empirical organization of content by users |
US10155826B2 (en) | 2014-12-12 | 2018-12-18 | Exxonmobil Research And Engineering Company | Olefin polymerization catalyst system comprising mesoporous organosilica support |
US11314757B2 (en) * | 2015-06-12 | 2022-04-26 | Bublup, Inc. | Search results modulator |
WO2018191000A1 (en) | 2017-04-10 | 2018-10-18 | Exxonmobil Chemicl Patents Inc. | Methods for making polyolefin polymer compositions |
CN108052636A (en) * | 2017-12-20 | 2018-05-18 | 北京工业大学 | Determine the method, apparatus and terminal device of the text subject degree of correlation |
Also Published As
Publication number | Publication date |
---|---|
CN100538695C (en) | 2009-09-09 |
US7865530B2 (en) | 2011-01-04 |
CN1725213A (en) | 2006-01-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7865530B2 (en) | Constructing and maintaining a personalized category tree, displaying documents by category and personalized categorization system | |
US11347963B2 (en) | Systems and methods for identifying semantically and visually related content | |
US8131684B2 (en) | Adaptive archive data management | |
US10229118B2 (en) | Apparatus, system and method for a semantic editor and search engine | |
EP1678635B1 (en) | Method and apparatus for automatic file clustering into a data-driven, user-specific taxonomy | |
US8027977B2 (en) | Recommending content using discriminatively trained document similarity | |
US7502785B2 (en) | Extracting semantic attributes | |
US8060513B2 (en) | Information processing with integrated semantic contexts | |
US8140584B2 (en) | Adaptive data classification for data mining | |
US7624130B2 (en) | System and method for exploring a semantic file network | |
US20210097472A1 (en) | Method and system for multistage candidate ranking | |
US7428538B2 (en) | Retrieval of structured documents | |
US7634471B2 (en) | Adaptive grouping in a file network | |
US7769771B2 (en) | Searching a document using relevance feedback | |
US9317533B2 (en) | Adaptive image retrieval database | |
US20100005087A1 (en) | Facilitating collaborative searching using semantic contexts associated with information | |
CN110888990A (en) | Text recommending methods, devices, equipment and media | |
US11386366B2 (en) | Method and system for cold start candidate recommendation | |
WO2012129149A2 (en) | Aggregating search results based on associating data instances with knowledge base entities | |
US10747759B2 (en) | System and method for conducting a textual data search | |
US8380745B1 (en) | Natural language search for audience | |
Wolfram | The symbiotic relationship between information retrieval and informetrics | |
CN107103023B (en) | Organizing electronically stored files using an automatically generated storage hierarchy | |
CN114402316A (en) | System and method for federated search using dynamic selection and distributed correlations | |
Jayarathna et al. | Unified relevance feedback for multi-application user interest modeling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JACOVI, MICHAL;SOROKA, VLADIMIR;SIGNING DATES FROM 20050915 TO 20051006;REEL/FRAME:016865/0870 Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JACOVI, MICHAL;SOROKA, VLADIMIR;REEL/FRAME:016865/0870;SIGNING DATES FROM 20050915 TO 20051006 |
|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LIU, SHI XIA;YANG, LI PING;REEL/FRAME:016866/0970 Effective date: 20050822 |
|
REMI | Maintenance fee reminder mailed | ||
LAPS | Lapse for failure to pay maintenance fees | ||
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20150104 |