[go: up one dir, main page]

CN114238654A - Method, device and computer-readable storage medium for constructing knowledge graph - Google Patents

Method, device and computer-readable storage medium for constructing knowledge graph Download PDF

Info

Publication number
CN114238654A
CN114238654A CN202111536550.1A CN202111536550A CN114238654A CN 114238654 A CN114238654 A CN 114238654A CN 202111536550 A CN202111536550 A CN 202111536550A CN 114238654 A CN114238654 A CN 114238654A
Authority
CN
China
Prior art keywords
knowledge
title
text
category
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111536550.1A
Other languages
Chinese (zh)
Other versions
CN114238654B (en
Inventor
李新鹏
彭加琪
王松
崔玉波
李春杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202111536550.1A priority Critical patent/CN114238654B/en
Publication of CN114238654A publication Critical patent/CN114238654A/en
Application granted granted Critical
Publication of CN114238654B publication Critical patent/CN114238654B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

本申请公开了一种知识图谱的构建方法、装置和计算机可读存储介质,该方法包括:获取待处理文档的目录标题与知识类别之间的类别映射表;基于待处理文档的正文数据与类别映射表,生成知识点类别,知识点类别为正文数据对应的知识类别;对正文数据进行解析处理,得到解析数据,解析数据包括实体属性名称与属性值;基于知识点类别、实体属性名称以及属性值,生成知识图谱。通过上述方式,本申请能够降低人力成本。

Figure 202111536550

The present application discloses a method, device and computer-readable storage medium for constructing a knowledge graph. The method includes: acquiring a category mapping table between the directory title of a document to be processed and a knowledge category; based on the text data and category of the document to be processed Mapping table, generate knowledge point categories, knowledge point categories are knowledge categories corresponding to the text data; parse the text data to obtain parsed data, the parsed data includes entity attribute names and attribute values; based on knowledge point categories, entity attribute names and attributes value to generate a knowledge graph. In the above manner, the present application can reduce labor costs.

Figure 202111536550

Description

Knowledge graph construction method and device and computer readable storage medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a method and a device for constructing a knowledge graph and a computer-readable storage medium.
Background
The richness of knowledge is an important factor for embodying machine Intelligence, and the sources and formats of knowledge acquired by an Artificial Intelligence (AI) system are diversified, such as: manually conveying the knowledge possessed by human beings to an AI system through manual arrangement, acquiring the knowledge through the Internet or unstructured knowledge derived from a specific product specification and the like; knowledge graph can be formed by extracting knowledge, how to quickly and efficiently construct a knowledge graph becomes one of basic works of knowledge graph research, but the construction scheme of the knowledge graph adopted in the related technology has the problems of low efficiency or higher training cost of a model.
Disclosure of Invention
The application provides a method and a device for constructing a knowledge graph and a computer-readable storage medium, which can reduce labor cost.
In order to solve the technical problem, the technical scheme adopted by the application is as follows: a method for constructing a knowledge graph is provided, and the method comprises the following steps: acquiring a category mapping table between a directory title and a knowledge category of a document to be processed; generating a knowledge point category based on the text data of the document to be processed and a category mapping table, wherein the knowledge point category is a knowledge category corresponding to the text data; analyzing the text data to obtain analyzed data, wherein the analyzed data comprises an entity attribute name and an attribute value; and generating the knowledge graph based on the knowledge point category, the entity attribute name and the attribute value.
In order to solve the above technical problem, another technical solution adopted by the present application is: there is provided a knowledge-graph building apparatus comprising a memory and a processor connected to each other, wherein the memory is used for storing a computer program, and the computer program is used for implementing the method for building a knowledge-graph in the above technical solution when being executed by the processor.
In order to solve the above technical problem, another technical solution adopted by the present application is: there is provided a computer-readable storage medium for storing a computer program for implementing the method for constructing a knowledge-graph in the above technical solution when the computer program is executed by a processor.
Through the scheme, the beneficial effects of the application are that: firstly, manually constructing a mapping relation between a directory of a document to be processed and a knowledge category to generate a category mapping table; then, a machine acquires a class mapping table, and generates a knowledge point type, an entity attribute name and an attribute value corresponding to each knowledge point in the document to be processed by using the class mapping table and the text data of the document to be processed, so as to construct a knowledge map related to each knowledge point; because the knowledge extraction model does not need to be trained, a training data set required by the knowledge extraction model does not need to be acquired, a large amount of manual labeling is not needed, and compared with a scheme of extraction mainly depending on manual work or man-machine coupling, less labor cost can be invested, and more efficient and accurate results can be obtained.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts. Wherein:
FIG. 1 is a schematic flow chart diagram of an embodiment of a method for constructing a knowledge graph provided herein;
FIG. 2 is a schematic flow chart diagram of another embodiment of a method for constructing a knowledge graph provided herein;
FIG. 3 is a schematic illustration of the last and penultimate headings provided herein;
FIG. 4 is a schematic flow chart of S24 in the embodiment shown in FIG. 3;
FIG. 5 is a schematic flow chart of S43 in the embodiment shown in FIG. 4;
FIG. 6 is a schematic illustration of a catalog of automotive instruction manuals provided herein;
FIG. 7 is a mapping of the sections of FIG. 6 to knowledge categories;
FIG. 8 is a schematic illustration of a sub-section of the "operate Components" section of FIG. 6;
FIG. 9 is a detail of the "front seat" of FIG. 8;
FIG. 10 is a schematic diagram of the corresponding hierarchy of FIGS. 6, 8 and 9;
FIG. 11 is knowledge information corresponding to FIG. 9;
FIG. 12 is a schematic diagram of an embodiment of a knowledge graph constructing apparatus provided herein;
FIG. 13 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be noted that the following examples are only illustrative of the present application, and do not limit the scope of the present application. Likewise, the following examples are only some examples and not all examples of the present application, and all other examples obtained by a person of ordinary skill in the art without any inventive work are within the scope of the present application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
It should be noted that the terms "first", "second" and "third" in the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of indicated technical features. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Firstly, the construction of knowledge graph is introduced, the knowledge graph of a specific field (such as automobile knowledge or medical knowledge) usually has a strict knowledge structure system, and the construction process comprises knowledge body modeling and entity knowledge extraction. The ontology modeling is mainly characterized in that a domain expert combs and designs a knowledge structure, and then the entity knowledge extraction stage is carried out to convert source data with different sources and various forms into structured knowledge through a corresponding technical scheme. In the stage of extracting the entity knowledge, for application scenes with a small amount of content knowledge, the application scenes can be completely sorted or imported by manpower, but for scenes such as automobiles or medicine, the application scenes which completely build the knowledge graph by manpower obviously take time and are laborious.
For some unstructured or semi-structured data (such as product usage specifications), knowledge extraction mainly depends on man-machine coupling, i.e., manual labeling of data, machine training of knowledge extraction models, and machine-assisted extraction combined with manual validation. Specifically, a list of documents in the same domain is collected, for example: taking the knowledge field of automobile parts as an example, a document list is a product use instruction of each large automobile system, the document list is uploaded to a labeling platform, a labeling task is created and then is distributed to a labeling person for labeling, a knowledge structure label defined in the knowledge platform is needed in the labeling process, and a training data set is generated by using a labeling result and is input to the training platform; the training platform creates a training task, loads a training data set and generates a knowledge extraction model; then, the machine runs a knowledge extraction model and is responsible for extracting knowledge points in a document to be extracted (such as a use specification of a specific vehicle type); the extracted result also needs to be finally confirmed by an auditor to be put in storage.
Compared with the traditional scheme of completely depending on manual extraction, the scheme of combining manual marking and machine-assisted extraction realizes certain intellectualization; but the effect of machine-assisted extraction is found to depend on the size and quality of the training data set through business practice verification. On the one hand, the more the data volume used for training the knowledge extraction model, the better the effect of the knowledge extraction model, and the more accurate the result, so as to ensure the better the effect and accuracy of the knowledge extraction model, the more training data sets need to be labeled, which undoubtedly consumes more manpower. On the other hand, in the early labeling process, the quality of the training data set is affected due to the subjectivity of manual participation, and if the quality of the training data set is not high, the effect of the knowledge extraction model and the result of machine-assisted extraction are low. Therefore, once the data volume labeled at the earlier stage is insufficient or the quality of the training data set is not high, the accuracy of the model is low, and in order to ensure the final warehousing quality, a large amount of manpower is still required to be invested for auditing and confirming subsequently. Even in some application scenarios, compared with a scheme completely depending on manual extraction, the scheme has higher input cost and poorer effect; for example, in the application scenario of extracting the contents of the automobile use instruction, the aim is to accurately extract and store the knowledge in an instruction manual (such as 400 pages); the method of extracting by manual arrangement is completely adopted, and the extraction can be completed only by consuming 2 months by 3-5 persons; if the above-mentioned human-computer coupling scheme is adopted, firstly, specifications with at least the same space are marked, that is, the manpower used in the stage of marking the training data set has reached the manpower for full-manual extraction, then, in the stage of model training, research personnel is required to be invested to research and develop and optimize the model, and finally, in the stage of audit warehousing, an auditor is required to participate in audit, so that the whole manpower investment of the scheme is obviously higher than that of the full-manual extraction scheme, and in addition, expensive Graphics Processing Unit (GPU) server resources are required to be consumed for model training, and the like.
Based on the above problem, the present application provides an extraction scheme based on a document structure, and the following describes in detail the technical scheme adopted in the present application.
Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a method for constructing a knowledge graph provided by the present application, the method including:
s11: and acquiring a category mapping table between the directory title and the knowledge category of the document to be processed.
The document to be processed can be obtained from a document database, or downloaded from the internet, or received from other devices, and the document to be processed can be an unstructured document with a catalog, such as a product specification, a paper, or a book.
Further, the knowledge categories mainly source chapter titles in the documents to be processed, and in order to facilitate the construction of domain-unified and standardized knowledge categories, the knowledge categories cannot directly adopt chapter titles, and a mapping relation to the chapter titles needs to be defined and established in advance. Specifically, for the obtained document to be processed, the directory structure of the document to be processed is analyzed manually, a mapping relationship between each chapter and knowledge category in the directory is established, and a category mapping table is generated for subsequent use. Specifically, the directory header includes a top-level header, and the category mapping table includes the top-level header and a knowledge category corresponding to the top-level header.
S12: and generating a knowledge point category based on the text data of the document to be processed and the category mapping table.
After the category mapping table is obtained, the text data of the document to be processed and the category mapping table are processed, so that a knowledge point category can be generated, wherein the knowledge point category is a knowledge category corresponding to the text data. Specifically, the document to be processed can be analyzed to obtain the text data of the document to be processed, wherein the text data comprises characters, formulas or pictures; then, processing the text data by adopting a document processing method in the related technology to obtain a title (namely a text title) in the text data; and then matching the text title with a category mapping table, namely taking the text title as a keyword, searching whether a directory title similar to the text title exists in the category mapping table, and if so, taking the knowledge category corresponding to the directory title as a knowledge point category.
S13: and analyzing the text data to obtain analyzed data.
After the text data is obtained, the text data can be analyzed by adopting a document analysis method in the related technology to generate analysis data, wherein the analysis data comprises entity attribute names and attribute values, the entity attribute names and the attribute values correspond to the knowledge point categories, the entity attribute names are the names of the knowledge points, and the attribute values are the attribute information of the knowledge points.
S14: and generating the knowledge graph based on the knowledge point category, the entity attribute name and the attribute value.
After the knowledge point category, the entity attribute name and the attribute value corresponding to each knowledge point are obtained, the corresponding relation of the knowledge point category, the entity attribute name and the attribute value can be established to obtain a knowledge map; or after acquiring the knowledge point category, the entity attribute name and the attribute value corresponding to each knowledge point, storing the knowledge point category, the entity attribute name and the attribute value in a document; or establishing a corresponding relation among the knowledge point category, the entity attribute name and the attribute value to obtain a knowledge corresponding table so as to perform other operations in the following process, such as: and updating or modifying the content in the knowledge corresponding table.
Understandably, after the knowledge point categories, the entity attribute names and the attribute values are obtained, the knowledge information can be verified manually, so that the accuracy of the finally constructed knowledge graph is ensured.
The embodiment provides a knowledge extraction scheme based on a document structure, which is suitable for a scene of knowledge map construction, wherein a mapping relation between a directory of a document to be processed and a knowledge category is constructed manually to obtain a category mapping table; then a machine (namely a knowledge graph constructing device) acquires the class mapping table, and obtains the knowledge point type, the entity attribute name and the attribute value corresponding to each knowledge point in the document to be processed by using the class mapping table and the text data of the document to be processed and combining with a document analysis processing method in the related technology, thereby constructing a knowledge graph related to each knowledge point; because the knowledge extraction model does not need to be trained, a training data set required by the training knowledge extraction model does not need to be acquired, a large amount of manual labeling is not needed, although the scheme also needs manual work, compared with the scheme adopting the knowledge extraction model, the adopted labor cost is lower, only manual work needs to be adopted in the construction of the category mapping table relevant to the catalogue and the verification stage, the unstructured document can be efficiently constructed into the knowledge map, and the efficiency and the accuracy of knowledge extraction can be improved.
Referring to fig. 2, fig. 2 is a schematic flow chart of another embodiment of a method for constructing a knowledge graph provided by the present application, the method including:
s21: and acquiring a category mapping table between the directory title and the knowledge category of the document to be processed.
S21 is the same as S11 in the previous embodiment, and is not repeated here.
S22: and analyzing the text data to obtain text header data and text content data.
After the text data of the document to be processed is obtained, the text data can be classified by adopting a document processing method in the related technology, so that the text data is divided into two parts: the method comprises the following steps of text header data and text content data, wherein the text header data are headers appearing in the text of a document to be processed, and the text content data are contents except the headers in the text of the document to be processed. And finally generating data corresponding to each level of title in the text data (namely, text title data) and data corresponding to the text content (namely, text content data) by analyzing the complete document structure.
S23: and matching the text title with the category mapping table to obtain the knowledge category matched with the text title, and determining the knowledge category matched with the text title as the knowledge point category.
The method comprises the steps that the text header data comprise a plurality of text headers, after the text header data are obtained, a knowledge point category can be generated according to the mapping relation between the knowledge category and the directory header (namely a category mapping table) and the header of each chapter in the text of the document to be processed (namely the text header), namely whether the directory header identical to the text header exists in the category mapping table is judged, and if the directory header exists, the knowledge category corresponding to the directory header is the knowledge point category.
Further, the plurality of text titles include a first-level title, the first-level title is a title appearing for the first time in a certain chapter, and the first-level title can be matched with the category mapping table to obtain a knowledge point category corresponding to the first-level title.
S24: and analyzing the text header data to generate an entity name and an attribute name.
The entity attribute names include entity names and attribute names, which can be resolved by a last-level header and a penultimate-level header (i.e., a header above the last-level header). The text header data can be analyzed to obtain a plurality of text headers; and then processing the plurality of text titles to generate entity names and attribute names. Specifically, the plurality of text titles include a last-level title and a penultimate-level title, the last-level title can be analyzed to obtain an entity name, and the penultimate-level title can be analyzed to obtain an attribute name; or analyzing the last-level title to obtain an attribute name, and analyzing the penultimate-level title to obtain an entity name.
Further, although some schemes exist in which the last level header is resolved to an entity name and the second to last level header is resolved to an attribute name; this is not absolute, however, for example, as shown in FIG. 3, the last level heading "steering wheel adjustment" should be resolved to an attribute name, and the penultimate heading "steering wheel" to a specific entity name; further clarification and refinement is therefore required with respect to resolution of entity names and attribute names, specific implementations of which are described below.
The scheme shown in fig. 4 is adopted for processing, and the method specifically comprises the following steps:
s41: the title to be processed is selected from the last-level title and the penultimate-level title.
Taking the document to be processed as the product specification as an example, for a document of the specification type, the title capable of being the entity name should be a noun, and the title capable of being the attribute name should be a verb or a combination of multiple words (containing at least one verb), so that the entity name and the attribute name can be determined by segmenting the main text title and then according to the part of speech and the number in the segmentation result.
Further, it is possible to select a word segmentation process for any one of the last-level title and the penultimate-level title, and determine whether the other title is an entity name or an attribute name based on the word segmentation result.
S42: and performing word segmentation on the title to be processed to obtain a word segmentation result.
Splitting a title to be processed by adopting a word segmentation method in the related technology to generate a word segmentation result, wherein the word segmentation result comprises at least one title word; further, the word segmentation result includes the number of each title word and the part-of-speech information of the title word.
S43: and generating an entity name and an attribute name based on the word segmentation result.
In the final constructed knowledge graph, the entity name should be a general and standardized word in the field, such as: parts in the automobile field or drug names in the medical field, and the like, so that a thesaurus of entity names (i.e. a preset labeled thesaurus) can be established, wherein the preset labeled thesaurus is commonly used in a certain field, for example: the method is suitable for use specifications of vehicle money of different brands. In the process of extracting the entity name, the words in the preset labeled word library can be referred to, and the scheme shown in fig. 5 is adopted to process the word segmentation result, which specifically comprises the following steps:
s51: and judging whether the preset labeled word library has the title words or not.
The preset labeled word library comprises a plurality of entity names.
S52: if the title words exist in the preset labeled word library, determining the title words as entity names, and determining the titles except the to-be-processed title in the last-stage title and the penultimate-second-stage title as attribute names.
If the fact that the entity name identical to the title word exists in the preset labeled word library is detected, the title word is extracted as the entity name, at the moment, the other title except the title to be processed in the last-stage title and the last-but-second-stage title is determined as the attribute name, or a keyword in the other title can be extracted as the attribute name.
S53: and if the title words do not exist in the preset labeled word library, determining the entity name and the attribute name based on the number and the part of speech of the title words.
Judging whether the number of the title words with the part of speech being the preset part of speech in the word segmentation result is the preset number or not; if the number of the title words with the part of speech being the preset part of speech in the word segmentation result is the preset number, setting the title words as entity names, and determining the titles except the title to be processed in the last-stage title and the last-but-second-stage title as attribute names; and if the number of the title words with the part of speech being the preset part of speech in the word segmentation result is not the preset number, setting the title words as attribute names, and determining the titles except the to-be-processed title in the last-stage title and the last-but-second-stage title as entity names. Specifically, the part of speech is preset as nouns, and the preset number is 1, that is, if the word segmentation result has only one noun, the noun is extracted as an entity name, otherwise, the noun is extracted as an attribute name.
Further, when no title word exists in the preset labeled word library, the title word is stored in the preset labeled word library so as to update the preset labeled word library, namely the extracted entity name is further enriched and the preset labeled word library is expanded.
S25: and analyzing the text content data to obtain an attribute value.
And analyzing the text content data, and analyzing the text content under the final-stage title into an attribute value, thereby realizing the extraction of the text content data into the attribute value.
S26: and establishing the corresponding relation among the knowledge point category, the entity name, the attribute name and the attribute value to generate a knowledge graph.
The extraction of the entity name, the attribute value and the belonged knowledge category is completed through the scheme, and the knowledge graph can be constructed by using the extracted knowledge information.
In a specific embodiment, in order to better explain the technical solution adopted in the embodiment, a product specification is taken as an example for explanation.
The product manual is generally structured by a catalog, and is exemplified by an automobile manual including chapters such as "safety notice", "instrument cluster", "operation components", "driving", "audio system", and "in-vehicle equipment", as shown in fig. 6.
Knowledge categories may be defined based on the title of the section, such as: "parts" or "driving methods", etc.; according to the combing result of the knowledge system, the knowledge categories and sections do not necessarily correspond to each other, for example, as shown in fig. 6 and 7, the contents in the two sections of "instrument cluster" and "operation parts" can be extracted as the knowledge of the category of "parts", wherein chapter is the section, and concept is the knowledge category; there is therefore a need to establish a mapping from chapters to knowledge categories.
The specific chapter content is also hierarchically structured and can comprise a primary title or a secondary title and the like; for example, fig. 8 shows a content page of the "adjust seat-front seat" section as shown in fig. 9, and a hierarchical structure obtained by parsing is shown in fig. 10. The specific adjustment steps of the driver seat are mainly described in combination with specific content information (i.e., the step description of image-text combination in fig. 9), and the final expected parsed extraction result is shown in fig. 11, where concept is a knowledge category, entity is an entity name, property is an attribute name, and value is an attribute value.
As shown in fig. 9 to 11, the last-stage title "driver seat" is resolved as an entity name, and the last-stage title (i.e., the penultimate-stage title) "adjustment step" thereof is resolved as an attribute name; as for the intermediate two-level title, it can be chosen according to the knowledge system to be finally constructed, for example: the "front row seats" can be resolved into a subtype of "parts", but not as core knowledge points that need to be extracted. So far, several key elements of knowledge extraction are extracted: knowledge category, entity name, attribute name, and attribute value.
The embodiment judges the number and the part of speech of the word segmentation results of the last-stage title/the penultimate-stage title according to the structural characteristics of the document to be processed, and extracts knowledge information (including knowledge category, entity name, attribute name and attribute value) in the document to be processed by integrating the pre-established ways of the tagging word stock of the entity name and the like; by utilizing the knowledge extraction result of the scheme, the construction of the knowledge graph of the document to be processed can be completed by a small amount of manual verification, so that a large amount of labor and machine training cost can be saved.
Referring to fig. 12, fig. 12 is a schematic structural diagram of an embodiment of a knowledge graph constructing apparatus provided in the present application, the knowledge graph constructing apparatus 120 includes a memory 121 and a processor 122 connected to each other, the memory 121 is used for storing a computer program, and the computer program is used for implementing the method for constructing the knowledge graph in the foregoing embodiment when being executed by the processor 122.
Referring to fig. 13, fig. 13 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present application, where the computer-readable storage medium 130 is used for storing a computer program 131, and the computer program 131 is used for implementing a method for constructing a knowledge graph in the foregoing embodiment when being executed by a processor.
The computer readable storage medium 130 may be a server, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and various media capable of storing program codes.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules or units is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims (11)

1. A method for constructing a knowledge graph, comprising:
acquiring a category mapping table between a directory title and a knowledge category of a document to be processed;
generating a knowledge point category based on the text data of the document to be processed and the category mapping table, wherein the knowledge point category is a knowledge category corresponding to the text data;
analyzing the text data to obtain analyzed data, wherein the analyzed data comprises entity attribute names and attribute values;
and generating a knowledge graph based on the knowledge point category, the entity attribute name and the attribute value.
2. The method for constructing a knowledge graph according to claim 1, wherein the entity attribute names include entity names and attribute names, and the step of parsing the text data to obtain parsed data includes:
analyzing the text data to obtain text header data and text content data;
analyzing the text header data to generate the entity name and the attribute name;
and analyzing the text content data to obtain the attribute value.
3. The method for constructing a knowledge graph according to claim 2, wherein the text header data includes a plurality of text headers, and the step of analyzing the text header data to generate the entity name and the attribute name includes:
analyzing the text title data to obtain a plurality of text titles;
and processing the text titles to generate the entity names and the attribute names.
4. The method of constructing a knowledge-graph of claim 3, wherein the plurality of body topics include a last level topic and a penultimate level topic, and wherein the step of processing the plurality of body topics to generate the entity name and the attribute name includes:
selecting a title to be processed from the last-level title and the penultimate-level title;
performing word segmentation processing on the title to be processed to obtain a word segmentation result;
and generating the entity name and the attribute name based on the word segmentation result.
5. The method for constructing a knowledge graph according to claim 4, wherein the word segmentation result comprises at least one title word, and the step of generating the entity name and the attribute name based on the word segmentation result comprises:
judging whether the title words exist in a preset labeling word bank or not;
if so, determining the title words as the entity names, and determining the titles except the to-be-processed title in the last-stage title and the penultimate-stage title as the attribute names;
if not, determining the entity name and the attribute name based on the number and the part of speech of the title words.
6. The method for constructing a knowledge graph according to claim 5, wherein the step of determining the entity name and the attribute name based on the number and the part of speech of the title words comprises:
judging whether the number of the title words with the part of speech being a preset part of speech in the word segmentation result is a preset number or not;
if so, setting the title words as the entity names, and determining the titles except the to-be-processed title in the last-stage title and the penultimate-stage title as the attribute names;
if not, setting the title words as the attribute names, and determining the titles except the to-be-processed title in the last-stage title and the penultimate-stage title as the entity names.
7. The method of constructing a knowledge-graph of claim 5, wherein the method further comprises:
and when the title words do not exist in the preset labeled word stock, storing the title words into the preset labeled word stock so as to update the preset labeled word stock.
8. The method for constructing a knowledge graph according to claim 3, wherein the step of generating the knowledge point category based on the text data of the document to be processed and the category mapping table comprises:
matching the text title with the category mapping table to obtain a knowledge category matched with the text title, and determining the knowledge category matched with the text title as the knowledge point category;
the step of generating a knowledge graph based on the knowledge point category, the entity attribute name, and the attribute value includes:
and establishing the corresponding relation among the knowledge point category, the entity name, the attribute name and the attribute value, and generating the knowledge graph.
9. The method of constructing a knowledge-graph of claim 1,
the directory titles include a top level title, and the category mapping table includes the top level title and a knowledge category corresponding to the top level title.
10. A knowledge-graph building apparatus comprising a memory and a processor connected to each other, wherein the memory is used for storing a computer program, and the computer program is used for implementing the knowledge-graph building method according to any one of claims 1 to 8 when being executed by the processor.
11. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, is configured to implement the method of constructing a knowledge-graph of any one of claims 1-8.
CN202111536550.1A 2021-12-15 2021-12-15 Knowledge graph construction method and device and computer readable storage medium Active CN114238654B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111536550.1A CN114238654B (en) 2021-12-15 2021-12-15 Knowledge graph construction method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111536550.1A CN114238654B (en) 2021-12-15 2021-12-15 Knowledge graph construction method and device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN114238654A true CN114238654A (en) 2022-03-25
CN114238654B CN114238654B (en) 2024-10-29

Family

ID=80756656

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111536550.1A Active CN114238654B (en) 2021-12-15 2021-12-15 Knowledge graph construction method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114238654B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115809311A (en) * 2022-12-22 2023-03-17 企查查科技有限公司 Data processing method and device of knowledge graph and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446337A (en) * 2018-09-19 2019-03-08 中国信息通信研究院 A kind of knowledge mapping construction method and device
CN113190687A (en) * 2021-05-08 2021-07-30 上海爱数信息技术股份有限公司 Knowledge graph determining method and device, computer equipment and storage medium
CN113407678A (en) * 2021-06-30 2021-09-17 竹间智能科技(上海)有限公司 Knowledge graph construction method, device and equipment
CN113553439A (en) * 2021-06-18 2021-10-26 杭州摸象大数据科技有限公司 Method and system for knowledge graph mining
WO2021226809A1 (en) * 2020-05-09 2021-11-18 北京中科院软件中心有限公司 Method and system for constructing knowledge map of manufacturing field

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446337A (en) * 2018-09-19 2019-03-08 中国信息通信研究院 A kind of knowledge mapping construction method and device
WO2021226809A1 (en) * 2020-05-09 2021-11-18 北京中科院软件中心有限公司 Method and system for constructing knowledge map of manufacturing field
CN113190687A (en) * 2021-05-08 2021-07-30 上海爱数信息技术股份有限公司 Knowledge graph determining method and device, computer equipment and storage medium
CN113553439A (en) * 2021-06-18 2021-10-26 杭州摸象大数据科技有限公司 Method and system for knowledge graph mining
CN113407678A (en) * 2021-06-30 2021-09-17 竹间智能科技(上海)有限公司 Knowledge graph construction method, device and equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115809311A (en) * 2022-12-22 2023-03-17 企查查科技有限公司 Data processing method and device of knowledge graph and computer equipment

Also Published As

Publication number Publication date
CN114238654B (en) 2024-10-29

Similar Documents

Publication Publication Date Title
CN111259631B (en) Referee document structuring method and referee document structuring device
Hatzigeorgiu et al. Design and Implementation of the Online ILSP Greek Corpus.
WO2010038540A1 (en) System for extracting term from document containing text segment
US8577887B2 (en) Content grouping systems and methods
CN109101551B (en) Question-answer knowledge base construction method and device
CN112699677B (en) Event extraction method and device, electronic equipment and storage medium
Bjarnadóttir The database of modern Icelandic inflection (Beygingarlýsing íslensks nútímamáls)
US7398196B1 (en) Method and apparatus for summarizing multiple documents using a subsumption model
US20090019362A1 (en) Automatic Reusable Definitions Identification (Rdi) Method
Mayr et al. A user centered approach to requirements modeling
Pivk et al. From tables to frames
CN112667815A (en) Text processing method and device, computer readable storage medium and processor
US20120078950A1 (en) Techniques for Extracting Unstructured Data
CN112749272A (en) Intelligent new energy planning text recommendation method for unstructured data
CN113157888A (en) Multi-knowledge-source-supporting query response method and device and electronic equipment
Bontcheva et al. Using human language technology for automatic annotation and indexing of digital library content
CN114238654A (en) Method, device and computer-readable storage medium for constructing knowledge graph
KR102280028B1 (en) Method for managing contents based on chatbot using big-data and artificial intelligence and apparatus for the same
CN111274354B (en) Referee document structuring method and referee document structuring device
WO2022032685A1 (en) Method and device for constructing multi-level knowledge graph
CN118395970A (en) Document processing method and device based on natural language, computer equipment and storage medium
CN118446315A (en) Problem solving method, device, storage medium and computer program product
CN117216214A (en) Question and answer extraction generation method, device, equipment and medium
US8719693B2 (en) Method for storing localized XML document values
CN116108170A (en) Emergency plan text extraction method and system based on natural language processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant