Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be noted that the following examples are only illustrative of the present application, and do not limit the scope of the present application. Likewise, the following examples are only some examples and not all examples of the present application, and all other examples obtained by a person of ordinary skill in the art without any inventive work are within the scope of the present application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
It should be noted that the terms "first", "second" and "third" in the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of indicated technical features. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Firstly, the construction of knowledge graph is introduced, the knowledge graph of a specific field (such as automobile knowledge or medical knowledge) usually has a strict knowledge structure system, and the construction process comprises knowledge body modeling and entity knowledge extraction. The ontology modeling is mainly characterized in that a domain expert combs and designs a knowledge structure, and then the entity knowledge extraction stage is carried out to convert source data with different sources and various forms into structured knowledge through a corresponding technical scheme. In the stage of extracting the entity knowledge, for application scenes with a small amount of content knowledge, the application scenes can be completely sorted or imported by manpower, but for scenes such as automobiles or medicine, the application scenes which completely build the knowledge graph by manpower obviously take time and are laborious.
For some unstructured or semi-structured data (such as product usage specifications), knowledge extraction mainly depends on man-machine coupling, i.e., manual labeling of data, machine training of knowledge extraction models, and machine-assisted extraction combined with manual validation. Specifically, a list of documents in the same domain is collected, for example: taking the knowledge field of automobile parts as an example, a document list is a product use instruction of each large automobile system, the document list is uploaded to a labeling platform, a labeling task is created and then is distributed to a labeling person for labeling, a knowledge structure label defined in the knowledge platform is needed in the labeling process, and a training data set is generated by using a labeling result and is input to the training platform; the training platform creates a training task, loads a training data set and generates a knowledge extraction model; then, the machine runs a knowledge extraction model and is responsible for extracting knowledge points in a document to be extracted (such as a use specification of a specific vehicle type); the extracted result also needs to be finally confirmed by an auditor to be put in storage.
Compared with the traditional scheme of completely depending on manual extraction, the scheme of combining manual marking and machine-assisted extraction realizes certain intellectualization; but the effect of machine-assisted extraction is found to depend on the size and quality of the training data set through business practice verification. On the one hand, the more the data volume used for training the knowledge extraction model, the better the effect of the knowledge extraction model, and the more accurate the result, so as to ensure the better the effect and accuracy of the knowledge extraction model, the more training data sets need to be labeled, which undoubtedly consumes more manpower. On the other hand, in the early labeling process, the quality of the training data set is affected due to the subjectivity of manual participation, and if the quality of the training data set is not high, the effect of the knowledge extraction model and the result of machine-assisted extraction are low. Therefore, once the data volume labeled at the earlier stage is insufficient or the quality of the training data set is not high, the accuracy of the model is low, and in order to ensure the final warehousing quality, a large amount of manpower is still required to be invested for auditing and confirming subsequently. Even in some application scenarios, compared with a scheme completely depending on manual extraction, the scheme has higher input cost and poorer effect; for example, in the application scenario of extracting the contents of the automobile use instruction, the aim is to accurately extract and store the knowledge in an instruction manual (such as 400 pages); the method of extracting by manual arrangement is completely adopted, and the extraction can be completed only by consuming 2 months by 3-5 persons; if the above-mentioned human-computer coupling scheme is adopted, firstly, specifications with at least the same space are marked, that is, the manpower used in the stage of marking the training data set has reached the manpower for full-manual extraction, then, in the stage of model training, research personnel is required to be invested to research and develop and optimize the model, and finally, in the stage of audit warehousing, an auditor is required to participate in audit, so that the whole manpower investment of the scheme is obviously higher than that of the full-manual extraction scheme, and in addition, expensive Graphics Processing Unit (GPU) server resources are required to be consumed for model training, and the like.
Based on the above problem, the present application provides an extraction scheme based on a document structure, and the following describes in detail the technical scheme adopted in the present application.
Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a method for constructing a knowledge graph provided by the present application, the method including:
s11: and acquiring a category mapping table between the directory title and the knowledge category of the document to be processed.
The document to be processed can be obtained from a document database, or downloaded from the internet, or received from other devices, and the document to be processed can be an unstructured document with a catalog, such as a product specification, a paper, or a book.
Further, the knowledge categories mainly source chapter titles in the documents to be processed, and in order to facilitate the construction of domain-unified and standardized knowledge categories, the knowledge categories cannot directly adopt chapter titles, and a mapping relation to the chapter titles needs to be defined and established in advance. Specifically, for the obtained document to be processed, the directory structure of the document to be processed is analyzed manually, a mapping relationship between each chapter and knowledge category in the directory is established, and a category mapping table is generated for subsequent use. Specifically, the directory header includes a top-level header, and the category mapping table includes the top-level header and a knowledge category corresponding to the top-level header.
S12: and generating a knowledge point category based on the text data of the document to be processed and the category mapping table.
After the category mapping table is obtained, the text data of the document to be processed and the category mapping table are processed, so that a knowledge point category can be generated, wherein the knowledge point category is a knowledge category corresponding to the text data. Specifically, the document to be processed can be analyzed to obtain the text data of the document to be processed, wherein the text data comprises characters, formulas or pictures; then, processing the text data by adopting a document processing method in the related technology to obtain a title (namely a text title) in the text data; and then matching the text title with a category mapping table, namely taking the text title as a keyword, searching whether a directory title similar to the text title exists in the category mapping table, and if so, taking the knowledge category corresponding to the directory title as a knowledge point category.
S13: and analyzing the text data to obtain analyzed data.
After the text data is obtained, the text data can be analyzed by adopting a document analysis method in the related technology to generate analysis data, wherein the analysis data comprises entity attribute names and attribute values, the entity attribute names and the attribute values correspond to the knowledge point categories, the entity attribute names are the names of the knowledge points, and the attribute values are the attribute information of the knowledge points.
S14: and generating the knowledge graph based on the knowledge point category, the entity attribute name and the attribute value.
After the knowledge point category, the entity attribute name and the attribute value corresponding to each knowledge point are obtained, the corresponding relation of the knowledge point category, the entity attribute name and the attribute value can be established to obtain a knowledge map; or after acquiring the knowledge point category, the entity attribute name and the attribute value corresponding to each knowledge point, storing the knowledge point category, the entity attribute name and the attribute value in a document; or establishing a corresponding relation among the knowledge point category, the entity attribute name and the attribute value to obtain a knowledge corresponding table so as to perform other operations in the following process, such as: and updating or modifying the content in the knowledge corresponding table.
Understandably, after the knowledge point categories, the entity attribute names and the attribute values are obtained, the knowledge information can be verified manually, so that the accuracy of the finally constructed knowledge graph is ensured.
The embodiment provides a knowledge extraction scheme based on a document structure, which is suitable for a scene of knowledge map construction, wherein a mapping relation between a directory of a document to be processed and a knowledge category is constructed manually to obtain a category mapping table; then a machine (namely a knowledge graph constructing device) acquires the class mapping table, and obtains the knowledge point type, the entity attribute name and the attribute value corresponding to each knowledge point in the document to be processed by using the class mapping table and the text data of the document to be processed and combining with a document analysis processing method in the related technology, thereby constructing a knowledge graph related to each knowledge point; because the knowledge extraction model does not need to be trained, a training data set required by the training knowledge extraction model does not need to be acquired, a large amount of manual labeling is not needed, although the scheme also needs manual work, compared with the scheme adopting the knowledge extraction model, the adopted labor cost is lower, only manual work needs to be adopted in the construction of the category mapping table relevant to the catalogue and the verification stage, the unstructured document can be efficiently constructed into the knowledge map, and the efficiency and the accuracy of knowledge extraction can be improved.
Referring to fig. 2, fig. 2 is a schematic flow chart of another embodiment of a method for constructing a knowledge graph provided by the present application, the method including:
s21: and acquiring a category mapping table between the directory title and the knowledge category of the document to be processed.
S21 is the same as S11 in the previous embodiment, and is not repeated here.
S22: and analyzing the text data to obtain text header data and text content data.
After the text data of the document to be processed is obtained, the text data can be classified by adopting a document processing method in the related technology, so that the text data is divided into two parts: the method comprises the following steps of text header data and text content data, wherein the text header data are headers appearing in the text of a document to be processed, and the text content data are contents except the headers in the text of the document to be processed. And finally generating data corresponding to each level of title in the text data (namely, text title data) and data corresponding to the text content (namely, text content data) by analyzing the complete document structure.
S23: and matching the text title with the category mapping table to obtain the knowledge category matched with the text title, and determining the knowledge category matched with the text title as the knowledge point category.
The method comprises the steps that the text header data comprise a plurality of text headers, after the text header data are obtained, a knowledge point category can be generated according to the mapping relation between the knowledge category and the directory header (namely a category mapping table) and the header of each chapter in the text of the document to be processed (namely the text header), namely whether the directory header identical to the text header exists in the category mapping table is judged, and if the directory header exists, the knowledge category corresponding to the directory header is the knowledge point category.
Further, the plurality of text titles include a first-level title, the first-level title is a title appearing for the first time in a certain chapter, and the first-level title can be matched with the category mapping table to obtain a knowledge point category corresponding to the first-level title.
S24: and analyzing the text header data to generate an entity name and an attribute name.
The entity attribute names include entity names and attribute names, which can be resolved by a last-level header and a penultimate-level header (i.e., a header above the last-level header). The text header data can be analyzed to obtain a plurality of text headers; and then processing the plurality of text titles to generate entity names and attribute names. Specifically, the plurality of text titles include a last-level title and a penultimate-level title, the last-level title can be analyzed to obtain an entity name, and the penultimate-level title can be analyzed to obtain an attribute name; or analyzing the last-level title to obtain an attribute name, and analyzing the penultimate-level title to obtain an entity name.
Further, although some schemes exist in which the last level header is resolved to an entity name and the second to last level header is resolved to an attribute name; this is not absolute, however, for example, as shown in FIG. 3, the last level heading "steering wheel adjustment" should be resolved to an attribute name, and the penultimate heading "steering wheel" to a specific entity name; further clarification and refinement is therefore required with respect to resolution of entity names and attribute names, specific implementations of which are described below.
The scheme shown in fig. 4 is adopted for processing, and the method specifically comprises the following steps:
s41: the title to be processed is selected from the last-level title and the penultimate-level title.
Taking the document to be processed as the product specification as an example, for a document of the specification type, the title capable of being the entity name should be a noun, and the title capable of being the attribute name should be a verb or a combination of multiple words (containing at least one verb), so that the entity name and the attribute name can be determined by segmenting the main text title and then according to the part of speech and the number in the segmentation result.
Further, it is possible to select a word segmentation process for any one of the last-level title and the penultimate-level title, and determine whether the other title is an entity name or an attribute name based on the word segmentation result.
S42: and performing word segmentation on the title to be processed to obtain a word segmentation result.
Splitting a title to be processed by adopting a word segmentation method in the related technology to generate a word segmentation result, wherein the word segmentation result comprises at least one title word; further, the word segmentation result includes the number of each title word and the part-of-speech information of the title word.
S43: and generating an entity name and an attribute name based on the word segmentation result.
In the final constructed knowledge graph, the entity name should be a general and standardized word in the field, such as: parts in the automobile field or drug names in the medical field, and the like, so that a thesaurus of entity names (i.e. a preset labeled thesaurus) can be established, wherein the preset labeled thesaurus is commonly used in a certain field, for example: the method is suitable for use specifications of vehicle money of different brands. In the process of extracting the entity name, the words in the preset labeled word library can be referred to, and the scheme shown in fig. 5 is adopted to process the word segmentation result, which specifically comprises the following steps:
s51: and judging whether the preset labeled word library has the title words or not.
The preset labeled word library comprises a plurality of entity names.
S52: if the title words exist in the preset labeled word library, determining the title words as entity names, and determining the titles except the to-be-processed title in the last-stage title and the penultimate-second-stage title as attribute names.
If the fact that the entity name identical to the title word exists in the preset labeled word library is detected, the title word is extracted as the entity name, at the moment, the other title except the title to be processed in the last-stage title and the last-but-second-stage title is determined as the attribute name, or a keyword in the other title can be extracted as the attribute name.
S53: and if the title words do not exist in the preset labeled word library, determining the entity name and the attribute name based on the number and the part of speech of the title words.
Judging whether the number of the title words with the part of speech being the preset part of speech in the word segmentation result is the preset number or not; if the number of the title words with the part of speech being the preset part of speech in the word segmentation result is the preset number, setting the title words as entity names, and determining the titles except the title to be processed in the last-stage title and the last-but-second-stage title as attribute names; and if the number of the title words with the part of speech being the preset part of speech in the word segmentation result is not the preset number, setting the title words as attribute names, and determining the titles except the to-be-processed title in the last-stage title and the last-but-second-stage title as entity names. Specifically, the part of speech is preset as nouns, and the preset number is 1, that is, if the word segmentation result has only one noun, the noun is extracted as an entity name, otherwise, the noun is extracted as an attribute name.
Further, when no title word exists in the preset labeled word library, the title word is stored in the preset labeled word library so as to update the preset labeled word library, namely the extracted entity name is further enriched and the preset labeled word library is expanded.
S25: and analyzing the text content data to obtain an attribute value.
And analyzing the text content data, and analyzing the text content under the final-stage title into an attribute value, thereby realizing the extraction of the text content data into the attribute value.
S26: and establishing the corresponding relation among the knowledge point category, the entity name, the attribute name and the attribute value to generate a knowledge graph.
The extraction of the entity name, the attribute value and the belonged knowledge category is completed through the scheme, and the knowledge graph can be constructed by using the extracted knowledge information.
In a specific embodiment, in order to better explain the technical solution adopted in the embodiment, a product specification is taken as an example for explanation.
The product manual is generally structured by a catalog, and is exemplified by an automobile manual including chapters such as "safety notice", "instrument cluster", "operation components", "driving", "audio system", and "in-vehicle equipment", as shown in fig. 6.
Knowledge categories may be defined based on the title of the section, such as: "parts" or "driving methods", etc.; according to the combing result of the knowledge system, the knowledge categories and sections do not necessarily correspond to each other, for example, as shown in fig. 6 and 7, the contents in the two sections of "instrument cluster" and "operation parts" can be extracted as the knowledge of the category of "parts", wherein chapter is the section, and concept is the knowledge category; there is therefore a need to establish a mapping from chapters to knowledge categories.
The specific chapter content is also hierarchically structured and can comprise a primary title or a secondary title and the like; for example, fig. 8 shows a content page of the "adjust seat-front seat" section as shown in fig. 9, and a hierarchical structure obtained by parsing is shown in fig. 10. The specific adjustment steps of the driver seat are mainly described in combination with specific content information (i.e., the step description of image-text combination in fig. 9), and the final expected parsed extraction result is shown in fig. 11, where concept is a knowledge category, entity is an entity name, property is an attribute name, and value is an attribute value.
As shown in fig. 9 to 11, the last-stage title "driver seat" is resolved as an entity name, and the last-stage title (i.e., the penultimate-stage title) "adjustment step" thereof is resolved as an attribute name; as for the intermediate two-level title, it can be chosen according to the knowledge system to be finally constructed, for example: the "front row seats" can be resolved into a subtype of "parts", but not as core knowledge points that need to be extracted. So far, several key elements of knowledge extraction are extracted: knowledge category, entity name, attribute name, and attribute value.
The embodiment judges the number and the part of speech of the word segmentation results of the last-stage title/the penultimate-stage title according to the structural characteristics of the document to be processed, and extracts knowledge information (including knowledge category, entity name, attribute name and attribute value) in the document to be processed by integrating the pre-established ways of the tagging word stock of the entity name and the like; by utilizing the knowledge extraction result of the scheme, the construction of the knowledge graph of the document to be processed can be completed by a small amount of manual verification, so that a large amount of labor and machine training cost can be saved.
Referring to fig. 12, fig. 12 is a schematic structural diagram of an embodiment of a knowledge graph constructing apparatus provided in the present application, the knowledge graph constructing apparatus 120 includes a memory 121 and a processor 122 connected to each other, the memory 121 is used for storing a computer program, and the computer program is used for implementing the method for constructing the knowledge graph in the foregoing embodiment when being executed by the processor 122.
Referring to fig. 13, fig. 13 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present application, where the computer-readable storage medium 130 is used for storing a computer program 131, and the computer program 131 is used for implementing a method for constructing a knowledge graph in the foregoing embodiment when being executed by a processor.
The computer readable storage medium 130 may be a server, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and various media capable of storing program codes.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules or units is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.