CN115687655A

CN115687655A - PDF document-based knowledge graph construction method, system, equipment and storage medium

Info

Publication number: CN115687655A
Application number: CN202211418911.7A
Authority: CN
Inventors: 张明
Original assignee: Xinhua Zhiyun Technology Co ltd
Current assignee: Xinhua Fusion Media Technology Development Beijing Co ltd; Xinhua Zhiyun Technology Co ltd
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-02-03

Abstract

The application discloses a PDF document-based knowledge graph construction method, a system, equipment and a storage medium, which relate to the technical field of data mining and comprise the following steps: the method comprises the steps of splitting a PDF document into a plurality of first pictures according to page numbers, and carrying out optical character recognition and title recognition on each first picture to obtain a plurality of first plain texts; splicing all the first plain texts to obtain a second plain text, and dividing the second plain text into a plurality of chapter texts according to the identified title; and performing entity identification and entity relationship extraction on each chapter text, and constructing a knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships. The method for automatically constructing the knowledge graph is provided for the books in the PDF format, so that the chapter information in the books can be reserved, the graph analysis can be performed on specific chapters, and the diversified requirements of service scenes are met.

Description

PDF document-based knowledge graph construction method, system, equipment and storage medium

Technical Field

The present application relates to the field of data mining technologies, and in particular, to a method, a system, a device, and a storage medium for constructing a knowledge graph based on a PDF document.

Background

The method mainly comprises the steps that original data of a knowledge graph automatically generated at the present stage are mainly text data, knowledge structural information extraction is carried out based on the text data, and then the knowledge graph under the data content is generated.

Disclosure of Invention

The application provides a knowledge graph construction method based on a PDF document, which aims to solve the problem that the prior art can not independently construct the knowledge graph of each chapter in an input document according to a text hierarchical structure.

In order to achieve the purpose, the following technical scheme is adopted in the application:

the knowledge graph construction method based on the PDF document comprises the following steps:

the method comprises the steps of splitting a PDF document into a plurality of first pictures according to page numbers, and carrying out optical character recognition and title recognition on each first picture to obtain a plurality of first plain texts;

splicing all the first plain texts to obtain a second plain text, and dividing the second plain text into a plurality of chapter texts according to the identified title;

and performing entity identification and entity relationship extraction on each chapter text, and constructing a knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships.

Preferably, the splitting the PDF document into a plurality of first pictures according to page numbers includes:

the method comprises the steps of splitting a PDF document into a plurality of sub-documents according to page numbers, and converting each sub-document into a picture;

numbering each picture according to the corresponding page number, and cutting out redundant information in each picture to generate a plurality of first pictures.

Preferably, the performing optical character recognition and title recognition on each first picture to obtain a plurality of first plain texts includes:

performing optical character recognition on each first picture to obtain all text segments contained in each first picture and position coordinates and pixels of the picture corresponding to each text segment;

comparing pixels of the text segment pictures in all the first pictures, and taking the text segment corresponding to the picture with the largest pixel as a candidate chapter title;

determining the positions of the candidate chapter titles in all first pictures according to the position coordinates of the pictures corresponding to the candidate chapter titles, and taking the candidate chapter titles positioned in the middle of all first picture lines as final chapter titles;

all the text segments contained in each first picture are spliced, and the corresponding final chapter titles are marked in the spliced text segments to obtain a plurality of first plain texts.

Preferably, the step of splicing all the first plain texts to obtain a second plain text includes:

and manually checking each first plain text according to the PDF document, and sequentially splicing all checked first plain texts into a second plain text according to the page number.

Preferably, the performing entity identification and entity relationship extraction on each chapter text, and constructing the knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships includes:

recognizing the entities in each chapter text by using a named entity recognition technology, and determining the relationship among the entities in each chapter text according to a co-occurrence principle;

disambiguating each entity, and merging the entity and entity relationship in all chapter texts after disambiguation;

and inputting the merging result into a Neo4j system to construct a visual knowledge graph of the whole PDF document.

Preferably, the performing entity identification and entity relationship extraction on each chapter of text, and constructing the knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships further includes:

and disambiguating the entity in each section of text, and respectively inputting the entity and the entity relationship in each section of text after disambiguation into a Neo4j system to construct a visual knowledge graph of each section of the PDF document.

A PDF document based knowledge graph building system, comprising:

the identification module is used for splitting the PDF document into a plurality of first pictures according to the page number, and carrying out optical character identification and title identification on each first picture to obtain a plurality of first plain texts;

the dividing module is used for splicing all the first plain texts to obtain a second plain text, and dividing the second plain text into a plurality of chapter texts according to the identified title;

and the construction module is used for performing entity identification and entity relationship extraction on each chapter text and constructing a knowledge graph corresponding to the PDF document according to the obtained entities and the obtained entity relationship.

Preferably, the dividing module includes:

the analysis unit is used for carrying out optical character recognition on each first picture to obtain all text segments contained in each first picture and position coordinates and pixels of the picture corresponding to each text segment;

the selecting unit is used for comparing the pixels of the text segment pictures in all the first pictures and taking the text segment corresponding to the picture with the largest pixel as a candidate chapter title;

the determining unit is used for determining the positions of the candidate chapter titles in all the first pictures according to the position coordinates of the pictures corresponding to the candidate chapter titles, and taking the candidate chapter titles positioned in the middle of all the first picture lines as final chapter titles;

and the splicing unit is used for splicing all the text segments contained in each first picture and marking the corresponding final chapter titles in the text segments to obtain a plurality of first plain texts.

An electronic device comprising a memory and a processor, the memory storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement a PDF document-based knowledge graph construction method as recited in any one of the above.

A computer-readable storage medium storing a computer program which, when executed, causes a computer to implement a method of constructing a PDF document based knowledge graph as defined in any one of the above.

The invention has the following beneficial effects:

the method for automatically constructing the knowledge graph is mainly provided for the books in the PDF format, on one hand, the method for converting the PDF format files into the plain text and simultaneously reserving chapter information in the books is provided, and an important information source can be provided for the construction of each subsequent chapter knowledge graph; on the other hand, the text content can be automatically generated into the knowledge graph and visually displayed, the service trouble that a data source is in a PDF format in a plurality of service scenes requiring the knowledge graph is met, meanwhile, the graph analysis can be carried out aiming at a specific chapter granularity, and the purpose of building the graph according to the requirement is achieved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the description below are only some embodiments of the present application, and for those skilled in the art, other drawings may be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of a PDF document-based knowledge graph construction method of the present application;

FIG. 2 is a schematic structural diagram of a PDF document-based knowledge graph building system according to the present application;

FIG. 3 is a schematic diagram of an electronic device implementing a PDF document-based knowledge graph construction method according to the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the claims and in the description of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, it being understood that the terms so used are interchangeable under appropriate circumstances and are merely used to describe a distinguishing manner between similar elements in the embodiments of the present application and that the terms "comprising" and "having" and any variations thereof are intended to cover a non-exclusive inclusion such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Examples

As shown in fig. 1, the present embodiment provides a method for constructing a knowledge graph based on a PDF document, which includes the following steps:

s110, splitting a PDF document into a plurality of first pictures according to page numbers, and carrying out optical character recognition and title recognition on each first picture to obtain a plurality of first plain texts;

s120, splicing all the first plain texts to obtain a second plain text, and dividing the second plain text into a plurality of chapter texts according to the identified title;

s130, performing entity identification and entity relationship extraction on each chapter text, and constructing a knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships.

The PDF (Portable Document Format) file Format is an electronic file Format developed by Adode company, and the file Format is irrelevant to an operating system platform, and is universal in Windows, unix or Mac OS operating systems, and this characteristic makes it an ideal file Format for electronic Document issuing and digital information propagation on the Internet, and meanwhile, because the same PDF Document is opened by different software, content errors or deformation are not easy to occur, and more electronic books are displayed in the PDF Format.

Specifically, a PDF document is divided into a plurality of sub-documents according to page numbers, and each sub-document is converted into a picture;

The method comprises the steps of paging books in PDF format according to page numbers, converting each page into a picture, numbering corresponding pictures according to the page numbers of the PDF books so as to sequentially store the pictures, wherein the operation is mainly performed by a Python PyPDF2 module, python PDF2 is a pure Python PDF library, separating and combining PDF documents, and is simple and easy to use, redundant information in each picture is cut out according to needs so as to only store a text area to be identified, the redundant information comprises page headers, page feet or other information which is useless for constructing a knowledge map, and the user judges the redundant information and stores the pictures.

Specifically, performing optical character recognition on each first picture to obtain all text segments contained in each first picture and position coordinates and pixels of the picture corresponding to each text segment;

The method comprises the following steps of performing content recognition on each cut picture by utilizing an Optical Character Recognition (OCR) text recognition technology, obtaining all text segment information contained in each picture and position coordinates and pixel sizes of the picture corresponding to each text segment, wherein the content recognition of the picture can use a Python CnOCR module, the CnOCR module is a character recognition tool kit under Python3, supports the recognition of common characters of simplified Chinese, traditional Chinese, english and numbers, also supports the recognition of vertically arranged characters, and then performs title marking, wherein the character size of a title is obviously different from the character size of a text under the general condition, the title in the picture is recognized by the character size and the layout of characters in the picture, and the specific flow is as follows:

1) The obtained pixel information of all the text segments is subjected to statistical analysis to obtain the distribution rule of pixels, under the general condition, if the picture contains information such as a title, a text, a diagram and the like, the font of the title is maximum, and the text and the picture information are inferior, and under the normal condition, the larger the font size is, the larger the pixels are, and based on the logic, the text segment with the maximum font size is extracted as a candidate chapter title;

2) And analyzing the position information of the text segment, wherein the title is a single line in general, and the line is displayed in the center, namely the line is positioned in the center of the whole text line, so the specific position of the candidate chapter title is analyzed according to the position coordinates of the picture corresponding to the candidate chapter title, and when the picture corresponding to the candidate chapter title is positioned in the center coordinate area of the line, the candidate chapter title is the final chapter title and is marked by a special symbol [ h1 ].

Specifically, each first plain text is manually checked according to the PDF document, and all checked first plain texts are sequentially spliced into a second plain text according to the page number.

The OCR text recognition has certain errors, and therefore, a link of manual verification is added, namely, the text content obtained by recognition is verified manually according to the content of the PDF document, after the error content in the text is corrected, the texts recognized from each picture are spliced according to the picture number to obtain a uniform plain text, namely, the text in the txt format.

The generated text in the txt format is split according to the service scene, in the embodiment, the text is split according to the chapters of the book, the splitting basis is the marked final chapter title, namely the chapter title symbol [ h1], and the split data is converted into subfiles with the number of chapters from an original txt file.

Specifically, entities in each chapter text are identified by using a named entity identification technology, and the relationship among the entities in each chapter text is determined according to a co-occurrence principle;

disambiguating each entity, and combining the entity and the entity relationship in all the chapter texts after disambiguation;

Or recognizing the entities in each chapter text by using a named entity recognition technology, and determining the relationship among the entities in each chapter text according to a co-occurrence principle;

Identifying entities such as characters, organizations, regions, time and the like by NER named entity identification for each section of text obtained by splitting, simultaneously counting the occurrence frequency of each extracted entity as a basis for analyzing the entity importance of subsequent services, extracting the relationship between the entities according to the co-occurrence principle, namely the simultaneous occurrence frequency of two entities in the same sentence, naming the relationship as co-occurrence, simultaneously counting the occurrence frequency of each relationship as a basis for analyzing the entity relationship importance of subsequent services, and then disambiguating the entities extracted from each section of text, wherein the disambiguation can use NLP technology to combine all the entities and entity relationships after disambiguation, thereby obtaining all the entities and relationship details and statistical data in the whole PDF book format, storing the entities and relationship details in the csv format, then importing the entities and relationship data into a Neo4j system for visual map display, and if the service is to perform visual analysis of a knowledge map for the text of a certain section, importing the entities and relationship data of the section before merging into the Neo4j system.

The embodiment provides a method for converting a book in a PDF format into a text file, in the process, information of each chapter in the book is reserved through a title recognition technology, a knowledge graph is automatically constructed according to text contents on the basis, the graph of the book is supported from the aspect of granularity, the graph analysis of a certain chapter is realized, and the individualized requirement of a business scene can be met.

As shown in fig. 2, the present embodiment further provides a system for constructing a knowledge graph based on a PDF document, including:

and the construction module is used for performing entity identification and entity relationship extraction on each chapter of text and constructing the knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships.

One embodiment of the above system may be: the identification module divides the PDF document into a plurality of first pictures according to page numbers, and performs optical character identification and title identification on each first picture to obtain a plurality of first plain texts; the dividing module splices all the first plain texts to obtain second plain texts, and divides the second plain texts into a plurality of chapter texts according to the identified titles; and the construction module performs entity identification and entity relationship extraction on each chapter text, and constructs a knowledge graph corresponding to the PDF document according to the obtained entities and the entity relationship.

As shown in fig. 3, the present embodiment further provides an electronic device, which includes a memory 301 and a processor 302, where the memory 301 is used to store one or more computer instructions, and the one or more computer instructions are executed by the processor 302 to implement a method for constructing a knowledge graph based on a PDF document as described above.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the electronic device described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

A computer-readable storage medium storing a computer program which, when executed by a computer, implements a method for constructing a knowledge-graph based on a PDF document as described above.

Illustratively, a computer program may be divided into one or more modules/units, one or more modules/units are stored in the memory 301 and executed by the processor 302, and the I/O interface transmission of data is performed by the input interface 305 and the output interface 306 to accomplish the present invention, and one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, the instruction segments describing the execution process of the computer program in the computer device.

The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer device may include, but is not limited to, the memory 301 and the processor 302, and those skilled in the art will appreciate that the present embodiment is only an example of the computer device, and does not constitute a limitation of the computer device, and may include more or less components, or combine some components, or different components, for example, the computer device may further include the input device 307, the network access device, the bus, and the like.

The Processor 302 may be a Central Processing Unit (CPU), other general purpose Processor 302, a Digital Signal Processor 302 (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor 302 may be a microprocessor 302 or the processor 302 may be any conventional processor 302 or the like.

The storage 301 may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The memory 301 may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) and the like provided on the computer device, further, the memory 301 may also include both an internal storage unit and an external storage device of the computer device, the memory 301 is used for storing computer programs and other programs and data required by the computer device, the memory 301 may also be used for temporarily storing in the output unit 308, and the aforementioned storage Media include various Media capable of storing program codes, such as a usb disk, a removable hard disk, a ROM303, a RAM304, a disk and an optical disk.

The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions within the technical scope of the present invention are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A PDF document-based knowledge graph construction method is characterized by comprising the following steps:

and performing entity identification and entity relationship extraction on each section of text, and constructing a knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships.

2. The method according to claim 1, wherein the splitting the PDF document into a plurality of first pictures according to page number comprises:

3. The method as claimed in claim 1, wherein the step of performing OCR and TIR on each first picture to obtain a plurality of first plain texts comprises:

determining the position of each candidate chapter title in all first pictures according to the position coordinates of the picture corresponding to the candidate chapter title, and taking the candidate chapter title positioned in the middle of all first picture rows as a final chapter title;

4. The method of claim 1, wherein the step of splicing all the first plain texts to obtain a second plain text comprises:

and manually checking each first plain text according to the PDF document, and sequentially splicing all the checked first plain texts into a second plain text according to the page number.

5. The method as claimed in claim 1, wherein the step of performing entity identification and entity relationship extraction on each chapter of text, and constructing the knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships comprises:

6. The method according to claim 1, wherein the performing entity identification and entity relationship extraction on each chapter of text, and constructing the knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships, further comprises:

7. A PDF document-based knowledge graph construction system is characterized by comprising:

8. The PDF document-based knowledge graph building system of claim 7, wherein the partitioning module comprises:

the selecting unit is used for comparing pixels of the text segment pictures in all the first pictures and taking the text segment corresponding to the picture with the largest pixel as a candidate chapter title;

the determining unit is used for determining the positions of the candidate chapter titles in all the first pictures according to the position coordinates of the pictures corresponding to the candidate chapter titles, and taking the candidate chapter titles positioned in the middle of all the first picture rows as final chapter titles;

9. An electronic device, comprising a memory and a processor, wherein the memory is used for storing one or more computer instructions, and the one or more computer instructions are executed by the processor to implement the method for deleting the block chain data based on the business topic according to any one of claims 1 to 6.

10. A computer-readable storage medium storing a computer program, wherein the computer program is configured to cause a computer to execute a method for constructing a knowledge graph based on a PDF document according to any one of claims 1 to 6.