[go: up one dir, main page]

CN115687655A - PDF document-based knowledge graph construction method, system, equipment and storage medium - Google Patents

PDF document-based knowledge graph construction method, system, equipment and storage medium Download PDF

Info

Publication number
CN115687655A
CN115687655A CN202211418911.7A CN202211418911A CN115687655A CN 115687655 A CN115687655 A CN 115687655A CN 202211418911 A CN202211418911 A CN 202211418911A CN 115687655 A CN115687655 A CN 115687655A
Authority
CN
China
Prior art keywords
text
chapter
picture
entity
pdf document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211418911.7A
Other languages
Chinese (zh)
Inventor
张明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinhua Fusion Media Technology Development Beijing Co ltd
Xinhua Zhiyun Technology Co ltd
Original Assignee
Xinhua Zhiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinhua Zhiyun Technology Co ltd filed Critical Xinhua Zhiyun Technology Co ltd
Priority to CN202211418911.7A priority Critical patent/CN115687655A/en
Publication of CN115687655A publication Critical patent/CN115687655A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The application discloses a PDF document-based knowledge graph construction method, a system, equipment and a storage medium, which relate to the technical field of data mining and comprise the following steps: the method comprises the steps of splitting a PDF document into a plurality of first pictures according to page numbers, and carrying out optical character recognition and title recognition on each first picture to obtain a plurality of first plain texts; splicing all the first plain texts to obtain a second plain text, and dividing the second plain text into a plurality of chapter texts according to the identified title; and performing entity identification and entity relationship extraction on each chapter text, and constructing a knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships. The method for automatically constructing the knowledge graph is provided for the books in the PDF format, so that the chapter information in the books can be reserved, the graph analysis can be performed on specific chapters, and the diversified requirements of service scenes are met.

Description

PDF document-based knowledge graph construction method, system, equipment and storage medium
Technical Field
The present application relates to the field of data mining technologies, and in particular, to a method, a system, a device, and a storage medium for constructing a knowledge graph based on a PDF document.
Background
The method mainly comprises the steps that original data of a knowledge graph automatically generated at the present stage are mainly text data, knowledge structural information extraction is carried out based on the text data, and then the knowledge graph under the data content is generated.
Disclosure of Invention
The application provides a knowledge graph construction method based on a PDF document, which aims to solve the problem that the prior art can not independently construct the knowledge graph of each chapter in an input document according to a text hierarchical structure.
In order to achieve the purpose, the following technical scheme is adopted in the application:
the knowledge graph construction method based on the PDF document comprises the following steps:
the method comprises the steps of splitting a PDF document into a plurality of first pictures according to page numbers, and carrying out optical character recognition and title recognition on each first picture to obtain a plurality of first plain texts;
splicing all the first plain texts to obtain a second plain text, and dividing the second plain text into a plurality of chapter texts according to the identified title;
and performing entity identification and entity relationship extraction on each chapter text, and constructing a knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships.
Preferably, the splitting the PDF document into a plurality of first pictures according to page numbers includes:
the method comprises the steps of splitting a PDF document into a plurality of sub-documents according to page numbers, and converting each sub-document into a picture;
numbering each picture according to the corresponding page number, and cutting out redundant information in each picture to generate a plurality of first pictures.
Preferably, the performing optical character recognition and title recognition on each first picture to obtain a plurality of first plain texts includes:
performing optical character recognition on each first picture to obtain all text segments contained in each first picture and position coordinates and pixels of the picture corresponding to each text segment;
comparing pixels of the text segment pictures in all the first pictures, and taking the text segment corresponding to the picture with the largest pixel as a candidate chapter title;
determining the positions of the candidate chapter titles in all first pictures according to the position coordinates of the pictures corresponding to the candidate chapter titles, and taking the candidate chapter titles positioned in the middle of all first picture lines as final chapter titles;
all the text segments contained in each first picture are spliced, and the corresponding final chapter titles are marked in the spliced text segments to obtain a plurality of first plain texts.
Preferably, the step of splicing all the first plain texts to obtain a second plain text includes:
and manually checking each first plain text according to the PDF document, and sequentially splicing all checked first plain texts into a second plain text according to the page number.
Preferably, the performing entity identification and entity relationship extraction on each chapter text, and constructing the knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships includes:
recognizing the entities in each chapter text by using a named entity recognition technology, and determining the relationship among the entities in each chapter text according to a co-occurrence principle;
disambiguating each entity, and merging the entity and entity relationship in all chapter texts after disambiguation;
and inputting the merging result into a Neo4j system to construct a visual knowledge graph of the whole PDF document.
Preferably, the performing entity identification and entity relationship extraction on each chapter of text, and constructing the knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships further includes:
recognizing the entities in each chapter text by using a named entity recognition technology, and determining the relationship among the entities in each chapter text according to a co-occurrence principle;
and disambiguating the entity in each section of text, and respectively inputting the entity and the entity relationship in each section of text after disambiguation into a Neo4j system to construct a visual knowledge graph of each section of the PDF document.
A PDF document based knowledge graph building system, comprising:
the identification module is used for splitting the PDF document into a plurality of first pictures according to the page number, and carrying out optical character identification and title identification on each first picture to obtain a plurality of first plain texts;
the dividing module is used for splicing all the first plain texts to obtain a second plain text, and dividing the second plain text into a plurality of chapter texts according to the identified title;
and the construction module is used for performing entity identification and entity relationship extraction on each chapter text and constructing a knowledge graph corresponding to the PDF document according to the obtained entities and the obtained entity relationship.
Preferably, the dividing module includes:
the analysis unit is used for carrying out optical character recognition on each first picture to obtain all text segments contained in each first picture and position coordinates and pixels of the picture corresponding to each text segment;
the selecting unit is used for comparing the pixels of the text segment pictures in all the first pictures and taking the text segment corresponding to the picture with the largest pixel as a candidate chapter title;
the determining unit is used for determining the positions of the candidate chapter titles in all the first pictures according to the position coordinates of the pictures corresponding to the candidate chapter titles, and taking the candidate chapter titles positioned in the middle of all the first picture lines as final chapter titles;
and the splicing unit is used for splicing all the text segments contained in each first picture and marking the corresponding final chapter titles in the text segments to obtain a plurality of first plain texts.
An electronic device comprising a memory and a processor, the memory storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement a PDF document-based knowledge graph construction method as recited in any one of the above.
A computer-readable storage medium storing a computer program which, when executed, causes a computer to implement a method of constructing a PDF document based knowledge graph as defined in any one of the above.
The invention has the following beneficial effects:
the method for automatically constructing the knowledge graph is mainly provided for the books in the PDF format, on one hand, the method for converting the PDF format files into the plain text and simultaneously reserving chapter information in the books is provided, and an important information source can be provided for the construction of each subsequent chapter knowledge graph; on the other hand, the text content can be automatically generated into the knowledge graph and visually displayed, the service trouble that a data source is in a PDF format in a plurality of service scenes requiring the knowledge graph is met, meanwhile, the graph analysis can be carried out aiming at a specific chapter granularity, and the purpose of building the graph according to the requirement is achieved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the description below are only some embodiments of the present application, and for those skilled in the art, other drawings may be obtained according to these drawings without inventive labor.
FIG. 1 is a flow chart of a PDF document-based knowledge graph construction method of the present application;
FIG. 2 is a schematic structural diagram of a PDF document-based knowledge graph building system according to the present application;
FIG. 3 is a schematic diagram of an electronic device implementing a PDF document-based knowledge graph construction method according to the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," and the like in the claims and in the description of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, it being understood that the terms so used are interchangeable under appropriate circumstances and are merely used to describe a distinguishing manner between similar elements in the embodiments of the present application and that the terms "comprising" and "having" and any variations thereof are intended to cover a non-exclusive inclusion such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Examples
As shown in fig. 1, the present embodiment provides a method for constructing a knowledge graph based on a PDF document, which includes the following steps:
s110, splitting a PDF document into a plurality of first pictures according to page numbers, and carrying out optical character recognition and title recognition on each first picture to obtain a plurality of first plain texts;
s120, splicing all the first plain texts to obtain a second plain text, and dividing the second plain text into a plurality of chapter texts according to the identified title;
s130, performing entity identification and entity relationship extraction on each chapter text, and constructing a knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships.
The PDF (Portable Document Format) file Format is an electronic file Format developed by Adode company, and the file Format is irrelevant to an operating system platform, and is universal in Windows, unix or Mac OS operating systems, and this characteristic makes it an ideal file Format for electronic Document issuing and digital information propagation on the Internet, and meanwhile, because the same PDF Document is opened by different software, content errors or deformation are not easy to occur, and more electronic books are displayed in the PDF Format.
Specifically, a PDF document is divided into a plurality of sub-documents according to page numbers, and each sub-document is converted into a picture;
numbering each picture according to the corresponding page number, and cutting out redundant information in each picture to generate a plurality of first pictures.
The method comprises the steps of paging books in PDF format according to page numbers, converting each page into a picture, numbering corresponding pictures according to the page numbers of the PDF books so as to sequentially store the pictures, wherein the operation is mainly performed by a Python PyPDF2 module, python PDF2 is a pure Python PDF library, separating and combining PDF documents, and is simple and easy to use, redundant information in each picture is cut out according to needs so as to only store a text area to be identified, the redundant information comprises page headers, page feet or other information which is useless for constructing a knowledge map, and the user judges the redundant information and stores the pictures.
Specifically, performing optical character recognition on each first picture to obtain all text segments contained in each first picture and position coordinates and pixels of the picture corresponding to each text segment;
comparing pixels of the text segment pictures in all the first pictures, and taking the text segment corresponding to the picture with the largest pixel as a candidate chapter title;
determining the positions of the candidate chapter titles in all first pictures according to the position coordinates of the pictures corresponding to the candidate chapter titles, and taking the candidate chapter titles positioned in the middle of all first picture lines as final chapter titles;
all the text segments contained in each first picture are spliced, and the corresponding final chapter titles are marked in the spliced text segments to obtain a plurality of first plain texts.
The method comprises the following steps of performing content recognition on each cut picture by utilizing an Optical Character Recognition (OCR) text recognition technology, obtaining all text segment information contained in each picture and position coordinates and pixel sizes of the picture corresponding to each text segment, wherein the content recognition of the picture can use a Python CnOCR module, the CnOCR module is a character recognition tool kit under Python3, supports the recognition of common characters of simplified Chinese, traditional Chinese, english and numbers, also supports the recognition of vertically arranged characters, and then performs title marking, wherein the character size of a title is obviously different from the character size of a text under the general condition, the title in the picture is recognized by the character size and the layout of characters in the picture, and the specific flow is as follows:
1) The obtained pixel information of all the text segments is subjected to statistical analysis to obtain the distribution rule of pixels, under the general condition, if the picture contains information such as a title, a text, a diagram and the like, the font of the title is maximum, and the text and the picture information are inferior, and under the normal condition, the larger the font size is, the larger the pixels are, and based on the logic, the text segment with the maximum font size is extracted as a candidate chapter title;
2) And analyzing the position information of the text segment, wherein the title is a single line in general, and the line is displayed in the center, namely the line is positioned in the center of the whole text line, so the specific position of the candidate chapter title is analyzed according to the position coordinates of the picture corresponding to the candidate chapter title, and when the picture corresponding to the candidate chapter title is positioned in the center coordinate area of the line, the candidate chapter title is the final chapter title and is marked by a special symbol [ h1 ].
Specifically, each first plain text is manually checked according to the PDF document, and all checked first plain texts are sequentially spliced into a second plain text according to the page number.
The OCR text recognition has certain errors, and therefore, a link of manual verification is added, namely, the text content obtained by recognition is verified manually according to the content of the PDF document, after the error content in the text is corrected, the texts recognized from each picture are spliced according to the picture number to obtain a uniform plain text, namely, the text in the txt format.
The generated text in the txt format is split according to the service scene, in the embodiment, the text is split according to the chapters of the book, the splitting basis is the marked final chapter title, namely the chapter title symbol [ h1], and the split data is converted into subfiles with the number of chapters from an original txt file.
Specifically, entities in each chapter text are identified by using a named entity identification technology, and the relationship among the entities in each chapter text is determined according to a co-occurrence principle;
disambiguating each entity, and combining the entity and the entity relationship in all the chapter texts after disambiguation;
and inputting the merging result into a Neo4j system to construct a visual knowledge graph of the whole PDF document.
Or recognizing the entities in each chapter text by using a named entity recognition technology, and determining the relationship among the entities in each chapter text according to a co-occurrence principle;
and disambiguating the entity in each section of text, and respectively inputting the entity and the entity relationship in each section of text after disambiguation into a Neo4j system to construct a visual knowledge graph of each section of the PDF document.
Identifying entities such as characters, organizations, regions, time and the like by NER named entity identification for each section of text obtained by splitting, simultaneously counting the occurrence frequency of each extracted entity as a basis for analyzing the entity importance of subsequent services, extracting the relationship between the entities according to the co-occurrence principle, namely the simultaneous occurrence frequency of two entities in the same sentence, naming the relationship as co-occurrence, simultaneously counting the occurrence frequency of each relationship as a basis for analyzing the entity relationship importance of subsequent services, and then disambiguating the entities extracted from each section of text, wherein the disambiguation can use NLP technology to combine all the entities and entity relationships after disambiguation, thereby obtaining all the entities and relationship details and statistical data in the whole PDF book format, storing the entities and relationship details in the csv format, then importing the entities and relationship data into a Neo4j system for visual map display, and if the service is to perform visual analysis of a knowledge map for the text of a certain section, importing the entities and relationship data of the section before merging into the Neo4j system.
The embodiment provides a method for converting a book in a PDF format into a text file, in the process, information of each chapter in the book is reserved through a title recognition technology, a knowledge graph is automatically constructed according to text contents on the basis, the graph of the book is supported from the aspect of granularity, the graph analysis of a certain chapter is realized, and the individualized requirement of a business scene can be met.
As shown in fig. 2, the present embodiment further provides a system for constructing a knowledge graph based on a PDF document, including:
the identification module is used for splitting the PDF document into a plurality of first pictures according to the page number, and carrying out optical character identification and title identification on each first picture to obtain a plurality of first plain texts;
the dividing module is used for splicing all the first plain texts to obtain a second plain text, and dividing the second plain text into a plurality of chapter texts according to the identified title;
and the construction module is used for performing entity identification and entity relationship extraction on each chapter of text and constructing the knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships.
One embodiment of the above system may be: the identification module divides the PDF document into a plurality of first pictures according to page numbers, and performs optical character identification and title identification on each first picture to obtain a plurality of first plain texts; the dividing module splices all the first plain texts to obtain second plain texts, and divides the second plain texts into a plurality of chapter texts according to the identified titles; and the construction module performs entity identification and entity relationship extraction on each chapter text, and constructs a knowledge graph corresponding to the PDF document according to the obtained entities and the entity relationship.
As shown in fig. 3, the present embodiment further provides an electronic device, which includes a memory 301 and a processor 302, where the memory 301 is used to store one or more computer instructions, and the one or more computer instructions are executed by the processor 302 to implement a method for constructing a knowledge graph based on a PDF document as described above.
It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process of the electronic device described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
A computer-readable storage medium storing a computer program which, when executed by a computer, implements a method for constructing a knowledge-graph based on a PDF document as described above.
Illustratively, a computer program may be divided into one or more modules/units, one or more modules/units are stored in the memory 301 and executed by the processor 302, and the I/O interface transmission of data is performed by the input interface 305 and the output interface 306 to accomplish the present invention, and one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, the instruction segments describing the execution process of the computer program in the computer device.
The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer device may include, but is not limited to, the memory 301 and the processor 302, and those skilled in the art will appreciate that the present embodiment is only an example of the computer device, and does not constitute a limitation of the computer device, and may include more or less components, or combine some components, or different components, for example, the computer device may further include the input device 307, the network access device, the bus, and the like.
The Processor 302 may be a Central Processing Unit (CPU), other general purpose Processor 302, a Digital Signal Processor 302 (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor 302 may be a microprocessor 302 or the processor 302 may be any conventional processor 302 or the like.
The storage 301 may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The memory 301 may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) and the like provided on the computer device, further, the memory 301 may also include both an internal storage unit and an external storage device of the computer device, the memory 301 is used for storing computer programs and other programs and data required by the computer device, the memory 301 may also be used for temporarily storing in the output unit 308, and the aforementioned storage Media include various Media capable of storing program codes, such as a usb disk, a removable hard disk, a ROM303, a RAM304, a disk and an optical disk.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions within the technical scope of the present invention are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A PDF document-based knowledge graph construction method is characterized by comprising the following steps:
the method comprises the steps of splitting a PDF document into a plurality of first pictures according to page numbers, and carrying out optical character recognition and title recognition on each first picture to obtain a plurality of first plain texts;
splicing all the first plain texts to obtain a second plain text, and dividing the second plain text into a plurality of chapter texts according to the identified title;
and performing entity identification and entity relationship extraction on each section of text, and constructing a knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships.
2. The method according to claim 1, wherein the splitting the PDF document into a plurality of first pictures according to page number comprises:
the method comprises the steps of splitting a PDF document into a plurality of sub-documents according to page numbers, and converting each sub-document into a picture;
numbering each picture according to the corresponding page number, and cutting out redundant information in each picture to generate a plurality of first pictures.
3. The method as claimed in claim 1, wherein the step of performing OCR and TIR on each first picture to obtain a plurality of first plain texts comprises:
performing optical character recognition on each first picture to obtain all text segments contained in each first picture and position coordinates and pixels of the picture corresponding to each text segment;
comparing pixels of the text segment pictures in all the first pictures, and taking the text segment corresponding to the picture with the largest pixel as a candidate chapter title;
determining the position of each candidate chapter title in all first pictures according to the position coordinates of the picture corresponding to the candidate chapter title, and taking the candidate chapter title positioned in the middle of all first picture rows as a final chapter title;
all the text segments contained in each first picture are spliced, and the corresponding final chapter titles are marked in the spliced text segments to obtain a plurality of first plain texts.
4. The method of claim 1, wherein the step of splicing all the first plain texts to obtain a second plain text comprises:
and manually checking each first plain text according to the PDF document, and sequentially splicing all the checked first plain texts into a second plain text according to the page number.
5. The method as claimed in claim 1, wherein the step of performing entity identification and entity relationship extraction on each chapter of text, and constructing the knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships comprises:
recognizing the entities in each chapter text by using a named entity recognition technology, and determining the relationship among the entities in each chapter text according to a co-occurrence principle;
disambiguating each entity, and combining the entity and the entity relationship in all the chapter texts after disambiguation;
and inputting the merging result into a Neo4j system to construct a visual knowledge graph of the whole PDF document.
6. The method according to claim 1, wherein the performing entity identification and entity relationship extraction on each chapter of text, and constructing the knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships, further comprises:
recognizing the entities in each chapter text by using a named entity recognition technology, and determining the relationship among the entities in each chapter text according to a co-occurrence principle;
and disambiguating the entity in each section of text, and respectively inputting the entity and the entity relationship in each section of text after disambiguation into a Neo4j system to construct a visual knowledge graph of each section of the PDF document.
7. A PDF document-based knowledge graph construction system is characterized by comprising:
the identification module is used for splitting the PDF document into a plurality of first pictures according to the page number, and carrying out optical character identification and title identification on each first picture to obtain a plurality of first plain texts;
the dividing module is used for splicing all the first plain texts to obtain a second plain text, and dividing the second plain text into a plurality of chapter texts according to the identified title;
and the construction module is used for performing entity identification and entity relationship extraction on each chapter of text and constructing the knowledge graph corresponding to the PDF document according to the obtained entities and entity relationships.
8. The PDF document-based knowledge graph building system of claim 7, wherein the partitioning module comprises:
the analysis unit is used for carrying out optical character recognition on each first picture to obtain all text segments contained in each first picture and position coordinates and pixels of the picture corresponding to each text segment;
the selecting unit is used for comparing pixels of the text segment pictures in all the first pictures and taking the text segment corresponding to the picture with the largest pixel as a candidate chapter title;
the determining unit is used for determining the positions of the candidate chapter titles in all the first pictures according to the position coordinates of the pictures corresponding to the candidate chapter titles, and taking the candidate chapter titles positioned in the middle of all the first picture rows as final chapter titles;
and the splicing unit is used for splicing all the text segments contained in each first picture and marking the corresponding final chapter titles in the text segments to obtain a plurality of first plain texts.
9. An electronic device, comprising a memory and a processor, wherein the memory is used for storing one or more computer instructions, and the one or more computer instructions are executed by the processor to implement the method for deleting the block chain data based on the business topic according to any one of claims 1 to 6.
10. A computer-readable storage medium storing a computer program, wherein the computer program is configured to cause a computer to execute a method for constructing a knowledge graph based on a PDF document according to any one of claims 1 to 6.
CN202211418911.7A 2022-11-14 2022-11-14 PDF document-based knowledge graph construction method, system, equipment and storage medium Pending CN115687655A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211418911.7A CN115687655A (en) 2022-11-14 2022-11-14 PDF document-based knowledge graph construction method, system, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211418911.7A CN115687655A (en) 2022-11-14 2022-11-14 PDF document-based knowledge graph construction method, system, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115687655A true CN115687655A (en) 2023-02-03

Family

ID=85051914

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211418911.7A Pending CN115687655A (en) 2022-11-14 2022-11-14 PDF document-based knowledge graph construction method, system, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115687655A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116090560A (en) * 2023-04-06 2023-05-09 北京大学深圳研究生院 Knowledge graph establishment method, device and system based on teaching materials
CN116912867A (en) * 2023-09-13 2023-10-20 之江实验室 Teaching material structure extraction method and device combining automatic labeling and recall completion
CN117391192A (en) * 2023-12-08 2024-01-12 杭州悦数科技有限公司 Method and device for constructing knowledge graph from PDF by using LLM based on graph database
CN117763206A (en) * 2024-02-20 2024-03-26 暗物智能科技(广州)有限公司 Knowledge tree generation method and device, electronic equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116090560A (en) * 2023-04-06 2023-05-09 北京大学深圳研究生院 Knowledge graph establishment method, device and system based on teaching materials
CN116090560B (en) * 2023-04-06 2023-08-01 北京大学深圳研究生院 Method, device and system for building knowledge map based on teaching materials
CN116912867A (en) * 2023-09-13 2023-10-20 之江实验室 Teaching material structure extraction method and device combining automatic labeling and recall completion
CN116912867B (en) * 2023-09-13 2023-12-29 之江实验室 Teaching material structure extraction method and device combining automatic labeling and recall completion
CN117391192A (en) * 2023-12-08 2024-01-12 杭州悦数科技有限公司 Method and device for constructing knowledge graph from PDF by using LLM based on graph database
CN117391192B (en) * 2023-12-08 2024-03-15 杭州悦数科技有限公司 Method and device for constructing knowledge graph from PDF by using LLM based on graph database
CN117763206A (en) * 2024-02-20 2024-03-26 暗物智能科技(广州)有限公司 Knowledge tree generation method and device, electronic equipment and storage medium
CN117763206B (en) * 2024-02-20 2024-06-11 暗物智能科技(广州)有限公司 Knowledge tree generation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108932294B (en) Resume data processing method, device, equipment and storage medium based on index
US11914968B2 (en) Official document processing method, device, computer equipment and storage medium
CN109062874B (en) Financial data acquisition method, terminal device and medium
CN115687655A (en) PDF document-based knowledge graph construction method, system, equipment and storage medium
CN111177532A (en) Vertical search method, device, computer system and readable storage medium
US8577882B2 (en) Method and system for searching multilingual documents
CN112016273A (en) Document directory generation method and device, electronic equipment and readable storage medium
CN110569335B (en) Triple verification method and device based on artificial intelligence and storage medium
US9495357B1 (en) Text extraction
CN111597800B (en) Method, device, equipment and storage medium for obtaining synonyms
CN111930976A (en) Presentation generation method, device, equipment and storage medium
CN115203445A (en) Multimedia resource searching method, device, equipment and medium
CN119129529A (en) PDF document conversion method, device, equipment, storage medium and product
WO2018208412A1 (en) Detection of caption elements in documents
CN113177407A (en) Data dictionary construction method and device, computer equipment and storage medium
CN109670183B (en) Text importance calculation method, device, equipment and storage medium
CN107168627B (en) Text editing method and device for touch screen
CN111368693A (en) Identification method and device for identity card information
CN115481599A (en) Document processing method and device, electronic equipment and storage medium
CN114548054B (en) PDF file conversion method, device, computer equipment and readable storage medium
CN113761906B (en) Method, apparatus, device and computer readable medium for parsing document
CN113486171B (en) Image processing method and device and electronic equipment
WO2022105120A1 (en) Text detection method and apparatus from image, computer device and storage medium
CN110276001B (en) Checking page identification method and device, computing equipment and medium
CN114239562A (en) Method, device and equipment for identifying program code blocks in document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230717

Address after: Room 430, Cultural and Entertainment Center, No. 460 Wenyi West Road, Xihu District, Hangzhou City, Zhejiang Province, 310050

Applicant after: XINHUA ZHIYUN TECHNOLOGY Co.,Ltd.

Applicant after: Xinhua fusion media technology development (Beijing) Co.,Ltd.

Address before: Room 430, cultural center, 460 Wenyi West Road, Xihu District, Hangzhou City, Zhejiang Province, 310012

Applicant before: XINHUA ZHIYUN TECHNOLOGY Co.,Ltd.