[go: up one dir, main page]

CN112632959A - EPUB file analysis method - Google Patents

EPUB file analysis method Download PDF

Info

Publication number
CN112632959A
CN112632959A CN202011590270.4A CN202011590270A CN112632959A CN 112632959 A CN112632959 A CN 112632959A CN 202011590270 A CN202011590270 A CN 202011590270A CN 112632959 A CN112632959 A CN 112632959A
Authority
CN
China
Prior art keywords
file
parsing
tag
epub
chapter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011590270.4A
Other languages
Chinese (zh)
Other versions
CN112632959B (en
Inventor
黄希希
蒋芹
佘晓龙
徐磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University
Original Assignee
Hubei University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University filed Critical Hubei University
Priority to CN202011590270.4A priority Critical patent/CN112632959B/en
Publication of CN112632959A publication Critical patent/CN112632959A/en
Application granted granted Critical
Publication of CN112632959B publication Critical patent/CN112632959B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention relates to an EPUB file analysis method, which comprises the following steps: and analyzing the meta information: calling a META-information analysis module to analyze a container.xml file in the META-INF file and reading a root file path of the electronic book; positioning the opf file according to the root file path, and analyzing meta information in the opf file; analyzing the content of the book main body: the book main body content is contained in an xml/html file, and the xml/html file is classified into a cover, a preamble part, a chapter part and a postscript part according to a page processing module; then respectively calling a cover analysis unit, a introduction analysis unit and a chapter analysis unit for processing, and respectively semanticizing the three parts of contents; for the postscript part, no treatment is carried out; extracting the main content of each chapter: the main body content is composed of a plurality of sections, each section is wrapped by a section label, and a section label analysis unit is called to process. The method can analyze the EPUB data into Json format data.

Description

EPUB file analysis method
Technical Field
The invention relates to the technical field of file reading methods, in particular to semantization of an EPUB format, and specifically relates to reconstruction and conversion of a book in the EPUB format into a json file.
Background
An epub (electronic publication) format file is a free open standard, and belongs to a content that can be automatically rearranged; i.e. the text content can be displayed in a manner most suitable for reading, depending on the characteristics of the reading device. The EPUB archive internally uses XHTML or DTBook (an XML standard proposed by DAISY Consortium) to render text and to wrap the archive contents in a zip-compressed format. The EPUB format optionally includes Digital Rights Management (DRM) -related functionality. However, the data of the EPUB electronic book have a great deal of redundancy at present, and are not easy to extract and reuse. If the method can convert the book information into the Json format with strong expansibility and simple analysis while keeping the book information, the method can be easily converted into other types of files, and has wide application prospect for converting and displaying the electronic book information.
The existing EPUB parsing scheme is to parse an EPUB format file based on epublib-core-latest. And generating a file header converter, a page converter and an object converter according to the analyzed information, wherein the file header converter is used for acquiring the author and the book name of the file, the page converter is used for acquiring the content of the file, and the object converter is used for integrating the file header converter and the page conversion information.
The prior art has the following defects: the data obtained after epublib analysis is rich text content in html format, if plain text is needed, the text content of the label can be obtained through jsup, but the obtained plain text typesetting is messy. And the steps of generating the header converter, the page converter and the object converter according to the analyzed information are too complicated, and the expandability is not strong.
Disclosure of Invention
The purpose of the invention is: in order to solve the problems, the invention provides an EPUB file analysis method, which aims to solve the problems that EPUB electronic book data has a large amount of redundancy and is difficult to extract and utilize.
In order to solve the problems, the technical scheme adopted by the invention is as follows:
a kind of EPUB file analysis method, transfer decompress module EPUB file name change convert into zip compress file and decompress it, then analyze to the decompression file, characterized by that, the said analysis method includes the following step:
(1) and analyzing the meta information: calling a META-information analysis module to analyze a container.xml file in the META-INF file and reading a root file path of the electronic book; positioning the opf file according to the root file path, and analyzing meta information in the opf file;
(2) analyzing the content of the book main body: book main body content is contained in an xml/html file, the xml/html file is classified according to a page processing module, an epub-type attribute and a data-type attribute of a file root tag are obtained by a classification method, and the classification method is classified according to different values and is divided into a front cover, a front introduction part, a chapter part and a postscript part; then respectively calling a cover analysis unit, a introduction analysis unit and a chapter analysis unit for processing, and respectively semanticizing the three parts of contents; for the postscript part, no treatment is carried out;
(3) extracting the main content of each chapter: the main body content is composed of a plurality of sections, each section is wrapped by a section label, and a section label analysis unit is called to process.
Furthermore, Opf file parsing is divided into two parts, namely parsing book meta information and parsing book media file path.
Further, the meta information of the book is contained in the tag named metadata, wherein the partial tag information to be parsed is as follows:
Figure BDA0002868756950000021
Figure BDA0002868756950000031
the method comprises the steps of calling a meta-information analysis module to construct an opf file into a document tree, processing metadata and manifest labels in a traversing manner, extracting values of text nodes corresponding to the meta-information labels and storing the values into a predefined semantization Json structure, wherein the paths of the xml and html files are stored as instance variables.
Furthermore, the analysis method of the cover analysis unit is to directly locate and extract the cover picture in the file, store the cover picture in the appointed folder and name the cover picture by the md5 value, and the specific method is to search the img label in the document tree, locate the img label in the image directory by the src attribute of the img label, move the img label and calculate the md5 value for renaming.
Furthermore, the parsing method of the preamble parsing unit is to locate a tag with epub-type as frontatter, traverse all tags therebelow, respectively convert the tags into different Json fields according to the attribute value of class, parse five types of tags, respectively including book title, copyright, gifting, preamble and thank information, of class as boottitlepage, CopyrightPage, deletion, Preface and Bookcaknowledgments, perform recursive traversal, and convert the tags into semantization Json set in advance according to the structure of the tags.
Further, the chapter analysis unit is used for decomposing the analyzed chapter type xhtml into two parts for extraction; firstly, the title and the phrase of each chapter are included, the title part is h1-h4 tags, the extraction method is to locate and extract the text node value according to the tag name, and the phrase part is similar to a common paragraph in an EPUB, so a paragraph parsing unit is adopted: the p-label analysis unit or the div analysis unit is used for analyzing, wherein the p-label represents a simple text, and the div-label represents a more complex text; the P-tag parsing unit and the div-tag parsing unit are entries of the remaining parsing units and are also the core of the chapter parsing unit.
Further, for the P tag parsing unit, the specific steps include: firstly, the child nodes are traversed, if the child nodes are text nodes, the text nodes are stored in a list, otherwise, a processing unit corresponding to the label is called to process the text nodes, and finally, the text nodes are constructed into data in a key value mode.
Further, for the div tag parsing unit, the specific steps include: firstly, traversing the div tag to take out all the sub tags under the div tag, and transferring the content of the sub tags to different processing modules for processing, wherein the processing is different from the processing that the key in the analysis result of the P tag analysis module is always 'P', but the key is returned as an independent key value pair according to the class of the P tag analysis module, which cannot be realized by the common method of converting Html to Json, and meanwhile, redundant div tags are removed.
The technical scheme provided by the invention has the beneficial effects that at least: the invention can solve the problem of utilizing EPUB electronic book information, maintains book information to the maximum extent and reduces the storage cost to the minimum by converting into the semantization Json file, and the Json format has strong expansibility and simple analysis, so the method is easy to convert into other types of files.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
fig. 1 is a parsing flowchart of an EPUB file parsing method disclosed in the embodiment of the present invention.
Fig. 2 is a main block diagram of an EPUB file parsing method disclosed in the embodiment of the present invention.
Fig. 3 is a schematic diagram of an xhtml structure of a chapter parsing unit disclosed in the embodiment of the present invention.
Fig. 4 is a flowchart of parsing a P tag parsing unit according to an embodiment of the present invention.
Fig. 5 is a parsing flowchart of the div tag parsing unit according to the embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The invention relates to an EPUB file parsing method, the parsing steps are shown in figure 1, and the main modules are shown in figure 2. Firstly, an uncompressing module is called to convert the EPUB file into a zip compressed file and decompress the zip compressed file, and then the uncompressed file is comprehensively analyzed.
First, the meta information is parsed. After decompression, an ebook generally complying with the EPUB3.0 standard may have three types of files, META-INF files, obeps (ops) files, and mimetype files. The method comprises the steps of firstly calling a META-information analysis module to analyze a container file in the META-INF file, and reading a root file (rootfile) path of the electronic book. And then positioning the opf file according to the root file path, and analyzing the meta information in the opf file. Opf the file analysis is divided into two parts, one is to analyze book meta information and the other is to analyze book media file path. The meta information of the book is mainly contained in the tag named metadata, wherein the information of the part of the tag to be analyzed is as follows:
name of meta information Name of label Properties
ISBN Identifier
Book title Title Id=pub-title
Book subtitle Title Id=pub-subtitle
Language used by books Language
Authors refer to Creator
Publishing company Publisher
Copyright information Rights
The method mainly comprises the steps of calling a meta-information analysis module to construct an opf file into a document tree, traversing and processing metadata and manifest tags, extracting values of text nodes corresponding to the metadata tags and storing the values into a predefined semantic Json structure, wherein the paths of the xml and html files are stored as instance variables.
And secondly, analyzing the content of the book main body. The method comprises the steps of firstly classifying the xml/html files according to a page processing module, wherein the classifying method mainly comprises the steps of obtaining an epub-type attribute and a data-type attribute of a file root label of each file, and classifying according to different values of the epub-type attribute and the data-type attribute. The book content is divided into three parts, namely a cover, a preamble part, a chapter part and a postscript part, and then different modules are respectively called for processing, which is a key innovation part of the book content semantic analysis method.
The analysis method of the cover analysis unit is that the cover pictures in the file are directly positioned and extracted to be stored in the appointed folder and named according to the md5 value, and the specific method is that the img label in the document tree is searched, the img label is positioned in the picture directory through the src attribute of the img label, the movement is carried out, and the md5 value is calculated for renaming.
The method comprises the steps of locating a label with epub-type as front attter, traversing all labels below the label, converting the label into different Json fields according to a class attribute value, mainly analyzing five types of labels, namely labels with class of BookTitlePage, copyrightPage, deletion, Preface and BookAcknowledgement, respectively including information of book title, copyright, gifts, preamble and thank, and performing recursive traversal in the analyzing method, and converting the labels into semantic Json set in advance according to the structure of the labels.
The structure of the chapter class xhtml to be analyzed by the chapter analysis unit is shown in fig. 3, and the present invention decomposes the chapter class xhtml into two parts to extract the two parts. First, the title and the phrase of each chapter, the title part is generally h1-h4 tags, the extraction method is to locate and extract the text node value according to the tag name, and the phrase part is similar to the ordinary paragraph in EPUB, so the paragraph parsing unit is adopted, for example: a p-tag parsing unit or a div parsing unit, where the p-tag represents simple text and the div-tag represents more complex text. The P-tag parsing unit and the div-tag parsing unit are almost entries of the remaining parsing units and are also cores of the chapter parsing unit.
The specific steps of the P-tag parsing unit are shown in fig. 4. Firstly, traversing child nodes, if the child nodes are text nodes, storing the text nodes in a list, otherwise, calling a processing unit corresponding to a label to process the text nodes, and finally constructing data in a key value form, such as: "p": text, { "span": text "] }, {" code ": text" } ], which has an extremely strong expansibility, and it is noted that although the formats of different tag processing results are almost the same, the processing methods of different tags are different.
For the Div tag parsing unit, the specific steps are as shown in fig. 5. Firstly, traversing the div tag to take out all the sub-tags under the div tag, and transferring the content of the sub-tags to different processing modules for processing, wherein the processing is different from that the key in the analysis result of the P-tag analysis module is always 'P', and the key is returned as an independent key value pair according to the class of the P-tag analysis module, for example: the extraction result of < div > < div class [ { "li": [ ], and { "li": } ] of text value [ </li > </div > ] is "ol" [ { "li": [ ], and { "li": } ]), which cannot be achieved by the ordinary Html to Json method, and the redundant div label is removed.
And then extracting the main content of each chapter, wherein the main content consists of a plurality of sections, each section is wrapped by a section label, the structure below the section is similar to that of the chapter, so that a section label analysis unit can be recursively called to process, the section label analysis unit is similar to the chapter analysis unit, and the difference is that the unit can not only recognize div and p labels to call corresponding units to process, but also recognize and analyze labels such as figure, ol, ul, pre and the like to process.
For the postscript part, the invention does not process the information in consideration of low information value.
The invention also synchronously develops a module for converting the semantic Json into html, can convert Json data into html files, has no information loss and is simpler compared with html files of source EPUB files, embodies the simplicity of the semantic Json format and is easy to convert into other file formats.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Of course, the processor and the storage medium may reside as discrete components in a user terminal.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Claims (8)

1. A kind of EPUB file analysis method, transfer decompress module EPUB file name change convert into zip compress file and decompress it, then analyze to the decompression file, characterized by that, the said analysis method includes the following step:
(1) and analyzing the meta information: calling a META-information analysis module to analyze a container.xml file in the META-INF file and reading a root file path of the electronic book; positioning the opf file according to the root file path, and analyzing meta information in the opf file;
(2) analyzing the content of the book main body: book main body content is contained in an xml/html file, the xml/html file is classified according to a page processing module, an epub-type attribute and a data-type attribute of a file root tag are obtained by a classification method, and the classification method is classified according to different values and is divided into a front cover, a front introduction part, a chapter part and a postscript part; then respectively calling a cover analysis unit, a introduction analysis unit and a chapter analysis unit for processing, and respectively semanticizing the three parts of contents; for the postscript part, no treatment is carried out;
(3) extracting the main content of each chapter: the main body content is composed of a plurality of sections, each section is wrapped by a section label, and a section label analysis unit is called to process.
2. The EPUB file parsing method of claim 1, wherein Opf file parsing is divided into two parts, one part parsing book meta information and the other part parsing book media file path.
3. The EPUB file parsing method of claim 2, wherein the meta information of the book is included in a tag named metadata, wherein the partial tag information to be parsed is as follows:
Figure FDA0002868756940000011
Figure FDA0002868756940000021
the method comprises the steps of calling a meta-information analysis module to construct an opf file into a document tree, processing metadata and manifest labels in a traversing manner, extracting values of text nodes corresponding to the meta-information labels and storing the values into a predefined semantization Json structure, wherein the paths of the xml and html files are stored as instance variables.
4. The method as claimed in claim 1, wherein the parsing method of the cover parsing unit is that the directly located cover pictures extracted from the file are stored in the designated folder and named with their md5 values, and the specific method is that the img tags in the document tree are searched, located in the picture directory by their src attributes, moved and renamed with the md5 values calculated.
5. The EPUB file parsing method of claim 1, wherein the parsing method of the preamble parsing unit is to locate a tag with EPUB-type being frontattter, traverse all tags therebelow, convert the tags into different Json fields according to class attribute values, parse five types of tags, including book title, copyright, presentation, preamble and thank you information according to class attribute values, and perform recursive traversal to convert the tags into semantization Json set in advance according to the structure of the tags.
6. The EPUB file parsing method according to claim 1, wherein the chapter parsing unit parses the chapter type xhtml parsed therein into two parts for extraction; firstly, the title and the phrase of each chapter are included, the title part is h1-h4 tags, the extraction method is to locate and extract the text node value according to the tag name, and the phrase part is similar to a common paragraph in an EPUB, so a paragraph parsing unit is adopted: the p-label analysis unit or the div analysis unit is used for analyzing, wherein the p-label represents a simple text, and the div-label represents a more complex text; the P-tag parsing unit and the div-tag parsing unit are entries of the remaining parsing units and are also the core of the chapter parsing unit.
7. The EPUB file parsing method according to claim 6, wherein for the P tag parsing unit, the specific steps include: firstly, the child nodes are traversed, if the child nodes are text nodes, the text nodes are stored in a list, otherwise, a processing unit corresponding to the label is called to process the text nodes, and finally, the text nodes are constructed into data in a key value mode.
8. The EPUB file parsing method according to claim 6 or 7, wherein for a div tag parsing unit, the specific steps include: firstly, traversing the div tag to take out all the sub tags under the div tag, and transferring the content of the sub tags to different processing modules for processing, wherein the processing is different from the processing that the key in the analysis result of the P tag analysis module is always 'P', but the key is returned as an independent key value pair according to the class of the P tag analysis module, which cannot be realized by the common method of converting Html to Json, and meanwhile, redundant div tags are removed.
CN202011590270.4A 2020-12-29 2020-12-29 A method for parsing EPUB files Active CN112632959B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011590270.4A CN112632959B (en) 2020-12-29 2020-12-29 A method for parsing EPUB files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011590270.4A CN112632959B (en) 2020-12-29 2020-12-29 A method for parsing EPUB files

Publications (2)

Publication Number Publication Date
CN112632959A true CN112632959A (en) 2021-04-09
CN112632959B CN112632959B (en) 2023-09-01

Family

ID=75285911

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011590270.4A Active CN112632959B (en) 2020-12-29 2020-12-29 A method for parsing EPUB files

Country Status (1)

Country Link
CN (1) CN112632959B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114372028A (en) * 2022-01-13 2022-04-19 上海呵呵呵文化传播有限公司 File processing method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130139052A1 (en) * 2011-11-26 2013-05-30 Huawei Technologies Co., Ltd. Method and apparatus for loading epub electronic book
CN103761277A (en) * 2014-01-09 2014-04-30 北京掌阔技术有限公司 ePub electronic book loading method and system
CN106202169A (en) * 2016-06-24 2016-12-07 北京玖扬博文文化发展有限公司 A kind of conversion method of ePub file format
CN106802880A (en) * 2015-11-25 2017-06-06 阿里巴巴集团控股有限公司 A kind of electronic document content shows, processing method and processing device
US10007646B1 (en) * 2015-04-03 2018-06-26 Nutanix, Inc. Method and system for presenting multiple levels of content for a document
CN110532233A (en) * 2019-08-20 2019-12-03 武汉鼎森电子科技有限公司 A kind of epub document generating method and system
CN110554996A (en) * 2019-08-20 2019-12-10 武汉鼎森电子科技有限公司 method and system for quickly opening epub file

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130139052A1 (en) * 2011-11-26 2013-05-30 Huawei Technologies Co., Ltd. Method and apparatus for loading epub electronic book
CN103761277A (en) * 2014-01-09 2014-04-30 北京掌阔技术有限公司 ePub electronic book loading method and system
US10007646B1 (en) * 2015-04-03 2018-06-26 Nutanix, Inc. Method and system for presenting multiple levels of content for a document
CN106802880A (en) * 2015-11-25 2017-06-06 阿里巴巴集团控股有限公司 A kind of electronic document content shows, processing method and processing device
CN106202169A (en) * 2016-06-24 2016-12-07 北京玖扬博文文化发展有限公司 A kind of conversion method of ePub file format
CN110532233A (en) * 2019-08-20 2019-12-03 武汉鼎森电子科技有限公司 A kind of epub document generating method and system
CN110554996A (en) * 2019-08-20 2019-12-10 武汉鼎森电子科技有限公司 method and system for quickly opening epub file

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
叶兰;: "电子图书新规范EPUB3.0及其应用", 图书馆杂志, no. 08, pages 53 - 59 *
唐光前: "开放式电子图书出版结构(OEBPS)规范的解析", 图书情报工作, no. 05, pages 58 - 62 *
迟亮;: "EPUB 3.1数字出版技术研究", 电脑知识与技术, no. 19, pages 239 - 240 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114372028A (en) * 2022-01-13 2022-04-19 上海呵呵呵文化传播有限公司 File processing method and device
CN114372028B (en) * 2022-01-13 2025-06-06 上海呵呵呵文化传播有限公司 File processing method and device

Also Published As

Publication number Publication date
CN112632959B (en) 2023-09-01

Similar Documents

Publication Publication Date Title
US11036808B2 (en) System and method for indexing electronic discovery data
EP1679625B1 (en) Method and apparatus for structuring documents based on layout, content and collection
US5778400A (en) Apparatus and method for storing, searching for and retrieving text of a structured document provided with tags
JP4373721B2 (en) Method and system for encoding markup language documents
US8407585B2 (en) Context-aware content conversion and interpretation-specific views
US8200702B2 (en) Independently variably scoped content rule application in a content management system
US20060236228A1 (en) Extensible markup language schemas for bibliographies and citations
CN108228676B (en) Information extraction method and system
CN111062187A (en) Structured parsing method and system for docx format document
KR20040087903A (en) Database model for hierarchical data formats
Elizarov et al. Scientific documents ontologies for semantic representation of digital libraries
CN105701091B (en) A kind of processing method and processing unit of semantic-based PDF document
WO2006081475A2 (en) System and method for processsing xml documents
CN115796146A (en) File comparison method and device
CN112632959A (en) EPUB file analysis method
CN102063415B (en) Method and system for embedding single-byte fonts in PDF (Portable Document Format) file
CN112818687B (en) Method, device, electronic equipment and storage medium for constructing title recognition model
Sirsat et al. Pattern matching for extraction of core contents from news web pages
Changuel et al. A general learning method for automatic title extraction from html pages
CN110554996A (en) method and system for quickly opening epub file
CN102360351A (en) Method and system for carrying out semantic description on content of electronic-book (e-book)
KR102075874B1 (en) Method for conversion of e-book and apparatus using the method
CN113657080A (en) An XML-based structured system and data package creation method
CN112836477B (en) Method and device for generating code annotation document, electronic equipment and storage medium
JP4388092B2 (en) Structured document database management system and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant