CN112632959A

CN112632959A - EPUB file analysis method

Info

Publication number: CN112632959A
Application number: CN202011590270.4A
Authority: CN
Inventors: 黄希希; 蒋芹; 佘晓龙; 徐磊
Original assignee: Hubei University
Current assignee: Hubei University
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-09
Anticipated expiration: 2040-12-29
Also published as: CN112632959B

Abstract

The invention relates to an EPUB file analysis method, which comprises the following steps: and analyzing the meta information: calling a META-information analysis module to analyze a container.xml file in the META-INF file and reading a root file path of the electronic book; positioning the opf file according to the root file path, and analyzing meta information in the opf file; analyzing the content of the book main body: the book main body content is contained in an xml/html file, and the xml/html file is classified into a cover, a preamble part, a chapter part and a postscript part according to a page processing module; then respectively calling a cover analysis unit, a introduction analysis unit and a chapter analysis unit for processing, and respectively semanticizing the three parts of contents; for the postscript part, no treatment is carried out; extracting the main content of each chapter: the main body content is composed of a plurality of sections, each section is wrapped by a section label, and a section label analysis unit is called to process. The method can analyze the EPUB data into Json format data.

Description

EPUB file analysis method

Technical Field

The invention relates to the technical field of file reading methods, in particular to semantization of an EPUB format, and specifically relates to reconstruction and conversion of a book in the EPUB format into a json file.

Background

An epub (electronic publication) format file is a free open standard, and belongs to a content that can be automatically rearranged; i.e. the text content can be displayed in a manner most suitable for reading, depending on the characteristics of the reading device. The EPUB archive internally uses XHTML or DTBook (an XML standard proposed by DAISY Consortium) to render text and to wrap the archive contents in a zip-compressed format. The EPUB format optionally includes Digital Rights Management (DRM) -related functionality. However, the data of the EPUB electronic book have a great deal of redundancy at present, and are not easy to extract and reuse. If the method can convert the book information into the Json format with strong expansibility and simple analysis while keeping the book information, the method can be easily converted into other types of files, and has wide application prospect for converting and displaying the electronic book information.

The existing EPUB parsing scheme is to parse an EPUB format file based on epublib-core-latest. And generating a file header converter, a page converter and an object converter according to the analyzed information, wherein the file header converter is used for acquiring the author and the book name of the file, the page converter is used for acquiring the content of the file, and the object converter is used for integrating the file header converter and the page conversion information.

The prior art has the following defects: the data obtained after epublib analysis is rich text content in html format, if plain text is needed, the text content of the label can be obtained through jsup, but the obtained plain text typesetting is messy. And the steps of generating the header converter, the page converter and the object converter according to the analyzed information are too complicated, and the expandability is not strong.

Disclosure of Invention

The purpose of the invention is: in order to solve the problems, the invention provides an EPUB file analysis method, which aims to solve the problems that EPUB electronic book data has a large amount of redundancy and is difficult to extract and utilize.

In order to solve the problems, the technical scheme adopted by the invention is as follows:

a kind of EPUB file analysis method, transfer decompress module EPUB file name change convert into zip compress file and decompress it, then analyze to the decompression file, characterized by that, the said analysis method includes the following step:

(1) and analyzing the meta information: calling a META-information analysis module to analyze a container.xml file in the META-INF file and reading a root file path of the electronic book; positioning the opf file according to the root file path, and analyzing meta information in the opf file;

(2) analyzing the content of the book main body: book main body content is contained in an xml/html file, the xml/html file is classified according to a page processing module, an epub-type attribute and a data-type attribute of a file root tag are obtained by a classification method, and the classification method is classified according to different values and is divided into a front cover, a front introduction part, a chapter part and a postscript part; then respectively calling a cover analysis unit, a introduction analysis unit and a chapter analysis unit for processing, and respectively semanticizing the three parts of contents; for the postscript part, no treatment is carried out;

(3) extracting the main content of each chapter: the main body content is composed of a plurality of sections, each section is wrapped by a section label, and a section label analysis unit is called to process.

Furthermore, Opf file parsing is divided into two parts, namely parsing book meta information and parsing book media file path.

Further, the meta information of the book is contained in the tag named metadata, wherein the partial tag information to be parsed is as follows:

the method comprises the steps of calling a meta-information analysis module to construct an opf file into a document tree, processing metadata and manifest labels in a traversing manner, extracting values of text nodes corresponding to the meta-information labels and storing the values into a predefined semantization Json structure, wherein the paths of the xml and html files are stored as instance variables.

Furthermore, the analysis method of the cover analysis unit is to directly locate and extract the cover picture in the file, store the cover picture in the appointed folder and name the cover picture by the md5 value, and the specific method is to search the img label in the document tree, locate the img label in the image directory by the src attribute of the img label, move the img label and calculate the md5 value for renaming.

Furthermore, the parsing method of the preamble parsing unit is to locate a tag with epub-type as frontatter, traverse all tags therebelow, respectively convert the tags into different Json fields according to the attribute value of class, parse five types of tags, respectively including book title, copyright, gifting, preamble and thank information, of class as boottitlepage, CopyrightPage, deletion, Preface and Bookcaknowledgments, perform recursive traversal, and convert the tags into semantization Json set in advance according to the structure of the tags.

Further, the chapter analysis unit is used for decomposing the analyzed chapter type xhtml into two parts for extraction; firstly, the title and the phrase of each chapter are included, the title part is h1-h4 tags, the extraction method is to locate and extract the text node value according to the tag name, and the phrase part is similar to a common paragraph in an EPUB, so a paragraph parsing unit is adopted: the p-label analysis unit or the div analysis unit is used for analyzing, wherein the p-label represents a simple text, and the div-label represents a more complex text; the P-tag parsing unit and the div-tag parsing unit are entries of the remaining parsing units and are also the core of the chapter parsing unit.

Further, for the P tag parsing unit, the specific steps include: firstly, the child nodes are traversed, if the child nodes are text nodes, the text nodes are stored in a list, otherwise, a processing unit corresponding to the label is called to process the text nodes, and finally, the text nodes are constructed into data in a key value mode.

Further, for the div tag parsing unit, the specific steps include: firstly, traversing the div tag to take out all the sub tags under the div tag, and transferring the content of the sub tags to different processing modules for processing, wherein the processing is different from the processing that the key in the analysis result of the P tag analysis module is always 'P', but the key is returned as an independent key value pair according to the class of the P tag analysis module, which cannot be realized by the common method of converting Html to Json, and meanwhile, redundant div tags are removed.

The technical scheme provided by the invention has the beneficial effects that at least: the invention can solve the problem of utilizing EPUB electronic book information, maintains book information to the maximum extent and reduces the storage cost to the minimum by converting into the semantization Json file, and the Json format has strong expansibility and simple analysis, so the method is easy to convert into other types of files.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a parsing flowchart of an EPUB file parsing method disclosed in the embodiment of the present invention.

Fig. 2 is a main block diagram of an EPUB file parsing method disclosed in the embodiment of the present invention.

Fig. 3 is a schematic diagram of an xhtml structure of a chapter parsing unit disclosed in the embodiment of the present invention.

Fig. 4 is a flowchart of parsing a P tag parsing unit according to an embodiment of the present invention.

Fig. 5 is a parsing flowchart of the div tag parsing unit according to the embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The invention relates to an EPUB file parsing method, the parsing steps are shown in figure 1, and the main modules are shown in figure 2. Firstly, an uncompressing module is called to convert the EPUB file into a zip compressed file and decompress the zip compressed file, and then the uncompressed file is comprehensively analyzed.

First, the meta information is parsed. After decompression, an ebook generally complying with the EPUB3.0 standard may have three types of files, META-INF files, obeps (ops) files, and mimetype files. The method comprises the steps of firstly calling a META-information analysis module to analyze a container file in the META-INF file, and reading a root file (rootfile) path of the electronic book. And then positioning the opf file according to the root file path, and analyzing the meta information in the opf file. Opf the file analysis is divided into two parts, one is to analyze book meta information and the other is to analyze book media file path. The meta information of the book is mainly contained in the tag named metadata, wherein the information of the part of the tag to be analyzed is as follows:

name of meta information	Name of label	Properties
			ISBN	Identifier
Book title	Title	Id＝pub-title
			Book subtitle	Title	Id＝pub-subtitle
Language used by books	Language
			Authors refer to	Creator
Publishing company	Publisher
			Copyright information	Rights

The method mainly comprises the steps of calling a meta-information analysis module to construct an opf file into a document tree, traversing and processing metadata and manifest tags, extracting values of text nodes corresponding to the metadata tags and storing the values into a predefined semantic Json structure, wherein the paths of the xml and html files are stored as instance variables.

And secondly, analyzing the content of the book main body. The method comprises the steps of firstly classifying the xml/html files according to a page processing module, wherein the classifying method mainly comprises the steps of obtaining an epub-type attribute and a data-type attribute of a file root label of each file, and classifying according to different values of the epub-type attribute and the data-type attribute. The book content is divided into three parts, namely a cover, a preamble part, a chapter part and a postscript part, and then different modules are respectively called for processing, which is a key innovation part of the book content semantic analysis method.

The analysis method of the cover analysis unit is that the cover pictures in the file are directly positioned and extracted to be stored in the appointed folder and named according to the md5 value, and the specific method is that the img label in the document tree is searched, the img label is positioned in the picture directory through the src attribute of the img label, the movement is carried out, and the md5 value is calculated for renaming.

The method comprises the steps of locating a label with epub-type as front attter, traversing all labels below the label, converting the label into different Json fields according to a class attribute value, mainly analyzing five types of labels, namely labels with class of BookTitlePage, copyrightPage, deletion, Preface and BookAcknowledgement, respectively including information of book title, copyright, gifts, preamble and thank, and performing recursive traversal in the analyzing method, and converting the labels into semantic Json set in advance according to the structure of the labels.

The structure of the chapter class xhtml to be analyzed by the chapter analysis unit is shown in fig. 3, and the present invention decomposes the chapter class xhtml into two parts to extract the two parts. First, the title and the phrase of each chapter, the title part is generally h1-h4 tags, the extraction method is to locate and extract the text node value according to the tag name, and the phrase part is similar to the ordinary paragraph in EPUB, so the paragraph parsing unit is adopted, for example: a p-tag parsing unit or a div parsing unit, where the p-tag represents simple text and the div-tag represents more complex text. The P-tag parsing unit and the div-tag parsing unit are almost entries of the remaining parsing units and are also cores of the chapter parsing unit.

The specific steps of the P-tag parsing unit are shown in fig. 4. Firstly, traversing child nodes, if the child nodes are text nodes, storing the text nodes in a list, otherwise, calling a processing unit corresponding to a label to process the text nodes, and finally constructing data in a key value form, such as: "p": text, { "span": text "] }, {" code ": text" } ], which has an extremely strong expansibility, and it is noted that although the formats of different tag processing results are almost the same, the processing methods of different tags are different.

For the Div tag parsing unit, the specific steps are as shown in fig. 5. Firstly, traversing the div tag to take out all the sub-tags under the div tag, and transferring the content of the sub-tags to different processing modules for processing, wherein the processing is different from that the key in the analysis result of the P-tag analysis module is always 'P', and the key is returned as an independent key value pair according to the class of the P-tag analysis module, for example: the extraction result of < div > < div class [ { "li": [ ], and { "li": } ] of text value [ </li > </div > ] is "ol" [ { "li": [ ], and { "li": } ]), which cannot be achieved by the ordinary Html to Json method, and the redundant div label is removed.

And then extracting the main content of each chapter, wherein the main content consists of a plurality of sections, each section is wrapped by a section label, the structure below the section is similar to that of the chapter, so that a section label analysis unit can be recursively called to process, the section label analysis unit is similar to the chapter analysis unit, and the difference is that the unit can not only recognize div and p labels to call corresponding units to process, but also recognize and analyze labels such as figure, ol, ul, pre and the like to process.

For the postscript part, the invention does not process the information in consideration of low information value.

The invention also synchronously develops a module for converting the semantic Json into html, can convert Json data into html files, has no information loss and is simpler compared with html files of source EPUB files, embodies the simplicity of the semantic Json format and is easy to convert into other file formats.

It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Of course, the processor and the storage medium may reside as discrete components in a user terminal.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Claims

1. A kind of EPUB file analysis method, transfer decompress module EPUB file name change convert into zip compress file and decompress it, then analyze to the decompression file, characterized by that, the said analysis method includes the following step:

2. The EPUB file parsing method of claim 1, wherein Opf file parsing is divided into two parts, one part parsing book meta information and the other part parsing book media file path.

3. The EPUB file parsing method of claim 2, wherein the meta information of the book is included in a tag named metadata, wherein the partial tag information to be parsed is as follows:

4. The method as claimed in claim 1, wherein the parsing method of the cover parsing unit is that the directly located cover pictures extracted from the file are stored in the designated folder and named with their md5 values, and the specific method is that the img tags in the document tree are searched, located in the picture directory by their src attributes, moved and renamed with the md5 values calculated.

5. The EPUB file parsing method of claim 1, wherein the parsing method of the preamble parsing unit is to locate a tag with EPUB-type being frontattter, traverse all tags therebelow, convert the tags into different Json fields according to class attribute values, parse five types of tags, including book title, copyright, presentation, preamble and thank you information according to class attribute values, and perform recursive traversal to convert the tags into semantization Json set in advance according to the structure of the tags.

6. The EPUB file parsing method according to claim 1, wherein the chapter parsing unit parses the chapter type xhtml parsed therein into two parts for extraction; firstly, the title and the phrase of each chapter are included, the title part is h1-h4 tags, the extraction method is to locate and extract the text node value according to the tag name, and the phrase part is similar to a common paragraph in an EPUB, so a paragraph parsing unit is adopted: the p-label analysis unit or the div analysis unit is used for analyzing, wherein the p-label represents a simple text, and the div-label represents a more complex text; the P-tag parsing unit and the div-tag parsing unit are entries of the remaining parsing units and are also the core of the chapter parsing unit.

7. The EPUB file parsing method according to claim 6, wherein for the P tag parsing unit, the specific steps include: firstly, the child nodes are traversed, if the child nodes are text nodes, the text nodes are stored in a list, otherwise, a processing unit corresponding to the label is called to process the text nodes, and finally, the text nodes are constructed into data in a key value mode.

8. The EPUB file parsing method according to claim 6 or 7, wherein for a div tag parsing unit, the specific steps include: firstly, traversing the div tag to take out all the sub tags under the div tag, and transferring the content of the sub tags to different processing modules for processing, wherein the processing is different from the processing that the key in the analysis result of the P tag analysis module is always 'P', but the key is returned as an independent key value pair according to the class of the P tag analysis module, which cannot be realized by the common method of converting Html to Json, and meanwhile, redundant div tags are removed.