CN112632959B

CN112632959B - A method for parsing EPUB files

Info

Publication number: CN112632959B
Application number: CN202011590270.4A
Authority: CN
Inventors: 黄希希; 蒋芹; 佘晓龙; 徐磊
Original assignee: Hubei University
Current assignee: Hubei University
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2023-09-01
Anticipated expiration: 2040-12-29
Also published as: CN112632959A

Abstract

The application relates to an EPUB file analysis method, which comprises the following steps: parsing the meta information: calling a META-information analysis module to analyze a container.xml file in the META-INF file, and reading a root file path of the electronic book; positioning an opf file according to the root file path, and analyzing meta information in the opf file; analyzing the content of the book main body: the book main body content is contained in an xhtml/html file, and the xhtml/html file is classified according to a page processing module and is divided into a cover, a preamble part, a chapter part and a postamble part; then respectively calling a cover analysis unit, a preamble analysis unit and a chapter analysis unit for processing, and respectively semantically processing the three parts of contents; for the postmark part, no processing is performed; extracting the main content of each chapter: the main body content consists of a plurality of sections, each section is wrapped by a section label, and a section label analyzing unit is called for processing. The method can analyze EPUB data into Json format data.

Description

A method for parsing EPUB files

技术领域technical field

本发明涉及文件的读取方法技术领域，具体涉及EPUB格式的语义化，具体地说是将EPUB格式的书籍进行重构转化成json文件。The invention relates to the technical field of file reading methods, in particular to the semanticization of the EPUB format, specifically to reconstructing books in the EPUB format and converting them into JSON files.

背景技术Background technique

EPUB(Electronic publication)格式文件，是一个自由的开放标准，属于一种可以"自动重新编排"的内容；也就是文字内容可以根据阅读设备的特性，以最适于阅读的方式显示。EPUB档案内部使用了XHTML或DTBook(一种由DAISY Consortium提出的XML标准)来展现文字、并以zip压缩格式来包裹档案内容。EPUB格式中包含了数位版权管理(DRM)相关功能可供选用。但是目前EPUB电子书数据存在大量冗余，不易提取和重复利用。而如果能在保留书籍信息的同时，将其转换成扩展性强，解析简单的Json格式，便将很容易转换成其他类型的文件，对于电子图书信息的转换和展示具有广阔的应用前景。EPUB (Electronic publication) format file is a free and open standard, which belongs to a kind of content that can be "automatically rearranged"; that is, the text content can be displayed in the most suitable way for reading according to the characteristics of the reading device. EPUB files internally use XHTML or DTBook (an XML standard proposed by the DAISY Consortium) to display text, and package the file content in zip compression format. The EPUB format includes optional functions related to digital rights management (DRM). However, there is a large amount of redundancy in EPUB e-book data at present, which is not easy to extract and reuse. However, if the book information can be converted into a Json format with strong scalability and simple parsing while retaining the book information, it will be easily converted into other types of files, which has broad application prospects for the conversion and display of e-book information.

现有的EPUB解析方案是基于epublib-core-latest.jar对EPUB格式文件进行解析。根据解析的信息生成文件头转换器、页面转换器和对象转换器，其中文件头转换器用于获取文件的作者和图书名称，页面转换器用于获取文件的内容，对象转换器用于对文件头转换器、页面转换信息的整合。The existing EPUB parsing solution is to parse EPUB format files based on epublib-core-latest.jar. Generate file header converter, page converter and object converter according to the parsed information, where the file header converter is used to obtain the author and book name of the file, the page converter is used to obtain the content of the file, and the object converter is used to convert the file header converter , Integration of page conversion information.

上述现有技术存在的缺点是：通过epublib解析后得到的数据是html格式的富文本内容，如果需要纯文本，可以通过jsoup获取标签的文本内容，但获取后的纯文本排版就会乱。并且根据解析的信息生成文件头转换器、页面转换器和对象转换器的步骤过于繁琐，可扩展性不强。The disadvantage of the above prior art is that the data obtained after parsing through epublib is rich text content in html format. If plain text is required, the text content of the label can be obtained through jsoup, but the typesetting of the plain text after acquisition will be messy. Moreover, the steps of generating file header converters, page converters and object converters according to the parsed information are too cumbersome, and the scalability is not strong.

发明内容Contents of the invention

本发明的目的是：为了解决上述问题，本发明提出一种EPUB文件解析方法，目的在于解决EPUB电子书数据存在大量冗余且不易提取利用的问题，将EPUB数据解析为拓展性强，体积小，具有丰富表现力且通常易于转化为pdf，word等格式的语义化Json格式数据。The purpose of the present invention is: in order to solve the above-mentioned problem, the present invention proposes a kind of EPUB file parsing method, the purpose is to solve the problem that EPUB e-book data has a large amount of redundancy and is not easy to extract and utilize, and EPUB data is parsed into strong expansibility and small volume. , which is rich in expressiveness and usually easy to convert into semantic Json format data in pdf, word and other formats.

为了解决上述问题，本发明所采用的技术方案是：In order to solve the above problems, the technical solution adopted in the present invention is:

一种EPUB文件解析方法，调用解压模块将EPUB文件改名转为zip压缩文件并对其进行解压，而后针对解压文件进行解析，其特征在于，所述解析方法包括如下步骤：A kind of EPUB file parsing method, call decompression module to change the name of EPUB file into zip compressed file and decompress it, then analyze for decompressed file, it is characterized in that, described parsing method comprises the steps:

(1)解析元信息：调用元信息解析模块解析META-INF文件中的container.xml文件，读取出电子书的根文件路径；根据根文件路径定位opf文件，解析opf文件中的元信息；(1) Analyzing meta information: call the meta information parsing module to parse the container.xml file in the META-INF file, read the root file path of the e-book; locate the opf file according to the root file path, and parse the meta information in the opf file;

(2)解析书籍主体内容：书籍主体内容包含在xhtml/html文件中，根据页面处理模块将xhtml/html文件进行分类，分类方法采用获取其文件根标签的epub-type属性和data-type属性，根据其值的不同来分类，分为封面、前言部分、章节部分和后记部分；然后分别调用封面解析单元、前言解析单元、章节解析单元进行处理，将这三部分内容分别进行语义化；针对后记部分，则不进行处理；(2) Analyze the main content of the book: the main content of the book is included in the xhtml/html file, and the xhtml/html file is classified according to the page processing module. The classification method adopts the epub-type attribute and the data-type attribute of the root tag of the file, According to the different values, it is divided into cover, preface, chapter and postscript; then call the cover analysis unit, preface analysis unit, and chapter analysis unit to process the content of these three parts respectively; for the postscript part, it will not be processed;

(3)提取每章的主体内容：主体内容由若干节组成，每一节由一个section标签进行包裹，调用section标签解析单元进行处理。(3) Extract the main content of each chapter: the main content is composed of several sections, each section is wrapped by a section tag, and the section tag parsing unit is called for processing.

进一步的，Opf文件解析分为两部分，一是解析书籍元信息，二是解析书籍媒体文件路径。Further, the analysis of the Opf file is divided into two parts, one is to analyze the meta information of the book, and the other is to analyze the path of the book media file.

进一步的，书籍的元信息包含在名为metadata的标签之中，其中需解析的部分标签信息如下表：Further, the meta information of the book is included in the tag named metadata, and part of the tag information that needs to be parsed is as follows:

书籍的媒体文件路径包含在名为manifest的标签中，每个item标签表示一个媒体文件，只解析属性media-type为“application/xhtml+xml”的item标签，从而获取所有xhtml/html文件的路径，解析的方法是调用元信息解析模块将opf文件构建为文档树，遍历处理metadata和manifest标签，提取元信息标签对应文本节点的值并存储到预定义的语义化Json结构中，xhtml和html文件路径则存储为实例变量。The media file path of the book is included in the tag named manifest. Each item tag represents a media file. Only the item tag whose attribute media-type is "application/xhtml+xml" is parsed to obtain the path of all xhtml/html files. , the parsing method is to call the meta information parsing module to build the opf file into a document tree, traverse and process the metadata and manifest tags, extract the value of the text node corresponding to the meta information tag and store it in the predefined semantic Json structure, xhtml and html files The path is stored as an instance variable.

进一步的，封面解析单元的解析方法是直接定位的提取出该文件中的封面图片存储在指定文件夹中并以其md5值进行命名，具体的方法即为查找文档树中的img标签并通过其src属性定位到图片目录进行移动和计算md5值重命名。Furthermore, the parsing method of the cover parsing unit is to directly locate and extract the cover image in the file, store it in a specified folder and name it with its md5 value, and the specific method is to search for the img tag in the document tree and pass it The src attribute locates the image directory for moving and calculating the md5 value for renaming.

进一步的，前言解析单元的解析方法是定位epub-type为frontmatter的标签，然后遍历其下所有的标签，根据class属性值分别来转化为不同的Json字段，解析五类标签，分别是class为BookTitlePage,CopyrightPage,Dedication,Preface和BookAcknowledgments的标签，分别包括了书籍标题、版权、赠言、前言以及感谢信息，解析的方法是进行递归遍历，根据标签的结构来转化为提前设定好的语义化Json。Furthermore, the parsing method of the preamble parsing unit is to locate the tag whose epub-type is frontmatter, and then traverse all the tags under it, convert them into different Json fields according to the class attribute value, and parse the five types of tags, which are the class is BookTitlePage ,CopyrightPage, Dedication, Preface and BookAcknowledgments tags include the book title, copyright, preface, preface and thank you information respectively. The method of parsing is to perform recursive traversal and convert it into pre-set semantic Json according to the structure of the tags.

进一步的，章节解析单元，其解析的章节类xhtml分解为两部分进行提取；首先是每章的标题和按语，标题部分是h1-h4标签，提取方法为根据标签名定位并提取其文本节点值，按语部分在EPUB中类似于普通的段落，因此采用段落解析单元：p标签解析单元或者div解析单元来进行解析，其中p标签来表示简单文本，div标签来表示较复杂文本；P标签解析单元和div标签解析单元是其余解析单元的入口，也是章节解析单元的核心。Further, the chapter parsing unit decomposes the parsed chapter class xhtml into two parts for extraction; the first is the title and commentary of each chapter, the title part is the h1-h4 tag, and the extraction method is to locate and extract the text node value according to the tag name , the comment part is similar to ordinary paragraphs in EPUB, so the paragraph analysis unit is used: p tag analysis unit or div analysis unit for analysis, where p tag represents simple text, div tag represents more complex text; P tag analysis unit and div tag parsing unit is the entrance of other parsing units, and also the core of chapter parsing unit.

进一步的，对于P标签解析单元，具体步骤包括：首先对其子节点进行遍历，如果是文本节点则存储在列表中，否则调用对应标签的处理单元进行处理，最终构造成一个键值形式的数据，该数据格式具有极强的拓展性。Further, for the P tag parsing unit, the specific steps include: first traversing its child nodes, if it is a text node, store it in the list, otherwise call the processing unit of the corresponding tag for processing, and finally construct a data in the form of a key value , the data format is extremely scalable.

进一步的，对于div标签解析单元，具体步骤包括：首先遍历div标签取出其下所有的子标签，将其内容转交不同的处理模块进行处理，这里的处理不同于P标签解析模块的解析结果中键始终是”p”，而是根据其class将其作为一个单独的键值对进行返回，这是一般Html转Json方法所无法做到的，同时也去除了冗余的div标签。Further, for the div tag parsing unit, the specific steps include: first traverse the div tag to take out all sub-tags under it, and transfer its content to a different processing module for processing. The processing here is different from the key in the parsing result of the P tag parsing module It is always "p", but returns it as a separate key-value pair according to its class, which is not possible with the general Html-to-Json method, and also removes redundant div tags.

本发明提供的上述技术方案的有益效果至少包括：本发明可以解决EPUB电子书信息的利用问题，通过转化为语义化Json文件，最大限度的保留书籍信息的同时，将存储成本降到极低，并且由于Json格式可拓展性强，解析简单，很容易转化为其他类型的文件。The beneficial effects of the above-mentioned technical solution provided by the present invention at least include: the present invention can solve the utilization problem of EPUB e-book information, and by converting it into a semantic Json file, while retaining the book information to the greatest extent, the storage cost is reduced to an extremely low level, And because the Json format is highly extensible and easy to parse, it can be easily converted into other types of files.

本发明的其它特征和优点将在随后的说明书中阐述，并且，部分地从说明书中变得显而易见，或者通过实施本发明而了解。本发明的目的和其他优点可通过在所写的说明书、权利要求书、以及附图中所特别指出的结构来实现和获得。Additional features and advantages of the invention will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

下面通过附图和实施例，对本发明的技术方案做进一步的详细描述。The technical solutions of the present invention will be described in further detail below with reference to the accompanying drawings and embodiments.

附图说明Description of drawings

附图用来提供对本发明的进一步理解，并且构成说明书的一部分，与本发明的实施例一起用于解释本发明，并不构成对本发明的限制。在附图中：The accompanying drawings are used to provide a further understanding of the present invention, and constitute a part of the description, and are used together with the embodiments of the present invention to explain the present invention, and do not constitute a limitation to the present invention. In the attached picture:

图1是本发明实施例公开的EPUB文件解析方法的解析流程图。Fig. 1 is the parsing flowchart of the EPUB file parsing method disclosed by the embodiment of the present invention.

图2是本发明实施例公开的EPUB文件解析方法的主要模块图。Fig. 2 is a main block diagram of the EPUB file parsing method disclosed by the embodiment of the present invention.

图3是本发明实施例公开的章节解析单元的xhtml结构示意图。Fig. 3 is a schematic diagram of the xhtml structure of the chapter parsing unit disclosed in the embodiment of the present invention.

图4是本发明实施例公开的P标签解析单元的解析流程图。Fig. 4 is an analysis flow chart of the P label analysis unit disclosed in the embodiment of the present invention.

图5是本发明实施例公开的div标签解析单元的解析流程图。Fig. 5 is a flow chart of parsing by the div tag parsing unit disclosed in the embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的示例性实施方式。虽然附图中显示了本公开的示例性实施方式，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施方式所限制。相反，提供这些实施方式是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

本发明涉及一种EPUB文件解析方法，解析步骤如图1，主要模块如图2。首先调用解压模块将EPUB文件改名转为zip压缩文件并对其进行解压，而后针对解压文件进行全面解析。The invention relates to an EPUB file parsing method, the parsing steps are shown in Figure 1, and the main modules are shown in Figure 2. First call the decompression module to rename the EPUB file into a zip compressed file and decompress it, and then perform a comprehensive analysis on the decompressed file.

第一步，解析元信息。一般遵循EPUB3.0标准的电子书解压后会存在三类文件，META-INF文件，OBEPS(OPS)文件和mimetype文件。本发明首先调用元信息解析模块解析META-INF文件中的container.xml文件，读取出电子书的根文件(rootfile)路径。随后根据根文件路径定位opf文件，解析opf文件中的元信息。Opf文件解析分为两部分，一是解析书籍元信息，二是解析书籍媒体文件路径。书籍的元信息主要包含在名为metadata的标签之中，其中需解析的部分标签信息如下表：The first step is to parse the meta information. Generally, there will be three types of files after decompression of e-books following the EPUB3.0 standard, META-INF files, OBEPS (OPS) files and mimetype files. The present invention first calls the meta-information analysis module to analyze the container.xml file in the META-INF file, and reads out the path of the root file (rootfile) of the e-book. Then locate the opf file according to the root file path, and parse the meta information in the opf file. Opf file parsing is divided into two parts, one is to parse the meta information of the book, and the other is to parse the path of the book media file. The meta information of a book is mainly contained in the tag named metadata, and part of the tag information that needs to be parsed is as follows:

元信息名称meta information name 标签名tag name 属性Attributes ISBNISBN IdentifierIdentifier 图书标题book title TitleTitle Id＝pub-titleId=pub-title 图书副标题book subtitle TitleTitle Id＝pub-subtitleId=pub-subtitle 图书所用语言language of the book LanguageLanguage 作者author CreatorCreator 出版社publishing house PublisherPublisher 版权信息Copyright Information RightsRights

书籍的媒体文件路径主要包含在名为manifest的标签中，每个item标签表示一个媒体文件，只解析属性media-type为“application/xhtml+xml”的item标签，从而获取所有xhtml/html文件的路径，解析的主要方法是调用元信息解析模块将该opf文件构建为文档树，遍历处理metadata和manifest标签，提取元信息标签对应文本节点的值并存储到预定义的语义化Json结构中，xhtml和html文件路径则存储为实例变量。The media file path of the book is mainly included in the tag named manifest. Each item tag represents a media file. Only the item tag whose attribute media-type is "application/xhtml+xml" is parsed to obtain all xhtml/html files. Path, the main method of parsing is to call the meta information parsing module to build the opf file into a document tree, traverse and process the metadata and manifest tags, extract the value of the text node corresponding to the meta information tag and store it in the predefined semantic Json structure, xhtml and html file paths are stored as instance variables.

第二步，解析书籍主体内容。书籍主体内容都包含在xhtml/html文件中，本发明首先根据页面处理模块将xhtml/html文件进行分类，分类的主要方法是获取其文件根标签的epub-type属性和data-type属性，根据其值的不同来分类。分为封面，前言部分，章节部分和后记部分,然后分别调用不同模块进行处理，这是本发明的重点创新部分，将书籍内容分为三部分分别进行语义化。The second step is to analyze the main content of the book. Book main content is included in the xhtml/html file, the present invention first classifies the xhtml/html file according to the page processing module, the main method of classification is to obtain the epub-type attribute and the data-type attribute of its file root tag, according to its The value is different to classify. It is divided into cover, preface, chapter and postscript, and then different modules are called for processing. This is the key innovation part of the present invention, and the content of the book is divided into three parts for semanticization.

封面解析单元的解析方法是直接定位的提取出该文件中的封面图片存储在指定文件夹中并以其md5值进行命名，具体的方法即为查找文档树中的img标签并通过其src属性定位到图片目录进行移动和计算md5值重命名。The analysis method of the cover analysis unit is to directly locate and extract the cover image in the file, store it in the specified folder and name it with its md5 value, and the specific method is to search for the img tag in the document tree and locate it through its src attribute Move to the picture directory and calculate the md5 value to rename.

前言解析单元的主要步骤是定位epub-type为frontmatter的标签，然后遍历其下所有的标签，根据class属性值分别来转化为不同的Json字段，主要解析五类标签，分别是class为BookTitlePage,CopyrightPage,Dedication,Preface和BookAcknowledgments的标签，分别包括了书籍标题，版权，赠言，前言以及感谢等信息，解析的方法同样是进行递归遍历，根据标签的结构来转化为提前设定好的语义化Json。The main steps of the preface parsing unit are to locate the label whose epub-type is frontmatter, then traverse all the labels under it, and convert them into different Json fields according to the class attribute value, and mainly analyze five types of labels, which are BookTitlePage and CopyrightPage respectively The tags of ,Dedication, Preface and BookAcknowledgments respectively include information such as book title, copyright, preface, foreword, and thanks.

章节解析单元，其解析的章节类xhtml的结构如图3所示，本发明将其分解为两部分进行提取。首先是每章的标题和按语，标题部分一般是h1-h4标签，提取方法为根据标签名定位并提取其文本节点值，按语部分在EPUB中类似于普通的段落，因此采用段落解析单元例如：p标签解析单元或者div解析单元来进行解析，其中p标签来表示简单文本，div标签来表示较复杂文本。P标签解析单元和div标签解析单元几乎是其余解析单元的入口，也是章节解析单元的核心。The chapter parsing unit, the structure of the chapter class xhtml parsed by it is shown in Figure 3, and the present invention decomposes it into two parts for extraction. The first is the title and commentary of each chapter. The title part is generally h1-h4 tags. The extraction method is to locate and extract the text node value according to the tag name. The commentary part is similar to a normal paragraph in EPUB, so a paragraph analysis unit is used. For example: The p tag parsing unit or the div parsing unit performs parsing, where the p tag represents simple text, and the div tag represents more complex text. The P tag parsing unit and the div tag parsing unit are almost the entrances of other parsing units, and they are also the core of the chapter parsing unit.

对于P标签解析单元，具体分步骤如图4。首先对其子节点进行遍历，如果是文本节点则存储在列表中，否则调用对应标签的处理单元进行处理，最终构造成一个键值形式的数据，例如：”p”:[“文本”,{“span”:[“文本”]}，{“code”:“文本”}]，该数据格式具有极强的拓展性，注意，尽管不同标签处理结果的格式几乎一样，但不同标签的处理方法不尽相同。For the P tag parsing unit, the specific steps are shown in Figure 4. First traverse its child nodes, if it is a text node, store it in the list, otherwise call the processing unit of the corresponding label for processing, and finally construct a key-value form of data, for example: "p":["text",{ "span":["text"]}, {"code":"text"}], this data format is extremely expandable. Note that although the format of the processing results of different tags is almost the same, the processing methods of different tags not exactly.

对于Div标签解析单元，具体的分步骤如图5。首先遍历div标签取出其下所有的子标签，将其内容转交不同的处理模块进行处理，这里的处理不同于P标签解析模块的解析结果中键始终是”p”，而是根据其class将其作为一个单独的键值对进行返回，例如：<div><div class＝”OrderedList”><ul><li>文本值</li></li></div></div>的提取结果为”ol”:[{“li”:[]},{“li”:[]}]，这是一般Html转Json方法所无法做到的，同时也去除了冗余的div标签。For the Div tag parsing unit, the specific sub-steps are shown in Figure 5. First, traverse the div tag to take out all the sub-tags under it, and transfer its content to different processing modules for processing. The processing here is different from the P tag parsing module. The key in the parsing result is always "p", but according to its class. Return as a single key-value pair, for example: <div><div class="OrderedList"><ul><li>text value</li></li></div></div> extraction result It is "ol":[{"li":[]},{"li":[]}], which cannot be done by the general Html to Json method, and also removes redundant div tags.

然后是提取每章的主体内容，主体内容由若干节组成，每一节由一个section标签进行包裹，其下的结构和章的结构类似，因此会递归的进行调用section标签解析单元进行处理，section标签解析单元和章节解析单元比较类似，不同的是该单元不仅会识别div和p标签调用相应单元进行处理，而且还会识别和解析figure，ol，ul，pre等标签来进行处理。Then extract the main content of each chapter. The main content is composed of several sections, and each section is wrapped by a section tag. The structure under it is similar to the chapter structure, so it will recursively call the section tag analysis unit for processing, section The tag parsing unit is similar to the chapter parsing unit, the difference is that this unit not only recognizes div and p tags to call corresponding units for processing, but also recognizes and parses tags such as figure, ol, ul, and pre for processing.

针对后记部分，考虑到其信息价值不高，本发明不进行处理。For the postscript part, considering that its information value is not high, the present invention does not process it.

本发明还同步开发了语义化Json转html的模块，可以将Json数据转化为html文件，与源EPUB文件的html文件对比无信息损失且更加简洁，体现了语义化Json格式的简洁性且易于转化为其他文件格式。The present invention also synchronously develops a semantic Json-to-html module, which can convert Json data into an html file. Compared with the html file of the source EPUB file, there is no information loss and is more concise, which reflects the simplicity of the semantic Json format and is easy to convert for other file formats.

应该明白，公开的过程中的步骤的特定顺序或层次是示例性方法的实例。基于设计偏好，应该理解，过程中的步骤的特定顺序或层次可以在不脱离本公开的保护范围的情况下得到重新安排。所附的方法权利要求以示例性的顺序给出了各种步骤的要素，并且不是要限于所述的特定顺序或层次。It is understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy described.

在上述的详细描述中，各种特征一起组合在单个的实施方案中，以简化本公开。不应该将这种公开方法解释为反映了这样的意图，即，所要求保护的主题的实施方案需要清楚地在每个权利要求中所陈述的特征更多的特征。相反，如所附的权利要求书所反映的那样，本发明处于比所公开的单个实施方案的全部特征少的状态。因此，所附的权利要求书特此清楚地被并入详细描述中，其中每项权利要求独自作为本发明单独的优选实施方案。In the foregoing Detailed Description, various features are grouped together in a single embodiment to simplify the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the embodiments of the claimed subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, the invention lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby expressly incorporated into the Detailed Description, with each claim standing on its own as a separate preferred embodiment of this invention.

本领域技术人员还应当理解，结合本文的实施例描述的各种说明性的逻辑框、模块、电路和算法步骤均可以实现成电子硬件、计算机软件或其组合。为了清楚地说明硬件和软件之间的可交换性，上面对各种说明性的部件、框、模块、电路和步骤均围绕其功能进行了一般地描述。至于这种功能是实现成硬件还是实现成软件，取决于特定的应用和对整个系统所施加的设计约束条件。熟练的技术人员可以针对每个特定应用，以变通的方式实现所描述的功能，但是，这种实现决策不应解释为背离本公开的保护范围。Those skilled in the art should also understand that various illustrative logical blocks, modules, circuits and algorithm steps described in conjunction with the embodiments herein can be implemented as electronic hardware, computer software or a combination thereof. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

结合本文的实施例所描述的方法或者算法的步骤可直接体现为硬件、由处理器执行的软件模块或其组合。软件模块可以位于RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、移动磁盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质连接至处理器，从而使处理器能够从该存储介质读取信息，且可向该存储介质写入信息。当然，存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。该ASIC可以位于用户终端中。当然，处理器和存储介质也可以作为分立组件存在于用户终端中。The steps of the method or algorithm described in conjunction with the embodiments herein may be directly embodied as hardware, a software module executed by a processor, or a combination thereof. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be a component of the processor. The processor and storage medium can be located in the ASIC. The ASIC may be located in the user terminal. Of course, the processor and the storage medium may also exist in the user terminal as discrete components.

对于软件实现，本申请中描述的技术可用执行本申请所述功能的模块(例如，过程、函数等)来实现。这些软件代码可以存储在存储器单元并由处理器执行。存储器单元可以实现在处理器内，也可以实现在处理器外，在后一种情况下，它经由各种手段以通信方式耦合到处理器，这些都是本领域中所公知的。For a software implementation, the techniques described in this application can be implemented with modules (eg, procedures, functions, and so on) that perform the functions described herein. These software codes can be stored in memory units and executed by processors. The memory unit can be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.

上文的描述包括一个或多个实施例的举例。当然，为了描述上述实施例而描述部件或方法的所有可能的结合是不可能的，但是本领域普通技术人员应该认识到，各个实施例可以做进一步的组合和排列。因此，本文中描述的实施例旨在涵盖落入所附权利要求书的保护范围内的所有这样的改变、修改和变型。此外，就说明书或权利要求书中使用的术语“包含”，该词的涵盖方式类似于术语“包括”，就如同“包括，”在权利要求中用作衔接词所解释的那样。此外，使用在权利要求书的说明书中的任何一个术语“或者”是要表示“非排它性的或者”。The foregoing description includes illustrations of one or more embodiments. Of course, it is impossible to describe all possible combinations of components or methods to describe the above-mentioned embodiments, but those skilled in the art should recognize that various embodiments can be further combined and permuted. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "comprises" is used in the specification or claims, the word is encompassed in a manner similar to the term "comprises" as interpreted when "comprises" is used as a link in the claims. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Claims

1. a kind of EPUB file parsing method, call decompression module to change the name of EPUB file into zip compressed file and decompress it, then analyze for decompressed file, it is characterized in that, described parsing method comprises the steps:

(1) Analyzing meta information: call the meta information parsing module to parse the container.xml file in the META-INF file, read the root file path of the e-book; locate the opf file according to the root file path, and parse the meta information in the opf file;

(2) Analyze the main content of the book: the main content of the book is included in the xhtml/html file, and the xhtml/html file is classified according to the page processing module. The classification method adopts the epub-type attribute and the data-type attribute of the root tag of the file, According to the different values, it is divided into cover, preface, chapter and postscript; then call the cover analysis unit, preface analysis unit, and chapter analysis unit to process the content of these three parts respectively; for the postscript part, it will not be processed;

The analysis method of the cover analysis unit is to directly locate and extract the cover image in the file, store it in the specified folder and name it with its md5 value, and the specific method is to search for the img tag in the document tree and locate it through its src attribute Go to the picture directory to move and calculate the md5 value to rename;

The parsing method of the preface parsing unit is to locate the tag whose epub-type is frontmatter, then traverse all the tags under it, convert them into different Json fields according to the class attribute value, and parse the five types of tags, respectively, the class is BookTitlePage, CopyrightPage, The tags of Dedication, Preface, and BookAcknowledgments include the book title, copyright, preface, preface, and thank you information respectively. The method of parsing is to perform recursive traversal, and convert it into a pre-set semantic Json according to the structure of the tag;

Chapter parsing unit, the parsed chapter class xhtml is decomposed into two parts for extraction; the first is the title and commentary of each chapter, the title part is h1-h4 tags, the extraction method is to locate and extract the text node value according to the tag name, and the commentary part In EPUB, it is similar to ordinary paragraphs, so the paragraph parsing unit: p tag parsing unit or div parsing unit is used for parsing, where p tag represents simple text, div tag represents more complex text; P tag parsing unit and div tag The analysis unit is the entrance of other analysis units and the core of the chapter analysis unit;

(3) Extract the main content of each chapter: the main content is composed of several sections, each section is wrapped by a section tag, and the section tag parsing unit is called for processing.

2. a kind of EPUB file parsing method as claimed in claim 1 is characterized in that, Opf file parsing is divided into two parts, and the one, parsing book meta-information, the 2nd, parsing book media file path.

3. A kind of EPUB file parsing method as claimed in claim 2, it is characterized in that, the meta-information of book is included among the label named metadata, wherein the partial label information that needs parsing is as follows:

meta information name tag name Attributes ISBN Identifier book title Title Id=pub-title book subtitle Title Id=pub-subtitle language of the book Language author Creator publishing house Publisher Copyright Information Rights

The media file path of the book is included in the tag named manifest. Each item tag represents a media file. Only the item tag whose attribute media-type is "application/xhtml+xml" is parsed to obtain the path of all xhtml/html files. , the parsing method is to call the meta information parsing module to build the opf file into a document tree, traverse and process the metadata and manifest tags, extract the value of the text node corresponding to the meta information tag and store it in the predefined semantic Json structure, xhtml and html files The path is stored as an instance variable.

4. a kind of EPUB file parsing method as claimed in claim 1, it is characterized in that, for the P label parsing unit, concrete steps comprise: at first its child node is traversed, if then be stored in the list if it is a text node, otherwise call The processing unit corresponding to the tag performs processing, and finally constructs data in the form of a key value. This data format is extremely expandable.

5. a kind of EPUB file parsing method as claimed in claim 1 is characterized in that, for the div tag parsing unit, concrete steps comprise: at first traversing the div tag takes out all sub-tags under it, and its content is handed over to different processing modules For processing, the processing here is different from the P tag parsing module, where the key in the parsing result is always "p", but returns it as a separate key-value pair according to its class, which cannot be done by the general Html to Json method Arrived, and also removed redundant div tags.