CN112632959A - EPUB file analysis method - Google Patents
EPUB file analysis method Download PDFInfo
- Publication number
- CN112632959A CN112632959A CN202011590270.4A CN202011590270A CN112632959A CN 112632959 A CN112632959 A CN 112632959A CN 202011590270 A CN202011590270 A CN 202011590270A CN 112632959 A CN112632959 A CN 112632959A
- Authority
- CN
- China
- Prior art keywords
- file
- parsing
- tag
- epub
- chapter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 claims abstract description 54
- 238000012545 processing Methods 0.000 claims abstract description 26
- 230000008569 process Effects 0.000 claims abstract description 12
- 238000000605 extraction Methods 0.000 claims description 6
- 230000006837 decompression Effects 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 2
- 238000012546 transfer Methods 0.000 claims description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 101100126955 Arabidopsis thaliana KCS2 gene Proteins 0.000 description 1
- 241000132023 Bellis perennis Species 0.000 description 1
- 235000005633 Chrysanthemum balsamita Nutrition 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention relates to an EPUB file analysis method, which comprises the following steps: and analyzing the meta information: calling a META-information analysis module to analyze a container.xml file in the META-INF file and reading a root file path of the electronic book; positioning the opf file according to the root file path, and analyzing meta information in the opf file; analyzing the content of the book main body: the book main body content is contained in an xml/html file, and the xml/html file is classified into a cover, a preamble part, a chapter part and a postscript part according to a page processing module; then respectively calling a cover analysis unit, a introduction analysis unit and a chapter analysis unit for processing, and respectively semanticizing the three parts of contents; for the postscript part, no treatment is carried out; extracting the main content of each chapter: the main body content is composed of a plurality of sections, each section is wrapped by a section label, and a section label analysis unit is called to process. The method can analyze the EPUB data into Json format data.
Description
Technical Field
The invention relates to the technical field of file reading methods, in particular to semantization of an EPUB format, and specifically relates to reconstruction and conversion of a book in the EPUB format into a json file.
Background
An epub (electronic publication) format file is a free open standard, and belongs to a content that can be automatically rearranged; i.e. the text content can be displayed in a manner most suitable for reading, depending on the characteristics of the reading device. The EPUB archive internally uses XHTML or DTBook (an XML standard proposed by DAISY Consortium) to render text and to wrap the archive contents in a zip-compressed format. The EPUB format optionally includes Digital Rights Management (DRM) -related functionality. However, the data of the EPUB electronic book have a great deal of redundancy at present, and are not easy to extract and reuse. If the method can convert the book information into the Json format with strong expansibility and simple analysis while keeping the book information, the method can be easily converted into other types of files, and has wide application prospect for converting and displaying the electronic book information.
The existing EPUB parsing scheme is to parse an EPUB format file based on epublib-core-latest. And generating a file header converter, a page converter and an object converter according to the analyzed information, wherein the file header converter is used for acquiring the author and the book name of the file, the page converter is used for acquiring the content of the file, and the object converter is used for integrating the file header converter and the page conversion information.
The prior art has the following defects: the data obtained after epublib analysis is rich text content in html format, if plain text is needed, the text content of the label can be obtained through jsup, but the obtained plain text typesetting is messy. And the steps of generating the header converter, the page converter and the object converter according to the analyzed information are too complicated, and the expandability is not strong.
Disclosure of Invention
The purpose of the invention is: in order to solve the problems, the invention provides an EPUB file analysis method, which aims to solve the problems that EPUB electronic book data has a large amount of redundancy and is difficult to extract and utilize.
In order to solve the problems, the technical scheme adopted by the invention is as follows:
a kind of EPUB file analysis method, transfer decompress module EPUB file name change convert into zip compress file and decompress it, then analyze to the decompression file, characterized by that, the said analysis method includes the following step:
(1) and analyzing the meta information: calling a META-information analysis module to analyze a container.xml file in the META-INF file and reading a root file path of the electronic book; positioning the opf file according to the root file path, and analyzing meta information in the opf file;
(2) analyzing the content of the book main body: book main body content is contained in an xml/html file, the xml/html file is classified according to a page processing module, an epub-type attribute and a data-type attribute of a file root tag are obtained by a classification method, and the classification method is classified according to different values and is divided into a front cover, a front introduction part, a chapter part and a postscript part; then respectively calling a cover analysis unit, a introduction analysis unit and a chapter analysis unit for processing, and respectively semanticizing the three parts of contents; for the postscript part, no treatment is carried out;
(3) extracting the main content of each chapter: the main body content is composed of a plurality of sections, each section is wrapped by a section label, and a section label analysis unit is called to process.
Furthermore, Opf file parsing is divided into two parts, namely parsing book meta information and parsing book media file path.
Further, the meta information of the book is contained in the tag named metadata, wherein the partial tag information to be parsed is as follows:
the method comprises the steps of calling a meta-information analysis module to construct an opf file into a document tree, processing metadata and manifest labels in a traversing manner, extracting values of text nodes corresponding to the meta-information labels and storing the values into a predefined semantization Json structure, wherein the paths of the xml and html files are stored as instance variables.
Furthermore, the analysis method of the cover analysis unit is to directly locate and extract the cover picture in the file, store the cover picture in the appointed folder and name the cover picture by the md5 value, and the specific method is to search the img label in the document tree, locate the img label in the image directory by the src attribute of the img label, move the img label and calculate the md5 value for renaming.
Furthermore, the parsing method of the preamble parsing unit is to locate a tag with epub-type as frontatter, traverse all tags therebelow, respectively convert the tags into different Json fields according to the attribute value of class, parse five types of tags, respectively including book title, copyright, gifting, preamble and thank information, of class as boottitlepage, CopyrightPage, deletion, Preface and Bookcaknowledgments, perform recursive traversal, and convert the tags into semantization Json set in advance according to the structure of the tags.
Further, the chapter analysis unit is used for decomposing the analyzed chapter type xhtml into two parts for extraction; firstly, the title and the phrase of each chapter are included, the title part is h1-h4 tags, the extraction method is to locate and extract the text node value according to the tag name, and the phrase part is similar to a common paragraph in an EPUB, so a paragraph parsing unit is adopted: the p-label analysis unit or the div analysis unit is used for analyzing, wherein the p-label represents a simple text, and the div-label represents a more complex text; the P-tag parsing unit and the div-tag parsing unit are entries of the remaining parsing units and are also the core of the chapter parsing unit.
Further, for the P tag parsing unit, the specific steps include: firstly, the child nodes are traversed, if the child nodes are text nodes, the text nodes are stored in a list, otherwise, a processing unit corresponding to the label is called to process the text nodes, and finally, the text nodes are constructed into data in a key value mode.
Further, for the div tag parsing unit, the specific steps include: firstly, traversing the div tag to take out all the sub tags under the div tag, and transferring the content of the sub tags to different processing modules for processing, wherein the processing is different from the processing that the key in the analysis result of the P tag analysis module is always 'P', but the key is returned as an independent key value pair according to the class of the P tag analysis module, which cannot be realized by the common method of converting Html to Json, and meanwhile, redundant div tags are removed.
The technical scheme provided by the invention has the beneficial effects that at least: the invention can solve the problem of utilizing EPUB electronic book information, maintains book information to the maximum extent and reduces the storage cost to the minimum by converting into the semantization Json file, and the Json format has strong expansibility and simple analysis, so the method is easy to convert into other types of files.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
fig. 1 is a parsing flowchart of an EPUB file parsing method disclosed in the embodiment of the present invention.
Fig. 2 is a main block diagram of an EPUB file parsing method disclosed in the embodiment of the present invention.
Fig. 3 is a schematic diagram of an xhtml structure of a chapter parsing unit disclosed in the embodiment of the present invention.
Fig. 4 is a flowchart of parsing a P tag parsing unit according to an embodiment of the present invention.
Fig. 5 is a parsing flowchart of the div tag parsing unit according to the embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The invention relates to an EPUB file parsing method, the parsing steps are shown in figure 1, and the main modules are shown in figure 2. Firstly, an uncompressing module is called to convert the EPUB file into a zip compressed file and decompress the zip compressed file, and then the uncompressed file is comprehensively analyzed.
First, the meta information is parsed. After decompression, an ebook generally complying with the EPUB3.0 standard may have three types of files, META-INF files, obeps (ops) files, and mimetype files. The method comprises the steps of firstly calling a META-information analysis module to analyze a container file in the META-INF file, and reading a root file (rootfile) path of the electronic book. And then positioning the opf file according to the root file path, and analyzing the meta information in the opf file. Opf the file analysis is divided into two parts, one is to analyze book meta information and the other is to analyze book media file path. The meta information of the book is mainly contained in the tag named metadata, wherein the information of the part of the tag to be analyzed is as follows:
name of meta information | Name of label | Properties |
ISBN | Identifier | |
Book title | Title | Id=pub-title |
Book subtitle | Title | Id=pub-subtitle |
Language used by books | Language | |
Authors refer to | Creator | |
Publishing company | Publisher | |
Copyright information | Rights |
The method mainly comprises the steps of calling a meta-information analysis module to construct an opf file into a document tree, traversing and processing metadata and manifest tags, extracting values of text nodes corresponding to the metadata tags and storing the values into a predefined semantic Json structure, wherein the paths of the xml and html files are stored as instance variables.
And secondly, analyzing the content of the book main body. The method comprises the steps of firstly classifying the xml/html files according to a page processing module, wherein the classifying method mainly comprises the steps of obtaining an epub-type attribute and a data-type attribute of a file root label of each file, and classifying according to different values of the epub-type attribute and the data-type attribute. The book content is divided into three parts, namely a cover, a preamble part, a chapter part and a postscript part, and then different modules are respectively called for processing, which is a key innovation part of the book content semantic analysis method.
The analysis method of the cover analysis unit is that the cover pictures in the file are directly positioned and extracted to be stored in the appointed folder and named according to the md5 value, and the specific method is that the img label in the document tree is searched, the img label is positioned in the picture directory through the src attribute of the img label, the movement is carried out, and the md5 value is calculated for renaming.
The method comprises the steps of locating a label with epub-type as front attter, traversing all labels below the label, converting the label into different Json fields according to a class attribute value, mainly analyzing five types of labels, namely labels with class of BookTitlePage, copyrightPage, deletion, Preface and BookAcknowledgement, respectively including information of book title, copyright, gifts, preamble and thank, and performing recursive traversal in the analyzing method, and converting the labels into semantic Json set in advance according to the structure of the labels.
The structure of the chapter class xhtml to be analyzed by the chapter analysis unit is shown in fig. 3, and the present invention decomposes the chapter class xhtml into two parts to extract the two parts. First, the title and the phrase of each chapter, the title part is generally h1-h4 tags, the extraction method is to locate and extract the text node value according to the tag name, and the phrase part is similar to the ordinary paragraph in EPUB, so the paragraph parsing unit is adopted, for example: a p-tag parsing unit or a div parsing unit, where the p-tag represents simple text and the div-tag represents more complex text. The P-tag parsing unit and the div-tag parsing unit are almost entries of the remaining parsing units and are also cores of the chapter parsing unit.
The specific steps of the P-tag parsing unit are shown in fig. 4. Firstly, traversing child nodes, if the child nodes are text nodes, storing the text nodes in a list, otherwise, calling a processing unit corresponding to a label to process the text nodes, and finally constructing data in a key value form, such as: "p": text, { "span": text "] }, {" code ": text" } ], which has an extremely strong expansibility, and it is noted that although the formats of different tag processing results are almost the same, the processing methods of different tags are different.
For the Div tag parsing unit, the specific steps are as shown in fig. 5. Firstly, traversing the div tag to take out all the sub-tags under the div tag, and transferring the content of the sub-tags to different processing modules for processing, wherein the processing is different from that the key in the analysis result of the P-tag analysis module is always 'P', and the key is returned as an independent key value pair according to the class of the P-tag analysis module, for example: the extraction result of < div > < div class [ { "li": [ ], and { "li": } ] of text value [ </li > </div > ] is "ol" [ { "li": [ ], and { "li": } ]), which cannot be achieved by the ordinary Html to Json method, and the redundant div label is removed.
And then extracting the main content of each chapter, wherein the main content consists of a plurality of sections, each section is wrapped by a section label, the structure below the section is similar to that of the chapter, so that a section label analysis unit can be recursively called to process, the section label analysis unit is similar to the chapter analysis unit, and the difference is that the unit can not only recognize div and p labels to call corresponding units to process, but also recognize and analyze labels such as figure, ol, ul, pre and the like to process.
For the postscript part, the invention does not process the information in consideration of low information value.
The invention also synchronously develops a module for converting the semantic Json into html, can convert Json data into html files, has no information loss and is simpler compared with html files of source EPUB files, embodies the simplicity of the semantic Json format and is easy to convert into other file formats.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Of course, the processor and the storage medium may reside as discrete components in a user terminal.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".
Claims (8)
1. A kind of EPUB file analysis method, transfer decompress module EPUB file name change convert into zip compress file and decompress it, then analyze to the decompression file, characterized by that, the said analysis method includes the following step:
(1) and analyzing the meta information: calling a META-information analysis module to analyze a container.xml file in the META-INF file and reading a root file path of the electronic book; positioning the opf file according to the root file path, and analyzing meta information in the opf file;
(2) analyzing the content of the book main body: book main body content is contained in an xml/html file, the xml/html file is classified according to a page processing module, an epub-type attribute and a data-type attribute of a file root tag are obtained by a classification method, and the classification method is classified according to different values and is divided into a front cover, a front introduction part, a chapter part and a postscript part; then respectively calling a cover analysis unit, a introduction analysis unit and a chapter analysis unit for processing, and respectively semanticizing the three parts of contents; for the postscript part, no treatment is carried out;
(3) extracting the main content of each chapter: the main body content is composed of a plurality of sections, each section is wrapped by a section label, and a section label analysis unit is called to process.
2. The EPUB file parsing method of claim 1, wherein Opf file parsing is divided into two parts, one part parsing book meta information and the other part parsing book media file path.
3. The EPUB file parsing method of claim 2, wherein the meta information of the book is included in a tag named metadata, wherein the partial tag information to be parsed is as follows:
the method comprises the steps of calling a meta-information analysis module to construct an opf file into a document tree, processing metadata and manifest labels in a traversing manner, extracting values of text nodes corresponding to the meta-information labels and storing the values into a predefined semantization Json structure, wherein the paths of the xml and html files are stored as instance variables.
4. The method as claimed in claim 1, wherein the parsing method of the cover parsing unit is that the directly located cover pictures extracted from the file are stored in the designated folder and named with their md5 values, and the specific method is that the img tags in the document tree are searched, located in the picture directory by their src attributes, moved and renamed with the md5 values calculated.
5. The EPUB file parsing method of claim 1, wherein the parsing method of the preamble parsing unit is to locate a tag with EPUB-type being frontattter, traverse all tags therebelow, convert the tags into different Json fields according to class attribute values, parse five types of tags, including book title, copyright, presentation, preamble and thank you information according to class attribute values, and perform recursive traversal to convert the tags into semantization Json set in advance according to the structure of the tags.
6. The EPUB file parsing method according to claim 1, wherein the chapter parsing unit parses the chapter type xhtml parsed therein into two parts for extraction; firstly, the title and the phrase of each chapter are included, the title part is h1-h4 tags, the extraction method is to locate and extract the text node value according to the tag name, and the phrase part is similar to a common paragraph in an EPUB, so a paragraph parsing unit is adopted: the p-label analysis unit or the div analysis unit is used for analyzing, wherein the p-label represents a simple text, and the div-label represents a more complex text; the P-tag parsing unit and the div-tag parsing unit are entries of the remaining parsing units and are also the core of the chapter parsing unit.
7. The EPUB file parsing method according to claim 6, wherein for the P tag parsing unit, the specific steps include: firstly, the child nodes are traversed, if the child nodes are text nodes, the text nodes are stored in a list, otherwise, a processing unit corresponding to the label is called to process the text nodes, and finally, the text nodes are constructed into data in a key value mode.
8. The EPUB file parsing method according to claim 6 or 7, wherein for a div tag parsing unit, the specific steps include: firstly, traversing the div tag to take out all the sub tags under the div tag, and transferring the content of the sub tags to different processing modules for processing, wherein the processing is different from the processing that the key in the analysis result of the P tag analysis module is always 'P', but the key is returned as an independent key value pair according to the class of the P tag analysis module, which cannot be realized by the common method of converting Html to Json, and meanwhile, redundant div tags are removed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011590270.4A CN112632959B (en) | 2020-12-29 | 2020-12-29 | A method for parsing EPUB files |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011590270.4A CN112632959B (en) | 2020-12-29 | 2020-12-29 | A method for parsing EPUB files |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112632959A true CN112632959A (en) | 2021-04-09 |
CN112632959B CN112632959B (en) | 2023-09-01 |
Family
ID=75285911
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011590270.4A Active CN112632959B (en) | 2020-12-29 | 2020-12-29 | A method for parsing EPUB files |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112632959B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114372028A (en) * | 2022-01-13 | 2022-04-19 | 上海呵呵呵文化传播有限公司 | File processing method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130139052A1 (en) * | 2011-11-26 | 2013-05-30 | Huawei Technologies Co., Ltd. | Method and apparatus for loading epub electronic book |
CN103761277A (en) * | 2014-01-09 | 2014-04-30 | 北京掌阔技术有限公司 | ePub electronic book loading method and system |
CN106202169A (en) * | 2016-06-24 | 2016-12-07 | 北京玖扬博文文化发展有限公司 | A kind of conversion method of ePub file format |
CN106802880A (en) * | 2015-11-25 | 2017-06-06 | 阿里巴巴集团控股有限公司 | A kind of electronic document content shows, processing method and processing device |
US10007646B1 (en) * | 2015-04-03 | 2018-06-26 | Nutanix, Inc. | Method and system for presenting multiple levels of content for a document |
CN110532233A (en) * | 2019-08-20 | 2019-12-03 | 武汉鼎森电子科技有限公司 | A kind of epub document generating method and system |
CN110554996A (en) * | 2019-08-20 | 2019-12-10 | 武汉鼎森电子科技有限公司 | method and system for quickly opening epub file |
-
2020
- 2020-12-29 CN CN202011590270.4A patent/CN112632959B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130139052A1 (en) * | 2011-11-26 | 2013-05-30 | Huawei Technologies Co., Ltd. | Method and apparatus for loading epub electronic book |
CN103761277A (en) * | 2014-01-09 | 2014-04-30 | 北京掌阔技术有限公司 | ePub electronic book loading method and system |
US10007646B1 (en) * | 2015-04-03 | 2018-06-26 | Nutanix, Inc. | Method and system for presenting multiple levels of content for a document |
CN106802880A (en) * | 2015-11-25 | 2017-06-06 | 阿里巴巴集团控股有限公司 | A kind of electronic document content shows, processing method and processing device |
CN106202169A (en) * | 2016-06-24 | 2016-12-07 | 北京玖扬博文文化发展有限公司 | A kind of conversion method of ePub file format |
CN110532233A (en) * | 2019-08-20 | 2019-12-03 | 武汉鼎森电子科技有限公司 | A kind of epub document generating method and system |
CN110554996A (en) * | 2019-08-20 | 2019-12-10 | 武汉鼎森电子科技有限公司 | method and system for quickly opening epub file |
Non-Patent Citations (3)
Title |
---|
叶兰;: "电子图书新规范EPUB3.0及其应用", 图书馆杂志, no. 08, pages 53 - 59 * |
唐光前: "开放式电子图书出版结构(OEBPS)规范的解析", 图书情报工作, no. 05, pages 58 - 62 * |
迟亮;: "EPUB 3.1数字出版技术研究", 电脑知识与技术, no. 19, pages 239 - 240 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114372028A (en) * | 2022-01-13 | 2022-04-19 | 上海呵呵呵文化传播有限公司 | File processing method and device |
CN114372028B (en) * | 2022-01-13 | 2025-06-06 | 上海呵呵呵文化传播有限公司 | File processing method and device |
Also Published As
Publication number | Publication date |
---|---|
CN112632959B (en) | 2023-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11036808B2 (en) | System and method for indexing electronic discovery data | |
EP1679625B1 (en) | Method and apparatus for structuring documents based on layout, content and collection | |
US5778400A (en) | Apparatus and method for storing, searching for and retrieving text of a structured document provided with tags | |
JP4373721B2 (en) | Method and system for encoding markup language documents | |
US8407585B2 (en) | Context-aware content conversion and interpretation-specific views | |
US8200702B2 (en) | Independently variably scoped content rule application in a content management system | |
US20060236228A1 (en) | Extensible markup language schemas for bibliographies and citations | |
CN108228676B (en) | Information extraction method and system | |
CN111062187A (en) | Structured parsing method and system for docx format document | |
KR20040087903A (en) | Database model for hierarchical data formats | |
Elizarov et al. | Scientific documents ontologies for semantic representation of digital libraries | |
CN105701091B (en) | A kind of processing method and processing unit of semantic-based PDF document | |
WO2006081475A2 (en) | System and method for processsing xml documents | |
CN115796146A (en) | File comparison method and device | |
CN112632959A (en) | EPUB file analysis method | |
CN102063415B (en) | Method and system for embedding single-byte fonts in PDF (Portable Document Format) file | |
CN112818687B (en) | Method, device, electronic equipment and storage medium for constructing title recognition model | |
Sirsat et al. | Pattern matching for extraction of core contents from news web pages | |
Changuel et al. | A general learning method for automatic title extraction from html pages | |
CN110554996A (en) | method and system for quickly opening epub file | |
CN102360351A (en) | Method and system for carrying out semantic description on content of electronic-book (e-book) | |
KR102075874B1 (en) | Method for conversion of e-book and apparatus using the method | |
CN113657080A (en) | An XML-based structured system and data package creation method | |
CN112836477B (en) | Method and device for generating code annotation document, electronic equipment and storage medium | |
JP4388092B2 (en) | Structured document database management system and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |