[go: up one dir, main page]

CN112069771B - Method and device for analyzing pictures in PDF (portable document format) file - Google Patents

Method and device for analyzing pictures in PDF (portable document format) file Download PDF

Info

Publication number
CN112069771B
CN112069771B CN202010871131.2A CN202010871131A CN112069771B CN 112069771 B CN112069771 B CN 112069771B CN 202010871131 A CN202010871131 A CN 202010871131A CN 112069771 B CN112069771 B CN 112069771B
Authority
CN
China
Prior art keywords
pdf file
file
page
resolution
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010871131.2A
Other languages
Chinese (zh)
Other versions
CN112069771A (en
Inventor
黄世玉
李保仓
胡赞华
龚正
李月龙
汪丹萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202010871131.2A priority Critical patent/CN112069771B/en
Publication of CN112069771A publication Critical patent/CN112069771A/en
Application granted granted Critical
Publication of CN112069771B publication Critical patent/CN112069771B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Facsimiles In General (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for analyzing pictures in a PDF file, and relates to the technical field of computers. One embodiment of the method comprises the following steps: identifying file information of a PDF file, the file information including a file size of the PDF file, a number of pages of the PDF file, and a page size of each of one or more file pages; for each of the one or more file pages: determining a picture resolution for resolving a picture from the file page based on file information of the identified PDF file; according to the determined resolution of the picture analysis, analyzing the picture from the file page to obtain the picture of the file page; and compressing the obtained picture for further processing. According to the embodiment, the efficiency of analyzing the pictures in the PDF file is improved, and the user experience is improved.

Description

Method and device for analyzing pictures in PDF (portable document format) file
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for resolving pictures in a PDF file.
Background
At present, a PDF file is a file format with wide application, can encapsulate characters, tables, pictures and the like, and has the advantages of small storage space, no random tampering, convenient transmission, no compatibility problem and the like. PDF files also suffer from the disadvantage that text, forms, pictures, etc. cannot be directly exported, which is a great inconvenience to users who attempt to extract text, forms, pictures, etc. from the PDF file.
In the prior art, some technical schemes for analyzing PDF files exist, and the technical schemes are mainly based on the following steps: uploading a PDF file; executing a picture analysis function in the PDF file to analyze each page in the PDF file; and sending all the analyzed PDF files to other services or storing the PDF files in a storage device.
However, in the process of implementing the present invention, the inventors found that at least the following problems exist in the prior art:
When the PDF file is larger or the resolution of pictures in the PDF file is larger, the time for analyzing the PDF file is longer, and even the system operation overload can be caused to cause system breakdown, so that the user experience is influenced; only the first few pictures in the PDF file usually have effective data, and the later pictures in the PDF file usually have no effective data, so that further analysis is not needed, and the traditional method analyzes all the pictures in the PDF file, wastes various resources of a system, reduces analysis efficiency and further influences user experience.
Disclosure of Invention
In view of this, the embodiments of the present invention provide a method and an apparatus for parsing pictures in a PDF file, so as to improve the parsing efficiency and improve the user experience by limiting the file size of the PDF file and the number of pages of the parsed PDF file. And moreover, the analysis efficiency and the user experience are further improved by determining the proper picture analysis resolution based on the file information of the PDF file.
To achieve the above object, according to one aspect of an embodiment of the present invention, there is provided a method for parsing a picture in a PDF file.
The method for analyzing the picture in the PDF file comprises the following steps:
identifying file information of a PDF file, the file information including a file size of the PDF file, a number of pages of the PDF file, and a page size of each of one or more file pages;
for each of the one or more file pages:
determining a picture resolution for resolving a picture from the file page based on file information of the identified PDF file; and
Analyzing the picture from the file page according to the determined picture analysis resolution to obtain the picture of the file page; and
The obtained picture is compressed for further processing.
Optionally, the method further comprises: determining whether there is a limit to the file size of the PDF file:
determining whether a file size of a PDF file exceeds a PDF file resolution threshold file size in response to determining that a limit of a text size of the PDF file exists;
In response to determining that there is no restriction of the file size of the PDF file, the picture resolution is identified as a default resolution for resolving pictures of the PDF file.
Optionally, determining whether the file size of the PDF file exceeds a PDF file parsing threshold file size further includes:
Determining whether the file size of the PDF file exceeds the file size of a PDF file analysis threshold value, ending analysis of the PDF file and returning error information; and
And determining that the file size of the PDF file does not exceed the file size of the PDF file analysis threshold value, and identifying the picture analysis resolution as the default resolution for analyzing the picture of the PDF file.
Optionally, the method further comprises:
Determining whether there is a limit to the number of pages of the PDF file:
Determining whether the number of pages of the PDF file exceeds a PDF file resolution threshold number of pages in response to determining that a limit exists on the number of pages of the PDF file; and
In response to determining that there is no limit to the number of pages of the PDF file, the picture resolution is identified as a default resolution for resolving pictures of the PDF file.
Optionally, determining whether the number of pages of the PDF file exceeds the PDF file parsing threshold number of pages further includes:
In response to determining that the number of pages of the PDF file exceeds the PDF file resolution threshold number of pages, identifying the picture resolution as a lowest resolution for resolving pictures of the PDF file; and
In response to determining that the number of pages of the PDF file does not exceed the PDF file resolution threshold number of pages, the picture resolution is identified as a default resolution for resolving pictures of the PDF file.
Optionally, the method further comprises:
determining whether there is a limit to the page size of the PDF file:
Determining whether a page size of the PDF file exceeds a PDF file resolution threshold page size in response to determining that a limit exists on the page size of the PDF file; and
In response to determining that there is no limit to the page size of the PDF file, the picture resolution is identified as a default resolution for resolving pictures of the PDF file.
Optionally, determining whether the page size of the PDF file exceeds a PDF file parsing threshold page size further includes:
in response to determining that the page size of the PDF file exceeds the PDF file parsing threshold page size, identifying the picture parsing resolution as 2 x a lowest resolution of pictures used to parse the PDF file/a number of pages of the PDF file; and
In response to determining that the page size of the PDF file does not exceed the PDF file resolution threshold page size, the picture resolution is identified as a default resolution for resolving pictures of the PDF file.
Optionally, after parsing the picture from the file page according to the determined picture parsing resolution to obtain the picture of the file page further comprises: it is determined whether the file page is the last page of the PDF file.
Optionally, determining whether the file page is the last page of the PDF file further includes:
ending the parsing of the PDF file in response to determining that the file page is the last page of the PDF file; and
In response to determining that the file page is not the last page of the PDF file, it is determined whether there is a limit to the number of read pages of the PDF file.
Optionally, determining whether there is a limit to the number of read pages of the PDF file further includes:
determining whether the number of pages of the PDF file which is analyzed exceeds the PDF file analysis threshold value reading page number or not in response to determining that the limit of the reading page number of the PDF file exists; and
In response to determining that there is no limit to the number of read pages of the PDF file and that the file page is not the last page of the PDF file, continuing to parse the next file page.
Optionally, determining whether the number of pages of the parsed PDF file exceeds the PDF file parsing threshold reading page number further includes:
Responding to the fact that the number of pages of the analyzed PDF file exceeds the PDF file analysis threshold value to read the number of pages, and ending analysis; and
And in response to determining that the number of pages of the PDF file that has been parsed does not exceed the PDF file parsing threshold read page number and that the file page is not the last page of the PDF file, continuing to parse the next file page.
Optionally, the method further comprises: the obtained picture is used for the artificial intelligence service.
Optionally, the PDF file threshold parsing file size is set automatically by a system or manually by a user.
Optionally, the PDF file parsing threshold number of pages is set automatically by a system or manually by a user.
Optionally, the PDF file parsing threshold page size is set automatically by a system or manually by a user.
To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided an apparatus for parsing a picture in a PDF file.
The device for analyzing the picture in the PDF file comprises the following steps: the system comprises a file information identification module, a picture resolution determination module, a picture resolution module and a picture compression module; wherein,
The file information identification module is used for identifying file information of a PDF file, wherein the file information comprises the file size of the PDF file, the page number of the PDF file and the page size of each file page in one or more file pages;
The picture resolution determining module is used for determining picture resolution for resolving pictures from the file pages based on the file information of the identified PDF file for each of the one or more file pages;
the picture analysis module is used for analyzing pictures from the file page according to the determined picture analysis resolution ratio so as to obtain pictures of the file page;
And the picture compression module is used for compressing the obtained picture for further processing.
To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided an electronic device for parsing a picture in a PDF file.
The electronic device for analyzing the picture in the PDF file comprises: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the method for analyzing the pictures in the PDF file.
To achieve the above object, according to still another aspect of the embodiments of the present invention, there is provided a computer-readable storage medium.
A computer-readable storage medium of an embodiment of the present invention has stored thereon a computer program which, when executed by a processor, implements a method for parsing a picture in a PDF file of an embodiment of the present invention.
One embodiment of the above invention has the following advantages or benefits: by limiting the file size of the PDF file and the number of pages for analyzing the PDF file, the analysis efficiency is improved, and the user experience is improved. And moreover, the analysis efficiency and the user experience are further improved by determining the proper picture analysis resolution based on the file information of the PDF file.
Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
fig. 1 is a schematic diagram of main steps of a method for parsing a picture in a PDF file according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the main steps of another method for parsing a picture in a PDF file according to an embodiment of the invention;
fig. 3 is a schematic diagram of main modules of an apparatus for parsing a picture in a PDF file according to an embodiment of the present invention;
FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be applied;
Fig. 5 is a schematic diagram of a computer system suitable for use in implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that the embodiments of the present invention and the technical features in the embodiments may be combined with each other without collision.
FIG. 1 is a schematic diagram of the main steps of a method of account management according to an embodiment of the present invention.
As shown in fig. 1, a method for account management according to an embodiment of the present invention mainly includes the following steps:
Step S101: file information of a PDF file is identified, the file information including a file size of the PDF file, a number of pages of the PDF file, and a page size of each of one or more file pages.
Step S102: for each of the one or more file pages, determining a picture resolution for resolving a picture from the file page based on file information of the identified PDF file.
Step S103: and according to the determined resolution of the picture analysis, analyzing the picture from the file page to obtain the picture of the file page.
Step S104: the obtained picture is compressed for further processing.
Prior to identifying the file information of a PDF file, the PDF file typically needs to be uploaded to a user system by an individual user or an enterprise user in order for the system to identify various file information of the PDF file. In general, file information of a PDF file includes a file size of the PDF file (which may be expressed in units of MB, KB, and the like), the number of pages of the PDF file, and a page size of each of one or more file pages (which may be expressed in units of KB, and the like).
In the present embodiment, the page size of each file page may be set to the file size of the PDF file/the number of pages of the PDF file or also set according to the actual page size of the identified file page, to which the present invention is not limited.
After identifying the file size of the PDF file, the system may determine whether to parse the PDF file according to various factors such as the file size of the identified PDF file and performance of the system. The performance of the system may be, for example, the word length of the system's CPU, the CPU's cache capacity, the CPU's frequency, the front-side bus frequency, etc. For example, if the file size of a PDF file is too large and exceeds a certain file size threshold (e.g., 50MB,100MB, or 200MB, etc., to which the present invention is not limited), the system takes too long to parse the PDF file, resulting in a system crash or a user waiting too long. By limiting the file size of the PDF file, the defect that the system is crashed or the waiting time of a user is too long due to the fact that the PDF file is too large can be effectively avoided, and therefore user experience is improved.
Further, after identifying the number of pages of the PDF file, the system may select an appropriate picture resolution to quickly analyze all pictures of the PDF file based on the number of pages of the identified PDF file. For example, if the number of pages of the PDF file is 100 pages (which exceeds the maximum allowed number of pages by default of the system) and the default resolution of the picture resolution by default of the system is 150DPI (or 200DPI or any other value, the invention is not limited thereto), resolving the PDF file with the default resolution of the picture by default of the system may take too long to affect the user experience because the number of pages of the PDF file is too large. Based on the method, the image resolution can be adjusted according to the page number of the PDF file. The larger the number of pages of the PDF file, the smaller the picture resolution may be selected, but not lower than the minimum picture resolution given by the system, otherwise the visual effect of the picture may be affected.
In addition, after identifying the page size of each of the one or more file pages of the PDF file, the system may further select an appropriate picture resolution to quickly parse all pictures of the PDF file based on the page size of each file page. For example, if the page size of the file page exceeds 500KB (or 1MB or any other value), the invention is not limited in this regard. Then parsing the PDF file with the default resolution of picture resolution of the system may take too long to affect the user experience because the page size of the file page is too large. Based on the method, the device and the system, the resolution of the picture can be adjusted according to the page size of the file page. The larger the page size of the file page, the smaller the picture resolution can be selected, but not lower than the minimum picture resolution given by the system, otherwise the visual effect of the picture can be affected.
In this embodiment of the invention, the parsed picture may be compressed for further processing. Further processing may include packaging the pictures for transmission to a storage device or for transmission to other processing services (e.g., various processing services such as an artificial intelligence processing service).
Further, according to the embodiment of the invention, two or three factors of the file size of the identified PDF file, the number of pages of the PDF file, and the page size of each of one or more file pages of the PDF file can be combined simultaneously to select an appropriate picture resolution and resolve pictures in the file pages according to the appropriate picture resolution, so as to improve the resolution effect of the system and further improve the user experience.
Fig. 2 is a schematic diagram of main steps of another method for parsing a picture in a PDF file according to an embodiment of the present invention. As shown in fig. 2, the main steps of another method for parsing a picture in a PDF file according to an embodiment of the present invention may include the following steps:
Step S201: file information of a PDF file is identified, the file information including a file size of the PDF file, a number of pages of the PDF file, and a page size of each of one or more file pages.
Step S202: and determining whether the number of pages of the PDF file exceeds the threshold number of pages for PDF file analysis. If the number of pages of the PDF file exceeds the threshold number of pages for PDF file analysis, the step S203 is carried out; otherwise, go to step S204.
In addition, before determining whether the number of pages of the PDF file exceeds a PDF file resolution threshold number of pages, the method further includes determining whether there is a limit to the number of pages of the PDF file: determining whether the number of pages of the PDF file exceeds a PDF file resolution threshold number of pages in response to determining that a limit exists on the number of pages of the PDF file; and in response to determining that there is no limit to the number of pages of the PDF file, identifying the picture resolution as a default resolution for resolving pictures of the PDF file. Before determining whether there is a limit to the number of pages of the PDF file, the method further includes determining whether a file size of the PDF file exceeds a PDF file resolution threshold file size: determining whether the file size of the PDF file exceeds the file size of a PDF file analysis threshold value, ending analysis of the PDF file and returning error information; and determining that the file size of the PDF file does not exceed the PDF file analysis threshold file size, and identifying the picture analysis resolution as a default resolution for analyzing the picture of the PDF file. Before determining whether the file size of the PDF file exceeds a PDF file resolution threshold file size, the method further includes determining whether there is a restriction of the file size of the PDF file: determining whether a file size of a PDF file exceeds a PDF file resolution threshold file size in response to determining that a limit of a text size of the PDF file exists; in response to determining that there is no restriction of the file size of the PDF file, the picture resolution is identified as a default resolution for resolving pictures of the PDF file.
Step S203: and identifying the picture resolution as the lowest resolution for resolving the picture of the PDF file.
Step S204: and identifying the picture resolution as a default resolution for resolving the picture of the PDF file.
Step S205: and determining whether the page size of the PDF file exceeds the PDF file analysis threshold page size. If the page size of the PDF file exceeds the PDF file analysis threshold page size, turning to step S206; otherwise, the process goes to step S207.
In addition, before determining whether the page size of the PDF file exceeds a PDF file parsing threshold page size, the method further includes determining whether there is a limit to the page size of the PDF file: determining whether a page size of the PDF file exceeds a PDF file resolution threshold page size in response to determining that a limit exists on the page size of the PDF file; and in response to determining that there is no restriction in the page size of the PDF file, identifying the picture resolution as a default resolution for resolving pictures of the PDF file.
Step S206: the picture resolution is identified as 2 x the lowest resolution of the picture for resolving the PDF file/the number of pages of the PDF file.
Step S207: and identifying the picture resolution as a default resolution for resolving the picture of the PDF file.
Step S208: and according to the determined resolution of the picture analysis, analyzing the picture from the file page to obtain the picture of the file page.
Step S209: it is determined whether the file page is the last page of the PDF file. If the file page is the last page of the PDF file, go to step S210; otherwise, go to step S211.
Further, after determining that the file page is not the last page of the PDF file, the method further includes determining whether there is a limit to the number of read pages of the PDF file: determining whether the number of pages of the PDF file which is analyzed exceeds the PDF file analysis threshold value reading page number or not in response to determining that the limit of the reading page number of the PDF file exists; in response to determining that there is no limit to the number of read pages of the PDF file and that the file page is not the last page of the PDF file, continuing to parse the next file page.
Step S210: and ending the analysis of the PDF file and compressing the analyzed picture.
Step S211: and determining whether the number of pages of the analyzed PDF file exceeds the PDF file analysis threshold reading page number.
Furthermore, after ending the parsing of the PDF file and compressing the parsed picture, the method further includes using the obtained picture for an artificial intelligence service.
The PDF file threshold value analysis file size is automatically set by a system or manually set by a user.
The PDF file analysis threshold page quantity is automatically set by a system or manually set by a user.
The PDF file analysis threshold page size is automatically set by a system or manually set by a user.
The present invention is described below with table 1 as an example for some parameters of file information of a PDF file.
TABLE 1
The results obtained after applying the method of the present invention with the parameters and default values shown in table 1 above are illustrated in table 2 below and compared in detail with the results obtained using the prior art.
TABLE 2
As can be seen from the comparison in table 2 above, when the same PDF file is input, the analysis efficiency is improved by about 8 times as compared with the embodiment 1 of the related art in the embodiment 1 of the present invention; in embodiment 2 of the present invention, the analysis efficiency is improved by about 2 times as compared with embodiment 2 of the prior art.
According to the method for analyzing the pictures in the PDF file, provided by the embodiment of the invention, the file size of the PDF file and the number of pages of the PDF file are limited, so that the analysis efficiency is improved, and the user experience is improved. And moreover, the analysis efficiency and the user experience are further improved by determining the proper picture analysis resolution based on the file information of the PDF file.
Fig. 3 is a schematic diagram of an apparatus for parsing a picture in a PDF file according to an embodiment of the present invention.
As shown in fig. 3, an apparatus 300 for parsing a picture in a PDF file according to an embodiment of the present invention includes: a file information identification module 301, a picture resolution determination module 302, a picture resolution module 303, and a picture compression module 304; wherein,
The file information identifying module 301 is configured to identify file information of a PDF file, where the file information includes a file size of the PDF file, a number of pages of the PDF file, and a page size of each of one or more file pages;
The picture resolution determining module 302 is configured to determine, for each of the one or more file pages, a picture resolution for resolving a picture from the file page based on file information of the identified PDF file;
a picture parsing module 303, configured to parse a picture from the file page according to the determined picture parsing resolution to obtain a picture of the file page; and
A picture compression module 304 for compressing the obtained picture for further processing.
In one embodiment of the present invention, the picture resolution determination module 302 is configured to determine whether the file size of the PDF file exceeds a PDF file resolution threshold file size; determining whether the file size of the PDF file exceeds the file size of a PDF file analysis threshold value, ending analysis of the PDF file and returning error information; and determining that the file size of the PDF file does not exceed the PDF file analysis threshold file size, and identifying the picture analysis resolution as a default resolution for analyzing the picture of the PDF file.
In one embodiment of the present invention, the picture resolution determination module 302 is configured to determine whether there is a limit on the number of pages of the PDF file: determining whether the number of pages of the PDF file exceeds a PDF file resolution threshold number of pages in response to determining that a limit exists on the number of pages of the PDF file; and in response to determining that there is no limit to the number of pages of the PDF file, identifying the picture resolution as a default resolution for resolving pictures of the PDF file.
In one embodiment of the present invention, the picture resolution determination module 302 is configured to determine whether the number of pages of the PDF file exceeds a PDF file resolution threshold number of pages: in response to determining that the number of pages of the PDF file exceeds the PDF file resolution threshold number of pages, identifying the picture resolution as a lowest resolution for resolving pictures of the PDF file; and in response to determining that the number of pages of the PDF file does not exceed the PDF file resolution threshold number of pages, identifying the picture resolution as a default resolution for resolving pictures of the PDF file.
In one embodiment of the present invention, the picture resolution determination module 302 is configured to determine whether there is a limitation on the page size of the PDF file: determining whether a page size of the PDF file exceeds a PDF file resolution threshold page size in response to determining that a limit exists on the page size of the PDF file; and in response to determining that there is no restriction in the page size of the PDF file, identifying the picture resolution as a default resolution for resolving pictures of the PDF file.
In one embodiment of the present invention, the picture resolution determination module 302, configured to determine whether the page size of the PDF file exceeds a PDF file resolution threshold page size further includes: in response to determining that the page size of the PDF file exceeds the PDF file parsing threshold page size, identifying the picture parsing resolution as 2x a lowest resolution of pictures used to parse the PDF file/a number of pages of the PDF file; and in response to determining that the page size of the PDF file does not exceed the PDF file resolution threshold page size, identifying the picture resolution as a default resolution for resolving pictures of the PDF file.
According to the device for analyzing the pictures in the PDF file, provided by the embodiment of the invention, the analysis efficiency is improved and the user experience is improved by limiting the file size of the PDF file and the number of pages of the PDF file. And moreover, the analysis efficiency and the user experience are further improved by determining the proper picture analysis resolution based on the file information of the PDF file.
Fig. 4 illustrates an exemplary system architecture 400 of a method for parsing a picture in a PDF file or an apparatus for parsing a picture in a PDF file to which embodiments of the present invention may be applied.
As shown in fig. 4, the system architecture 400 may include terminal devices 401, 402, 403, a network 404, and a server 405. The network 404 is used as a medium to provide communication links between the terminal devices 401, 402, 403 and the server 405. The network 404 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 405 via the network 404 using the terminal devices 401, 402, 403 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the terminal devices 401, 402, 403.
The terminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 405 may be a server providing various services, such as a background management server providing support for shopping-type websites browsed by the user using the terminal devices 401, 402, 403. The background management server can analyze and other processing on the received data such as the product information inquiry request and the like, and feed back processing results (such as target push information and product information) to the terminal equipment.
It should be noted that, the method for parsing a picture in a PDF file provided by the embodiment of the present invention is generally executed by the server 405, and accordingly, the device for parsing a picture in a PDF file is generally disposed in the server 405.
It should be understood that the number of terminal devices, networks and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 5, there is illustrated a schematic diagram of a computer system 500 suitable for use in implementing an embodiment of the present invention. The terminal device shown in fig. 5 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU) 501, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input section 506 including a keyboard, a mouse, and the like; an output portion 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as needed so that a computer program read therefrom is mounted into the storage section 508 as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 509, and/or installed from the removable media 511. The above-described functions defined in the system of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 501.
The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor, for example, as: a processor comprises a file information identification module, a picture resolution determination module, a picture resolution module and a picture compression module. The names of these modules do not limit the module itself in some cases, and for example, the picture resolution determination module may also be described as a "module that determines a picture resolution".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include: identifying file information of a PDF file, the file information including a file size of the PDF file, a number of pages of the PDF file, and a page size of each of one or more file pages; for each of the one or more file pages: determining a picture resolution for resolving a picture from the file page based on file information of the identified PDF file; according to the determined resolution of the picture analysis, analyzing the picture from the file page to obtain the picture of the file page; and compressing the obtained picture for further processing.
According to the technical scheme provided by the embodiment of the invention, the analysis efficiency is improved and the user experience is improved by limiting the file size of the PDF file and the number of pages of the analyzed PDF file. And moreover, the analysis efficiency and the user experience are further improved by determining the proper picture analysis resolution based on the file information of the PDF file.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (18)

1. A method for parsing a picture in a PDF file, comprising:
identifying file information of a PDF file, the file information including a file size of the PDF file, a number of pages of the PDF file, and a page size of each of one or more file pages;
for each of the one or more file pages:
determining a picture resolution for resolving a picture from the file page based on file information of the identified PDF file; and
Analyzing the picture from the file page according to the determined picture analysis resolution to obtain the picture of the file page; and
Compressing the obtained picture for further processing;
Further comprises:
Determining whether there is a limit to the file size of the PDF file:
determining whether a file size of a PDF file exceeds a PDF file resolution threshold file size in response to determining that a limit of a text size of the PDF file exists;
Identifying the picture resolution as a default resolution for resolving pictures of the PDF file in response to determining that there is no restriction of a file size of the PDF file;
Further comprises:
Determining whether there is a limit to the number of pages of the PDF file:
Determining whether the number of pages of the PDF file exceeds a PDF file resolution threshold number of pages in response to determining that a limit exists on the number of pages of the PDF file; and
Identifying the picture resolution as a default resolution for resolving pictures of the PDF file in response to determining that there is no limit to the number of pages of the PDF file; further comprises:
determining whether there is a limit to the page size of the PDF file:
Determining whether a page size of the PDF file exceeds a PDF file resolution threshold page size in response to determining that a limit exists on the page size of the PDF file; and
Identifying the picture resolution as a default resolution for resolving pictures of the PDF file in response to determining that there is no restriction of the page size of the PDF file; determining whether there is a limit to the number of read pages of the PDF file further includes:
determining whether the number of pages of the PDF file which is analyzed exceeds the PDF file analysis threshold value reading page number or not in response to determining that the limit of the reading page number of the PDF file exists; and
In response to determining that there is no limit to the number of read pages of the PDF file and that the file page is not the last page of the PDF file, continuing to parse the next file page.
2. The method of claim 1, wherein determining whether the file size of the PDF file exceeds a PDF file resolution threshold file size further comprises:
Determining whether the file size of the PDF file exceeds the file size of a PDF file analysis threshold value, ending analysis of the PDF file and returning error information; and
And determining that the file size of the PDF file does not exceed the file size of the PDF file analysis threshold value, and identifying the picture analysis resolution as the default resolution for analyzing the picture of the PDF file.
3. The method of claim 1, wherein determining whether the number of pages of the PDF file exceeds a PDF file resolution threshold number of pages further comprises:
In response to determining that the number of pages of the PDF file exceeds the PDF file resolution threshold number of pages, identifying the picture resolution as a lowest resolution for resolving pictures of the PDF file; and
In response to determining that the number of pages of the PDF file does not exceed the PDF file resolution threshold number of pages, the picture resolution is identified as a default resolution for resolving pictures of the PDF file.
4. The method of claim 1, wherein determining whether the page size of the PDF file exceeds a PDF file resolution threshold page size further comprises:
in response to determining that the page size of the PDF file exceeds the PDF file parsing threshold page size, identifying the picture parsing resolution as 2 x a lowest resolution of pictures used to parse the PDF file/a number of pages of the PDF file; and
In response to determining that the page size of the PDF file does not exceed the PDF file resolution threshold page size, the picture resolution is identified as a default resolution for resolving pictures of the PDF file.
5. A method according to claim 3, further comprising:
determining whether there is a limit to the page size of the PDF file:
Determining whether a page size of the PDF file exceeds a PDF file resolution threshold page size in response to determining that a limit exists on the page size of the PDF file; and
In response to determining that there is no limit to the page size of the PDF file, the picture resolution is identified as a default resolution for resolving pictures of the PDF file.
6. The method of claim 5, wherein determining whether the page size of the PDF file exceeds a PDF file resolution threshold page size further comprises:
in response to determining that the page size of the PDF file exceeds the PDF file parsing threshold page size, identifying the picture parsing resolution as 2 x a lowest resolution of pictures used to parse the PDF file/a number of pages of the PDF file; and
In response to determining that the page size of the PDF file does not exceed the PDF file resolution threshold page size, the picture resolution is identified as a default resolution for resolving pictures of the PDF file.
7. The method of claim 1, further comprising, after parsing a picture from the document page according to the determined picture parsing resolution to obtain a picture of the document page:
It is determined whether the file page is the last page of the PDF file.
8. The method of claim 7, wherein determining whether the file page is the last page of the PDF file further comprises:
ending the parsing of the PDF file in response to determining that the file page is the last page of the PDF file; and
In response to determining that the file page is not the last page of the PDF file, it is determined whether there is a limit to the number of read pages of the PDF file.
9. The method of claim 1, wherein determining whether the number of pages of the PDF file that has been parsed exceeds a PDF file parsing threshold read page number further comprises:
Responding to the fact that the number of pages of the analyzed PDF file exceeds the PDF file analysis threshold value to read the number of pages, and ending analysis; and
And in response to determining that the number of pages of the PDF file that has been parsed does not exceed the PDF file parsing threshold read page number and that the file page is not the last page of the PDF file, continuing to parse the next file page.
10. The method of claim 8, wherein determining whether there is a limit to the number of read pages of the PDF file further comprises:
determining whether the number of pages of the PDF file which is analyzed exceeds the PDF file analysis threshold value reading page number or not in response to determining that the limit of the reading page number of the PDF file exists; and
In response to determining that there is no limit to the number of read pages of the PDF file and that the file page is not the last page of the PDF file, continuing to parse the next file page.
11. The method of claim 10, wherein determining whether the number of pages of the PDF file that has been parsed exceeds a PDF file parsing threshold read page number further comprises:
Responding to the fact that the number of pages of the analyzed PDF file exceeds the PDF file analysis threshold value to read the number of pages, and ending analysis; and
In response to determining the number of pages of the PDF file that have been parsed, parsing of the next file page is continued.
12. The method as recited in claim 1, further comprising:
the obtained picture is used for the artificial intelligence service.
13. The method according to claim 1 or 2, wherein the PDF file threshold parsing file size is set automatically by a system or manually by a user.
14. A method according to claim 1 or 3, wherein the PDF file resolution threshold number of pages is set automatically by a system or manually by a user.
15. The method of claim 1 or 4, wherein the PDF file resolution threshold page size is set automatically by a system or manually by a user.
16. An apparatus for parsing a picture in a PDF file, comprising: the system comprises a file information identification module, a picture resolution determination module, a picture resolution module and a picture compression module; wherein,
The file information identification module is used for identifying file information of a PDF file, wherein the file information comprises the file size of the PDF file, the page number of the PDF file and the page size of each file page in one or more file pages;
the picture resolution determination module is used for determining picture resolution for resolving pictures from the file pages based on the file information of the identified PDF file;
the picture analysis module is used for analyzing pictures from the file page according to the determined picture analysis resolution ratio so as to obtain pictures of the file page;
A picture compression module for compressing the obtained picture for further processing;
the picture resolution determination module is further configured to:
Determining whether there is a limit to the file size of the PDF file:
determining whether a file size of a PDF file exceeds a PDF file resolution threshold file size in response to determining that a limit of a text size of the PDF file exists;
Identifying the picture resolution as a default resolution for resolving pictures of the PDF file in response to determining that there is no restriction of a file size of the PDF file;
the picture resolution determination module is further configured to:
Determining whether there is a limit to the number of pages of the PDF file:
Determining whether the number of pages of the PDF file exceeds a PDF file resolution threshold number of pages in response to determining that a limit exists on the number of pages of the PDF file; and
Identifying the picture resolution as a default resolution for resolving pictures of the PDF file in response to determining that there is no limit to the number of pages of the PDF file; the picture resolution determination module is further configured to:
determining whether there is a limit to the page size of the PDF file:
Determining whether a page size of the PDF file exceeds a PDF file resolution threshold page size in response to determining that a limit exists on the page size of the PDF file; and
Identifying the picture resolution as a default resolution for resolving pictures of the PDF file in response to determining that there is no restriction of the page size of the PDF file; the picture resolution determination module:
determining whether the number of pages of the PDF file which is analyzed exceeds the PDF file analysis threshold value reading page number or not in response to determining that the limit of the reading page number of the PDF file exists; and
In response to determining that there is no limit to the number of read pages of the PDF file and that the file page is not the last page of the PDF file, continuing to parse the next file page.
17. An electronic device for parsing a picture in a PDF file, comprising:
one or more processors;
storage means for storing one or more programs,
When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-15.
18. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-15.
CN202010871131.2A 2020-08-26 2020-08-26 Method and device for analyzing pictures in PDF (portable document format) file Active CN112069771B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010871131.2A CN112069771B (en) 2020-08-26 2020-08-26 Method and device for analyzing pictures in PDF (portable document format) file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010871131.2A CN112069771B (en) 2020-08-26 2020-08-26 Method and device for analyzing pictures in PDF (portable document format) file

Publications (2)

Publication Number Publication Date
CN112069771A CN112069771A (en) 2020-12-11
CN112069771B true CN112069771B (en) 2024-05-28

Family

ID=73658922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010871131.2A Active CN112069771B (en) 2020-08-26 2020-08-26 Method and device for analyzing pictures in PDF (portable document format) file

Country Status (1)

Country Link
CN (1) CN112069771B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12175183B2 (en) 2020-11-16 2024-12-24 Issu, Inc. Device dependent rendering of PDF content including multiple articles and a table of contents
US12248747B2 (en) * 2020-11-16 2025-03-11 Issuu, Inc. Device dependent rendering of PDF content

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101441713A (en) * 2007-11-19 2009-05-27 汉王科技股份有限公司 Optical character recognition method and apparatus of PDF document
CN103197908A (en) * 2013-04-09 2013-07-10 广东粤铁瀚阳科技有限公司 Information display platform based PDF (portable document format) file display method and information display platform based PDF file display system
CN109271613A (en) * 2018-09-25 2019-01-25 四川译讯信息科技有限公司 A kind of pdf document analytic method
CN109582654A (en) * 2018-11-30 2019-04-05 万兴科技股份有限公司 PDF document compression method, device, computer equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8660373B2 (en) * 2008-07-22 2014-02-25 Xerox Corporation PDF de-chunking and object classification
US9177142B2 (en) * 2011-10-14 2015-11-03 Trustwave Holdings, Inc. Identification of electronic documents that are likely to contain embedded malware

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101441713A (en) * 2007-11-19 2009-05-27 汉王科技股份有限公司 Optical character recognition method and apparatus of PDF document
CN103197908A (en) * 2013-04-09 2013-07-10 广东粤铁瀚阳科技有限公司 Information display platform based PDF (portable document format) file display method and information display platform based PDF file display system
CN109271613A (en) * 2018-09-25 2019-01-25 四川译讯信息科技有限公司 A kind of pdf document analytic method
CN109582654A (en) * 2018-11-30 2019-04-05 万兴科技股份有限公司 PDF document compression method, device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12175183B2 (en) 2020-11-16 2024-12-24 Issu, Inc. Device dependent rendering of PDF content including multiple articles and a table of contents
US12248747B2 (en) * 2020-11-16 2025-03-11 Issuu, Inc. Device dependent rendering of PDF content

Also Published As

Publication number Publication date
CN112069771A (en) 2020-12-11

Similar Documents

Publication Publication Date Title
CN110362372B (en) Page translation method, device, medium and electronic equipment
US20170109371A1 (en) Method and Apparatus for Processing File in a Distributed System
US11080322B2 (en) Search methods, servers, and systems
CN113704222B (en) A method and device for processing a service request
CN113220981A (en) Method and device for optimizing cache
CN112069771B (en) Method and device for analyzing pictures in PDF (portable document format) file
CN113011201A (en) Code file processing method and device
CN110647327B (en) Method and device for dynamic control of user interface based on card
CN112214250B (en) Application program component loading method and device
CN111427899A (en) Method, device, equipment and computer readable medium for storing file
CN105187562A (en) System and method for operating remote file
CN113779018A (en) A data processing method and device
CN116304403A (en) Web page access method, device, computer equipment and storage medium
CN113141403B (en) Log transmission method and device
CN113760965B (en) Data query method and device
CN112699116B (en) A data processing method and system
CN112783615B (en) A cleaning method and device for data processing tasks
CN113703760A (en) Page jump control method and device
CN113079165B (en) Access processing method and device
CN112487765B (en) Method and device for generating notification text
CN117648509A (en) A rendering data processing method and device
CN113761433B (en) Service processing method and device
CN112667627B (en) A data processing method and device
CN115495316A (en) Management method and device for historical page maintenance records
CN112131287B (en) A method and device for reading data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220920

Address after: 25 Financial Street, Xicheng District, Beijing 100033

Applicant after: CHINA CONSTRUCTION BANK Corp.

Address before: 25 Financial Street, Xicheng District, Beijing 100033

Applicant before: CHINA CONSTRUCTION BANK Corp.

Applicant before: Jianxin Financial Science and Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant