[go: up one dir, main page]

CN109656733B - Method and device for intelligent scheduling of multi-OCR recognition engines - Google Patents

Method and device for intelligent scheduling of multi-OCR recognition engines Download PDF

Info

Publication number
CN109656733B
CN109656733B CN201811615258.7A CN201811615258A CN109656733B CN 109656733 B CN109656733 B CN 109656733B CN 201811615258 A CN201811615258 A CN 201811615258A CN 109656733 B CN109656733 B CN 109656733B
Authority
CN
China
Prior art keywords
ocr
queue
recognition engine
file
ocr recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201811615258.7A
Other languages
Chinese (zh)
Other versions
CN109656733A (en
Inventor
陈辉鑫
周文贵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Shangji Network Technology Co ltd
Original Assignee
Xiamen Shangji Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Shangji Network Technology Co ltd filed Critical Xiamen Shangji Network Technology Co ltd
Priority to CN201811615258.7A priority Critical patent/CN109656733B/en
Publication of CN109656733A publication Critical patent/CN109656733A/en
Application granted granted Critical
Publication of CN109656733B publication Critical patent/CN109656733B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明涉及一种智能调度多OCR引擎的方法,包括如下步骤:调用OCR应用接口上传待识别文件,指定待识别文件的类型;根据待识别文件的类型确定处理该文件的OCR识别引擎的消费类型,每种消费类型的OCR引擎均创建一与之对应的OCR队列,各种消费类型的OCR识别引擎数量为至少一个;待识别文件请求进入对应OCR识别引擎的OCR队列时,先执行用于判断该消费类型的OCR识别引擎是否超时的超时检测,未超时,待识别文件进入对应的OCR队列,已超时,无法入队并通知用户。本发明的优点:充分利用各消费类型的OCR识别引擎的优点达到高性能高准确率的识别,有序调度各OCR识别引擎,防止队列拥挤、系统出错、引擎瘫痪。

Figure 201811615258

The invention relates to a method for intelligently scheduling multiple OCR engines, comprising the following steps: calling an OCR application interface to upload a file to be recognized, specifying the type of the file to be recognized; determining the consumption type of the OCR recognition engine processing the file according to the type of the file to be recognized , the OCR engine of each consumption type creates a corresponding OCR queue, and the number of OCR recognition engines of each consumption type is at least one; when the file to be recognized is requested to enter the OCR queue corresponding to the OCR recognition engine, the Timeout detection of whether the OCR recognition engine of this consumption type has timed out. If it has not timed out, the file to be recognized has entered the corresponding OCR queue. It has timed out and cannot be added to the queue and the user will be notified. The advantages of the present invention are: make full use of the advantages of OCR recognition engines of various consumption types to achieve high performance and high accuracy recognition, schedule each OCR recognition engine in an orderly manner, and prevent queue congestion, system errors and engine paralysis.

Figure 201811615258

Description

Method and equipment for intelligently scheduling multiple OCR recognition engines
Technical Field
The invention relates to a method and equipment for intelligently scheduling multiple OCR recognition engines, belonging to the field of OCR recognition.
Background
There are many OCR recognition engines on the market at present, but the consumption types of each OCR recognition engine are different, that is, the points of attention of the various OCR recognition engines are different, some engines pay attention to recognizing full text, some engines pay attention to recognizing fragments, some engines pay attention to recognition accuracy, and some engines pay attention to recognition performance. When a batch of bill documents need to be processed simultaneously and contain both full text and fragments, how to make each type of bill enter a matched OCR recognition engine reasonably and orderly and prevent the recognition task from being crowded and the program from being wrong is a technical problem to be solved urgently.
Chinese invention patent publication No. CN106326741A, "malicious program detection method and apparatus based on multi-engine system", discloses: the multi-engine system comprises a plurality of engines, wherein each engine corresponds to a respective adept processing type, and the method comprises the following steps: analyzing the type of the program to be tested; determining a first engine adept to process the program to be tested according to the type of the program to be tested and the adept processing types corresponding to the plurality of engines; and detecting the program to be detected through the first engine, and taking a detection result of the first engine as a detection result of the multi-engine system on the program to be detected. The disclosed technical scheme only provides a first engine which is adept to process the program to be tested according to the type of the program to be tested and adept processing types corresponding to a plurality of engines, but does not provide how to process multi-type recognition texts in batches in order by the plurality of engines and ensure that the types of the recognition texts are matched with the types which are adept to be processed by an OCR recognition engine.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method for intelligently scheduling multiple OCR engines, which classifies different types of files to be recognized into different OCR queues according to user requirements, then consumes related queue tasks according to the characteristics of each OCR recognition engine, fully utilizes the advantages of each OCR recognition engine, and simultaneously performs overtime judgment before the files to be recognized are queued to prevent the queues from being jammed, thereby achieving the recognition effect with high performance and high accuracy.
The first technical scheme of the invention is as follows:
a method of intelligently scheduling multiple OCR engines comprising the steps of: calling an OCR application interface to upload a file to be recognized, and specifying the type of the file to be recognized; determining the consumption types of OCR recognition engines for processing the file according to the type of the file to be recognized, wherein each OCR engine of the consumption type creates an OCR queue corresponding to the OCR engine, and the number of the OCR recognition engines of each consumption type is at least one; when a file to be recognized requests to enter an OCR queue of a corresponding OCR recognition engine, firstly executing overtime detection for judging whether the consumption type OCR recognition engine is overtime or not, if not, entering the corresponding OCR queue, if so, the file to be recognized cannot be queued and the user is informed.
More preferably, the timeout detection is: and comparing the estimated time consumption of the file to be recognized corresponding to any consumption type of OCR recognition engine with the preset task timeout time to obtain whether the OCR queue is overtime.
More preferably, the timeout detection is: the average time consumption of any consumption type of OCR recognition engine (the number of files to be recognized in the list of the OCR queues plus the minimum task number of the consumption type of OCR recognition engines) is obtained, the estimated time consumption of the consumption type of OCR recognition engines is obtained, and the estimated time consumption is compared with the preset task timeout time, so that whether the OCR queues are overtime is determined; and each OCR recognition engine correspondingly creates a task list, and the minimum task number value is in the task list corresponding to each OCR recognition engine of the same consumption type, and the task list with the minimum task number.
Preferably, the average time-consuming way of determining the OCR recognition engines of each consumption type is as follows: presetting an initial value by an OCR recognition engine of each consumption type; the method comprises the steps of periodically sampling recognition time consumption values of all OCR recognition engines in an A time period, then taking the recognition time consumption values of all OCR recognition engines in a B time period and the total number of recognition tasks, averaging to obtain the task time consumption values of all OCR recognition engines, taking the minimum task time consumption value as the minimum time consumption value of the OCR recognition engines of the consumption types in the OCR recognition engines of the same consumption types, finally comparing the minimum time consumption value with the preset initial value, taking the preset initial value as the average time consumption of the OCR recognition engines if the minimum time consumption value is larger than the preset initial value, and taking the minimum time consumption as the average time consumption of the OCR recognition engines if the minimum time consumption value is smaller than the preset initial value, wherein the A time period is smaller than the B time period.
Preferably, the method for intelligently scheduling multiple OCR engines further comprises creating an OCR queue consumer that continuously detects the OCR queue and continuously creates queue members into tasks, and then distributes to corresponding OCR recognition engines: acquiring numerical values of each OCR recognition engine, wherein each numerical value is a preset value; reading a list of OCR queues corresponding to the OCR recognition engines of the consumption types and version numbers of the list of the OCR queues, and determining the number of queue members which can be used for creating tasks in each OCR queue: if the number of the files to be identified in the OCR queue is larger than the value, the number of queue members which can be used for creating the task is equal to the value, otherwise, the number of queue members which can be used for creating the task is equal to the number of the files to be identified in the OCR queue; judging whether the load of an OCR recognition engine exceeds a preset maximum bearing range of the OCR recognition engine, wherein the load of the OCR recognition engine comprises the number of tasks in a task list of the OCR recognition engine and the number of queue members which can be used for creating the tasks, if so, the queue members are not created as the tasks, if not, checking whether the list version number of the OCR queue read by verification is consistent with the version number of the list of the current OCR queue, if so, updating the state of the queue members used for creating the tasks and the version number of the list of the current OCR queue, and if not, abandoning the created tasks; and acquiring the queue members in the updated state, and inserting the queue members into a task list of the OCR recognition engine in the form of tasks.
Preferably, the OCR recognition engines of each consumption type periodically perform said time-out detection to timely find out congestion problems.
Preferably, the method for intelligently scheduling multiple OCR engines further comprises a priority setting step for intelligently scheduling the files to be recognized, which is one or any combination of the following setting modes: 1) setting task priority: setting the priority of the file to be identified through a management interface, or setting the priority of a user, wherein when the user submits the file to be identified, the priority of the file to be identified is automatically marked according to the priority of the user; 2) configuring OCR queue priority corresponding to OCR recognition engines of various consumption types through a management system; 3) setting priority according to the task aging time, and preferentially identifying the earlier aging time in the same OCR queue; 4) setting priority according to the identification type: in the same OCR queue, determining the priority according to the type of the file to be identified; 5) setting priority according to the enqueue time: the earlier the enqueue time is, the higher the priority is, and the recognition is carried out in the same OCR queue according to a first-in first-out principle; when a plurality of setting modes are combined, the priority of each mode is set.
Preferably, the consumption types of the OCR recognition engine include emergency, fragment, full text, general purpose, specifically: an emergency OCR recognition engine, a fragmented OCR recognition engine, a full text OCR recognition engine and a generic OCR recognition engine; correspondingly creating an OCR queue by each consumption type OCR recognition engine, wherein the OCR queue comprises an emergency queue, a fragment queue, a full text queue and a general queue, and setting the priority of each OCR queue from high to low as: an emergency queue, a fragment queue, a full text queue and a general queue; calling an OCR application interface, and specifying the type of a file to be recognized and the consumption type of an OCR recognition engine for processing the file to be recognized; determining the best matching OCR recognition engine to process the type of file: judging whether the priority of the file to be recognized is urgent or not, if so, executing overtime detection by an emergency OCR recognition engine, if not, entering the file to be recognized into an emergency queue, and if so, failing to enqueue and notifying a user; if not, executing the next judgment; judging that the file to be recognized is a fragment file, if so, executing the overtime detection by a fragment OCR recognition engine, if not, entering the fragment queue by the fragment OCR recognition engine, and if so, not queuing and informing a user; if not, executing the next judgment; judging that the file to be recognized is a full-text file, if so, executing overtime detection by a full-text OCR recognition engine, if not, entering the full-text queue by the file to be recognized, and if overtime, failing to enqueue and notifying a user; if not, executing the next judgment; and the general OCR recognition engine executes overtime detection, if the overtime detection does not exist, the file to be recognized enters a general queue, and if the overtime detection does not exist, the file cannot be queued and the user is notified.
The invention also provides equipment for intelligently scheduling the multiple OCR engines.
The second technical scheme of the invention is as follows:
an apparatus for intelligently scheduling multiple OCR engines comprising a memory and a processor, said memory storing instructions adapted to be loaded by the processor and to perform the steps of: calling an OCR application interface to upload a file to be recognized, and specifying the type of the file to be recognized; determining the consumption types of OCR recognition engines for processing the file according to the type of the file to be recognized, wherein each OCR engine of the consumption type creates an OCR queue corresponding to the OCR engine, and the number of the OCR recognition engines of each consumption type is at least one; when a file to be recognized requests to enter an OCR queue of a corresponding OCR recognition engine, firstly executing overtime detection for judging whether the consumption type OCR recognition engine is overtime or not, if not, entering the corresponding OCR queue, if so, the file to be recognized cannot be queued and the user is informed.
More preferably, the timeout detection is: and comparing the estimated time consumption of the file to be recognized corresponding to any consumption type of OCR recognition engine with the preset task timeout time to obtain whether the OCR queue is overtime.
More preferably, the timeout is detected as: the average time consumption of any consumption type of OCR recognition engine (the number of files to be recognized in the list of the OCR queues plus the minimum task number of the consumption type of OCR recognition engines) is obtained, the estimated time consumption of the consumption type of OCR recognition engines is obtained, and the estimated time consumption is compared with the preset task timeout time, so that whether the OCR queues are overtime is determined; and each OCR recognition engine correspondingly creates a task list, and the minimum task number value is in the task list corresponding to each OCR recognition engine of the same consumption type, and the task list with the minimum task number.
Preferably, the average time-consuming way of determining the OCR recognition engines of each consumption type is as follows: presetting an initial value by an OCR recognition engine of each consumption type; the method comprises the steps of periodically sampling recognition time consumption values of all OCR recognition engines in an A time period, taking the recognition time consumption values of all OCR recognition engines in a B time period and the total number of recognition tasks, averaging to obtain the task time consumption values of all OCR recognition engines, taking the minimum task time consumption value as the minimum time consumption value of the OCR recognition engines of the consumption types in the OCR recognition engines of the same consumption types, finally comparing the minimum time consumption value with the preset initial value, taking the preset initial value as the average time consumption of the OCR recognition engines if the minimum time consumption value is larger than the preset initial value, and taking the minimum time consumption as the average time consumption of the OCR recognition engines if the minimum time consumption value is smaller than the preset initial value, wherein the A time period is smaller than the B time period.
Preferably, the instructions are loaded by the processor and then the following steps are executed: creating an OCR queue consumer that continually detects the OCR queue and continually creates queue members into tasks that are then distributed to corresponding OCR recognition engines: acquiring numerical values of each OCR recognition engine, wherein each numerical value is a preset value; reading a list of OCR queues corresponding to the OCR recognition engines of the consumption types and version numbers of the list of the OCR queues, and determining the number of queue members which can be used for creating tasks in each OCR queue: if the number of the files to be identified in the OCR queue is larger than the value, the number of queue members which can be used for creating the task is equal to the value, otherwise, the number of queue members which can be used for creating the task is equal to the number of the files to be identified in the OCR queue; judging whether the load of an OCR recognition engine exceeds a preset maximum bearing range of the OCR recognition engine, wherein the load of the OCR recognition engine comprises the number of tasks in a task list of the OCR recognition engine and the number of queue members which can be used for creating the tasks, if so, the queue members are not created as the tasks, if not, checking whether the list version number of the OCR queue read by verification is consistent with the version number of the list of the current OCR queue, if so, updating the state of the queue members used for creating the tasks and the version number of the list of the current OCR queue, and if not, abandoning the created tasks; and acquiring the queue members in the updated state, and inserting the queue members into a task list of the OCR recognition engine in the form of tasks.
Preferably, the OCR recognition engines of each consumption type periodically perform said time-out detection to timely find out congestion problems.
Preferably, the instructions are loaded by the processor and then execute a priority setting step for intelligently scheduling the file to be identified, wherein the priority setting step is one or any combination of the following setting modes: 1) setting task priority: setting the priority of the file to be identified through a management interface, or setting the priority of a user, wherein when the user submits the file to be identified, the priority of the file to be identified is automatically marked according to the priority of the user; 2) configuring OCR queue priority corresponding to OCR recognition engines of various consumption types through a management system; 3) setting priority according to the task aging time, and preferentially identifying the earlier aging time in the same OCR queue; 4) setting priority according to the identification type: in the same OCR queue, determining the priority according to the type of the file to be identified; 5) setting priority according to the enqueue time: the earlier the enqueue time is, the higher the priority is, and the recognition is carried out in the same OCR queue according to a first-in first-out principle; when the above multiple modes are combined, the priority of each mode is set at the same time.
Preferably, the consumption types of the OCR recognition engine include emergency, fragment, full text, general purpose, specifically: an emergency OCR recognition engine, a fragmented OCR recognition engine, a full text OCR recognition engine and a generic OCR recognition engine; correspondingly creating an OCR queue by each consumption type OCR recognition engine, wherein the OCR queue comprises an emergency queue, a fragment queue, a full text queue and a general queue, and setting the priority of each OCR queue from high to low as: an emergency queue, a fragment queue, a full text queue and a general queue; the processor performs the steps of: calling an OCR application interface, and specifying the type of a file to be recognized and the consumption type of an OCR recognition engine for processing the file to be recognized; determining the most matched OCR recognition engine for processing the type of file, specifically: judging whether the priority of the file to be recognized is urgent or not, if so, executing overtime detection by an emergency OCR recognition engine, if not, entering the file to be recognized into an emergency queue, and if so, failing to enqueue and notifying a user; if not, executing the next judgment; judging that the file to be recognized is a fragment file, if so, executing the overtime detection by a fragment OCR recognition engine, if not, entering the fragment queue by the fragment OCR recognition engine, and if so, not queuing and informing a user; if not, executing the next judgment; judging that the file to be recognized is a full-text file, if so, executing overtime detection by a full-text OCR recognition engine, if not, entering the full-text queue by the file to be recognized, and if overtime, failing to enqueue and notifying a user; if not, executing the next judgment; and the general OCR recognition engine executes overtime detection, if the overtime detection does not exist, the file to be recognized enters a general queue, and if the overtime detection does not exist, the file cannot be queued and the user is notified.
The invention has the following beneficial effects:
1. according to the method and the equipment for intelligently scheduling the multiple OCR engines, whether the OCR recognition engines matched with the files to be recognized are overtime or not is determined, the advantages of the OCR recognition engines of various consumption types are fully utilized to achieve high-performance and high-accuracy recognition, the OCR recognition engines can be scheduled in order, and system errors, engine paralysis and the like caused by queue congestion are prevented;
2. according to the method and the equipment for intelligently scheduling the multiple OCR engines, overtime detection can accurately reflect the time consumption condition of the current OCR recognition engine, and guarantee is provided for the steady operation of a system;
3. the invention relates to a method and equipment for intelligently scheduling multiple OCR engines, which can prevent the OCR engines from being paralyzed due to the excessive load by detecting and controlling the load of the OCR engines through an OCR queue consuming terminal, and simultaneously check the list version number of an OCR queue and update the queue member state and the list version number of the OCR queue, thereby effectively avoiding the repeated creation of tasks in the multiple OCR engines of the same consumption type by queue members.
4. The invention discloses a method and equipment for intelligently scheduling multiple OCR engines, wherein the OCR recognition engines of various consumption types periodically execute overtime detection, so that the normal operation of each OCR recognition engine is ensured, and congestion problems are timely discovered and processed.
5. The invention discloses a method and equipment for intelligently scheduling multiple OCR engines, and also provides multiple priority setting modes to realize intelligent scheduling of tasks;
6. the method and the equipment for intelligently scheduling the multiple OCR engines can simultaneously and intelligently schedule an emergency OCR recognition engine, a fragment OCR recognition engine, a full-text OCR recognition engine and a general OCR recognition engine, and orderly schedule by combining queue priorities, so that high-performance and high-accuracy OCR recognition is realized.
Drawings
FIG. 1 is a block diagram of the present invention;
FIG. 2 is a schematic flow chart of the present invention;
FIG. 3 is a schematic diagram of the average elapsed time flow of the present invention;
FIG. 4 is a flow chart of the OCR queue consuming side of the present invention;
FIG. 5 is a flow diagram of the present invention employing four consuming type OCR recognition engines.
Detailed Description
The invention is described in detail below with reference to the figures and the specific embodiments.
Example one
Referring to fig. 1 and 2, a method for intelligently scheduling multiple OCR engines includes the following steps: calling an OCR application interface to upload a file to be recognized, and specifying the type of the file to be recognized; determining the consumption types of OCR recognition engines for processing the file according to the type of the file to be recognized, wherein each OCR engine of the consumption type creates an OCR queue corresponding to the OCR engine, and the number of the OCR recognition engines of each consumption type is at least one; when a file to be recognized requests to enter an OCR queue of a corresponding OCR recognition engine, firstly executing overtime detection for judging whether the consumption type OCR recognition engine is overtime or not, if not, entering the corresponding OCR queue, if so, the file to be recognized cannot be queued and the user is informed. By determining the matched OCR recognition engines and judging whether the OCR recognition engines are overtime before enqueuing, the advantages of the OCR recognition engines of various consumption types are fully utilized to achieve high-performance and high-accuracy recognition, the OCR recognition engines can be dispatched in order, and system errors, engine paralysis and the like caused by queue congestion are prevented.
The timeout detection is: and comparing the estimated time consumption of the file to be recognized corresponding to any consumption type of OCR recognition engine with the preset task timeout time to obtain whether the OCR queue is overtime. Specifically, the average time consumption of any consumption type of OCR recognition engine (the number of documents to be recognized in the list of the OCR queues + the minimum task number of the consumption type of OCR recognition engine) is obtained to obtain the estimated time consumption of the consumption type of OCR recognition engine, and the estimated time consumption is compared with the preset task timeout time, so as to obtain whether the OCR queues have timed out. And each OCR recognition engine correspondingly creates a task list, and the minimum task number value is in the task list corresponding to each OCR recognition engine of the same consumption type, and the task list with the minimum task number.
Referring to fig. 3, there are various methods for determining the average time consumption of the OCR recognition engines, the first method is to preset an initial value for each consumption type of the OCR recognition engine; periodically sampling the recognition time consumption value of each OCR recognition engine in the A time period, for example, sampling the recognition time consumption value of each OCR recognition engine in nearly 6 hours every one minute, then taking the recognition time consumption value of each OCR recognition engine in the B time period and the total number of recognition tasks, averaging to obtain the task time consumption value of each OCR recognition engine, taking the minimum task time consumption value as the minimum time consumption value of the OCR recognition engine of the consumption type in the OCR recognition engines of the same consumption type, for example, taking the time consumption values of 3 fragment OCR recognition engines in 24 hour periods as 10 seconds, 8 seconds and 15 seconds respectively, and the number of completed tasks is 100, therefore, the task time consumption values of the 3 fragment OCR recognition engines are 0.1 second, 0.08 second and 0.15 second, taking 0.08 second as the minimum time consumption of the fragment OCR recognition engines, finally, comparing the minimum time consumption value with the preset initial value, if the minimum time consumption value is greater than the preset initial value, and taking a preset initial value as the average consumed time of the OCR recognition engine, and if the average consumed time is smaller than the preset initial value, taking the minimum consumed time as the average consumed time of the OCR recognition engine, wherein the A time period is smaller than the B time period, or B ═ KA, and K is a positive integer. For the average time consumption value, the time period taken by the average time consumption value is larger, and the accuracy of the obtained time consumption value is higher, however, the identification time consumption value in a large period of multiple sampling is slow in data processing speed, occupies larger cache resources, and affects the performance of the processor. And the second method comprises the following steps: the method is relatively simple and easy to implement, and the accuracy of the average consumed time determined by the method is not high, and the method is easily influenced by burst multitasking, so that the average consumed time in the time period is too large and inaccurate.
The first average time consumption determination method can accurately reflect the time consumption situation of the current OCR recognition engine, and provides guarantee for the stable operation of the system.
Referring to fig. 4, the embodiment further creates an OCR queue consuming end, which continuously detects the OCR queue and continuously creates queue members into tasks, and then distributes the tasks to corresponding OCR recognition engines, specifically: acquiring numerical values of each OCR recognition engine, wherein each numerical value is a preset value; reading the list of the corresponding OCR queues and the version numbers of the list of the OCR queues, and determining the number of queue members which can be used for creating tasks in each OCR queue: if the number of the files to be identified in the OCR queue is larger than the value, the number of queue members which can be used for creating the task is equal to the value, otherwise, the number of queue members which can be used for creating the task is equal to the number of the files to be identified in the OCR queue; judging whether the load of an OCR recognition engine exceeds a preset maximum bearing range of the OCR recognition engine, wherein the load of the OCR recognition engine comprises the number of tasks in a task list of the OCR recognition engine and the number of queue members available for creating the tasks, if so, no longer creating the queue members as the tasks, if not, checking whether the list version number of the OCR queue read by verification is consistent with the list version number of the current OCR queue, if so, updating the state of the queue members used for creating the tasks and the version number of the list of the current OCR queue, for example, updating the state of the queue members as follows: after dequeuing, changing the version of the list of the OCR queue from V1.0 to V1.1, and if the version of the list of the OCR queue is not consistent with the version of the OCR queue, abandoning the creation of a task; and acquiring the queue members in the updated state, and inserting the queue members into a task list of the OCR recognition engine in the form of tasks.
For example, 2 fragment OCR recognition engines are adopted to recognize fragment files, the access value of each fragment recognition engine is 3, a list and a version number of a fragment OCR queue are read, the number of files to be recognized in the list of the OCR queue is 2, and the version number is V1.0, so that the number of queue members which can be used for creating tasks in the fragment OCR queue is determined to be 2; then, setting the maximum bearing range of the fragmented OCR recognition engines to be 3-5, the number of tasks in the task list of the first fragmented OCR recognition engine to be 3, and the number of tasks in the task list of the second fragmented OCR recognition engine to be 5, then, the load of the first fragmented OCR recognition engine is the number of tasks in the task list 3 plus the number of queue members available for creating tasks 2, and the load is within the maximum bearing range, therefore, when updating, it is checked whether the version of the list of the current OCR queue is V1.0, if so, two files to be recognized in the list of the OCR queue are marked as "dequeued", and the version number of the list of the OCR queue is updated to be V1.1. And finally, respectively creating two files to be recognized which are marked as dequeued into tasks, and inserting the tasks into a task list of the first fragment OCR recognition engine. The number of tasks in the task list of the second fragmented OCR recognition engine has reached the maximum tolerance range and therefore no new tasks are created into the task list.
The load of the OCR recognition engine is detected and controlled by the OCR queue consuming end, the OCR recognition engine is prevented from being paralyzed due to the overload, meanwhile, the list version number of the OCR queue is verified, then the queue member state is updated, and the list version number of the OCR queue is updated, so that the repeated creation of tasks in a plurality of OCR recognition engines of the same consumption type by queue members can be effectively avoided.
In order to timely find that the OCR recognition engines have timeout problems in the recognition process, in the embodiment, the OCR recognition engines of each consumption type periodically perform the timeout detection, so as to ensure the normal operation of each OCR recognition engine and timely find and process congestion problems.
The embodiment also provides a method for intelligently scheduling a file to be identified, which is one or any combination of the following modes: 1) setting task priority: setting the priority of the file to be identified through a management interface or setting the priority of a user, and automatically setting the priority of the file to be identified according to the priority of the user when the user submits the file to be identified; 2) configuring OCR queue priority corresponding to OCR recognition engines of various consumption types through a management system; 3) judging the priority according to the task aging time, and preferentially identifying the earlier the aging time is under the same condition; 4) setting priority according to the identification type: in the same OCR queue, the priority is determined according to the recognition type, for example, for an OCR recognition engine with a general consumption type, the corresponding OCR queue may include both full text recognition and fragment recognition, so that the file priority recognition of the fragment type can be set, because the fragment file recognition speed is high and the full text recognition takes a long time; 5) the enqueue time is as follows: the earlier the enqueue time, the higher the priority, i.e. under the same condition, the recognition is carried out according to the first-in first-out principle.
Referring to fig. 5, taking the system including the emergency OCR recognition engine, the fragmented OCR recognition engine, the full-text OCR recognition engine and the general OCR recognition engine as an example, each OCR recognition engine of the consumption type correspondingly creates an OCR queue, specifically: the method comprises the following steps of setting the priority of queue task distribution from high to low as: an emergency queue, a fragment queue, a full text queue and a general queue;
step 1, a client calls an OCR application interface, uploads a file to be recognized, and specifies the type of the file to be recognized and the consumption type of an OCR recognition engine for processing the file to be recognized; meanwhile, whether the cache file is used or not can be set; the cache file is the recognition result of the last recognition of the file to be recognized; step 2, after receiving, the server judges whether to use the cache file, if so, extracts the last identification result from the cache and returns the identification result to the client, and if not, or the cache file does not exist, the server executes step 3; step 3, determining the best matched OCR recognition engine for processing the type of file:
judging whether the priority of the file to be recognized is urgent or not, if so, executing overtime detection by an emergency OCR recognition engine, if not, entering the file to be recognized into an emergency queue, and if so, failing to enqueue and notifying a user; if not, executing the next judgment;
judging that the file to be recognized is a fragment file, if so, executing the overtime detection by a fragment OCR recognition engine, if not, entering the fragment queue by the fragment OCR recognition engine, and if so, not queuing and informing a user; if not, executing the next judgment;
judging that the file to be recognized is a full-text file, if so, executing overtime detection by a full-text OCR recognition engine, if not, entering the full-text queue by the file to be recognized, and if overtime, failing to enqueue and notifying a user; if not, executing the next judgment;
and the general OCR recognition engine executes overtime detection, if the overtime detection does not exist, the file to be recognized enters a general queue, and if the overtime detection does not exist, the file cannot be queued and the user is notified.
The four OCR recognition engines are matched by setting the queue priority and combining the consumption types of the files to be recognized, and can orderly schedule the files to be recognized, so that high-performance and high-accuracy OCR recognition is realized.
Example two
Referring to fig. 1 and 2, an apparatus for intelligently scheduling multiple OCR engines includes a memory and a processor, the memory storing instructions adapted to be loaded by the processor and to perform the steps of: calling an OCR application interface to upload a file to be recognized, and specifying the type of the file to be recognized; determining the consumption types of OCR recognition engines for processing the file according to the type of the file to be recognized, wherein each OCR engine of the consumption type creates an OCR queue corresponding to the OCR engine, and the number of the OCR recognition engines of each consumption type is at least one; when a file to be recognized requests to enter an OCR queue of a corresponding OCR recognition engine, firstly executing overtime detection for judging whether the consumption type OCR recognition engine is overtime or not, if not, entering the corresponding OCR queue, if so, the file to be recognized cannot be queued and the user is informed. By determining the matched OCR recognition engines and judging whether the OCR recognition engines are overtime before enqueuing, the advantages of the OCR recognition engines of various consumption types are fully utilized to achieve high-performance and high-accuracy recognition, the OCR recognition engines can be dispatched in order, and system errors, engine paralysis and the like caused by queue congestion are prevented.
The timeout detection is: and comparing the estimated time consumption of the file to be recognized corresponding to any consumption type of OCR recognition engine with the preset task timeout time to obtain whether the OCR queue is overtime. Specifically, the method comprises the following steps: and (4) obtaining the estimated time consumption of the consumption type OCR recognition engine (the number of the documents to be recognized in the list of the OCR queue + the minimum task number of the consumption type OCR recognition engine), and comparing the estimated time consumption with the preset task timeout time to obtain whether the OCR queue is overtime. And each OCR recognition engine correspondingly creates a task list, and the minimum task number value is in the task list corresponding to each OCR recognition engine of the same consumption type, and the task list with the minimum task number.
Referring to fig. 3, there are various methods for determining the average time consumption of the OCR recognition engines, the first method is to preset an initial value for each consumption type of the OCR recognition engine; periodically sampling the recognition time consumption value of each OCR recognition engine in the A time period, for example, sampling the recognition time consumption value of each OCR recognition engine in nearly 6 hours every one minute, then taking the recognition time consumption value of each OCR recognition engine in the B time period and the total number of recognition tasks, averaging to obtain the task time consumption value of each OCR recognition engine, taking the minimum task time consumption value as the minimum time consumption value of the OCR recognition engine of the consumption type in the OCR recognition engines of the same consumption type, for example, taking the time consumption values of 3 fragment OCR recognition engines in 24 hour periods as 10 seconds, 8 seconds and 15 seconds respectively, and the number of completed tasks is 100, therefore, the task time consumption values of the 3 fragment OCR recognition engines are 0.1 second, 0.08 second and 0.15 second, taking 0.08 second as the minimum time consumption of the fragment OCR recognition engines, finally, comparing the minimum time consumption value with the preset initial value, if the minimum time consumption value is greater than the preset initial value, and taking a preset initial value as the average consumed time of the OCR recognition engine, and if the average consumed time is smaller than the preset initial value, taking the minimum consumed time as the average consumed time of the OCR recognition engine, wherein the A time period is smaller than the B time period, or B ═ KA, and K is a positive integer. For the average time consumption value, the time period taken by the average time consumption value is larger, and the accuracy of the obtained time consumption value is higher, however, the identification time consumption value in a large period of multiple sampling is slow in data processing speed, occupies larger cache resources, and affects the performance of the processor. And the second method comprises the following steps: the method is relatively simple and easy to implement, and the accuracy of the average consumed time determined by the method is not high, and the method is easily influenced by burst multitasking, so that the average consumed time in the time period is too large and inaccurate.
The first average time consumption determination method can accurately reflect the time consumption situation of the current OCR recognition engine, and provides guarantee for the stable operation of the system.
Referring to fig. 4, in the embodiment, the following steps are further executed after the instruction is loaded by the processor: creating an OCR queue consuming end, continuously detecting the OCR queue and continuously creating queue members into tasks, and then distributing the tasks to corresponding OCR recognition engines, specifically: acquiring numerical values of each OCR recognition engine, wherein each numerical value is a preset value; taking the list of the corresponding OCR queues and the version number of the list of the OCR queues, and determining the number of queue members which can be used for creating tasks in each OCR queue: if the number of the files to be identified in the OCR queue is larger than the value, the number of queue members which can be used for creating the task is equal to the value, otherwise, the number of queue members which can be used for creating the task is equal to the number of the files to be identified in the OCR queue; judging whether the load of an OCR recognition engine exceeds a preset maximum bearing range of the OCR recognition engine, wherein the load of the OCR recognition engine comprises the number of tasks in a task list of the OCR recognition engine and the number of queue members available for creating the tasks, if so, no longer creating the queue members as the tasks, if not, checking whether the list version number of the OCR queue read by verification is consistent with the list version number of the current OCR queue, if so, updating the state of the queue members used for creating the tasks and the version number of the list of the current OCR queue, for example, updating the state of the queue members as follows: after dequeuing, changing the version of the list of the OCR queue from V1.0 to V1.1, and if the version of the list of the OCR queue is not consistent with the version of the OCR queue, abandoning the creation of a task; and acquiring the queue members in the updated state, and inserting the queue members into a task list of the OCR recognition engine in the form of tasks.
For example, 2 fragment OCR recognition engines are adopted to recognize fragment files, the access value of each fragment recognition engine is 3, a list and a version number of a fragment OCR queue are read, the number of files to be recognized in the list of the OCR queue is 2, and the version number is V1.0, so that the number of queue members which can be used for creating tasks in the fragment OCR queue is determined to be 2; then, setting the maximum bearing range of the fragmented OCR recognition engines to be 3-5, the number of tasks in the task list of the first fragmented OCR recognition engine to be 3, and the number of tasks in the task list of the second fragmented OCR recognition engine to be 5, then, the load of the first fragmented OCR recognition engine is the number of tasks in the task list 3 plus the number of queue members available for creating tasks 2, and the load is within the maximum bearing range, therefore, when updating, it is checked whether the version of the list of the current OCR queue is V1.0, if so, two files to be recognized in the list of the OCR queue are marked as "dequeued", and the version number of the list of the OCR queue is updated to be V1.1. And finally, respectively creating two files to be recognized which are marked as dequeued into tasks, and inserting the tasks into a task list of the first fragment OCR recognition engine. The number of tasks in the task list of the second fragmented OCR recognition engine has reached the maximum tolerance range and therefore no new tasks are created into the task list.
The load of the OCR recognition engine is detected and controlled by the OCR queue consuming end, the OCR recognition engine is prevented from being paralyzed due to the overload, meanwhile, the list version number of the OCR queue is verified, then the queue member state is updated, and the list version number of the OCR queue is updated, so that the repeated creation of tasks in a plurality of OCR recognition engines of the same consumption type by queue members can be effectively avoided.
In order to timely find that the OCR recognition engines have timeout problems in the recognition process, in the embodiment, the OCR recognition engines of each consumption type periodically perform the timeout detection, so as to ensure the normal operation of each OCR recognition engine and timely find and process congestion problems.
In this embodiment, the instruction is loaded by the processor and then performs a priority setting step for intelligently scheduling the file to be identified, which is one or a combination of any of the following modes: 1) setting task priority: setting the priority of the file to be identified through a management interface or setting the priority of a user, and automatically setting the priority of the file to be identified according to the priority of the user when the user submits the file to be identified; 2) configuring OCR queue priority corresponding to OCR recognition engines of various consumption types through a management system; 3) judging the priority according to the task aging time, and preferentially identifying the earlier the aging time is under the same condition; 4) setting priority according to the identification type: in the same OCR queue, the priority is determined according to the recognition type, for example, for an OCR recognition engine with a general consumption type, the corresponding OCR queue may include both full text recognition and fragment recognition, so that the file priority recognition of the fragment type can be set, because the fragment file recognition speed is high and the full text recognition takes a long time; 5) the enqueue time is as follows: the earlier the enqueue time, the higher the priority, i.e. under the same condition, the recognition is carried out according to the first-in first-out principle.
Referring to fig. 5, taking the system including the emergency OCR recognition engine, the fragmented OCR recognition engine, the full-text OCR recognition engine and the general OCR recognition engine as an example, each OCR recognition engine of the consumption type correspondingly creates an OCR queue, specifically: the method comprises the following steps of setting the priority of queue task distribution from high to low as: an emergency queue, a fragment queue, a full text queue and a general queue;
step 1, a client calls an OCR application interface, uploads a file to be recognized, and specifies the type of the file to be recognized and the consumption type of an OCR recognition engine for processing the file to be recognized; meanwhile, whether the cache file is used or not can be set; the cache file is the recognition result of the last recognition of the file to be recognized;
step 2, after receiving, the server judges whether to use the cache file, if so, extracts the last identification result from the cache and returns the identification result to the client, and if not, or the cache file does not exist, the server executes step 3;
step 3, determining the best matched OCR recognition engine for processing the type of file:
judging whether the priority of the file to be recognized is urgent or not, if so, executing overtime detection by an emergency OCR recognition engine, if not, entering the file to be recognized into an emergency queue, and if so, failing to enqueue and notifying a user; if not, executing the next judgment;
judging that the file to be recognized is a fragment file, if so, executing the overtime detection by a fragment OCR recognition engine, if not, entering the fragment queue by the fragment OCR recognition engine, and if so, not queuing and informing a user; if not, executing the next judgment;
judging that the file to be recognized is a full-text file, if so, executing overtime detection by a full-text OCR recognition engine, if not, entering the full-text queue by the file to be recognized, and if overtime, failing to enqueue and notifying a user; if not, executing the next judgment;
and the general OCR recognition engine executes overtime detection, if the overtime detection does not exist, the file to be recognized enters a general queue, and if the overtime detection does not exist, the file cannot be queued and the user is notified.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1.一种智能调度多OCR引擎的方法,其特征在于,包括如下步骤:调用OCR应用接口上传待识别文件,指定待识别文件的类型;根据待识别文件的类型确定处理该文件的OCR识别引擎的消费类型,每种消费类型的OCR引擎均创建一与之对应的OCR队列,各种消费类型的OCR识别引擎数量为至少一个;待识别文件请求进入对应OCR识别引擎的OCR队列时,先执行用于判断该消费类型的OCR识别引擎是否超时的超时检测:将任意一消费类型的OCR识别引擎对应的待识别文件的预估耗时与预设的任务超时时间相比较;未超时,待识别文件进入对应的OCR队列,已超时,无法入队并通知用户;1. a method for intelligent scheduling of multiple OCR engines, is characterized in that, comprises the steps: call OCR application interface to upload file to be recognized, specify the type of file to be recognized; Determine the OCR recognition engine that handles this file according to the type of file to be recognized The OCR engine for each consumption type creates a corresponding OCR queue, and the number of OCR recognition engines for each consumption type is at least one; when the file request to be recognized enters the OCR queue corresponding to the OCR recognition engine, execute first Timeout detection for judging whether the OCR recognition engine of the consumption type has timed out: compare the estimated time consumption of the file to be recognized corresponding to the OCR recognition engine of any consumption type with the preset task timeout time; The file enters the corresponding OCR queue, it has timed out and cannot be queued and the user is notified; 其中,确定预估耗时的方式为:任意一消费类型的OCR识别引擎的平均耗时*(OCR队列的列表中的待识别文件数+该消费类型的OCR识别引擎的最小任务数);确定平均耗时方式为:各消费类型的OCR识别引擎预设初始值;周期性采样各OCR识别引擎在A时间周期内的识别耗时值,然后取B时间周期内各OCR识别引擎的识别耗时值与识别任务总数,求均值得到各OCR识别引擎的任务耗时值,在相同消费类型的OCR识别引擎中,取最小的任务耗时值作为该消费类型的OCR识别引擎的最小耗时值,最后,将该最小耗时与所述预设初始值比较,若大于预设初始值,则将预设初始值作为该OCR识别引擎的平均耗时,若小于预设初始值,则将最小耗时作为该OCR识别引擎的平均耗时,所述A时间周期小于所述B时间周期;各所述OCR识别引擎均对应创建一任务列表,所述最小任务数取值于同一消费类型的各OCR识别引擎对应的任务列表中,任务数最少的任务列表。Among them, the way to determine the estimated time consumption is: the average time consumption of the OCR recognition engine of any consumption type * (the number of files to be recognized in the list of the OCR queue + the minimum number of tasks of the OCR recognition engine of the consumption type); determine The average time-consuming method is as follows: the initial value of the OCR recognition engine for each consumption type is preset; the recognition time-consuming value of each OCR recognition engine in the A time period is periodically sampled, and then the recognition time-consuming of each OCR recognition engine in the B time period is taken. value and the total number of recognition tasks, and the average value is obtained to obtain the task time consumption value of each OCR recognition engine. Among the OCR recognition engines of the same consumption type, the minimum task time consumption value is taken as the minimum time consumption value of the OCR recognition engine of this consumption type. Finally, the minimum time consumption is compared with the preset initial value, if it is greater than the preset initial value, the preset initial value is used as the average time consumption of the OCR recognition engine, and if it is less than the preset initial value, the minimum time consumption As the average time consumption of the OCR recognition engine, the A time period is less than the B time period; each of the OCR recognition engines correspondingly creates a task list, and the minimum number of tasks takes the value of each OCR of the same consumption type In the task list corresponding to the recognition engine, the task list with the least number of tasks. 2.根据权利要求1所述的一种智能调度多OCR引擎的方法,其特征在于:还创建一OCR队列消费端,其不断检测OCR队列并将队列成员不断创建成任务,然后分发给对应的OCR识别引擎:2. the method for a kind of intelligent scheduling multiple OCR engines according to claim 1, is characterized in that: also create an OCR queue consumer, it constantly detects OCR queue and queue members are constantly created into tasks, then distribute to corresponding OCR recognition engine: 获取各OCR识别引擎的取数值,各取数值为预设值;Obtain the value of each OCR recognition engine, and each value is a preset value; 读取各消费类型的OCR识别引擎对应的OCR队列的列表以及OCR队列的列表的版本号,确定各OCR队列中可用于创建任务的队列成员数量:若OCR队列中的待识别文件数量大于取数值,则可用于创建任务的队列成员数量等于取数值,反之,可用于创建任务的队列成员数量等于OCR队列中的待识别文件数量;Read the list of OCR queues corresponding to the OCR recognition engine of each consumption type and the version number of the list of OCR queues, and determine the number of queue members that can be used to create tasks in each OCR queue: If the number of files to be recognized in the OCR queue is greater than the value , the number of queue members that can be used to create tasks is equal to the value, otherwise, the number of queue members that can be used to create tasks is equal to the number of files to be identified in the OCR queue; 判断OCR识别引擎的负载是否超过该OCR识别引擎预设的最大承受范围,所述OCR识别引擎的负载包括OCR识别引擎的任务列表中的任务数和可用于创建任务的队列成员数量,若超过,则不再将队列成员创建为任务,若未超过,先核验读取到的OCR队列的列表版本号与当前OCR队列的列表的版本号是否一致,若一致,更新用于创建任务的队列成员的状态和当前OCR队列的列表的版本号,若不一致,则放弃创建任务;Whether the load of judging the OCR recognition engine exceeds the preset maximum bearing range of the OCR recognition engine, the load of the OCR recognition engine comprises the number of tasks in the task list of the OCR recognition engine and the number of queue members that can be used to create a task, if it exceeds, Then the queue members will no longer be created as tasks. If not exceeded, first check whether the version number of the read OCR queue list is consistent with the version number of the current OCR queue list. If they are consistent, update the queue member used to create the task. The status and the version number of the current OCR queue list, if they are inconsistent, the creation task will be abandoned; 获取更新状态的队列成员,以任务的形式插入OCR识别引擎的任务列表。Get the queue members with updated status and insert them into the task list of the OCR recognition engine in the form of tasks. 3.根据权利要求1所述的一种智能调度多OCR引擎的方法,其特征在于:各消费类型的OCR识别引擎周期性地执行所述超时检测,及时发现拥挤问题。3 . The method for intelligently scheduling multiple OCR engines according to claim 1 , wherein the OCR identification engine of each consumption type periodically performs the timeout detection to discover the congestion problem in time. 4 . 4.根据权利要求1所述的一种智能调度多OCR引擎的方法,其特征在于:还包括用于智能调度待识别文件的优先级设置步骤,其为以下几种设置方式中的一种或任意方式的组合:1)设置任务优先级:通过管理界面设置待识别文件的优先级,或设置用户优先级,该用户提交待识别文件时,根据用户优先级自动标记该待识别文件的优先级;2)通过管理系统配置各消费类型的OCR识别引擎对应的OCR队列优先级;3)根据任务时效时间设定优先级,在同一OCR队列中,时效时间越早的优先识别;4)根据识别类型设定优先级:在同一OCR队列中,根据待识别文件的类型确定优先级;5)根据入队时间设定优先级:入队时间越早优先级越高,在同一OCR队列中,按先入先出原则进行识别;采用多种设置方式组合时,设置各方式的优先级。4. the method for a kind of intelligent scheduling multi-OCR engine according to claim 1, is characterized in that: also comprises the priority setting step that is used for intelligent scheduling to be identified file, it is one of following several setting modes or Combination of any method: 1) Set task priority: set the priority of the file to be recognized through the management interface, or set the priority of the user, when the user submits the file to be recognized, the priority of the file to be recognized is automatically marked according to the user priority ; 2) Configure the OCR queue priority corresponding to the OCR recognition engine of each consumption type through the management system; 3) Set the priority according to the task aging time. In the same OCR queue, the earlier the aging time is prioritized; 4) According to the identification Set the priority by type: in the same OCR queue, determine the priority according to the type of the file to be identified; 5) Set the priority according to the entry time: the earlier the entry time, the higher the priority, in the same OCR queue, press The first-in-first-out principle is used for identification; when a combination of multiple setting methods is used, the priority of each method is set. 5.根据权利要求1所述的一种智能调度多OCR引擎的方法,其特征在于:所述OCR识别引擎的消费类型包括紧急、碎片、全文、通用,具体为:紧急OCR识别引擎、碎片OCR识别引擎、全文OCR识别引擎和通用OCR识别引擎;各所述消费类型的OCR识别引擎对应创建一OCR队列,包括紧急队列、碎片队列、全文队列以及通用队列,设置各OCR队列的优先级从高到低为:紧急队列、碎片队列、全文队列以及通用队列;5. The method for intelligently scheduling multiple OCR engines according to claim 1, wherein the consumption types of the OCR recognition engine include emergency, fragmentation, full text, and general, and are specifically: emergency OCR recognition engine, fragmented OCR Recognition engine, full-text OCR recognition engine and general OCR recognition engine; OCR recognition engines of each described consumption type create an OCR queue correspondingly, including emergency queue, fragment queue, full-text queue and general queue, and set the priority of each OCR queue from high From low to low: emergency queue, fragmentation queue, full-text queue and general queue; 调用OCR应用接口,指定待识别文件的类型和用于处理该待识别文件的OCR识别引擎的消费类型;Call the OCR application interface to specify the type of the file to be recognized and the consumption type of the OCR recognition engine used to process the file to be recognized; 确定处理该类型文件的最匹配的OCR识别引擎:Determine the best matching OCR recognition engine for this type of file: 判断待识别文件的优先级是否为紧急,若是,紧急OCR识别引擎执行所述超时检测,未超时,则待识别文件进入紧急队列,若超时,无法入队并通知用户;若否,执行下一判断;Determine whether the priority of the file to be recognized is urgent. If so, the emergency OCR recognition engine executes the timeout detection. If it does not time out, the file to be recognized enters the emergency queue. If it times out, it cannot enter the queue and notify the user; if not, execute the next judge; 判断待识别文件为碎片文件,若是,碎片OCR识别引擎执行所述超时检测,未超时,则待识别文件进入碎片队列,若超时,无法入队并通知用户;若否,执行下一判断;It is judged that the file to be recognized is a fragmented file. If it is, the fragment OCR recognition engine performs the timeout detection. If it does not time out, the file to be recognized enters the fragmented queue. If it times out, it cannot enter the queue and notify the user; if not, execute the next judgment; 判断待识别文件为全文文件,若是,全文OCR识别引擎执行所述超时检测,未超时,则待识别文件进入全文队列,若超时,无法入队并通知用户;若否,执行下一判断;It is judged that the file to be recognized is a full-text file. If yes, the full-text OCR recognition engine executes the timeout detection. If the timeout is not exceeded, the file to be recognized enters the full-text queue. If it times out, it cannot be queued and the user is notified; if not, the next judgment is performed; 所述通用OCR识别引擎执行超时检测,若未超时,待识别文件进入通用队列,若超时,无法入队并通知用户。The general OCR recognition engine performs time-out detection. If the time-out is not exceeded, the file to be recognized enters the general-purpose queue. 6.一种智能调度多OCR引擎的设备,其特征在于,包括存储器和处理器,所述存储器存储有指令,所述指令适于由处理器加载并执行以下步骤:6. A device for intelligently scheduling multiple OCR engines, comprising a memory and a processor, wherein the memory stores instructions, and the instructions are adapted to be loaded by the processor and perform the following steps: 调用OCR应用接口上传待识别文件,指定待识别文件的类型;根据待识别文件的类型确定处理该文件的OCR识别引擎的消费类型,每种消费类型的OCR引擎均创建一与之对应的OCR队列,各种消费类型的OCR识别引擎数量为至少一个;待识别文件请求进入对应OCR识别引擎的OCR队列时,先执行用于判断该消费类型的OCR识别引擎是否超时的超时检测,未超时,待识别文件进入对应的OCR队列,已超时,无法入队并通知用户;Call the OCR application interface to upload the file to be recognized, and specify the type of the file to be recognized; determine the consumption type of the OCR recognition engine that processes the file according to the type of the file to be recognized, and the OCR engine of each consumption type creates a corresponding OCR queue. , the number of OCR recognition engines for various consumption types is at least one; when the file request to be recognized enters the OCR queue corresponding to the OCR recognition engine, the timeout detection for judging whether the OCR recognition engine of the consumption type has timed out is performed first. The identification file enters the corresponding OCR queue, it has timed out and cannot be queued and the user is notified; 其中,确定预估耗时的方式为:任意一消费类型的OCR识别引擎的平均耗时*(OCR队列的列表中的待识别文件数+该消费类型的OCR识别引擎的最小任务数);确定平均耗时方式为:各消费类型的OCR识别引擎预设初始值;周期性采样各OCR识别引擎在A时间周期内的识别耗时值,然后取B时间周期内各OCR识别引擎的识别耗时值与识别任务总数,求均值得到各OCR识别引擎的任务耗时值,在相同消费类型的OCR识别引擎中,取最小的任务耗时值作为该消费类型的OCR识别引擎的最小耗时值,最后,将该最小耗时与所述预设初始值比较,若大于预设初始值,则将预设初始值作为该OCR识别引擎的平均耗时,若小于预设初始值,则将最小耗时作为该OCR识别引擎的平均耗时,所述A时间周期小于所述B时间周期;各所述OCR识别引擎均对应创建一任务列表,所述最小任务数取值于同一消费类型的各OCR识别引擎对应的任务列表中,任务数最少的任务列表。Among them, the way to determine the estimated time consumption is: the average time consumption of the OCR recognition engine of any consumption type * (the number of files to be recognized in the list of the OCR queue + the minimum number of tasks of the OCR recognition engine of the consumption type); determine The average time-consuming method is as follows: the initial value of the OCR recognition engine for each consumption type is preset; the recognition time-consuming value of each OCR recognition engine in the A time period is periodically sampled, and then the recognition time-consuming of each OCR recognition engine in the B time period is taken. value and the total number of recognition tasks, and the average value is obtained to obtain the task time consumption value of each OCR recognition engine. Among the OCR recognition engines of the same consumption type, the minimum task time consumption value is taken as the minimum time consumption value of the OCR recognition engine of this consumption type. Finally, the minimum time consumption is compared with the preset initial value, if it is greater than the preset initial value, the preset initial value is used as the average time consumption of the OCR recognition engine, and if it is less than the preset initial value, the minimum time consumption As the average time consumption of the OCR recognition engine, the A time period is less than the B time period; each of the OCR recognition engines correspondingly creates a task list, and the minimum number of tasks takes the value of each OCR of the same consumption type In the task list corresponding to the recognition engine, the task list with the least number of tasks. 7.根据权利要求6所述的一种智能调度多OCR引擎的设备,其特征在于,7. the equipment of a kind of intelligent scheduling multiple OCR engine according to claim 6, is characterized in that, 所述指令由处理器加载后还执行如下步骤:创建一OCR队列消费端,其不断检测OCR队列并将队列成员不断创建成任务,然后分发给对应的OCR识别引擎:After the instruction is loaded by the processor, the following steps are also executed: an OCR queue consumer is created, which continuously detects the OCR queue and continuously creates the queue members into tasks, and then distributes them to the corresponding OCR recognition engine: 获取各OCR识别引擎的取数值,各取数值为预设值;Obtain the value of each OCR recognition engine, and each value is a preset value; 读取各消费类型的OCR识别引擎对应的OCR队列的列表以及OCR队列的列表的版本号,确定各OCR队列中可用于创建任务的队列成员数量:若OCR队列中的待识别文件数量大于取数值,则可用于创建任务的队列成员数量等于取数值,反之,可用于创建任务的队列成员数量等于OCR队列中的待识别文件数量;Read the list of OCR queues corresponding to the OCR recognition engine of each consumption type and the version number of the list of OCR queues, and determine the number of queue members that can be used to create tasks in each OCR queue: If the number of files to be recognized in the OCR queue is greater than the value , the number of queue members that can be used to create tasks is equal to the value, otherwise, the number of queue members that can be used to create tasks is equal to the number of files to be identified in the OCR queue; 判断OCR识别引擎的负载是否超过该OCR识别引擎预设的最大承受范围,所述OCR识别引擎的负载包括OCR识别引擎的任务列表中的任务数和可用于创建任务的队列成员数量,若超过,则不再将队列成员创建为任务,若未超过,先核验读取到的OCR队列的列表版本号与当前OCR队列的列表的版本号是否一致,若一致,更新用于创建任务的队列成员的状态和当前OCR队列的列表的版本号,若不一致,则放弃创建任务;Whether the load of judging the OCR recognition engine exceeds the preset maximum bearing range of the OCR recognition engine, the load of the OCR recognition engine comprises the number of tasks in the task list of the OCR recognition engine and the number of queue members that can be used to create a task, if it exceeds, Then the queue members will no longer be created as tasks. If not exceeded, first check whether the version number of the read OCR queue list is consistent with the version number of the current OCR queue list. If they are consistent, update the queue member used to create the task. The status and the version number of the current OCR queue list, if they are inconsistent, the creation task will be abandoned; 获取更新状态的队列成员,以任务的形式插入OCR识别引擎的任务列表。Get the queue members with updated status and insert them into the task list of the OCR recognition engine in the form of tasks. 8.根据权利要求6所述的一种智能调度多OCR引擎的设备,其特征在于:各消费类型的OCR识别引擎周期性地执行所述超时检测,及时发现拥挤问题。8 . The device for intelligently scheduling multiple OCR engines according to claim 6 , wherein the OCR identification engine of each consumption type periodically performs the timeout detection, so as to discover the congestion problem in time. 9 . 9.根据权利要求6所述的一种智能调度多OCR引擎的设备,其特征在于:所述指令由处理器加载后还执行用于智能调度待识别文件的优先级设置步骤,其为以下几种设置方式中的一种或任意方式的组合:1)设置任务优先级:通过管理界面设置待识别文件的优先级,或设置用户优先级,该用户提交待识别文件时,根据用户优先级自动标记该待识别文件的优先级;2)通过管理系统配置各消费类型的OCR识别引擎对应的OCR队列优先级;3)根据任务时效时间设定优先级,在同一OCR队列中,时效时间越早的优先识别;4)根据识别类型设定优先级:在同一OCR队列中,根据待识别文件的类型确定优先级;5)根据入队时间设定优先级:入队时间越早优先级越高,在同一OCR队列中,按先入先出原则进行识别;采用上述多种方式组合设置时,同时设置各方式的优先级。9. The equipment of a kind of intelligent scheduling multi-OCR engine according to claim 6, it is characterized in that: after described instruction is loaded by processor, also execute the priority setting step for intelligently scheduling to-be-recognized file, it is following several One or any combination of these setting methods: 1) Set task priority: set the priority of the file to be recognized through the management interface, or set the priority of the user, when the user submits the file to be recognized, automatically according to the user priority Mark the priority of the file to be recognized; 2) Configure the OCR queue priority corresponding to the OCR recognition engine of each consumption type through the management system; 3) Set the priority according to the task aging time. In the same OCR queue, the aging time is earlier. 4) Set the priority according to the recognition type: in the same OCR queue, determine the priority according to the type of the file to be identified; 5) Set the priority according to the queue entry time: the earlier the queue entry time, the higher the priority , in the same OCR queue, the identification is carried out according to the principle of first-in, first-out; when the combination of the above methods is used, the priority of each method is set at the same time. 10.根据权利要求6所述的一种智能调度多OCR引擎的设备,其特征在于:所述OCR识别引擎的消费类型包括紧急、碎片、全文、通用,具体为:紧急OCR识别引擎、碎片OCR识别引擎、全文OCR识别引擎和通用OCR识别引擎;各所述消费类型的OCR识别引擎对应创建一OCR队列,包括紧急队列、碎片队列、全文队列以及通用队列,设置各OCR队列的优先级从高到低为:紧急队列、碎片队列、全文队列以及通用队列;所述处理器执行以下步骤:10. The device for intelligently scheduling multiple OCR engines according to claim 6, wherein the consumption types of the OCR recognition engine include emergency, fragmentation, full text, and general, specifically: emergency OCR recognition engine, fragmented OCR Recognition engine, full-text OCR recognition engine and general OCR recognition engine; OCR recognition engines of each described consumption type create an OCR queue correspondingly, including emergency queue, fragment queue, full-text queue and general queue, and set the priority of each OCR queue from high To low are: emergency queue, fragmentation queue, full-text queue and general queue; the processor performs the following steps: 调用OCR应用接口,指定待识别文件的类型和用于处理该待识别文件的OCR识别引擎的消费类型;Call the OCR application interface to specify the type of the file to be recognized and the consumption type of the OCR recognition engine used to process the file to be recognized; 确定处理该类型文件的最匹配的OCR识别引擎,具体为:Determine the best matching OCR recognition engine for processing this type of file, specifically: 判断待识别文件优先级是否为紧急,若是,紧急OCR识别引擎执行所述超时检测,未超时,则待识别文件进入紧急队列,若超时,无法入队并通知用户;若否,执行下一判断;Determine whether the priority of the file to be recognized is urgent. If so, the emergency OCR recognition engine executes the timeout detection. If it does not time out, the file to be recognized enters the emergency queue. If it times out, it cannot enter the queue and notify the user; if not, execute the next judgment ; 判断待识别文件为碎片文件,若是,碎片OCR识别引擎执行所述超时检测,未超时,则待识别文件进入碎片队列,若超时,无法入队并通知用户;若否,执行下一判断;It is judged that the file to be recognized is a fragmented file. If it is, the fragment OCR recognition engine performs the timeout detection. If it does not time out, the file to be recognized enters the fragmented queue. If it times out, it cannot enter the queue and notify the user; if not, execute the next judgment; 判断待识别文件为全文文件,若是,全文OCR识别引擎执行所述超时检测,未超时,则待识别文件进入全文队列,若超时,无法入队并通知用户;若否,执行下一判断;It is judged that the file to be recognized is a full-text file. If yes, the full-text OCR recognition engine executes the timeout detection. If the timeout is not exceeded, the file to be recognized enters the full-text queue. If it times out, it cannot be queued and the user is notified; if not, the next judgment is performed; 所述通用OCR识别引擎执行超时检测,若未超时,待识别文件进入通用队列,若超时,无法入队并通知用户。The general OCR recognition engine performs time-out detection. If the time-out is not exceeded, the file to be recognized enters the general-purpose queue.
CN201811615258.7A 2018-12-27 2018-12-27 Method and device for intelligent scheduling of multi-OCR recognition engines Expired - Fee Related CN109656733B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811615258.7A CN109656733B (en) 2018-12-27 2018-12-27 Method and device for intelligent scheduling of multi-OCR recognition engines

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811615258.7A CN109656733B (en) 2018-12-27 2018-12-27 Method and device for intelligent scheduling of multi-OCR recognition engines

Publications (2)

Publication Number Publication Date
CN109656733A CN109656733A (en) 2019-04-19
CN109656733B true CN109656733B (en) 2021-03-12

Family

ID=66117273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811615258.7A Expired - Fee Related CN109656733B (en) 2018-12-27 2018-12-27 Method and device for intelligent scheduling of multi-OCR recognition engines

Country Status (1)

Country Link
CN (1) CN109656733B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689019B (en) * 2019-09-27 2022-05-24 中国银行股份有限公司 OCR recognition model determining method and device
CN112820390B (en) * 2019-11-15 2024-11-26 深圳迈瑞生物医疗电子股份有限公司 A priority setting method, a testing method and a sample analysis system
CN112348022B (en) * 2020-10-28 2024-05-07 富邦华一银行有限公司 Free-form document identification method based on deep learning
CN113239921A (en) * 2021-05-10 2021-08-10 上海交大慧谷通用技术有限公司 Task grading and distributing method and system for OCR (optical character recognition) service
CN116189210A (en) * 2023-04-23 2023-05-30 福昕鲲鹏(北京)信息科技有限公司 Image OCR (optical character recognition) method, electronic equipment and storage medium
CN118014072B (en) * 2024-04-10 2024-08-16 中国电建集团昆明勘测设计研究院有限公司 Construction method and system of knowledge graph for hydraulic and hydroelectric engineering

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150213B (en) * 2011-12-06 2016-06-22 北大方正集团有限公司 Balancing method of loads and device
CN102780773B (en) * 2012-07-16 2015-01-07 西安电子科技大学 Method for keeping cache consistency in network using content as center
CN106097088A (en) * 2016-06-07 2016-11-09 中国建设银行股份有限公司 The processing method of accounting data and system
CN106569891B (en) * 2016-11-09 2021-01-29 苏州浪潮智能科技有限公司 Method and device for scheduling and executing tasks in storage system

Also Published As

Publication number Publication date
CN109656733A (en) 2019-04-19

Similar Documents

Publication Publication Date Title
CN109656733B (en) Method and device for intelligent scheduling of multi-OCR recognition engines
CN108776934B (en) Distributed data calculation method and device, computer equipment and readable storage medium
WO2020228177A1 (en) Batch data processing method and apparatus, computer device and storage medium
CN106557369B (en) Multithreading management method and system
CN112328399A (en) Cluster resource scheduling method and device, computer equipment and storage medium
US9558227B2 (en) Reducing lock occurrences in server/database systems
CN113835985B (en) Method, device and equipment for monitoring and analyzing jamming reason
US20030204552A1 (en) IO completion architecture for user-mode networking
CN111061556A (en) Optimization method and device for executing priority task, computer equipment and medium
WO2020172852A1 (en) Computing resource scheduling method, scheduler, internet of things system, and computer readable medium
CN110119307B (en) Data processing request processing method and device, storage medium and electronic device
CN110912949B (en) Method and device for submitting sites
CN113391911B (en) Dynamic scheduling method, device and equipment for big data resources
CN104991816A (en) Process scheduling method and apparatus
Banerjee et al. Heavy traffic scaling limits for shortest remaining processing time queues with heavy tailed processing time distributions
CN111240864A (en) Asynchronous task processing method, device, equipment and computer readable storage medium
CN103442087B (en) A kind of Web service system visit capacity based on response time trend analysis controls apparatus and method
CN110837509A (en) Method, device, equipment and storage medium for scheduling dependence
CN113806102B (en) Message queue processing method, device and computing device
WO2019029721A1 (en) Task scheduling method, apparatus and device, and storage medium
CN112395054B (en) Thread scheduling method, device and system
EP2477112A1 (en) Method for efficient scheduling in a resource-sharing system
CN114443241A (en) A task dynamic scheduling method, task delivery method and device thereof
CN112988417A (en) Message processing method and device, electronic equipment and computer readable medium
CN111597056A (en) Distributed scheduling method, system, storage medium and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210312