CN117112872A - Government affair text archiving method and system based on semi-supervised learning - Google Patents
Government affair text archiving method and system based on semi-supervised learning Download PDFInfo
- Publication number
- CN117112872A CN117112872A CN202311360019.2A CN202311360019A CN117112872A CN 117112872 A CN117112872 A CN 117112872A CN 202311360019 A CN202311360019 A CN 202311360019A CN 117112872 A CN117112872 A CN 117112872A
- Authority
- CN
- China
- Prior art keywords
- text
- archiving
- semi
- materials
- government
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 239000000463 material Substances 0.000 claims abstract description 71
- 238000012795 verification Methods 0.000 claims abstract description 13
- 230000007246 mechanism Effects 0.000 claims abstract description 11
- 238000007726 management method Methods 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 22
- 239000013598 vector Substances 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 12
- 238000012552 review Methods 0.000 claims description 12
- 238000013523 data management Methods 0.000 claims description 10
- 238000013500 data storage Methods 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 7
- 238000013480 data collection Methods 0.000 claims description 6
- 238000012217 deletion Methods 0.000 claims description 6
- 230000037430 deletion Effects 0.000 claims description 6
- 230000002452 interceptive effect Effects 0.000 claims description 6
- 238000012423 maintenance Methods 0.000 claims description 6
- 238000013461 design Methods 0.000 claims description 5
- 239000012776 electronic material Substances 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 4
- 238000009825 accumulation Methods 0.000 claims description 3
- 238000007792 addition Methods 0.000 claims description 3
- 238000007796 conventional method Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 238000012986 modification Methods 0.000 claims description 3
- 230000004048 modification Effects 0.000 claims description 3
- 230000007115 recruitment Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims 1
- 238000010801 machine learning Methods 0.000 abstract description 6
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 7
- 238000011161 development Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/93—Document management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/383—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/20—Software design
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
- G06F8/31—Programming languages or programming paradigms
- G06F8/315—Object-oriented languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19147—Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19173—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N1/00—Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
- H04N1/00127—Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture
- H04N1/00326—Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture with a data reading, recognizing or recording apparatus, e.g. with a bar-code apparatus
- H04N1/00328—Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture with a data reading, recognizing or recording apparatus, e.g. with a bar-code apparatus with an apparatus processing optically-read information
- H04N1/00331—Connection or combination of a still picture apparatus with another apparatus, e.g. for storage, processing or transmission of still picture signals or of information associated with a still picture with a data reading, recognizing or recording apparatus, e.g. with a bar-code apparatus with an apparatus processing optically-read information with an apparatus performing optical character recognition
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N1/00—Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
- H04N1/21—Intermediate information storage
- H04N1/2166—Intermediate information storage for mass storage, e.g. in document filing systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Library & Information Science (AREA)
- Signal Processing (AREA)
- Business, Economics & Management (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- General Business, Economics & Management (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
技术领域Technical field
本发明涉及机器学习及智慧政务技术领域,具体地说是一种基于半监督学习的政务文本归档方法及系统。The invention relates to the technical fields of machine learning and smart government affairs, specifically a method and system for archiving government affairs texts based on semi-supervised learning.
背景技术Background technique
随着电子政务的发展,自动化流程在政府系统中的应用更加普遍,无纸化办公已成为必然趋势。然而,当办理业务时,仍然存在大量的纸质材料需要电子化、归档处理;目前采用较多的仍是人工录入归档,而过多的人工干预容易造成归档识别的不确定性,且人工归档效率低。With the development of e-government, automated processes are more commonly used in government systems, and paperless office has become an inevitable trend. However, when handling business, there are still a large number of paper materials that need to be electronically filed and processed; currently, manual entry and filing are still used more often, and too much manual intervention can easily lead to uncertainty in filing identification, and manual filing low efficiency.
发明内容Contents of the invention
本发明的技术任务是针对以上不足之处,提供一种基于半监督学习的政务文本归档方法及系统,提高政务文本识别归类的准确性,减少因为过多的人工干预造成的识别归档不确定性,进而提高业务服务水平;有助于信息资料电子化,极大地方便政务数据的管理。The technical task of the present invention is to address the above shortcomings and provide a government affairs text archiving method and system based on semi-supervised learning to improve the accuracy of identification and classification of government affairs texts and reduce uncertainty in identification and archiving caused by excessive manual intervention. nature, thereby improving the level of business services; it contributes to the electronicization of information data and greatly facilitates the management of government data.
本发明解决其技术问题所采用的技术方案是:The technical solutions adopted by the present invention to solve the technical problems are:
一种基于半监督学习的政务文本归档方法,包括手工录入阶段和自动归档阶段,手工录入阶段,在业务办理时,由办理人员扫描材料并输入或者选择标签,后台程序将扫描的材料按照标签存储至指定路径下,完成文本归档;自动归档阶段,由办理人员扫描材料,通过自动校验模块判断材料所属的标签,后台程序根据标签将扫描的材料存储至指定路径下;A government affairs text archiving method based on semi-supervised learning, including a manual entry stage and an automatic archiving stage. In the manual entry stage, during business processing, the handler scans the materials and inputs or selects labels, and the background program stores the scanned materials according to the labels. Go to the designated path and complete the text archiving; in the automatic archiving stage, the handler scans the materials and determines the label to which the material belongs through the automatic verification module. The background program stores the scanned materials in the designated path according to the label;
自动归档阶段启动自学习机制,首先提取各标签下文本材料的特征向量,然后建立标签与文本材料特征向量的关系,最后将关系表更新到自动校验模块,依次循环往复,实现标签的动态管理;The automatic archiving phase starts the self-learning mechanism. First, the feature vectors of the text materials under each tag are extracted, then the relationship between the tags and the feature vectors of the text materials is established, and finally the relationship table is updated to the automatic verification module, and the cycle is repeated to realize the dynamic management of tags. ;
其中,文本材料来自人工标定文本和自动标定文本,人工标定文本具有确切的文本材料与标签的对应关系,自动标定文本具有不确定性,需要自动标定文本加入惩罚机制,控制噪声文本信息进入训练机;经过训练之后,获取模型,更新至自动校验模块,依次循环往复,实现不间断的模型优化。Among them, the text material comes from manual calibration text and automatically calibrated text. The manual calibration text has the exact correspondence between the text material and the label. The automatically calibrated text has uncertainty. It is necessary to add a penalty mechanism to the automatically calibrated text to control the entry of noisy text information into the training machine. ; After training, obtain the model, update it to the automatic verification module, and repeat the cycle to achieve uninterrupted model optimization.
本方法将人工录入归档改为自动归档,为政务办事提供极大便捷性;在应用场景中,提高政务文本识别归类的准确性,减少因为过多的人工干预造成的识别归档不确定性,进而提高业务服务水平;完善的自动归档方法可以有助于信息资料电子化,极大地方便政务数据的管理。This method changes manual entry and archiving into automatic archiving, providing great convenience for government services; in application scenarios, it improves the accuracy of government text recognition and classification, and reduces the uncertainty in identification and archiving caused by excessive manual intervention. This will further improve business service levels; a complete automatic archiving method can help digitize information data and greatly facilitate the management of government data.
优选的,所述自动校验模块,对获取的实时材料信息进行处理,对于文字材料,提取字符串,作为文本的归类依据之一;然后将这组具有描述文本特征的字符串转化为特征向量。Preferably, the automatic verification module processes the obtained real-time material information, and extracts strings from text materials as one of the basis for text classification; and then converts this set of strings describing text characteristics into features. vector.
进一步的,由于提取的字符串包含很多无价值的字符,需要通过设计策略,提取多个可以描述该文本特征的字符。Furthermore, since the extracted string contains many worthless characters, it is necessary to design strategies to extract multiple characters that can describe the characteristics of the text.
进一步的,预先创建一套标签集,其对应了要归档的目录;同时预先构建每条标签对应的特征向量,组成一组标签的特征向量集;Further, a set of tag sets is created in advance, which corresponds to the directory to be archived; at the same time, the feature vector corresponding to each tag is pre-constructed to form a set of feature vectors of the tags;
将由刚录入材料生成的特征向量逐一在标签的特征向量集中作相关性分析;由此获取一组相关性较强的标签,并且为达到比较高的准确率,可通过控制相关度来获取相关性最强的标签,如果没有满足条件的标签,可选择创建该标签,或者统一定义为其他;The feature vectors generated from the newly entered materials are analyzed one by one in the feature vector set of the tags; thus a group of tags with strong correlation is obtained, and in order to achieve a relatively high accuracy, the correlation can be obtained by controlling the correlation degree The strongest label. If there is no label that meets the conditions, you can choose to create this label, or define it as something else;
通过获取的标签,查询归档信息资源库,获取归档信息;Query the archived information resource library through the acquired tags to obtain archived information;
最终,根据归档信息,由后端程序执行归档操作。Finally, based on the archived information, the backend program performs the archiving operation.
优选的,该方法的实现业务主要包括数据采集、信息录入、归档管理和数据存储管理,对包括证照类、合同、委托书、政策法规、证明材料的文本进行归档;其中,Preferably, the implementation business of this method mainly includes data collection, information entry, archiving management and data storage management, and archiving of texts including licenses, contracts, power of attorney, policies and regulations, and certification materials; wherein,
数据采集方式包括采用高拍仪、扫描仪、电子材料、便携设备拍摄方式获取政务文本;Data collection methods include using high-definition cameras, scanners, electronic materials, and portable devices to capture government documents;
信息录入包括人工审核录入和自动审核录入模式;Information entry includes manual review entry and automatic review entry mode;
归档管理根据是否预先已有归档目录,包括新建归档目录和现有归档目录管理;Archive management includes new archive directory and existing archive directory management based on whether there is an archive directory in advance;
数据管理包括数据存储、ER索引、数据查询、数据删除,以方便数据能够充分利用到其他业务。Data management includes data storage, ER index, data query, and data deletion, so that the data can be fully utilized in other businesses.
优选的,该方法的实现包括任务调度模块、业务处理模块、数据管理模块和AI服务模块,其中,Preferably, the implementation of this method includes a task scheduling module, a business processing module, a data management module and an AI service module, where,
任务调度模块作为Controller(主控制),协调各个模块之间的运行,包括开启或关闭自动归档模式、开启或关闭自学习模式;The task scheduling module serves as the Controller (main control) to coordinate the operation between various modules, including turning on or off the automatic archiving mode and turning on or off the self-learning mode;
业务处理模块负责业务办理时的事项,包括输入/选择标签、扫描材料、存储操作;The business processing module is responsible for matters during business processing, including entering/selecting labels, scanning materials, and storage operations;
数据管理模块负责数据的增删改查,协调数据资源;The data management module is responsible for adding, deleting, modifying and checking data and coordinating data resources;
AI服务模块负责智能计算服务,包括提供文字识别、策略判断。The AI service module is responsible for intelligent computing services, including providing text recognition and strategic judgment.
优选的,该方法的具体实现包括手工录用阶段、半监督学习阶段和无监督学习阶段,Preferably, the specific implementation of this method includes a manual recruitment stage, a semi-supervised learning stage and an unsupervised learning stage,
在手工录入阶段,按照常规的方法录入材料图像,逐渐积累大量的、有效的、带有标签的政务文本图像;In the manual entry stage, material images are entered according to conventional methods, and a large number of valid government affairs text images with labels are gradually accumulated;
在半监督学习阶段,一方面继续实施手工录入,另一方面开启自学习模型,充分利用已积累的样本图片进行分类学习,逐步优化识别准确率,即手工录入和自学习是同时进行的;在达到一定积累后,开启自学习功能辅助手工录入;In the semi-supervised learning stage, on the one hand, manual input is continued, and on the other hand, the self-learning model is turned on, making full use of the accumulated sample pictures for classification learning, and gradually optimizing the recognition accuracy, that is, manual input and self-learning are carried out at the same time; in After reaching a certain level of accumulation, the self-learning function is turned on to assist manual entry;
在无监督学习阶段,此时系统已具备自主学习能力,且具有较高的准确率,完全不需要手工录入归档,办事人仅需提交材料即可实现文本图像材料的自动归档。In the unsupervised learning stage, the system has the ability to learn independently and has a high accuracy. There is no need for manual input and archiving. The clerk only needs to submit materials to realize automatic archiving of text and image materials.
本发明还要求保护一种基于半监督学习的政务文本归档系统,该系统实现上述的基于半监督学习的政务文本归档方法;该系统包括交互客户端、应用服务器集群、AI服务器集群、各类数据与数据库系统以及用于完善功能的组件;The present invention also claims a government document archiving system based on semi-supervised learning, which implements the above-mentioned government document archiving method based on semi-supervised learning; the system includes an interactive client, an application server cluster, an AI server cluster, and various data and database systems and components used to improve functionality;
所述交互客户端包括业务大厅、移动客户端、Web客户端以及管理员客户端,提供用户信息录入、查阅等功能,提供管理员用户运维功能;The interactive client includes a business lobby, a mobile client, a Web client and an administrator client, providing functions such as user information entry and review, and providing administrator user operation and maintenance functions;
应用服务器集群用于实现系统的基础功能, AI服务器集群用于为系统提供AI计算服务;通过配置中心,实现定制归档任务、控制系统运行参数;The application server cluster is used to implement the basic functions of the system, and the AI server cluster is used to provide AI computing services for the system; through the configuration center, customized archiving tasks and control of system operating parameters are realized;
部署数据库服务,提供数据存储、增删改查操作,同时部署消息队列服务、缓存服务用于增强系统的稳定性。Deploy database services to provide data storage, addition, deletion, modification and query operations. At the same time, deploy message queue services and cache services to enhance the stability of the system.
优选的,各类客户端通过Nginx+防火墙的模式连接网关集群,以确保系统信息安全;Preferably, all types of clients connect to the gateway cluster through Nginx+firewall mode to ensure system information security;
个人客户的移动设备、Web客户端由公有云经防火墙来访问系统,完成业务办理;业务大厅、内部的私有设备和运维客户端由私有云来访问系统。Personal customers' mobile devices and Web clients access the system from the public cloud through the firewall to complete business processing; the business hall, internal private equipment and operation and maintenance clients access the system from the private cloud.
优选的,后端服务器包括应用服务器(App Server)、AI服务器(AI Server)和数据库服务器(DB Server),根据任务类型的不同,分别运行应用程序接口服务(API Service)、AI服务(AI Service)和数据库(DB)操作任务。Preferably, the back-end server includes an application server (App Server), an AI server (AI Server) and a database server (DB Server). Depending on the task type, the back-end server runs the application program interface service (API Service) and the AI service (AI Service) respectively. ) and database (DB) operation tasks.
本发明的一种基于半监督学习的政务文本归档方法及系统与现有技术相比,具有以下有益效果:Compared with the existing technology, the government text archiving method and system based on semi-supervised learning of the present invention has the following beneficial effects:
1、充分利用现有的办事流程,实现半监督学习的政务文本归档方案;1. Make full use of existing service processes to implement a semi-supervised learning government document archiving solution;
2、基于有标签材料的自学习流程,提高了算法的准确性和效率;2. The self-learning process based on labeled materials improves the accuracy and efficiency of the algorithm;
3、加入惩罚机制和阈值控制(相关度控制),增强了算法的鲁棒性和稳定性;3. Add a penalty mechanism and threshold control (correlation control) to enhance the robustness and stability of the algorithm;
4、此技术方案的实施,大大节省了人力、物力成本,提高的业务办理效率;4. The implementation of this technical solution greatly saves human and material costs and improves business processing efficiency;
5、系统采用模块化设计、开发,计算资源占用小,部署简单,应用方便。5. The system adopts modular design and development, takes up little computing resources, is simple to deploy, and is easy to apply.
附图说明Description of drawings
图1是本发明实施例提供的基于半监督学习的政务文本归档方法实现流程示图;Figure 1 is a flowchart showing the implementation of a government text archiving method based on semi-supervised learning provided by an embodiment of the present invention;
图2是本发明实施例提供的基于半监督学习的政务文本归档方法业务逻辑示图;Figure 2 is a business logic diagram of the government affairs text archiving method based on semi-supervised learning provided by the embodiment of the present invention;
图3是本发明实施例提供的基于半监督学习的政务文本归档方法业务实施流程示图;Figure 3 is a business implementation flow diagram of the government text archiving method based on semi-supervised learning provided by the embodiment of the present invention;
图4是本发明实施例提供的自学习流程示图;Figure 4 is a self-learning flow diagram provided by an embodiment of the present invention;
图5是本发明实施例提供的系统功能模块组成示图;Figure 5 is a diagram of the system functional module composition provided by the embodiment of the present invention;
图6是本发明实施例提供的基于半监督学习的政务文本归档系统部署示图;Figure 6 is a deployment diagram of a government text archiving system based on semi-supervised learning provided by an embodiment of the present invention;
图7是本发明实施例提供的基于半监督学习的政务文本归档系统网络架构示图。Figure 7 is a network architecture diagram of a government text archiving system based on semi-supervised learning provided by an embodiment of the present invention.
具体实施方式Detailed ways
本发明实施例提供一种基于半监督学习的政务文本归档方法,包括手工录入阶段和自动归档阶段。参考图3所示,手工录入阶段,在业务办理时,由办理人员扫描材料并输入或者选择标签,后台程序将扫描的材料按照标签存储至指定路径下,完成文本归档,这也是现行的业务处理方案;自动归档阶段,由办理人员扫描材料,通过自动校验模块判断材料所属的标签,后台程序根据标签将扫描的材料存储至指定路径下。The embodiment of the present invention provides a method for archiving government documents based on semi-supervised learning, which includes a manual entry stage and an automatic archiving stage. As shown in Figure 3, in the manual entry stage, during business processing, the handler scans the materials and inputs or selects labels. The background program stores the scanned materials in the designated path according to the labels to complete text archiving. This is also the current business processing. Solution: In the automatic archiving stage, the handler scans the materials and determines the label to which the material belongs through the automatic verification module. The background program stores the scanned materials in the designated path according to the label.
自动归档阶段启动自学习机制,首先提取各标签下文本材料的特征向量,然后建立标签与文本材料特征向量的关系,最后将关系表更新到自动校验模块,依次循环往复,实现标签的动态管理,达到自学习的目的。The automatic archiving phase starts the self-learning mechanism. First, the feature vectors of the text materials under each tag are extracted, then the relationship between the tags and the feature vectors of the text materials is established, and finally the relationship table is updated to the automatic verification module, and the cycle is repeated to realize the dynamic management of tags. , to achieve the purpose of self-learning.
其中,文本材料来自人工标定文本和自动标定文本,人工标定文本具有确切的文本材料与标签的对应关系,自动标定文本具有不确定性,需要自动标定文本加入惩罚机制,控制噪声文本信息进入训练机;经过训练之后,获取模型,更新至自动校验模块,依次循环往复,实现不间断的模型优化。参考图4所示,Among them, the text material comes from manual calibration text and automatically calibrated text. The manual calibration text has the exact correspondence between the text material and the label. The automatically calibrated text has uncertainty. It is necessary to add a penalty mechanism to the automatically calibrated text to control the entry of noisy text information into the training machine. ; After training, obtain the model, update it to the automatic verification module, and repeat the cycle to achieve uninterrupted model optimization. Referring to Figure 4,
惩罚机制是指在机器学习中常用的一种控制正则化过程对误差调整的机制,通常由惩罚函数及其系数来实现,本文中对详细的惩罚机制未作限定,但要达到的目的是一致的,均是增强机器学习模型的拟合能力,做出更准确的推断。The penalty mechanism refers to a mechanism commonly used in machine learning to control the regularization process to adjust the error. It is usually implemented by a penalty function and its coefficients. The detailed penalty mechanism is not limited in this article, but the purpose to be achieved is consistent All of them are to enhance the fitting ability of the machine learning model and make more accurate inferences.
自动校验模块是实现半监督学习的关键,旨在较少的人为干预下,获取录入材料的可归档信息。具体实现过程是:对获取的实时材料信息进行处理,对于文字材料,提取字符串,作为文本的归类依据之一;由于提取的字符串包含很多无价值的字符,需要通过设计策略,提取多个可以描述该文本特征的字符;然后将这组具有描述文本特征的字符串转化为特征向量。The automatic verification module is the key to realizing semi-supervised learning, aiming to obtain archivable information of entered materials with less human intervention. The specific implementation process is: process the obtained real-time material information. For text materials, extract strings as one of the basis for text classification; because the extracted strings contain many worthless characters, it is necessary to use design strategies to extract many characters. characters that can describe the characteristics of the text; then convert this set of character strings that describe the characteristics of the text into a feature vector.
可预先创建一套标签集,其对应了要归档的目录;同时预先构建每条标签对应的特征向量,组成一组标签的特征向量集;A set of tag sets can be created in advance, which corresponds to the directory to be archived; at the same time, the feature vector corresponding to each tag can be pre-constructed to form a set of tag feature vectors;
此时,便可将由刚录入材料生成的特征向量逐一在标签的特征向量集中作相关性分析;于是便可获取一组相关性较强的标签,并且为达到比较高的准确率,可通过控制相关度来获取相关性最强的标签,如果没有满足条件的标签,可选择创建该标签,或者统一定义为其他;At this time, the feature vectors generated from the newly entered materials can be analyzed one by one in the feature vector set of the tags; thus a group of tags with strong correlation can be obtained, and in order to achieve a relatively high accuracy, the control Relevance to get the most relevant tag. If there is no tag that meets the conditions, you can choose to create the tag, or define it as something else;
通过获取的标签,查询归档信息资源库,获取归档信息;Query the archived information resource library through the acquired tags to obtain archived information;
最终,根据归档信息,由后端程序执行归档操作。Finally, based on the archived information, the backend program performs the archiving operation.
该过程如图1所示。The process is shown in Figure 1.
如图2所示,该方法的实现业务主要由数据采集、信息录入、归档管理和数据存储管理四部分组成,主要可以对证照类、合同、委托书、政策法规、证明材料等文本进行归档。其中,As shown in Figure 2, the implementation business of this method mainly consists of four parts: data collection, information entry, archiving management and data storage management. It can mainly archive documents such as licenses, contracts, letters of attorney, policies and regulations, and certification materials. in,
数据采集可采用多种方式,例如采用高拍仪、扫描仪、电子材料、便携设备等手段获取政务文本;Data collection can be done in a variety of ways, such as using high-definition cameras, scanners, electronic materials, portable devices, etc. to obtain government documents;
信息录入阶段主要分为人工审核录入和自动审核录入两种模式;The information entry stage is mainly divided into two modes: manual review and entry and automatic review and entry;
归档管理根据是否预先已有归档目录,包括新建归档目录和现有归档目录管理两种情况;Archive management depends on whether there is an archive directory in advance, including new archive directory and existing archive directory management;
数据管理阶段,主要包括数据存储、ER索引、数据查询、数据删除,以方便数据能够充分利用到其他业务。The data management stage mainly includes data storage, ER index, data query, and data deletion, so that the data can be fully utilized in other businesses.
ER索引:ER全称为Entity Relationship,译为实体关系,常用图的形式来表达,即实体关系图,其是一种提供了实体、属性和联系的方法;利用该方法,建立办理事项实体与各类材料以及材料与材料之间的关系,提供描述这种复杂关系的索引,即称为ER索引。ER index: ER stands for Entity Relationship, which is translated as entity relationship. It is often expressed in the form of a diagram, that is, the entity relationship diagram, which is a method that provides entities, attributes and relationships; using this method, establish the relationship between the entity and each entity of the transaction. Class materials and the relationship between materials provide an index that describes this complex relationship, which is called an ER index.
如图5所示,该方法的实现包括任务调度模块、业务处理模块、数据管理模块和AI服务模块,其中,As shown in Figure 5, the implementation of this method includes a task scheduling module, a business processing module, a data management module and an AI service module, where,
任务调度模块作为Controller(主控制),协调各个模块之间的运行,包括开启或关闭自动归档模式、开启或关闭自学习模式等;The task scheduling module serves as the Controller (main control) to coordinate the operation between various modules, including turning on or off the automatic archiving mode, turning on or off the self-learning mode, etc.;
业务处理模块负责业务办理时的事项,包括输入/选择标签、扫描材料、存储操作等;The business processing module is responsible for matters during business processing, including entering/selecting labels, scanning materials, storage operations, etc.;
数据管理模块负责数据的增删改查,协调数据资源;The data management module is responsible for adding, deleting, modifying and checking data and coordinating data resources;
AI服务模块主要提供文字识别、策略判断等智能计算服务。The AI service module mainly provides intelligent computing services such as text recognition and strategic judgment.
如下以某市智慧审批系统中政务文本自动归档的实现过程来具体描述本方法的应用:The application of this method is specifically described as follows based on the implementation process of automatic archiving of government documents in a city's smart approval system:
该项目要求将办理业务群众提供的文本归档,常见的文本材料有身份证、营业执照、银行卡、合同、委托书、政策法规、证明材料等,材料类型一般分为拍照、扫描件、电子材料等。归档的材料可以用于政务系统内部的资源共享,减少其他环节的办事流程,提高办事效率。This project requires the archiving of texts provided by people handling business. Common text materials include ID cards, business licenses, bank cards, contracts, power of attorney, policies and regulations, certification materials, etc. The types of materials are generally divided into photos, scanned copies, and electronic materials. wait. Archived materials can be used for resource sharing within the government system, reducing work processes in other links and improving work efficiency.
一般地,采用人工审核录入的方式对提交的材料逐一审核、归档。随着业务量的增长,大量的材料审核工作严重影响了事项办理进度,甚至带来归档出错风险,需要一种智能的方法实现对提交材料的自动审核和归档。Generally, the submitted materials are reviewed and archived one by one using manual review and entry. With the growth of business volume, a large amount of material review work has seriously affected the progress of matter processing, and even brought the risk of filing errors. An intelligent method is needed to realize the automatic review and filing of submitted materials.
利用该方法,充分利用现有系统进行优化升级。具体来说,基于该方法的实施分为三个阶段,分别为手工录用阶段、半监督学习阶段和无监督学习阶段。Use this method to make full use of existing systems for optimization and upgrades. Specifically, the implementation based on this method is divided into three stages, namely the manual recruitment stage, the semi-supervised learning stage and the unsupervised learning stage.
在手工录入阶段,按照常规的方法录入材料图像,逐渐积累大量的、有效的、带有标签的政务文本图像。In the manual entry stage, material images are entered according to conventional methods, and a large number of valid government affairs text images with labels are gradually accumulated.
在半监督学习阶段,一方面继续实施手工录入,另一方面开启自学习模型,充分利用已积累的样本图片进行分类学习,逐步优化识别准确率,即手工录入和自学习是同时进行的,甚至在达到一定积累后,开启自学习功能辅助手工录入。In the semi-supervised learning stage, on the one hand, manual input is continued, and on the other hand, the self-learning model is turned on, making full use of the accumulated sample pictures for classification learning, and gradually optimizing the recognition accuracy, that is, manual input and self-learning are carried out at the same time, and even After reaching a certain amount of accumulation, the self-learning function is turned on to assist manual entry.
在无监督学习阶段,此时系统已具备自主学习能力,且具有较高的准确率,完全不需要手工录入归档,办事人仅需提交材料即可实现文本图像材料的自动归档。In the unsupervised learning stage, the system has the ability to learn independently and has a high accuracy. There is no need for manual input and archiving. The clerk only needs to submit materials to realize automatic archiving of text and image materials.
采用该方法开发系统时,可考虑分为四大模块:任务调度、业务处理、数据管理、AI服务。该架构充分考虑解耦化设计,将前三者的开发任务由前后端工程师开发,使用政务系统主流的Java语言,可以做到与其他系统的兼容,同时,AI服务由算法工程师开发,使用业界主流的Python语言,充分发挥算法优势,为系统全域提供web服务,快速优化迭代更新。When developing a system using this method, it can be divided into four major modules: task scheduling, business processing, data management, and AI services. This architecture fully considers the decoupling design. The development tasks of the first three are developed by front-end and back-end engineers. The Java language, which is the mainstream of the government system, can be compatible with other systems. At the same time, the AI service is developed by algorithm engineers and uses industry-leading Java language. The mainstream Python language gives full play to its algorithm advantages, provides web services for the entire system, and quickly optimizes and updates iteratively.
文字识别技术在政务系统数字化建设中的应用非常广泛,可以提高行政效能和服务水平,被广泛应用在政府公文数字化处理、表格信息数字化录入、智慧城市数据文本化和自动分类等应用场景,文字识别技术的应用加快了政务系统的处理速度,降低了政务系统的处理成本。文本自动归档是基于人工智能技术的文件整理和管理方法,帮助用户快速、准确地识别并归档各种文档、图片、音频及视频文件。文本自动归档主要采用自然语言处理、机器学习、深度学习等技术进行文档分类、标签化、归档等过程。Text recognition technology is widely used in the digital construction of government affairs systems. It can improve administrative efficiency and service levels. It is widely used in application scenarios such as digital processing of government documents, digital entry of form information, textualization and automatic classification of smart city data. Text recognition The application of technology speeds up the processing speed of government affairs systems and reduces the processing costs of government affairs systems. Automatic text archiving is a file organization and management method based on artificial intelligence technology that helps users quickly and accurately identify and archive various documents, pictures, audio and video files. Automatic text archiving mainly uses natural language processing, machine learning, deep learning and other technologies to carry out document classification, labeling, archiving and other processes.
机器学习的常用方法,主要分为有监督学习(supervised learning)和无监督学习(unsupervised learning)。简单的归纳就是,是否有监督(supervised),就看输入数据是否有标签(label)。输入数据有标签,则为有监督学习;没标签则为无监督学习。另外,有监督和无监督中间包含的一种学习算法是半监督学习(semi-supervised learning)。对于半监督学习,其训练数据的一部分是有标签的,另一部分没有标签,而没标签数据的数量常常极大于有标签数据数量(这也是符合现实情况的)。隐藏在半监督学习下的基本规律在于:数据的分布不是完全随机的,通过一些有标签数据的局部特征,以及更多没标签数据的整体分布,就可以得到可以接受甚至是非常好的分类结果。Commonly used methods of machine learning are mainly divided into supervised learning and unsupervised learning. A simple summary is that whether there is supervision (supervised) depends on whether the input data has a label (label). If the input data has labels, it is supervised learning; if there are no labels, it is unsupervised learning. In addition, a learning algorithm included between supervised and unsupervised is semi-supervised learning. For semi-supervised learning, part of the training data is labeled, and the other part is unlabeled, and the number of unlabeled data is often greater than the number of labeled data (this is also consistent with reality). The basic rule hidden under semi-supervised learning is that the distribution of data is not completely random. Through some local features of labeled data and more overall distribution of unlabeled data, acceptable or even very good classification results can be obtained. .
本发明实施例还提供一种基于半监督学习的政务文本归档系统,该系统实现上述实施例所述的基于半监督学习的政务文本归档方法;如图6所示,该系统包括交互客户端、应用服务器集群、AI服务器集群、各类数据与数据库系统以及用于完善功能的组件等。Embodiments of the present invention also provide a government text archiving system based on semi-supervised learning, which implements the semi-supervised learning-based government text archiving method described in the above embodiment; as shown in Figure 6, the system includes an interactive client, Application server clusters, AI server clusters, various data and database systems, and components used to improve functions, etc.
所述交互客户端包括业务大厅、移动客户端、Web客户端以及管理员客户端等,提供用户信息录入、查阅等功能,提供管理员用户运维功能。The interactive client includes a business lobby, a mobile client, a Web client, an administrator client, etc., and provides functions such as user information entry and review, as well as administrator user operation and maintenance functions.
外部的各类客户端通过“Nginx+防火墙”的模式连接网关集群,以确保系统信息安全。Various external clients connect to the gateway cluster through the "Nginx+firewall" mode to ensure system information security.
系统内部提供了应用服务器集群和AI服务器集群,应用服务器集群用于实现系统的基础功能, AI服务器集群用于为系统提供AI计算服务;通过配置中心,可实现定制归档任务、控制系统运行参数;The system provides an application server cluster and an AI server cluster. The application server cluster is used to implement the basic functions of the system, and the AI server cluster is used to provide AI computing services for the system. Through the configuration center, customized archiving tasks and system operating parameters can be controlled;
PHP应用服务器集群:PHP全称为Hypertext Preprocessor,中文名为“超文本预处理器”,是一种通用开源脚本语言;基于该语言,可开发应用服务器集群,具有高并发、分布式的特点。K8s:其全称为kubernetes,因其名字过长,用“8”替代了中间8个字母;其是一款开源的、著名的基于容器的集群管理平台,本文中用于构建AI服务器集群,提供docker(容器)管理与负载均衡。另外,类型的管理平台比较多,技术选型阶段可根据实际选择适合自己的服务管理平台。PHP application server cluster: PHP's full name is Hypertext Preprocessor, and its Chinese name is "Hypertext Preprocessor". It is a general open source scripting language; based on this language, application server clusters can be developed, with high concurrency and distributed characteristics. K8s: Its full name is kubernetes. Because the name is too long, "8" is used to replace the middle 8 letters; it is an open source and well-known container-based cluster management platform. In this article, it is used to build an AI server cluster and provides Docker (container) management and load balancing. In addition, there are many types of management platforms. In the technology selection stage, you can choose a service management platform that suits you based on actual conditions.
部署数据库服务,提供数据存储、增删改查操作,同时部署消息队列服务、缓存服务用于增强系统的稳定性。Kafka是Apache旗下的一款开源的分布式流媒体平台,是一种高吞吐量、持久性、分布式的发布订阅的消息队列系统。Redis是开源免费的,遵守BSD协议,是一个高性能的key-value非关系型数据库,本文中用于实现缓存服务。Deploy database services to provide data storage, addition, deletion, modification and query operations. At the same time, deploy message queue services and cache services to enhance the stability of the system. Kafka is an open source distributed streaming platform owned by Apache. It is a high-throughput, durable, distributed publish-subscribe message queue system. Redis is open source and free, abides by the BSD protocol, and is a high-performance key-value non-relational database. It is used in this article to implement caching services.
如图7所示,介绍了基于该方法开发的系统的网络架构图。个人客户的移动设备、Web客户端由公有云经防火墙来访问系统,完成业务办理;业务大厅、内部的私有设备和运维客户端由私有云来访问系统。As shown in Figure 7, the network architecture diagram of the system developed based on this method is introduced. Personal customers' mobile devices and Web clients access the system from the public cloud through the firewall to complete business processing; the business hall, internal private equipment and operation and maintenance clients access the system from the private cloud.
后端服务器包括应用服务器(App Server)、AI服务器(AI Server)和数据库服务器(DB Server),根据任务类型的不同,分别运行应用程序接口服务(API Service)、AI服务(AI Service)和数据库(DB)操作等任务。The back-end server includes application server (App Server), AI server (AI Server) and database server (DB Server). Depending on the task type, it runs the application program interface service (API Service), AI service (AI Service) and database respectively. (DB) operations and other tasks.
通过上面具体实施方式,所述技术领域的技术人员可容易的实现本发明。但是应当理解,本发明并不限于上述的具体实施方式。在公开的实施方式的基础上,所述技术领域的技术人员可任意组合不同的技术特征,从而实现不同的技术方案。Through the above specific embodiments, those skilled in the technical field can easily implement the present invention. However, it should be understood that the present invention is not limited to the specific embodiments described above. On the basis of the disclosed embodiments, those skilled in the technical field can arbitrarily combine different technical features to achieve different technical solutions.
除说明书所述的技术特征外,均为本专业技术人员的已知技术。Except for the technical features described in the specification, they are all technologies known to those skilled in the art.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311360019.2A CN117112872B (en) | 2023-10-20 | 2023-10-20 | Government affair text archiving method and system based on semi-supervised learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311360019.2A CN117112872B (en) | 2023-10-20 | 2023-10-20 | Government affair text archiving method and system based on semi-supervised learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117112872A true CN117112872A (en) | 2023-11-24 |
CN117112872B CN117112872B (en) | 2024-07-12 |
Family
ID=88796891
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311360019.2A Active CN117112872B (en) | 2023-10-20 | 2023-10-20 | Government affair text archiving method and system based on semi-supervised learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117112872B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109658062A (en) * | 2018-12-13 | 2019-04-19 | 广州华资软件技术有限公司 | A kind of electronic record intelligent processing method based on deep learning |
CN111461636A (en) * | 2019-01-22 | 2020-07-28 | 广东鼎义互联科技股份有限公司 | Virtual robot-based government affair service platform and application |
CN112182326A (en) * | 2020-10-16 | 2021-01-05 | 山东浪潮商用系统有限公司 | Efficient electronic archive management method and system |
CN113312476A (en) * | 2021-02-03 | 2021-08-27 | 珠海卓邦科技有限公司 | Automatic text labeling method and device and terminal |
WO2023019120A2 (en) * | 2021-08-13 | 2023-02-16 | Pricewaterhousecoopers Llp | Methods and systems for artificial intelligence-assisted document annotation |
CN115827939A (en) * | 2022-11-28 | 2023-03-21 | 华东冶金地质勘查局八一五地质队 | Digital archive management system |
CN116756395A (en) * | 2023-05-12 | 2023-09-15 | 严福 | Electronic archiving method and system for urban construction archives |
-
2023
- 2023-10-20 CN CN202311360019.2A patent/CN117112872B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109658062A (en) * | 2018-12-13 | 2019-04-19 | 广州华资软件技术有限公司 | A kind of electronic record intelligent processing method based on deep learning |
CN111461636A (en) * | 2019-01-22 | 2020-07-28 | 广东鼎义互联科技股份有限公司 | Virtual robot-based government affair service platform and application |
CN112182326A (en) * | 2020-10-16 | 2021-01-05 | 山东浪潮商用系统有限公司 | Efficient electronic archive management method and system |
CN113312476A (en) * | 2021-02-03 | 2021-08-27 | 珠海卓邦科技有限公司 | Automatic text labeling method and device and terminal |
WO2023019120A2 (en) * | 2021-08-13 | 2023-02-16 | Pricewaterhousecoopers Llp | Methods and systems for artificial intelligence-assisted document annotation |
CN115827939A (en) * | 2022-11-28 | 2023-03-21 | 华东冶金地质勘查局八一五地质队 | Digital archive management system |
CN116756395A (en) * | 2023-05-12 | 2023-09-15 | 严福 | Electronic archiving method and system for urban construction archives |
Non-Patent Citations (3)
Title |
---|
S. SHASHANK HOLLA 等: "End-to-End Speech Recognition for Low Resource Language Sanskrit using Self-Supervised Learning", 2022 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS SIGNAL PROCESSING AND NETWORKING (WISPNET), 31 December 2022 (2022-12-31) * |
宋华;: "在线政务服务平台电子文件归档管理对策研究", 浙江档案, no. 05, 31 May 2019 (2019-05-31) * |
龚炜;: "一套基于人工智能技术的政务服务平台设计", 中国科技信息, no. 12, 15 June 2020 (2020-06-15) * |
Also Published As
Publication number | Publication date |
---|---|
CN117112872B (en) | 2024-07-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111104498B (en) | Semantic understanding method in task type dialogue system | |
CN113569895B (en) | Image processing model training method, processing method, device, equipment and medium | |
US11551437B2 (en) | Collaborative information extraction | |
KR20200068050A (en) | Apparatus and method for generating learning data for artificial intelligence performance | |
US20220156488A1 (en) | Hierarchal document classification system and method | |
CN117893807B (en) | Knowledge distillation-based federal self-supervision contrast learning image classification system and method | |
CN106815310A (en) | A kind of hierarchy clustering method and system to magnanimity document sets | |
CN111522923B (en) | A Multi-Task Dialogue State Tracking Method | |
CN113011568A (en) | Model training method, data processing method and equipment | |
CN108549909B (en) | Object classification method and object classification system based on crowdsourcing | |
CN110163268A (en) | A kind of image processing method, device and server, storage medium | |
CN113806560A (en) | A method and system for generating knowledge graph of electric power data | |
CN113837307A (en) | Data similarity calculation method and device, readable medium and electronic equipment | |
CN116342887A (en) | Method, device, device and storage medium for image segmentation | |
CN114842277A (en) | Semi-supervised target detection method and platform based on pseudo-gain class rebalancing | |
CN104657422B (en) | A kind of content issue intelligent method for classifying based on categorised decision tree | |
Bian et al. | Sentiment analysis of Chinese paintings based on lightweight convolutional neural network | |
CN114005009B (en) | A training method and device for a target detection model based on RS loss | |
CN118379387B (en) | Single domain generalization method based on basic model | |
CN114491168A (en) | Method and system for regulating and controlling cloud sample data sharing, computer equipment and storage medium | |
CN119149697A (en) | Recommendation method, device, equipment and storage medium based on knowledge graph | |
CN111091198B (en) | Data processing method and device | |
CN113822126A (en) | Icon recognition method, device and computer-readable storage medium | |
CN117112872A (en) | Government affair text archiving method and system based on semi-supervised learning | |
CN111768214B (en) | Product attribute prediction method, system, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |