CN117909440B

CN117909440B - Intelligent archive index and retrieval system

Info

Publication number: CN117909440B
Application number: CN202410277736.7A
Authority: CN
Inventors: 黄一强
Original assignee: Xiamen Lanji Archives Technology Co ltd
Current assignee: Xiamen Lanji Archives Technology Co ltd
Priority date: 2024-03-12
Filing date: 2024-03-12
Publication date: 2024-06-04
Anticipated expiration: 2044-03-12
Also published as: CN117909440A

Abstract

The invention provides an intelligent archive indexing and retrieving system, which belongs to the archive retrieving field, and comprises an archive data acquisition module, a data preprocessing module, a semantic analysis unit and a data processing module, wherein the archive data acquisition module performs image scanning on a paper archive to form a scanning archive, the archive data acquisition module performs grabbing in an electronic archive database to call the archived electronic archive, the data preprocessing module performs data processing on the electronic archive and the scanning archive to form an archive to be indexed, and the data in the archive to be indexed is extracted through the semantic analysis unit to labels and abstracts in the archive to be indexed. The system solves the problem of unified management of paper and electronic files in the intelligent file indexing and retrieving system in the prior art, and realizes automatic processing and efficient retrieval of data. Through semantic analysis, classification coding and other technologies, the accuracy and the efficiency of retrieval are improved, the complex and changeable retrieval requirements are met, and more convenience and value are brought to enterprises and organizations.

Description

Intelligent archive index and retrieval system

Technical Field

The invention belongs to the field of archive retrieval, and particularly relates to an intelligent archive indexing and retrieving system.

Background

With the advent of the digital age, businesses and organizations are faced with vast amounts of electronic archive data. How to efficiently manage and retrieve these files becomes an urgent need. Traditional archive management methods are often based on manual classification and marking, and the method is not only low in efficiency, but also difficult to meet complex and changeable retrieval requirements. Therefore, the advent of intelligent archive indexing and retrieval systems has revolutionized archive management. The intelligent archive indexing and retrieving system utilizes advanced computer technology to automatically process and analyze the electronic archive so as to realize efficient and accurate indexing and retrieving.

Natural Language Processing (NLP) is an important branch of the fields of computer science and artificial intelligence, which studies how to let computers understand and process human language. In intelligent archive indexing and retrieval systems, NLP techniques are used to extract key information in the archive, such as labels, summaries, etc., and to understand the query intent of the user. Machine learning is a technique that allows computer systems to learn from data and improve performance. In intelligent archive indexing and retrieval systems, machine learning techniques are used to train models to achieve automatic classification, clustering, and ordering of archive data. The models can be continuously optimized according to historical data and user feedback, and the retrieval accuracy and efficiency are improved. Information retrieval is a branch of computer science that studies how to find and obtain relevant information from a large collection of documents. In intelligent archive indexing and retrieval systems, information retrieval techniques are used to construct efficient indexing structures and algorithms to support fast and accurate archive retrieval. With the increasing number of electronic files, processing and analyzing these data requires significant computing power. Big data technology provides a distributed storage and computation framework such as Hadoop and Spark, and efficient data processing and analysis tools, and provides powerful technical support for intelligent archive indexing and retrieval systems.

The intelligent archive index and retrieval system improves archive management efficiency and quality, and brings more commercial value and competitive advantage to enterprises and organizations. Through the system, a user can quickly find out required file information, so that the working efficiency is improved; meanwhile, enterprises and organizations can better mine and utilize the value of the archive data, and support decision making. Therefore, the intelligent archive index and retrieval system has wide application prospect and important significance in modern enterprises and organizations.

Disclosure of Invention

The invention provides an intelligent archive index and retrieval system, which solves the problem of unified management of paper and electronic archives in the prior art, and realizes automatic processing and efficient retrieval of data. Through semantic analysis, classification coding and other technologies, the accuracy and the efficiency of retrieval are improved, the complex and changeable retrieval requirements are met, and more convenience and value are brought to enterprises and organizations.

The technical scheme of the invention is realized as follows: the intelligent file indexing and retrieving system comprises a file data acquisition module, a data preprocessing module, a semantic analysis unit, a search module and a retrieval module, wherein the file data acquisition module scans paper files to form a scanning file, meanwhile, the file data acquisition module is used for capturing the archived electronic files in an electronic file database, the data preprocessing module is used for processing the electronic files and the scanning file to form files to be indexed, the data in the files to be indexed are extracted from labels and abstracts in the files to be indexed through the semantic analysis unit, the extracted labels and abstracts are subjected to semantic classification, the labels and indexes in an abstract list are subjected to set classification, the set classification is classified according to file types, the classified set classification is set to be optimized position indexes, indexes in the set classification are encoded according to the set classification, each set classification corresponds to one set classification type, all the set classification types are imported into the retrieval module, a user is used for carrying out analysis on a search instruction through the retrieval module when carrying out retrieval, the user classification category is matched with the set classification type, a set with high matching degree is selected, the analysis code of the search instruction is carried out in the set classification, then the analysis of the search instruction is correspondingly carried out on the set classification, and the analysis code of the search instruction is matched with the set classification category, and the user classification is formed on the matching search result according to the matching search result.

Compared with the prior art, the traditional file management system usually aims at electronic files or paper files only, but not both. Paper documents are typically manually entered or manually classified after scanning. The intelligent archive indexing and retrieving system can process paper and electronic archives simultaneously. The paper file is converted into a digital format through image scanning, and is uniformly processed with the electronic file. Many existing archive systems rely on simple keyword matching or metadata-based retrieval, lacking in a deep semantic understanding. The system utilizes a semantic analysis unit to deeply extract and classify the labels and the abstracts. This means that the system can understand the archive content more accurately, thereby improving the accuracy of retrieval and recall rate. Traditional archive retrieval methods often do not involve categorizing and encoding collections of tags and digests. The system not only performs semantic categorization, but also further categorizes the categorizations according to profile categories and sets an optimized location index and code for each category. This structured approach increases the retrieval efficiency. Typical retrieval systems may retrieve a full library directly, are inefficient, and may return a large number of irrelevant results. In this system, the user's search instructions are first matched to the set categorization categories and then matched in detail within a particular code set. The hierarchical searching method greatly improves the searching speed and accuracy. The search results are typically presented in a simple list, and may not contain relevance scores or other additional information to the query. The search list generated by the system is not only ordered according to the matching degree, but also possibly contains other additional information such as relevance scores, document sources and the like, and a more comprehensive and more visual search experience is provided for the user. The intelligent archive indexing and retrieving system has remarkable advantages and differences compared with the prior art through comprehensive archive processing capacity, deep semantic analysis, optimized retrieving flow and rich retrieving result display.

As a preferred embodiment, when setting the optimized position index, the collection in the same class is classified as {0,1,2 …, a, b }, and at a+1<b, the data objects from the position index 0 to i are all accessed, the position index a+1 is not accessed, and a is set as the optimized position index in the current access.

As a preferred embodiment, when the abstract and the index are classified into the set, the K-means cluster is adopted to classify the set to form an index set, the tag and the abstract index after the set classification are deeply analyzed to obtain category data distribution, the sorting standard of each set classification is determined based on the data analysis result, the corresponding weight is calculated for each sorting standard, the position score is calculated for the index in each set classification by combining the sorting standard and the weight, and the index in the set classification is sorted according to the position score.

As a preferred embodiment, a coding scheme is created according to the characteristics and the number of the set classification categories, a corresponding coding group is created for each set classification category according to the determined coding scheme, and each set classification category is mapped with the corresponding coding group.

As a preferred embodiment, after a search instruction is input by a user, analyzing the search instruction, extracting key information and parameters in the search instruction, generating an analysis code of the search instruction, determining a matching rule according to the characteristics of a coding group and the analysis code, comparing the analysis code with codes in the coding group one by one, judging whether the matched codes exist according to the matching rule, and recording the successfully matched codes and the corresponding set classification categories thereof to form a search result.

As a preferred embodiment, when the data preprocessing module processes the electronic file and the scanning file, the duplicate data in the electronic file and the scanning file are deduplicated, the redundant file is cleaned, and the electronic file and the scanning file are formatted uniformly to form the file to be indexed.

After the technical scheme is adopted, the invention has the beneficial effects that: high efficiency: the system greatly improves the efficiency of file management through automatic file data acquisition, processing and analysis. The paper file is quickly converted into a digital format through image scanning and is uniformly processed with the electronic file, so that the tedious and time-consuming process of traditional paper file management is eliminated. Accuracy: and the semantic analysis technology is utilized to deeply understand and classify the files, so that the accuracy of file indexing and retrieval is ensured. The system can accurately extract key information in the file, classify and encode the key information according to semantic content, and therefore accuracy and recall rate of retrieval are improved. Comprehensively: the system supports unified management of paper and electronic files, and ensures the integrity and comprehensiveness of file data. The user can search through a unified interface without switching between different systems and formats, so that the working efficiency and the user experience are improved. Flexibility: by the design of the collection classification and the coding group, the system can flexibly cope with archive data with different categories and complexities. The arrangement of the coding group enables the system to perform efficient retrieval according to different file categories, and meets the requirements of diversified and personalized retrieval. Scalability: the system adopts a modularized design, and all modules are mutually independent, so that the expansion and the upgrading are convenient. With the development of technology and the change of user demands, new functional modules can be conveniently added or existing modules can be optimized, and the advancement and adaptability of the system are maintained. User friendliness: the retrieval module provides an intuitive and easy-to-use user interface and supports a user to retrieve files through simple operation. The search list is ordered according to the matching degree, and provides relevance scores and other additional information to help the user to quickly find the required file information.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a block diagram of a system of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples:

As shown in FIG. 1, the intelligent archive indexing and retrieving system comprises an archive data acquisition module for scanning a paper archive to form a scanning archive, and meanwhile, the archive data acquisition module is used for capturing the archived electronic archive, a data preprocessing module is used for processing the electronic archive and the scanning archive to form an archive to be indexed, the data in the archive to be indexed are extracted by a semantic analysis unit, the extracted labels and abstracts are subjected to semantic classification, the labels and the indexes in an abstract list are subjected to set classification, the set classification is classified according to the archive category, the classified set classification is set to optimize the position index, the indexes in the set classification are encoded according to the set classification category, each encoding group corresponds to one set classification category, all encoding groups are led into a retrieving module, when a user retrieves the user retrieval instruction, the parsed instruction is matched with the set classification category through the retrieving module, the set classification category is selected for retrieval, the analysis instruction is analyzed in the retrieving module, the retrieval instruction is matched with the set classification category, the matching result is correspondingly encoded in the encoding group and the retrieving instruction is matched with the set classification list according to the matching result, and the matching result is retrieved by the user list.

The intelligent file index and search system works on the principle that paper files are converted into digital images through an image scanning technology to form scanning files. The archived electronic files are automatically grabbed or recalled from the electronic file database. And carrying out standardization, denoising, OCR (optical character recognition) and other processes on the electronic file and the scanning file, and preparing for subsequent indexing and retrieval. And (3) carrying out deep analysis on the text data in the file to be indexed by using a natural language processing technology, and extracting key labels and abstracts. And classifying the extracted labels and abstracts through a semantic analysis algorithm to form a file set with similar topics. And indexing the classified file set, and optimizing the retrieval speed. And coding indexes in the set classification, and forming different coding groups according to the file types, wherein each coding group corresponds to a specific set classification. And analyzing the search instruction input by the user, and extracting key information. And matching the analyzed instruction with the existing set classification, and selecting the set classification with high matching degree for further retrieval. And carrying out detailed matching in the selected coding group to find out files corresponding to the user instructions. And forming a search list according to the matching degree, wherein the list comprises the title, abstract, matching degree and other information of related files. And sending the search list to a user, and browsing and acquiring related files by the user according to the need.

The working flow of the whole system is as follows: an initialization stage: and configuring system parameters, establishing electronic archive database connection, and preparing a data preprocessing module. And (3) data acquisition: and processing paper and electronic files simultaneously, and collecting and preprocessing data. Semantic analysis: and carrying out semantic analysis on the files to be indexed, extracting labels and abstracts, and carrying out semantic classification. Index construction: and constructing a set classification index according to the semantic classification result, and designing a coding group. User interaction: and receiving a user search instruction, and performing instruction analysis and matching operation. The search is performed: and performing a search operation in the selected set classification and coding group to find the relevant file. Results show that: and generating and displaying the search list to the user. User feedback and optimization: and continuously optimizing and upgrading the system according to the user feedback and the use condition. The intelligent archive index and retrieval system remarkably improves the efficiency and accuracy of archive management through a highly-automatic processing flow, an advanced semantic analysis technology and an optimized retrieval algorithm, and provides a convenient and efficient archive retrieval experience for users.

When setting the optimized position index, classifying the collection in the same class as {0,1,2 …, a, b }, when a+1<b, all data objects from the position index 0 to i are accessed, the position index a+1 is not accessed, and setting a as the optimized position index in the access. First, there is a series of set categorizations, with the elements in these categorizations being ordered by rule or attribute. These set classifications are labeled {0,1,2 …, a, b }. A condition is given that the value of a +1 is smaller than b. This means that there is at least one element gap between a and b in the set classification. All data objects between position index 0 and i have been accessed. In other words, the data at these locations has been processed or retrieved. In contrast, the data on position index a+1 has not yet been accessed. This means that we have skipped the data at the a+1 position during processing or retrieval. Based on the categorization described above, the system sets a to the "optimal position index" in this visit. This means that starting from position a in the next data processing or retrieval task may be a more efficient, faster choice, since the positions after a (e.g. a + 1) have not yet been accessed. When processing or retrieving data in the collection categorization, the next starting position is optimized from the previous access record. In this way, the system can perform data processing and retrieval more intelligently and efficiently, avoiding repetitive or unnecessary operations, and thus improving overall performance.

As a preferred embodiment, when the abstract and the index are classified into the set, the K-means cluster is adopted to classify the set to form an index set, the tag and the abstract index after the set classification are deeply analyzed to obtain category data distribution, the sorting standard of each set classification is determined based on the data analysis result, the corresponding weight is calculated for each sorting standard, the position score is calculated for the index in each set classification by combining the sorting standard and the weight, and the index in the set classification is sorted according to the position score. K-means is an unsupervised machine learning algorithm used to divide data into K different clusters or sets. In this scenario, the summary and index are treated as data points, which are divided into sets or clusters by the K-means algorithm. The digests and indexes in each set are somewhat similar or related. Once the data is categorized into different sets, the next step is to conduct in-depth analysis of the tags and summary indexes in each set. Such analysis may involve counting tag frequencies, digest lengths, keyword occurrences, etc. in each set to obtain more information about the data distribution. Based on the data analysis results described above, one or more ranking criteria may be determined for each set. For example, if the tag frequencies in a set vary widely, then the tag frequencies may be an important ranking criterion. The selection of the ranking criteria depends on the results of the data analysis and the specific business needs or objectives. For each determined ranking criterion, its corresponding weight also needs to be calculated. The weight represents the importance of the criterion in deciding the final ranking. For example, if the tag frequency is considered a very important criterion, it may be given a higher weight. In combination with the ranking criteria and their corresponding weights, a location score may be calculated for each index in each set. This score reflects the importance or relevance of the index in the collection in which it is located. Finally, the indices in the set may be ordered according to the location score of each index. Thus, when a user queries or retrieves data, the results may be presented in this order, providing more relevant, valuable information. The process combines machine learning (K-means clustering) and data analysis technology to realize intelligent classification and sequencing of abstracts and indexes, thereby improving the efficiency and accuracy of data retrieval.

As a preferred embodiment, a coding scheme is created according to the characteristics and the number of the set classification categories, a corresponding coding group is created for each set classification category according to the determined coding scheme, and each set classification category is mapped with the corresponding coding group. Different collection categorization categories may have different characteristics, such as their size, the number of elements contained, relationships to other categories, and so forth. At the same time, the number of aggregate categorization categories is also an important consideration. How many different set categorization categories will determine the complexity of the coding scheme and the number of coding groups required. Based on the nature and number of aggregate categorization categories, a suitable coding scheme needs to be designed. This approach should be able to efficiently represent each set categorization class and allow for quick and accurate identification in subsequent processing. The coding scheme may include using specific characters, numbers or symbol combinations to represent different categories. A corresponding code set is created for each set of categorization categories according to the designed coding scheme. This code set contains all the codes used to represent the class. For example, if there is a set of 10 different categories to categorize and a digital coding scheme is employed, there may be 10 different coding groups from 00 to 09, each group corresponding to a particular category. Mapping is a process of associating a collection categorization class with its corresponding coding group. This typically involves creating a look-up table or database in which the correspondence of categories and corresponding code sets is stored. Once the mapping relationship is established, the system can quickly find the corresponding coding group according to the classification of the collection, which is very useful in the processes of data processing, storage and retrieval. In this way, the system is able to handle collection categorization categories in a structured and efficient manner, which is critical for large-scale data processing and information management.

As a preferred embodiment, after a search instruction is input by a user, analyzing the search instruction, extracting key information and parameters in the search instruction, generating an analysis code of the search instruction, determining a matching rule according to the characteristics of a coding group and the analysis code, comparing the analysis code with codes in the coding group one by one, judging whether the matched codes exist according to the matching rule, and recording the successfully matched codes and the corresponding set classification categories thereof to form a search result. The user enters a search instruction, which may be in the form of text, speech, images, etc., via some interface or means, in order to hope that the system will return data or information related to the instruction. After receiving the search instruction of the user, the system needs to analyze the instruction first. The parsing process may involve natural language processing, speech recognition, image recognition, etc. in order to extract critical information and parameters from the instructions. After the search instruction is analyzed, the system generates an analysis code according to the extracted key information and parameters. This resolution code is an encoded representation of the original search instruction that contains all the critical information needed for subsequent matching and searching. The system needs to determine a set of matching rules before performing the actual code matching. These rules define how the analysis code and codes within the code set are compared and what the comparison results are considered to be a match. The matching rules may be based on methods such as similarity calculation, pattern matching, semantic analysis, etc. And comparing the analysis code with codes in each coding group one by the system according to the determined matching rule. This comparison process may involve operations of calculating similarity scores, checking for the presence or absence of particular patterns, analyzing semantic relationships, and the like. After all comparisons are completed, the system will determine whether there is a code matching the resolved code according to the matching rules. If there is a matching code, this means that data or information related to the user retrieval instruction is found. For each successfully matched code, the system records the code and its corresponding set categorization class. This information will be used to form the final search result. And finally, integrating all successfully matched codes and corresponding set classification categories by the system to form a complete search result. The results will be ranked in some way (e.g., similarity score, time order, etc.) and presented to the user. Through the process, the system can effectively understand and respond to the retrieval requirement of the user, and quickly and accurately find the data or information related to the user instruction.

As a preferred embodiment, when the data preprocessing module processes the electronic file and the scanning file, the duplicate data in the electronic file and the scanning file are deduplicated, the redundant file is cleaned, and the electronic file and the scanning file are formatted uniformly to form the file to be indexed. An electronic archive generally refers to a document stored in a digital format, such as PDF, word, excel or the like. A scan file is a file that converts a paper document into a digital image by a scanner, typically stored in JPEG, TIFF, or PDF format. Deduplication is an important step in data processing, aimed at eliminating duplicate items in a dataset. Deduplication, as used herein, means the identification and removal of duplicate files or data items in the same or similar content in the electronic archive and the scan archive to reduce data redundancy and improve data quality. Redundant archive cleaning refers to further cleaning and optimization of the data set to remove unnecessary or duplicate information. This may include deleting invalid data, correcting erroneous data, normalizing the data format, etc., in order to ensure the accuracy, consistency, and validity of the data. Because the source and format of the electronic and scan files may vary, they need to be converted to a uniform format before further processing can take place. Format unification involves converting documents of different formats to the same standard format for subsequent indexing, storage, and retrieval operations. This typically involves file conversion techniques such as converting PDF files to text files, or JPEG images to PDFs, etc. After the above process, the electronic file and the scan file are integrated into a unified, cleaned data set, which is called the "file to be indexed". To be indexed means that the archives are ready to be indexed so that the user can quickly find the desired information by searching or other query means. The process aims at improving the quality, consistency and accessibility of the electronic files and the scanning files, and lays a foundation for subsequent data management and analysis work.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The intelligent archive indexing and searching system is characterized by comprising an archive data acquisition module, a data preprocessing module, a semantic analysis unit, a search module and a search module, wherein the archive data acquisition module scans paper archives to form a scanning archive, the archive data acquisition module is used for capturing an electronic archive, the archived electronic archive and the scanning archive are processed by the data preprocessing module to form an archive to be indexed, the data in the archive to be indexed is subjected to extraction of labels and abstracts in the archive to be searched through the semantic analysis unit, the extracted labels and abstracts are subjected to semantic classification, the labels and indexes in an abstract list are subjected to set classification, the set classification is classified according to archive types, the classified set classification is set with an optimized position index, the indexes in the set classification are encoded according to the set classification, each encoding group corresponds to one set classification type, all encoding groups are led into the search module, when a user searches, the user searches an analysis instruction through the search module, the analyzed instruction is matched with the set classification type, the set classification type with high matching degree is selected, the analysis of the search instruction is analyzed in the encoding groups, the encoding groups are matched in the encoding groups, the matching list is matched with the search module, and the matching search result is formed, and the user search list is searched according to the matching degree; when the optimized position index is set, classifying the collection in the same class as {0,1,2 …, a, b }, when a+1<b, all data objects from the position index 0 to i are accessed, the position index a+1 is not accessed, and setting a as the optimized position index in the access; when the abstract and the index are classified into sets, K-means clustering is adopted to conduct set classification to form an index set, the labels and the abstract indexes after the set classification are deeply analyzed to obtain category data distribution, the sorting standard of each set classification is determined based on data analysis results, corresponding weight is calculated for each sorting standard, position scores are calculated for the indexes in each set classification in combination with the sorting standard and the weight, and the indexes in the set classification are sorted according to the position scores.

2. The intelligent archive indexing and retrieval system of claim 1, wherein: and creating a coding scheme according to the characteristics and the quantity of the set classification categories, creating a corresponding coding group for each set classification category according to the determined coding scheme, and mapping each set classification category with the corresponding coding group.

3. The intelligent archive indexing and retrieval system of claim 1, wherein: analyzing after a search instruction is input by a user, extracting key information and parameters in the search instruction, generating an analysis code of the instruction, determining a matching rule according to the characteristics of the coding group and the analysis code, comparing the analysis code with codes in the coding group one by one, judging whether the matched codes exist according to the matching rule, and recording the successfully matched codes and the corresponding set classification categories thereof to form a search result.

4. The intelligent archive indexing and retrieval system of claim 1, wherein: when the data preprocessing module processes data on the electronic file and the scanning file, duplicate data in the electronic file and the scanning file are removed, redundant files are cleaned, and the electronic file and the scanning file are subjected to format unification to form a file to be indexed.