CN118862036B - An intelligent archive management system and method based on big data - Google Patents
An intelligent archive management system and method based on big data Download PDFInfo
- Publication number
- CN118862036B CN118862036B CN202411113876.7A CN202411113876A CN118862036B CN 118862036 B CN118862036 B CN 118862036B CN 202411113876 A CN202411113876 A CN 202411113876A CN 118862036 B CN118862036 B CN 118862036B
- Authority
- CN
- China
- Prior art keywords
- classification
- uploaded
- database
- file
- determined
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000003860 storage Methods 0.000 claims abstract description 50
- 238000007726 management method Methods 0.000 claims abstract description 42
- 238000012795 verification Methods 0.000 claims abstract description 20
- 238000011156 evaluation Methods 0.000 claims description 60
- 230000004044 response Effects 0.000 claims description 35
- 238000012552 review Methods 0.000 claims description 25
- 238000012545 processing Methods 0.000 claims description 24
- 238000004364 calculation method Methods 0.000 claims description 15
- 238000007499 fusion processing Methods 0.000 claims description 14
- 230000005540 biological transmission Effects 0.000 claims description 12
- 238000007689 inspection Methods 0.000 claims description 8
- 230000008520 organization Effects 0.000 abstract description 8
- 238000004140 cleaning Methods 0.000 abstract description 4
- 230000007246 mechanism Effects 0.000 abstract description 4
- 230000000875 corresponding effect Effects 0.000 description 108
- 230000006870 function Effects 0.000 description 10
- 238000003058 natural language processing Methods 0.000 description 9
- 238000004590 computer program Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 230000003252 repetitive effect Effects 0.000 description 3
- 230000000153 supplemental effect Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000004075 alteration Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 238000013468 resource allocation Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000004141 dimensional analysis Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011056 performance test Methods 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000000611 regression analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000026676 system process Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/30—Authentication, i.e. establishing the identity or authorisation of security principals
- G06F21/31—User authentication
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/64—Protecting data integrity, e.g. using checksums, certificates or signatures
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an intelligent archive management system and method based on big data, and relates to the technical field of archive management. According to the intelligent archive management method based on big data, only authorized personnel can upload files through identity verification, the uploading feasibility check is carried out, the storage of non-compliance archives is avoided, the organization structure of archives is further optimized through intelligent classification and dynamic retrieval, so that archives are kept up to date and most relevant, meanwhile, the archives which are outdated or are no longer needed are automatically backed up and removed, the storage burden is reduced, the integrity and accessibility of data are ensured, the intelligent classification and regular cleaning mechanism helps to organize effective management storage space, unnecessary storage cost is reduced, and the efficiency and safety of archive management can be effectively improved.
Description
Technical Field
The invention relates to the technical field of archive management, in particular to an intelligent archive management system and method based on big data.
Background
Through the reasonable utilization of the electronic file management system, not only can files of enterprises and public institutions be subjected to electronic management, but also multidimensional information in the files can be subjected to data processing, so that development of corresponding institutions is promoted, and the added value of enterprises is improved.
In the process of realizing file information management and utilization, searching for electronic files is a ubiquitous ring. Current electronic archive management systems have supported a variety of search modes, with full-text searching, fuzzy searching, and precision searching being the dominant search modes. The full text search is suitable for searching staff to only have impressions on partial document contents, namely when the searching staff inputs partial document contents in the full text search engine, the electronic archive management system automatically searches electronic archives containing corresponding document contents in a word stock of the full text search engine, but because the content input in the full text search engine by the searching staff is less, the accurate positioning of target documents cannot be realized, so that most of word stocks of the full text search engine are mass databases, and documents containing the corresponding document contents can be returned and displayed as much as possible.
When the prior art manages files, the files are inconvenient to intelligently classify based on keywords of the files, the classification process is complex, and the storage burden is increased.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an intelligent archive management system and method based on big data, which solve the problems that the archive is inconvenient to intelligently classify based on keywords of the archive and the classification process is complicated and the storage burden is increased in the prior art when the archive is managed.
The intelligent file management method based on big data comprises the following steps of verifying the identity of a file uploading personnel, carrying out uploading feasibility check on files to be uploaded after the verification is passed, if the uploading feasibility check fails, not allowing uploading, if the uploading feasibility check is successful, allowing uploading, carrying out intelligent classification on the files to be uploaded, determining the corresponding final recommended classification type of the files to be uploaded, which are allowed to be uploaded, in a database, carrying out file retrieval on the files under the corresponding final recommended classification type of the files to be uploaded, which are determined, and determining whether files to be removed exist under the final recommended classification type, if so, carrying out backup on the files to be removed, and then removing.
The method comprises the steps of obtaining file uploading personnel identity information, verifying the file uploading personnel identity information, wherein the file uploading personnel identity information comprises an account number, a password and a file uploading personnel number, verifying is failed if the file uploading personnel identity information is not a corresponding person stored in a database, and carrying out non-permission personnel login prompt, verifying is passed if the file uploading personnel identity information is a corresponding person stored in the database, receiving the file to be uploaded, extracting keywords from the file to be uploaded, obtaining a content keyword set, comparing the obtained content keyword set with a forbidden word set stored in the database, if the content keyword set and the forbidden word set have an intersection, failing uploading feasibility check, and permitting uploading if the content keyword set and the forbidden word set do not have an intersection, and if uploading feasibility check is successful, receiving the file to be uploaded.
Further, the intelligent classification of the file to be uploaded which is allowed to be uploaded is carried out, and the final recommended classification type corresponding to the file to be uploaded which is allowed to be uploaded in the database is determined, and the method comprises the following steps of carrying out keyword extraction on the quoted content of the file to be uploaded which is allowed to be uploaded, obtaining a quoted document keyword set, carrying out keyword extraction on a drawing of the file to be uploaded which is allowed to be uploaded, obtaining a drawing keyword set, obtaining the proportion of structured content and unstructured content of the file to be uploaded which is allowed to be uploaded, obtaining a content keyword reference set, a quoted document keyword reference set and structured content and unstructured content reference proportion corresponding to each classification type stored in the database, carrying out fusion of the content keyword set, the quoted document set, the structured content and unstructured content reference proportion corresponding to each classification type stored in the database, and obtaining a classification index matching formula corresponding to the classification type to be obtained:
Wherein i is the number of classification type stored in the database, fz i is the index of classification matching corresponding to the ith classification type, a 1 is the set of reference document keywords corresponding to the ith classification type stored in the database, B 1 is the set of reference document keywords of the file to be uploaded which is allowed to be uploaded, C 1 is the set of drawing keywords of the file to be uploaded which is allowed to be uploaded, BL is the structured content and unstructured content proportion of the file to be uploaded which is allowed to be uploaded, a 2i is the set of reference document keyword reference corresponding to the ith classification type stored in the database, B 2i is the set of reference document keyword reference corresponding to the ith classification type stored in the database, C 2i is the set of drawing reference keywords corresponding to the ith classification type stored in the database, C3538 is the structured content and unstructured content reference proportion corresponding to the ith classification type stored in the database, e is a natural constant, sigma (a 1,A2i) is the set of reference document keyword similarity corresponding to the ith classification type stored in the database, and sigma (a 4225 is the set of reference document similarity corresponding to the ith classification type stored in the database; the classification matching indexes are arranged in order from small to large, based on the selected number N set in the database, the classification types corresponding to the classification matching indexes which are ranked as the first N are obtained, and determining one classification type from the recommendation classification types to be determined as a final recommendation classification type.
The method comprises the steps of determining a final recommendation type from recommendation types to be determined, pre-storing uploading files to be uploaded which are allowed to upload into the recommendation types to be determined, retrieving pre-stored uploading files to be uploaded which are allowed to upload from the recommendation types to be determined, obtaining actual retrieval situation data corresponding to the recommendation types to be determined, wherein the actual retrieval situation data comprises average response time, retrieval success rate, CPU average utilization rate and broadband average utilization rate, obtaining parameter retrieval situation data corresponding to the recommendation types to be determined and stored in a database, wherein the parameter retrieval situation data comprises parameter average response time, parameter retrieval success rate, CPU parameter average utilization rate and broadband parameter average utilization rate, obtaining retrieval situation allowable deviation data corresponding to the recommendation types to be determined, wherein the retrieval situation allowable deviation data comprises average response time allowable deviation, retrieval success rate allowable deviation, CPU average utilization rate allowable deviation and broadband average utilization rate allowable deviation, obtaining the retrieval situation data corresponding to the recommendation types to be determined, and the retrieval situation of the recommendation types to be determined, and the retrieval situation data corresponding to the most evaluation types to be evaluated are the recommendation types to be determined, and the retrieval situation of the parameter retrieval situation is the most-estimated, and the retrieval situation of the recommendation types to be determined is the most important.
Further, the calculation formula of the calling condition evaluation value is as follows:
In the formula, j is the number of the type of the recommended classification to be determined, ts is the average response time, ds is the average utilization rate of the broadband reference, ks is the average utilization rate of the broadband reference, tc j is the average response time of the j-th type of the recommended classification to be determined stored in the database, dc j is the average reference call success rate of the j-th type of the recommended classification to be determined stored in the database, CPc j is the average utilization rate of the CPU reference corresponding to the j-th type of the recommended classification to be determined stored in the database, kc j is the average utilization rate of the broadband reference corresponding to the j-th type of the recommended classification to be determined stored in the database, Δtc j is the average response time allowable deviation of the j-th type of the recommended classification to be determined stored in the database, Δdc j is the average utilization rate allowable deviation of the j-th type of the recommended classification to be determined stored in the database, and Δdc j is the average utilization rate allowable deviation of the j-th type of the recommended classification to be determined in the database.
The method comprises the steps of obtaining file storage condition data under each recommendation type to be determined, wherein the file storage condition data comprise file storage quantity, file total storage space and data transmission rate, carrying out fusion processing on the file storage condition data under each recommendation type to be determined to obtain allowable deviation data comparison parameters corresponding to each recommendation type to be determined, carrying out difference calculation on the allowable deviation data comparison parameters corresponding to each recommendation type to be determined and comparison standard values corresponding to each recommendation type to be determined stored in a database to obtain deviation values corresponding to each recommendation type to be determined, and carrying out comparison on the deviation values corresponding to each recommendation type to be determined and the deviation values corresponding to each recommendation type to be determined stored in the database to obtain the allowable deviation data of the retrieval condition corresponding to the deviation values corresponding to the recommendation type to be determined, wherein the deviation values corresponding to the recommendation type to be determined are stored in the database.
Further, the calculation formula of the allowable deviation data comparison parameter is as follows:
Wherein j is the number of the type of the recommended classification to be determined, YB j is the allowable deviation data comparison parameter corresponding to the type of the recommended classification to be determined, sd j is the number of archives stored corresponding to the type of the recommended classification to be determined, ck j is the total storage space of archives corresponding to the type of the recommended classification to be determined, cL j is the data transmission rate corresponding to the type of the recommended classification to be determined, α 1 is the weight factor of the number of archives stored in the database, α 2 is the weight factor of the total storage space of archives stored in the database, α 3 is the weight factor of the data transmission rate stored in the database, and e is a natural constant.
Further, the file searching is carried out on the files under the final recommended classification type corresponding to the files to be uploaded, which are determined to be allowed to be uploaded, in the database, and whether files to be removed exist under the final recommended classification type is determined, which comprises the following steps of obtaining historical reference data of each file under the final recommended classification type, wherein the historical reference data comprises reference times, average opening duration and repeated content occupation ratio; the method comprises the steps of obtaining reference definition data of files corresponding to final recommendation classification types stored in a database, wherein the reference definition data comprise reference definition times, average opening duration definition time and repeated content definition proportion, carrying out fusion processing on historical reference data of all files under the final recommendation classification types and reference definition data corresponding to the final recommendation classification types stored in the database to obtain historical reference evaluation values of all files, comparing the historical reference evaluation values of all files with historical reference evaluation thresholds corresponding to the final recommendation classification types stored in the database, and judging whether the historical reference evaluation values of all files are smaller than the historical reference evaluation thresholds corresponding to preset recommendation classification types stored in the database or not, wherein the files corresponding to the historical reference evaluation values need to be removed if the historical reference evaluation values are smaller than the historical reference evaluation thresholds corresponding to the preset recommendation classification types stored in the database, and the files corresponding to the historical reference evaluation values do not need to be removed if the historical reference evaluation values are not smaller than the historical reference evaluation thresholds corresponding to the preset recommendation classification types stored in the database.
Further, the calculation formula of the history review evaluation value is as follows:
Where f is the number of files, LCo f is the historical review value of the f file, sc f is the review number of the f file, dc f is the average open duration of the f file, cZ f is the repeated content duty ratio of the f file, ds is the review limit number, DD is the average open duration limit time, scZ is the repeated content limit duty ratio, and e is a natural constant.
The intelligent file management system based on big data comprises an uploading feasibility checking module, a file classifying module and an existing file processing module, wherein the uploading feasibility checking module is used for verifying the identity of a file uploading personnel, uploading feasibility checking is carried out on the file to be uploaded after the verification is passed, if the uploading feasibility checking fails, uploading is not allowed, if the uploading feasibility checking succeeds, uploading is allowed, the file classifying module is used for intelligently classifying the file to be uploaded which is allowed to be uploaded, a preset recommended classification type corresponding to the file to be uploaded which is allowed to be uploaded in a database is determined, the existing file processing module is used for searching the file under the preset recommended classification type corresponding to the determined file to be uploaded which is allowed to be uploaded in the database, determining whether the file to be removed exists, and if the file to be removed exists, removing the file after backup is carried out.
The invention has the following beneficial effects:
According to the intelligent archive management method based on big data, only authorized personnel can upload files through identity verification, the uploading feasibility check is carried out, the storage of non-compliance archives is avoided, the organization structure of archives is further optimized through intelligent classification and dynamic retrieval, so that archives are kept up to date and most relevant, meanwhile, the archives which are outdated or are no longer needed are automatically backed up and removed, the storage burden is reduced, the integrity and accessibility of data are ensured, the intelligent classification and regular cleaning mechanism helps to organize effective management storage space, unnecessary storage cost is reduced, and the efficiency and safety of archive management can be effectively improved.
Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.
Drawings
FIG. 1 is a flow chart of an intelligent archive management method based on big data.
FIG. 2 is a flow chart of the intelligent archive management system based on big data.
Detailed Description
According to the embodiment of the application, through the intelligent archive management method and system based on big data, only authorized personnel can upload files through identity verification, uploading feasibility check avoids storing of illegal archives, intelligent classification and dynamic retrieval further optimize the organization structure of archives, so that archives are kept up to date and most relevant, meanwhile, automatic backup and removal of archives which are outdated or are no longer needed reduce the storage burden, the integrity and accessibility of data are ensured, an intelligent classification and regular cleaning mechanism helps to organize and effectively manage storage space, unnecessary storage cost is reduced, and the efficiency and safety of archive management can be effectively improved.
Referring to fig. 1, an embodiment of the present invention provides a technical solution, which includes verifying personnel identity of file uploading, performing uploading feasibility inspection on files to be uploaded after the personnel identity passes the verification, if the uploading feasibility inspection fails, not allowing uploading, if the uploading feasibility inspection is successful, allowing uploading, performing intelligent classification on files to be uploaded which are allowed to upload, determining a final recommended classification type corresponding to the files to be uploaded in a database, performing file retrieval on files under the determined final recommended classification type corresponding to the files to be uploaded which are allowed to upload, determining whether files to be removed exist under the final recommended classification type, if so, performing backup on files to be removed, and removing.
Firstly, verifying the identity of a file uploading person, and ensuring that only an authorized user can upload the file, wherein the step is to ensure the information security and comply with related laws and regulations. Before uploading the file, the compliance and the like of the file can be automatically checked, so that all uploaded files are ensured to meet the standards and requirements of organizations. The uploaded files are automatically classified into corresponding database classifications through an algorithm, and the searching efficiency is improved according to the information such as the content of the files. Files under the same category are retrieved, and whether files need to be updated or removed is evaluated, which includes automatically backing up and deleting obsolete or redundant files, ensuring the latest and accuracy of the archive.
Through strict identity verification and file inspection, risks of information leakage and illegal access are reduced. The intelligent classification and periodic cleaning mechanism helps the organization to effectively manage storage space, reducing unnecessary storage costs. The efficiency and the security of file management can be effectively improved. Only authorized personnel can upload files through identity verification, and the uploading feasibility check avoids the storage of non-compliance files, the intelligent classification and dynamic retrieval further optimize the organization structure of the files, so that the archives are kept up to date and most relevant, and simultaneously, the files which are outdated or are no longer needed are automatically backed up and removed, so that the storage burden is reduced, and the integrity and accessibility of data are ensured. Such archive management methods are particularly useful in environments that require processing large amounts of data and are highly demanding in terms of data security, such as government authorities, large businesses, and research institutions.
The file uploading personnel identity verification method comprises the steps of obtaining file uploading personnel identity information, verifying the file uploading personnel identity information, wherein the file uploading personnel identity information comprises an account number, a password and a file uploading personnel number, verifying is failed if the file uploading personnel identity information is not a corresponding person stored in a database, and a non-permission personnel login prompt is carried out, verifying is passed if the file uploading personnel identity information is a corresponding person stored in the database, receiving a file to be uploaded, extracting keywords from the file to be uploaded, obtaining a content keyword set, comparing the obtained content keyword set with a forbidden word set stored in the database, if the content keyword set and the forbidden word set have an intersection, failing uploading feasibility verification, and permitting uploading if the content keyword set and the forbidden word set do not have an intersection, succeeding uploading feasibility verification.
In this embodiment, the identity information of the person uploading the file, including the account number, the password and the job number, is obtained, and the obtained identity information is compared with the records in the database. If the information is not matched, the verification fails, the system prompts the non-allowed personnel to log in, further operation is prevented, and if the information is matched, the verification is successful, and the user is allowed to enter the next step. And receiving the file to be uploaded, extracting keywords from the file, comparing the extracted keywords with a forbidden word set pre-stored in a database, if the keywords and the forbidden words have intersections, indicating that the file possibly contains unsuitable or sensitive contents, and if the keywords and the forbidden words have no intersections, indicating that the content of the file meets the requirements, uploading feasibility inspection is passed, and the file can be uploaded.
The authentication ensures that only authorized personnel can upload files, unauthorized access and potential data leakage are prevented, and sensitive or illegal content uploading is avoided through illegal word checking, so that organizations are protected from legal and reputation risks. By combining identity verification and content monitoring, the information security is ensured, the quality of data management and user operation is optimized, and the archive management process is more intelligent and meets the requirements of modern enterprises.
Wherein the obtaining of the set of content keywords may use Natural Language Processing (NLP) techniques. This process first requires preprocessing of the archive content, including removal of garbage, normalization of text formats, and the like. Existing algorithms such as TF-IDF (word frequency-inverse document frequency), text mining algorithms, or deep learning models (e.g., BERT, etc.) may then be applied to identify and extract keywords in the text. These keywords are a high summary of the main content and subject matter of the document, and can effectively reflect the core information of the archive. Finally, the extracted keyword set can be used for further content auditing, classification and index establishment, so that the efficiency and accuracy of the archive management system are improved.
Specifically, the method for intelligently classifying the files to be uploaded, which are allowed to be uploaded, determines the final recommended classification type corresponding to the files to be uploaded, which are allowed to be uploaded, in the database, and comprises the following steps: keyword extraction is performed on the reference content of the file to be uploaded, which is allowed to be uploaded, and a reference document keyword set is obtained (first, a reference content part in the file to be uploaded is obtained, including document references, cited paragraphs or notes, and then, text analysis is performed on the reference content by using Natural Language Processing (NLP) technology such as word segmentation and part-of-speech tagging; keyword extraction is performed on the legends of the files to be uploaded that are allowed to upload, a set of legends keywords is obtained (first, all the legends text under all the graphics and tables in the files are identified and extracted, then, these legends text are pre-processed, such as to remove noise and standardized formats, after which the legends text is analyzed using Natural Language Processing (NLP) techniques, such as word segmentation and part-of-speech tagging, keyword extraction algorithms (e.g., TF-IDF or other text mining techniques) are applied to identify important words in the legends that reflect the primary content and information of the graphics or tables, finally, these extracted keywords constitute a set of legends that can be used to enhance indexing and searching functions of the file content, the overall efficiency and the data availability of the archive management system are improved), the proportion of the structured content and the unstructured content of the archive to be uploaded which is allowed to be uploaded is obtained (firstly, the uploaded archive is scanned and analyzed to identify the structured content (such as tables, lists, database export data and the like) and the unstructured content (such as free text, image description and the like) in the archive. This typically requires the use of document parsing techniques, such as OCR (optical character recognition) or specific document parsing algorithms, to extract and tag the different types of data. The system then calculates the amount of data for each content type, possibly in terms of number of characters, number of words, or data blocks. By comparing the metrics of the two contents, the system is able to calculate the ratio of structured to unstructured content. This ratio helps the intelligent classification system to better understand the content composition of the archive for more accurate processing and classification thereof).
And acquiring a content keyword reference set, a reference document keyword reference set, a drawing keyword reference set and a structured content and unstructured content reference proportion of files to be uploaded, which correspond to each classification type and are stored in a database. The reference set and reference proportion acquisition process corresponding to each classification type stored in the database is to collect samples from existing files and data, wherein the samples should represent different classification types, and the sample files are subjected to detailed content analysis, including text, reference documents, notes and the like. Keywords are extracted from these content using natural language processing techniques such as TF-IDF, LDA (latent dirichlet allocation) model, etc. The sample archive is parsed to identify structured content (e.g., tables, database records) and unstructured content (e.g., paragraph text, image annotations). The ratio of the two contents in each sample is calculated. Based on the analysis, a keyword reference set is created for each classification type, including content keywords, reference document keywords, and drawing keywords. At the same time, a typical proportion of structured and unstructured content for each category is determined.
The method comprises the steps of carrying out fusion processing on a content keyword set, a reference document keyword set, a drawing keyword set, structured content and unstructured content proportion of files to be uploaded which are allowed to be uploaded, a content keyword reference set, a reference document keyword reference set, a drawing keyword reference set and structured content and unstructured content reference proportion of files stored in a database, obtaining classification matching indexes corresponding to all classification types, arranging the classification matching indexes in a sequence from small to large, obtaining classification types corresponding to the classification matching indexes ranked into the first N on the basis of the selection number N set in the database, marking the classification types as recommendation classification types to be determined, and determining one classification type from the recommendation classification types to be determined as a final recommendation classification type.
Keywords are extracted from different parts of the archive that are allowed to upload (e.g., reference documents, notes), and the proportions of structured content (e.g., tables, lists) and unstructured content (e.g., free text) of the archive are analyzed. The method comprises the steps of obtaining a preset classified reference set of a database, wherein the classified reference set comprises content keywords, reference document keywords, a drawing keyword set and a structured and unstructured content proportion of each class, and carrying out fusion processing on the keyword set and the content proportion of files to be uploaded and the reference set in the database to generate a classified matching index. And selecting the first N classifications with the highest matching degree from high to low according to the classification matching indexes, and determining a final recommended classification type from the candidate classifications.
By comprehensively analyzing text keywords and content structures, the method can more accurately classify files into the most relevant classifications in the database, and improves the efficiency and accuracy of data retrieval. The method is particularly suitable for processing a large amount of archival data and can be expanded to cope with the requirements of a large-scale information system.
The calculation formula of the classification matching index is as follows:
In the formula, i is the number of the classification type stored in the database, fz i is the classification matching index corresponding to the ith classification type, a 1 is the content keyword reference set of the file to be uploaded which is allowed to be uploaded, B 1 is the reference document keyword set of the file to be uploaded which is allowed to be uploaded, C 1 is the drawing keyword set of the file to be uploaded which is allowed to be uploaded, BL is the structured content and unstructured content proportion of the file to be uploaded which is allowed to be uploaded, a 2i is the content keyword reference set corresponding to the ith classification type stored in the database, B 2i is the drawing keyword reference set corresponding to the ith classification type stored in the database, C 2i is the drawing keyword reference set corresponding to the ith classification type stored in the database, C cBL i is the structured content and unstructured content reference proportion corresponding to the ith classification type stored in the database, e is a natural constant, sigma (a 1,A2i) is the content keyword set and the content keyword reference document reference keyword reference set corresponding to the ith classification type stored in the database, and sigma (sigma-type classification type C is the sigma type reference document reference set corresponding to the ith classification type stored in the database).
In this embodiment, the content keyword set represents a keyword set of the document body text, which directly reflects the subject and content key of the document, and is the most direct information source in classification decision, and the content keyword set is used in combination with other parameters, so that a core view of the document content can be provided. The set of reference document keywords contains keywords extracted from the reference portion of the archive, which typically contains supporting arguments or supplemental information that aids in understanding the academic or literature context of the archive, and in combination with keywords of the subject content, the reference document keywords help reveal the depth and scope of the archive, providing a supplemental perspective that is critical to the accuracy of classification. The set of annotation keywords relates to the explanatory text of graphics, tables, or other visual content, which is typically a visual interpretation or supplemental description of the subject content, which provides an understanding of archive visual information, which is particularly important when processing archives that contain rich graphics and tables, helping to more fully evaluate the multidimensional information of the content. The proportion of structured content and unstructured content indicates the proportion of structured data (such as tables and lists) and unstructured text (such as free text descriptions) in the archive, which reflects the organization mode and the data form of the information, the proportion of structured and unstructured content is helpful for knowing the information density and the organization complexity of the archive, and the archive can be classified more accurately by combining with a keyword set, especially in the occasion of distinguishing a highly structured report from a general text document.
The four parameters are comprehensively utilized, so that the content characteristics and the adaptability of the file to be uploaded can be comprehensively estimated from different dimensions. By comparing these detailed data points, the intelligent archive management system can accurately categorize the archive into the most appropriate category, ensuring quick retrieval and efficient management of information. The multi-dimensional analysis method improves the classification precision and reduces the possibility of error classification, thereby improving the overall data processing efficiency and the practicability of the system.
Sigma (a 1,A2i),σ(B1,B2i),σ(C1,C2i) is used to represent the similarity between the set of content keywords, the set of reference document keywords, the set of drawing keywords and the set of references in the database for the corresponding classification of the archive to be uploaded. The similarity values can be calculated by, for example, cosine similarity (the cosine similarity is suitable for text data in a high-dimensional space, the similarity of two groups of keywords in the direction can be effectively measured, and the number difference of the keywords is ignored), so that the relevance and consistency among the keywords are measured. The influence of the similarity of the keywords is adjusted through the exponential function, so that the contribution of the condition of high similarity to the matching index is larger.The difference between the structured and unstructured content proportion of the file to be uploaded and the standard proportion of each category in the database is considered, the consistency of the content structure is reflected, and the logarithm and the square root are used for adjusting the influence of the proportion difference on the result, so that the result is smoother and manageable.
The method comprises the steps of pre-storing uploading-allowed files under each recommending type to be determined, and acquiring actual calling condition data corresponding to each recommending type to be determined, wherein the actual calling condition data comprises average response time, calling success rate, CPU average use rate and broadband average use rate. Where average response time refers to the average time from sending a request to receiving a response, the start and end times of each file or data retrieval request are logged by the system, the response times of all requests are summed and divided by the total number of requests. The call success rate refers to the ratio of the successfully completed call requests to the total call requests, monitors all call requests and records the times of success and failure, and calculates the success times divided by the total request times to obtain the success rate. The average CPU usage rate refers to the proportion of time occupied by the CPU by the calling task in a certain time, and the CPU usage condition is monitored by using tools provided by an operating system (such as top commands in Linux, a task manager of Windows) or special performance monitoring software. The broadband utilization rate refers to the utilization rate of network bandwidth in a certain time, a network monitoring tool (such as WIRESHARK, NETFLOW and the like) is used for capturing network traffic data, and the ratio of the transmission data quantity to the total capacity of the network bandwidth in a given time is calculated to obtain the broadband utilization rate.
And acquiring parameter calling condition data corresponding to each to-be-determined recommended classification type stored in a database, wherein the parameter calling condition data comprises parameter average response time, parameter calling success rate, CPU parameter average utilization rate and broadband parameter average utilization rate. The reference call condition data, such as the reference average response time, the reference call success rate, the CPU reference average utilization rate and the broadband reference average utilization rate, corresponding to each recommendation category type to be determined stored in the database are generally obtained based on historical performance data and expected performance standards. The data reflects the performance of each class type in an ideal or typical situation and is used as a benchmark for comparing and evaluating actual performance, and the process of obtaining the data includes the steps of analyzing call records of historically different class types, and calculating the average response time, success rate, CPU and broadband usage of each class. Through designing and executing targeted performance tests, various operations are simulated to ensure the reliability and accuracy of various indexes. According to an organization's Service Level Agreement (SLA) or performance objective, parameter values of each performance index are set, and the parameter values are based on the optimal performance standard. These reference values are stored in a database.
The method comprises the steps of obtaining calling condition allowable deviation data corresponding to each recommendation type to be determined, carrying out fusion processing on actual calling condition data corresponding to each recommendation type to be determined, reference calling condition data corresponding to each recommendation type to be determined and calling condition allowable deviation data corresponding to each recommendation type to be determined, which are stored in a database, and obtaining calling condition evaluation values corresponding to each recommendation type to be determined, wherein the calling condition allowable deviation data comprises average response time allowable deviation, calling success rate allowable deviation, CPU average utilization rate allowable deviation and broadband average utilization rate allowable deviation, obtaining minimum calling condition evaluation values, and determining the recommendation type to be determined corresponding to the minimum calling condition evaluation values as a final recommendation type.
The files to be uploaded, which are allowed to be uploaded, are temporarily stored in each candidate recommendation category, and this step simulates the storage state of the files under each category. And testing the calling performance of the prestored files under each category, and collecting key performance indexes such as average response time, calling success rate, CPU (Central processing Unit) utilization rate and broadband utilization rate. Preset call performance criteria data (reference data) including a predetermined response time, success rate, CPU, and broadband use criteria for each category are obtained. An allowable deviation for each performance metric is defined, which aids in evaluating the difference between the actual performance data and the predetermined criteria. And carrying out fusion processing on the actual calling data, the preset data and the allowable deviation data, and calculating a comprehensive calling condition evaluation value. This evaluation reflects the performance suitability of the profile under each category. The category with the lowest evaluation value of the calling condition (best performing) is selected from all the candidate categories as the final recommended category.
The optimal performance of the selected classification can be ensured through the actual test and evaluation of the retrieval performance of the files under different classifications, which is crucial to efficient data access, and the evaluation of the CPU utilization rate and the broadband utilization rate is helpful for optimizing resource allocation and management, and reduces the operation cost. The method has the advantages that the calling success rate and the response time are improved, the use experience of a user is directly improved, particularly, in an environment where a large amount of data needs to be frequently accessed and processed, the classification is selected by using a data-driven method, the interference of subjective judgment is reduced, and the objectivity and scientificity of decision making are ensured.
The calculation formula of the call condition evaluation value is as follows:
In the formula, j is the number of the type of the recommended classification to be determined, ts is the average response time, ds is the average utilization rate of the broadband reference, ks is the average utilization rate of the broadband reference, tc j is the average response time of the j-th type of the recommended classification to be determined stored in the database, dc j is the average reference call success rate of the j-th type of the recommended classification to be determined stored in the database, CPc j is the average utilization rate of the CPU reference corresponding to the j-th type of the recommended classification to be determined stored in the database, kc j is the average utilization rate of the broadband reference corresponding to the j-th type of the recommended classification to be determined stored in the database, Δtc j is the average response time allowable deviation of the j-th type of the recommended classification to be determined stored in the database, Δdc j is the average utilization rate allowable deviation of the j-th type of the recommended classification to be determined stored in the database, and Δdc j is the average utilization rate allowable deviation of the j-th type of the recommended classification to be determined in the database.
In this embodiment, for each performance index, the difference between the actual value and the reference value is calculated, and then divided by the allowable deviation to obtain a normalized performance deviation value, all normalized deviation values are added and added by 1, and this "adding 1" operation can prevent the result from being zero after applying the hyperbolic tangent function when all performance indexes completely meet the reference value (the deviation is zero).
The parameter values are made to approach the set criteria (reference values) rather than merely being optimized to be highest or lowest in order to ensure that the overall performance of the system matches the expected business needs and operating environment. The response time is a direct indicator of the user experience, reflecting the speed with which the system processes requests, and in data retrieval systems, fast response time means efficient data retrieval and processing capabilities. The response time is affected by CPU and broadband utilization, and the response is faster if the resource usage efficiency is high, but if too fast a response time is pursued, unnecessary or uneconomical amounts of resources (such as CPU and bandwidth) may be consumed, resulting in an unnecessary increase in the operation cost. The success rate measures the proportion of the request to be correctly and completely responded, directly relates to reliability and error processing capacity, is closely related to the stability of a system and resource management (such as a CPU and a broadband), and can cause the failure of the calling due to insufficient resources or improper management. The CPU utilization reflects the occupation of processor resources, and efficient CPU utilization can support more concurrent requests and faster processing speeds, with CPU utilization affecting response time and overall performance of the system, and excessive CPU utilization can lead to processing delays and service degradation. The broadband utilization rate reflects the use condition of network resources, is particularly important for a data calling system relying on network transmission, influences the speed and stability of data transmission, directly influences the response time and success rate of data calling, and can cause network congestion and data calling failure due to insufficient resources.
Average response time (Ts), call success rate (Ds), CPU average utilization (CPs), and broadband average utilization (Ks) are selected as key parameters for performance evaluation because these metrics collectively reflect the efficiency, reliability, and resource management capabilities of a data processing system.
By dividing the performance bias by the allowable bias, the importance and impact of different performance indicators can be normalized, ensuring that no single indicator has an excessive impact on the assessment results. Allowing different reference values and bias limits to be set for different classification types provides a high degree of customization, adapting to different business requirements and performance goals.
The method comprises the steps of obtaining file storage condition data under each recommendation type to be determined, wherein the file storage condition data comprise file storage quantity, file total storage space and data transmission rate, carrying out fusion processing on the file storage condition data under each recommendation type to be determined to obtain allowable deviation data comparison parameters corresponding to each recommendation type to be determined, carrying out difference calculation on the allowable deviation data comparison parameters corresponding to each recommendation type to be determined and comparison standard values corresponding to each recommendation type to be determined stored in a database to obtain deviation values corresponding to each recommendation type to be determined, carrying out comparison on the deviation values corresponding to each recommendation type to be determined and each deviation value corresponding to each recommendation type to be determined stored in the database to obtain allowable deviation data of the retrieval condition corresponding to the deviation values corresponding to the recommendation type to be determined stored in the database which are successfully compared (for example, the deviation values stored in the database are 1,2 and 3, and the calculated deviation values are 2) and obtaining allowable deviation data of the successful retrieval condition corresponding to the comparison condition.
The call case permission bias data stored in the database is set based on historical performance data, expected business requirements, and system capabilities. First, the historical operational data is analyzed to see how different classification types perform in actual operation, including response time, success rate, CPU and broadband utilization, etc. Next, an ideal or acceptable range for each performance indicator is determined, taking into account the business requirements and system design goals. These ranges are typically verified by expert review or with analog data before being assessed as allowed deviations to ensure that they are both realistic and meet business objectives. Finally, the allowable deviation values are standardized and stored in a database for subsequent performance monitoring and management, so that the system is ensured to stably operate within the preset parameter ranges. In this way, the call condition allows the deviation data to help manage system performance, prevent unexpected performance fluctuations, and provide basis for tuning and optimization.
First, archive storage data, including the number of stores, total storage space, and data transfer rate, for each recommended class type to be determined is collected, which data provides preliminary information about archive storage efficiency and capacity. And carrying out fusion processing on the collected data to generate allowable deviation data comparison parameters representing the storage efficiency and performance of each classification type, and carrying out difference calculation on the obtained comparison parameters and a preset standard value in a database, wherein the step is to find the deviation between the actual storage condition and the expected standard. And comparing the calculated deviation value with a preset deviation value in a database.
The calculation formula of the allowable deviation data comparison parameter is as follows:
Wherein j is the number of the type of the recommended classification to be determined, YB j is the allowable deviation data comparison parameter corresponding to the type of the recommended classification to be determined, sd j is the number of archives stored corresponding to the type of the recommended classification to be determined, ck j is the total storage space of archives corresponding to the type of the recommended classification to be determined, cL j is the data transmission rate corresponding to the type of the recommended classification to be determined, α 1 is the weight factor of the number of archives stored in the database, α 2 is the weight factor of the total storage space of archives stored in the database, α 3 is the weight factor of the data transmission rate stored in the database, and e is a natural constant.
In this embodiment, the derivation of the weighting factors from historical data generally involves statistical analysis and machine learning techniques. First, past performance data is collected, including data relating to the number of archival stores, storage space, data transfer rates, and system operating efficiency and stability. Multiple regression analysis is then used to analyze the actual extent of impact of these indices on system performance, and it is possible to identify which indices have the greatest impact on performance, and determine the weights of the indices accordingly. These weighting factors may then be used to translate the historical data into input parameters in a predictive model or decision support tool to optimize future performance and resource allocation policies. By integrating three key performance indexes, the performance and resource requirements of a classification type can be comprehensively evaluated, and decision making is ensured to be based on comprehensive data analysis.
Specifically, the file searching is performed on the determined file to be uploaded, which is allowed to be uploaded, under the final recommended classification type corresponding to the file to be uploaded in the database, and whether the file to be removed exists under the final recommended classification type is determined, which comprises the following steps: and acquiring historical reference data of each file under the final recommended classification type, wherein the historical reference data comprises reference times, average opening duration and repeated content duty ratio. Each time a user or system accesses a certain archive, the system automatically records a review, which is implemented by a middleware or database trigger, and the corresponding review counter is incremented each time the archive is opened. The number of references is typically stored in a database associated with metadata of the corresponding archive. The time stamp for each archive is opened and closed is tracked. One time stamp is recorded when the file is opened, and the other time stamp is recorded when the file is closed, wherein the difference between the two time stamps is the duration of the reference. The average open duration is obtained by dividing the accumulated duration of all references of a single file by the number of references. This data is also stored in the database in association with the corresponding archive. The repetition content duty cycle may use Natural Language Processing (NLP) tools or proprietary data alignment software to evaluate the repetition rate of the content.
And acquiring reference definition data of files corresponding to the final recommended classification type stored in a database, wherein the reference definition data comprises reference definition times, average opening duration definition time and repeated content definition duty ratio. Firstly, a large amount of historical consulting data are collected, the data reflect the performance of different files in actual use, statistical analysis is carried out on the collected data, the consulting times and the median of average opening duration are determined, the consulting times and the average opening duration are defined as consulting defining times, the average opening duration defining time, and the repeated content proportion of files corresponding to the consulting times and the median of the average opening duration is defined as repeated content defining proportion.
The method comprises the steps of carrying out fusion processing on historical consulting data of each file under the final recommendation classification type and consulting definition data corresponding to the final recommendation classification type stored in a database to obtain historical consulting evaluation values of each file, comparing the historical consulting evaluation values of each file with a historical consulting evaluation threshold corresponding to the final recommendation classification type stored in the database, judging whether the historical consulting evaluation values of each file are smaller than the historical consulting evaluation threshold corresponding to the preset recommendation classification type stored in the database, wherein the file corresponding to the historical consulting evaluation value needs to be removed if the historical consulting evaluation value is smaller than the historical consulting evaluation threshold corresponding to the preset recommendation classification type stored in the database, and the file corresponding to the historical consulting evaluation value does not need to be removed if the historical consulting evaluation value is not smaller than the historical consulting evaluation threshold corresponding to the preset recommendation classification type stored in the database.
The historical review evaluation threshold corresponding to the final recommended category type stored in the database is obtained by historical data analysis, which includes collecting and evaluating a large amount of data of the relevant category historical review behavior. The analyst calculates the statistics of the number of reviews, the average open duration, and the repetition content duty, such as the average, median, standard deviation, etc., and considers the business needs and actual usage, thereby setting review evaluation thresholds appropriate for the classification.
Historical review data for each archive, including the number of reviews, average open duration, and repeat content duty cycle, is collected, which provides an intuitive understanding of archive usage frequency and user interest in archive content. A reference standard (defining data) such as a reference number threshold, minimum requirement of average open time, and acceptable repetition content ratio is set for the archive under each category. And combining the collected historical reference data with reference definition data, and carrying out fusion processing to generate a historical reference evaluation value of each file. And comparing the calculated historical consulting evaluation value with a preset threshold value. If the evaluation value of a certain profile is below a threshold value, this indicates that the actual review of the profile does not reach the expected frequency of use or attention, and is therefore considered no longer needed or outdated, and removal is recommended. Removing files with low frequency of use or high repetition can free up valuable storage space, freeing up more active or important content. By reducing redundant content in the database, the retrieval efficiency can be improved, and the user can find the required information faster.
The calculation formula of the history review evaluation value is as follows:
Where f is the number of files, LCo f is the historical review value of the f file, sc f is the review number of the f file, dc f is the average open duration of the f file, cZ f is the repeated content duty ratio of the f file, ds is the review limit number, DD is the average open duration limit time, scZ is the repeated content limit duty ratio, and e is a natural constant.
In this embodiment, the number of reviews visually reflects the popularity or frequency of use of the archive. A high number of reviews typically indicates that the archive content is of high value or demand to the user, which is a direct measure of the importance of the archive for use in assessing the appeal of the archive to a target user group. The average open duration measures the average time spent by the user after opening the profile, reflecting the user's engagement and interest in the profile's content. Longer open times may mean more engaging in content or more informative. And the actual investment and the value evaluation of the user on the file content are further understood by matching the reference times. A high content repetition rate may indicate a large amount of repeated information in the archive, possibly affecting the originality and value of the archive, an indicator used to evaluate the originality and novelty of the content, which is particularly important for maintaining high quality and high originality archives. The two parameters of the number of reviews and the average open duration are typically positively correlated because the user may spend more time reading or processing the profile if the user finds the content useful or interesting. Theoretically, if the content of an archive is highly repetitive, it may attract less repetitive access unless such repetitive information has significant reference value. If the archive content is highly repeatable, users may close the archive in a short period of time, as they may quickly discover that the information does not provide new knowledge or data.
The first part of the formula is an exponential function that evaluates the ratio of the number of reviews of the profile and the average open duration relative to the set criteria. By the exponential function processing, it emphasizes the influence of the condition that the reference frequency and the duration exceed or fall short of the standard on the evaluation value, so that the evaluation result is significantly affected by the significant deviation of any one index. The second part is processed by the hyperbolic tangent function, the difference between the repeated content proportion of the file and the set threshold value is considered, the influence of the repeated content proportion on the file value is reflected, and the file value with high content repetition rate is lower.
An intelligent file management system based on big data is shown in fig. 2, and comprises an uploading feasibility checking module, a file classifying module and an existing file processing module, wherein the uploading feasibility checking module is used for verifying the identity of a file uploading personnel, uploading feasibility checking is carried out on files to be uploaded after the verification is passed, uploading is not allowed if the uploading feasibility checking fails, uploading is allowed if the uploading feasibility checking succeeds, the file classifying module is used for intelligently classifying files to be uploaded, which are allowed to be uploaded, and determining the corresponding preset recommended classification types of the files to be uploaded, which are allowed to be uploaded, in a database, the existing file processing module is used for carrying out file retrieval on the files, which are determined to be uploaded, under the corresponding preset recommended classification types, in the database, and determining whether files to be removed exist or not, and if the files to be removed are backed up.
An electronic device includes a processor and a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform the big data based intelligent archive management method as described above.
A computer readable storage medium storing a program which when executed by a processor implements the intelligent archive management method based on big data as above.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of systems, apparatuses (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (3)
1. The intelligent archive management method based on big data is characterized by comprising the following steps of:
Verifying the identity of a file uploading personnel, and after the verification is passed, carrying out uploading feasibility inspection on the file to be uploaded, if the uploading feasibility inspection fails, not allowing uploading, and if the uploading feasibility inspection is successful, allowing uploading;
the files to be uploaded which are allowed to be uploaded are intelligently classified, and the final recommended classification type corresponding to the files to be uploaded which are allowed to be uploaded in the database is determined;
carrying out file retrieval on the determined files which are allowed to be uploaded and are under the corresponding final recommended classification type in the database, determining whether files to be removed exist under the final recommended classification type, if so, backing up the files to be removed and then removing the files to be removed;
the intelligent classification of the file to be uploaded which is allowed to be uploaded is carried out, and the final recommendation classification type corresponding to the file to be uploaded which is allowed to be uploaded in the database is determined, and the method comprises the following steps:
extracting keywords from the reference content of the file to be uploaded, which is allowed to be uploaded, and obtaining a reference document keyword set;
extracting keywords from the notes of the file to be uploaded, which is allowed to be uploaded, and obtaining a note keyword set;
Acquiring the proportion of structured content and unstructured content of a file to be uploaded, which is allowed to be uploaded;
Acquiring a content keyword reference set, a reference document keyword reference set, a drawing keyword reference set and a structured content and unstructured content reference proportion of files to be uploaded, which correspond to each classification type and are stored in a database;
The method comprises the steps of carrying out fusion processing on a content keyword set, a reference document keyword set, a drawing keyword set, structured content and unstructured content proportion of files to be uploaded, a content keyword reference set, a reference document keyword reference set, a drawing keyword reference set and structured content and unstructured content reference proportion which are stored in a database and correspond to each classification type, and obtaining classification matching indexes corresponding to each classification type;
the calculation formula of the classification matching index is as follows:
Wherein i is the number of classification type stored in the database, fz i is the index of classification matching corresponding to the ith classification type, a 1 is the set of reference document keywords corresponding to the ith classification type stored in the database, B 1 is the set of reference document keywords of the file to be uploaded which is allowed to be uploaded, C 1 is the set of drawing keywords of the file to be uploaded which is allowed to be uploaded, BL is the structured content and unstructured content proportion of the file to be uploaded which is allowed to be uploaded, a 2i is the set of reference document keyword reference corresponding to the ith classification type stored in the database, B 2i is the set of reference document keyword reference corresponding to the ith classification type stored in the database, C 2i is the set of drawing reference keywords corresponding to the ith classification type stored in the database, C3538 is the structured content and unstructured content reference proportion corresponding to the ith classification type stored in the database, e is a natural constant, sigma (a 1,A2i) is the set of reference document keyword similarity corresponding to the ith classification type stored in the database, and sigma (a 4225 is the set of reference document similarity corresponding to the ith classification type stored in the database;
The classification matching indexes are arranged in sequence from small to large, based on the selected number N set in the database, classification types corresponding to the classification matching indexes which are ranked as the first N are obtained, and the classification types are recorded as recommended classification types to be determined;
determining one classification type from the recommendation classification types to be determined as a final recommendation classification type;
the determining a classification type from the recommendation classification types to be determined as a final recommendation classification type comprises the following steps:
Pre-storing the files to be uploaded which are allowed to be uploaded into the recommendation classification types to be determined;
The method comprises the steps of calling a pre-stored file to be uploaded, which is allowed to be uploaded, from each recommendation classification type to be determined, and obtaining actual calling condition data corresponding to each recommendation classification type to be determined, wherein the actual calling condition data comprises average response time, calling success rate, CPU average utilization rate and broadband average utilization rate;
acquiring parameter calling condition data corresponding to each to-be-determined recommended classification type stored in a database, wherein the parameter calling condition data comprises parameter average response time, parameter calling success rate, CPU parameter average utilization rate and broadband parameter average utilization rate;
acquiring calling condition allowable deviation data corresponding to each recommendation classification type to be determined, wherein the calling condition allowable deviation data comprises average response time allowable deviation, calling success rate allowable deviation, CPU average use rate allowable deviation and broadband average use rate allowable deviation;
Carrying out fusion processing on actual calling condition data corresponding to each recommendation type to be determined, reference calling condition data corresponding to each recommendation type to be determined and stored in a database, and calling condition allowable deviation data corresponding to each recommendation type to be determined, and obtaining calling condition evaluation values corresponding to each recommendation type to be determined;
Acquiring a minimum calling condition evaluation value, and determining a to-be-determined recommendation classification type corresponding to the minimum calling condition evaluation value as a final recommendation classification type;
the calculation formula of the calling condition evaluation value is as follows:
Wherein j is the number of the type of the recommended classification to be determined, TS is the average response time, ds is the average utilization rate of the broadband parameter, ks is the average utilization rate of the broadband parameter, tc j is the average response time of the j-th type of the recommended classification to be determined stored in the database, dc j is the average reference retrieval success rate of the j-th type of the recommended classification to be determined stored in the database, CPc j is the average utilization rate of the CPU parameter of the j-th type of the recommended classification to be determined stored in the database, kc j is the average utilization rate of the broadband parameter of the j-th type of the recommended classification to be determined stored in the database, deltaTc j is the average response time allowable deviation of the j-th type of the recommended classification to be determined stored in the database, deltaDc j is the average utilization rate allowable deviation of the j-th type of the recommended classification to be determined stored in the database, deltaDc j is the average utilization rate allowable deviation of the j-th type of the recommended classification to be determined in the database;
the acquiring of the calling condition allowable deviation data corresponding to each recommendation classification type to be determined comprises the following steps:
Acquiring file storage condition data of each type of recommended classification to be determined, wherein the file storage condition data comprises the number of file storage, the total storage space of the file and the data transmission rate;
Carrying out fusion processing on the archival storage condition data under each recommendation classification type to be determined to obtain allowable deviation data comparison parameters corresponding to each recommendation classification type to be determined;
Performing difference calculation on the allowable deviation data comparison parameters corresponding to the recommendation classification types to be determined and the comparison standard values corresponding to the recommendation classification types to be determined stored in the database, and obtaining deviation values corresponding to the recommendation classification types to be determined;
comparing the deviation value corresponding to each recommendation type to be determined with each deviation value corresponding to each recommendation type to be determined stored in the database, and obtaining calling condition allowable deviation data corresponding to the deviation value corresponding to the recommendation type to be determined stored in the database which is successfully compared;
The calculation formula of the allowable deviation data comparison parameter is as follows:
Wherein j is the number of the type of the recommended classification to be determined, YB j is the allowable deviation data comparison parameter corresponding to the type of the recommended classification to be determined, sd j is the number of archives stored corresponding to the type of the recommended classification to be determined, ck j is the total storage space of archives corresponding to the type of the recommended classification to be determined, cL j is the data transmission rate corresponding to the type of the recommended classification to be determined, α 1 is the weight factor of the number of archives stored in the database, α 2 is the weight factor of the total storage space of archives stored in the database, α 3 is the weight factor of the data transmission rate stored in the database, and e is a natural constant;
The file searching is performed on the determined files to be uploaded, which are allowed to be uploaded, under the final recommended classification type corresponding to the files to be uploaded in the database, and whether files to be removed exist under the final recommended classification type is determined, which comprises the following steps:
Acquiring historical reference data of each file under the final recommended classification type, wherein the historical reference data comprises reference times, average opening duration and repeated content duty ratio;
Acquiring reference definition data of files corresponding to the final recommended classification type stored in a database, wherein the reference definition data comprises reference definition times, average opening duration definition time and repeated content definition duty ratio;
Carrying out fusion processing on the historical consulting data of each file under the final recommended classification type and the consulting definition data corresponding to the final recommended classification type stored in the database to obtain a historical consulting evaluation value of each file;
comparing the historical consulting evaluation value of each file with a historical consulting evaluation threshold corresponding to the final recommended classification type stored in the database, and judging whether the historical consulting evaluation value of each file is smaller than the historical consulting evaluation threshold corresponding to the preset recommended classification type stored in the database:
if the historical consulting evaluation value is smaller than the historical consulting evaluation threshold value corresponding to the preset recommended classification type stored in the database, the file corresponding to the historical consulting evaluation value needs to be removed;
If the historical consulting evaluation value is not smaller than the historical consulting evaluation threshold corresponding to the preset recommended classification type stored in the database, the file corresponding to the historical consulting evaluation value does not need to be removed;
the calculation formula of the history reference evaluation value is as follows:
Where f is the number of files, LCo f is the historical review value of the f file, sc f is the review number of the f file, dc f is the average open duration of the f file, cZ f is the repeated content duty ratio of the f file, ds is the review limit number, DD is the average open duration limit time, scZ is the repeated content limit duty ratio, and e is a natural constant.
2. An intelligent archive management method based on big data according to claim 1, wherein the step of verifying the identity of the archive uploading personnel, and after the verification is passed, performing an uploading feasibility check on the archive to be uploaded comprises the following steps:
acquiring file uploading personnel identity information, and verifying the file uploading personnel identity information, wherein the file uploading personnel identity information comprises an account number, a password and a file uploading personnel number:
if the personnel identity information uploaded by the file is not the corresponding personnel stored in the database, the verification is not passed, and a login prompt of the non-permitted personnel is carried out;
If the personnel identity information uploaded by the file is the corresponding personnel stored in the database, the verification is passed;
receiving a file to be uploaded, extracting keywords from the file to be uploaded, and obtaining a content keyword set;
comparing the obtained content keyword set with forbidden word sets stored in a database:
If the intersection exists between the content keyword set and the forbidden word set, the uploading feasibility check fails and uploading is not allowed;
if the intersection set does not exist between the content keyword set and the forbidden word set, the uploading feasibility check is successful, and uploading is allowed.
3. An intelligent archive management system based on big data, which is used for the intelligent archive management method based on big data as claimed in any one of claims 1-2, and is characterized by comprising an uploading feasibility checking module, an archive classifying module and an existing archive processing module, wherein:
The uploading feasibility checking module is used for verifying the identity of a personnel uploading the file, and after the verification is passed, uploading the file to be uploaded, if the uploading feasibility checking fails, uploading is not allowed, and if the uploading feasibility checking is successful, uploading is allowed;
the file classification module is used for intelligently classifying the files to be uploaded which are allowed to be uploaded, and determining the preset recommended classification types corresponding to the files to be uploaded which are allowed to be uploaded in the database;
The existing file processing module is used for searching files under the preset recommended classification type corresponding to the files to be uploaded in the database, which are determined to be allowed to be uploaded, determining whether files to be removed exist, if so, backing up the files to be removed, and then removing the files to be removed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411113876.7A CN118862036B (en) | 2024-08-13 | 2024-08-13 | An intelligent archive management system and method based on big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411113876.7A CN118862036B (en) | 2024-08-13 | 2024-08-13 | An intelligent archive management system and method based on big data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118862036A CN118862036A (en) | 2024-10-29 |
CN118862036B true CN118862036B (en) | 2025-01-14 |
Family
ID=93166499
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202411113876.7A Active CN118862036B (en) | 2024-08-13 | 2024-08-13 | An intelligent archive management system and method based on big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118862036B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN120046713B (en) * | 2025-04-25 | 2025-07-18 | 浙江一山智慧医疗研究有限公司 | Health knowledge mining method, system and terminal based on dynamic feedback mechanism |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109460675A (en) * | 2018-10-26 | 2019-03-12 | 温州博盈科技有限公司 | A kind of enterprise information security management method |
CN114834813A (en) * | 2022-07-02 | 2022-08-02 | 济南亚正企业管理咨询有限公司 | Human resource management file query method and file storage device |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110222160B (en) * | 2019-05-06 | 2023-09-15 | 平安科技(深圳)有限公司 | Intelligent semantic document recommendation method and device and computer readable storage medium |
CN116775972A (en) * | 2023-06-30 | 2023-09-19 | 深圳市世迦科技有限公司 | Remote resource arrangement service method and system based on information technology |
CN117556112B (en) * | 2024-01-11 | 2024-04-16 | 中国标准化研究院 | Electronic archive information intelligent management system |
-
2024
- 2024-08-13 CN CN202411113876.7A patent/CN118862036B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109460675A (en) * | 2018-10-26 | 2019-03-12 | 温州博盈科技有限公司 | A kind of enterprise information security management method |
CN114834813A (en) * | 2022-07-02 | 2022-08-02 | 济南亚正企业管理咨询有限公司 | Human resource management file query method and file storage device |
Also Published As
Publication number | Publication date |
---|---|
CN118862036A (en) | 2024-10-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210216915A1 (en) | Systems and Methods for Predictive Coding | |
KR102026304B1 (en) | Esg based enterprise assessment device and operating method thereof | |
US10031829B2 (en) | Method and system for it resources performance analysis | |
US9268851B2 (en) | Ranking information content based on performance data of prior users of the information content | |
US20140207786A1 (en) | System and methods for computerized information governance of electronic documents | |
CN118467465B (en) | File information data management method based on digitization | |
CN118862036B (en) | An intelligent archive management system and method based on big data | |
US20240185370A1 (en) | Method for information recommendation based on data interaction, device, and storage medium | |
US20240144405A1 (en) | Method for information interaction, device, and storage medium | |
CN118761736A (en) | A document management system and method based on artificial intelligence | |
Herraiz et al. | Impact of installation counts on perceived quality: A case study on debian | |
CN119646278B (en) | File intelligent management system based on multi-type analysis | |
CN118410151A (en) | Business intelligent query method and system based on enterprise database | |
CN119599255A (en) | Data asset management system and method | |
CN115982429B (en) | Knowledge management method and system based on flow control | |
CN118820812A (en) | A method, device and medium for building an intelligent audit model based on big data | |
US11068376B2 (en) | Analytics engine selection management | |
WO2022262775A1 (en) | Information processing method and apparatus based on data exchange, and device and storage medium | |
TW201539217A (en) | A document analysis system, document analysis method and document analysis program | |
CN119808794B (en) | A big data intelligent analysis method and system based on AI | |
CN118656472B (en) | Intelligent information management system for R&D projects | |
CN116707834B (en) | Distributed big data evidence obtaining and analyzing platform based on cloud storage | |
CN119441564B (en) | A data classification and grading method, system, device and medium based on big data | |
CN118761786B (en) | A credit profile assessment method for a shipping company | |
KR102395550B1 (en) | Method and apparatus for analyzing confidential information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |