CN116955745A

CN116955745A - File data processing method and device

Info

Publication number: CN116955745A
Application number: CN202310967374.XA
Authority: CN
Inventors: 洪泽慧; 李博; 尚琪林
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-08-02
Filing date: 2023-08-02
Publication date: 2023-10-27

Abstract

The specification relates to the technical field of artificial intelligence, and particularly discloses a archive data processing method and device, wherein the method comprises the following steps: acquiring target archive data; dividing the target archive data into structured data and unstructured data; respectively preprocessing the structured data and the unstructured data to obtain preprocessed structured data and unstructured data; combining the preprocessed structured data and unstructured data to obtain preprocessed target archive data; and inputting the preprocessed target archive data into a target classification model to obtain the association relation data between the target archive data and the historical archive data. Through the scheme, the association relation between the archive data can be obtained, so that the invisible association between the archive data can be established.

Description

File data processing method and device

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for processing archive data.

Background

Financial management is the management of cash flows and profit allocations in relation to the purchase of assets, the melting and management of capital, under certain overall objectives. Financial management is an integral part of enterprise management and is an important task for organizing enterprise financial activities and processing financial relationships. And financial archive management is an important component of financial management.

At present, financial archives management is mainly carried out by manual management, collection is mainly carried out by asking, and archival resources are in a stage mainly stored. That is, in the prior art, the archive management scheme fails to establish the relevance between archives, but stores the archives only, so that searching of the invisible relevance data is difficult to realize during searching, and the use requirement cannot be met.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the specification provides a method and a device for processing archival data, which are used for solving the problem that an archival management scheme in the prior art is difficult to realize invisible associated data search.

The embodiment of the specification provides a archive data processing method, which comprises the following steps:

acquiring target archive data; dividing the target archive data into structured data and unstructured data;

respectively preprocessing the structured data and the unstructured data to obtain preprocessed structured data and unstructured data;

combining the preprocessed structured data and unstructured data to obtain preprocessed target archive data;

and inputting the preprocessed target archive data into a target classification model to obtain the association relation data between the target archive data and the historical archive data.

In one embodiment, preprocessing the structured data and the unstructured data respectively to obtain preprocessed structured data and unstructured data includes:

performing data cleaning on the structured data to obtain preprocessed structured data;

and performing word segmentation on the unstructured data, and extracting characteristic data from the unstructured data subjected to word segmentation to obtain the unstructured data subjected to pretreatment.

In one embodiment, after inputting the preprocessed target archive data into the target classification model to obtain the association relationship data between the target archive data and the history archive data, the method further includes:

determining information of history archive data associated with the target archive data based on the association relationship data among the target archive data;

and storing the information of the historical archive data associated with the target archive data in a database in an associated manner.

In one embodiment, the method further comprises:

receiving a search request sent by a client; the search request comprises search words;

inquiring archive data corresponding to the search word and archive data associated with the archive data from a database based on the search word;

And feeding back the archive data corresponding to the search word and the archive data associated with the archive data to the client.

In one embodiment, the object classification model is constructed by:

acquiring a history archive dataset; preprocessing each history file data in the history file data set to obtain a preprocessed history file data set;

determining history archive data associated with each history archive data in the preprocessed history archive data set;

taking the preprocessed historical archive data set as input, and taking information of the historical archive data associated with each historical archive data as a label to obtain a training set;

and training the preset classification model by using the training set to obtain a target classification model.

In one embodiment, determining the history archive data associated with each of the preprocessed history archive data sets includes:

extracting characteristics of each history file data in the preprocessed history file data set to obtain a characteristic matrix corresponding to each history file data;

calculating association parameters between every two historical archive data in the historical archive data set based on the feature matrix corresponding to each historical archive data;

And determining the historical archive data associated with each historical archive data in the historical archive data set according to the association parameters between every two historical archive data in the historical archive data set.

In one embodiment, extracting features of each history archive data in the preprocessed history archive data set to obtain a feature matrix corresponding to each history archive data, including:

performing binary matrix processing on each history file data in the preprocessed history file data set to obtain a feature matrix corresponding to each history file data;

and calculating the probability of each feature in the feature matrix, and taking the probability of each feature as the weight value of each feature.

In one embodiment, calculating the correlation parameter between the historical archive data in the historical archive data set based on the feature matrix corresponding to each historical archive data includes:

determining whether the first historical archive data and the second historical archive data have the same characteristics;

and under the condition that the same characteristics are determined to exist, determining the relevance parameters between the first historical archive data and the second historical archive data according to the first weight values of the same characteristics in the first historical archive data and the second weight values in the second historical archive data.

In one embodiment, calculating the correlation parameter between the historical archive data in the historical archive data set based on the feature matrix corresponding to each historical archive data may include:

extracting a first feature set with feature weights meeting a first preset condition from a feature matrix corresponding to the first historical archive data; extracting a second feature set with feature weights meeting a second preset condition from a feature matrix corresponding to second historical archive data;

calculating association parameters between one or more features in the first feature set and one or more features in the second feature set;

and calculating a relevance parameter between the first historical archive data and the second historical archive data based on the feature weights of one or more features in the first feature set, the feature weights of one or more features in the second feature set and the calculated relevance parameter.

In one embodiment, determining the historical archive data associated with each historical archive data in the historical archive data set according to the association parameters between every two historical archive data in the historical archive data set includes:

and determining the two history archive data with the relevance parameter larger than a preset threshold value as the history archive data which are mutually related.

The embodiment of the specification also provides a archival data processing device, which comprises:

the acquisition module is used for acquiring target archive data; dividing the target archive data into structured data and unstructured data;

the preprocessing module is used for respectively preprocessing the structured data and the unstructured data to obtain preprocessed structured data and unstructured data;

the merging module is used for merging the preprocessed structured data and unstructured data to obtain preprocessed target archive data;

and the classification module is used for inputting the preprocessed target archive data into a target classification model to obtain the association relation data between the target archive data and the historical archive data.

The embodiments of the present disclosure also provide a computer device, including a processor and a memory for storing instructions executable by the processor, where the processor executes the instructions to implement the steps of the archive data processing method described in any of the embodiments above.

The present description also provides a computer-readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the archive data processing method described in any of the above embodiments.

In this embodiment of the present disclosure, a method for processing archive data is provided, where target archive data may be obtained, the target archive data is divided into structured data and unstructured data, the structured data and unstructured data are preprocessed respectively to obtain preprocessed structured data and unstructured data, the preprocessed structured data and unstructured data are combined to obtain preprocessed target archive data, and the preprocessed target archive data is input into a target classification model to obtain association relationship data between the target archive data. In this embodiment, the archive data is divided into structured data and unstructured data according to types, and the structured data and the unstructured data are respectively processed and synthesized to obtain preprocessed archive data, and the input preprocessed archive data is processed through the target classification model to obtain the association relationship between the archive data, so that the invisible association between the archive data can be established, a manager can search out required invisible association data during searching, so that the archive data can be managed conveniently, and the data searching result is optimized.

Drawings

The accompanying drawings are included to provide a further understanding of the specification, and are incorporated in and constitute a part of this specification. In the drawings:

Fig. 1 is a schematic diagram showing an application scenario of a file data processing method in an embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating a method of archive data processing in an embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating a method of archive data processing in an embodiment of the present disclosure;

FIG. 4 is a flow chart illustrating a method of archive data processing in an embodiment of the present disclosure;

FIG. 5 illustrates a flow diagram for building a classification model of a target in an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an archive data processing device in an embodiment of the present disclosure;

fig. 7 shows a schematic diagram of a computer device in an embodiment of the present description.

Detailed Description

The principles and spirit of the present specification will be described below with reference to several exemplary embodiments. It should be understood that these embodiments are presented merely to enable one skilled in the art to better understand and practice the present description, and are not intended to limit the scope of the present description in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Those skilled in the art will appreciate that the embodiments of the present description may be implemented as a system, apparatus, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

The embodiment of the specification provides a archive data processing method. Fig. 1 is a schematic diagram illustrating an application scenario of a file data processing method according to an embodiment of the present disclosure. As shown in fig. 1, a user may input or generate target profile data through a client. The client may send the target profile data to the server. The server may divide the target archive data into structured data and unstructured data. The server can respectively preprocess the structured data and the unstructured data to obtain the preprocessed structured data and unstructured data. The server can combine the preprocessed structured data and unstructured data to obtain preprocessed target archive data. The server can input the preprocessed target archive data into a target classification model to obtain association relation data between the target archive data and the historical archive data. The server may then determine information of the history archive data associated with the target archive data based on the association relationship data between the target archive data, and store the information of the history archive data associated with the target archive data in association with the target archive data to a database. The server may also receive a search request sent by the client. The search request includes a search term. The server can query archive data corresponding to the search word and archive data associated with the archive data from a database based on the search word, and feed back the archive data corresponding to the search word and the archive data associated with the archive data to the client.

The server may be a single server, a server cluster, or a cloud server, and the specific composition forms no limitation of the present application. The client may be a desktop computer, a notebook computer, a mobile phone terminal, a PDA, or the like, and the present application is not limited as long as the client is a device capable of displaying contents and receiving operation instructions to a user or a business person.

Fig. 2 shows a flowchart of a file data processing method in an embodiment of the present specification. Although the present description provides methods and apparatus structures as shown in the following examples or figures, more or fewer steps or modular units may be included in the methods or apparatus based on conventional or non-inventive labor. In the steps or the structures of the apparatuses, which logically do not have the necessary cause and effect relationship, the execution order or the structure of the modules of the apparatuses are not limited to the execution order or the structure of the modules shown in the drawings and described in the embodiments of the present specification. The described methods or module structures may be implemented sequentially or in parallel (e.g., in a parallel processor or multithreaded environment, or even in a distributed processing environment) in accordance with the embodiments or the method or module structure connection illustrated in the figures when implemented in a practical device or end product application.

Specifically, as shown in fig. 2, the archive data processing method provided in an embodiment of the present disclosure may include the following steps:

step S10, acquiring target archive data; the target archive data is divided into structured data and unstructured data.

The method in the present embodiment can be applied to a server. The server may obtain the target profile data. Structured data and unstructured data may be included in the target profile data. In this embodiment, the structured data refers to data having a fixed structure, and in the financial archives, the data of the relational archives database belongs to such data, and the data may have a fixed structure after being processed by the computer, and the unstructured data refers to data without a fixed structure. The target archive data may be divided into structured data and unstructured data.

And step S20, respectively preprocessing the structured data and the unstructured data to obtain the preprocessed structured data and unstructured data.

And step S30, merging the preprocessed structured data and unstructured data to obtain preprocessed target archive data.

Because the structured data and the unstructured data are different in type, the structured data and the unstructured data can be preprocessed respectively, and the preprocessed structured data and unstructured data can be obtained.

In some embodiments of the present disclosure, preprocessing the structured data and the unstructured data to obtain preprocessed structured data and unstructured data may include: performing data cleaning on the structured data to obtain preprocessed structured data; and performing word segmentation on the unstructured data, and extracting characteristic data from the unstructured data subjected to word segmentation to obtain the unstructured data subjected to pretreatment.

Specifically, the structured data may be cleaned to remove data from missing values, erroneous values, and noise. The unstructured data may be subjected to word segmentation and feature data may be extracted from the segmented data. The word segmentation refers to obtaining word words in files, and then obtaining features of the word segmentation, namely feature words, for subsequent relational construction. Through the preprocessing, on one hand, irrelevant data can be removed, the subsequent data processing amount is reduced, and further, the data processing efficiency is improved, and on the other hand, the subsequent data association processing can be facilitated.

After the preprocessed structured data and unstructured data are obtained, the preprocessed structured data and unstructured data can be combined to obtain preprocessed target archive data.

And S40, inputting the preprocessed target archive data into a target classification model to obtain association relation data between the target archive data and the historical archive data.

The preprocessed target archive data can be input into the target classification model to obtain association relationship data among the target archive data. The historical archive data herein may refer to archive data stored in a server or database. The target classification model is constructed in advance and is used for determining the association relation data between the input target archive data.

In one embodiment, the association relationship data may include association degree parameters and/or association types of history archive data associated with the target archive data. In another embodiment, the association relationship data may include association degree parameters and/or association types between the target profile data and each of the plurality of history profile data.

For example, after the target profile data is input into the target classification model, it may be determined that the target profile data is associated with the history profile data a, the history profile data B, and the history profile data C, and the association type with the history profile data a is a first type, the association type with the history profile data B is a second type, and the association type with the history profile data C is a third type.

In the above embodiment, the archive data is divided into the structured data and the unstructured data according to the types, and the structured data and the unstructured data are respectively processed and synthesized to obtain the preprocessed archive data, and the input preprocessed archive data is processed through the target classification model to obtain the association relationship between the archive data, so that the invisible association between the archive data can be established, a manager can search out the required invisible association data during searching, the archive data can be managed conveniently, and the data searching result is optimized.

Fig. 3 shows a flowchart of a file data processing method in an embodiment of the present specification. As shown in fig. 3, in some embodiments of the present disclosure, after inputting the preprocessed target archive data into the target classification model in step S40 to obtain the association relationship data between the target archive data and the history archive data, the method may further include:

step S50, based on the association relation data among the target archive data, determining the information of the history archive data associated with the target archive data.

Step S60, the information of the history archive data related to the target archive data and the target archive data is stored in a database in a related mode.

In this embodiment, the server may further determine information of history archive data associated with the target archive data based on association relationship data between the target archive data. The information of the history archive data associated with the target archive data may include an identification of the associated history archive data, an association degree parameter, an association type, and/or the like.

In one embodiment, the association relationship data may include association degree parameters and/or association types of history archive data associated with the target archive data. In this way, the identification of the history archive data associated with the target archive data, the association degree parameter and/or the association type and the like can be directly determined based on the association relationship data.

In another embodiment, the association relationship data may include association degree parameters and/or association types between the target profile data and each of the plurality of history profile data. In this way, the history archive data associated with the target archive data can be determined from the plurality of history archive data according to the association degree parameter and/or the association type between the target archive data and each of the plurality of history archive data. The historical archive data of which the association degree parameter and/or the association type meet the preset condition can be determined as the historical archive data of the target archive data association. Wherein the preset condition may include at least one of: the association degree parameters are larger than a preset threshold value, the association degree parameters are ranked from large to small and then are ranked in front of a preset number of association degree parameters, and the association type belongs to a preset type.

After determining the information of the history archive data associated with the target archive data based on the association relationship data between the target archive data, the information of the history archive data associated with the target archive data may be stored in association with the target archive data to the database. By the above method, the archive data and the association relation between the archive data can be stored in the database.

Fig. 4 shows a flowchart of a file data processing method in an embodiment of the present specification. As shown in fig. 4, in some embodiments of the present description, the method may further include the following steps.

Step S401, receiving a search request sent by a client; the search request includes a search term.

Step S402, based on the search word, inquiring archive data corresponding to the search word and archive data associated with the archive data from a database.

Step S403, feeding back the archive data corresponding to the search word and the archive data associated with the archive data to the client.

The server may receive a search request sent by a user via the client. Search terms or search criteria may be included in the search request. The server may query the archive data corresponding to the search term or search condition and archive data associated with the archive data from the database based on the search term or search condition. The archive data corresponding to the search word and the archive data managed by the archive data can be fed back to the client. For example, the server searches for the archive data a corresponding to the search term and the archive data C and the archive data D associated with the archive data a, and may return the archive data A, C and D to the client. By the method, the associated archive data can be returned when the user searches.

FIG. 5 illustrates a flow chart for building a classification model of an object in an embodiment of the present description. As shown in fig. 5, in some embodiments of the present description, the object classification model may be constructed in the following manner.

Step S501, a history file data set is obtained; and preprocessing each history file data in the history file data set to obtain a preprocessed history file data set.

Step S502, determining the history archive data associated with each history archive data in the preprocessed history archive data set.

Step S503, taking the preprocessed history archive data set as input, and taking information of history archive data associated with each history archive data as a label, to obtain a training set.

Step S504, training a preset classification model by using the training set to obtain a target classification model.

In particular, a historical archive dataset may be obtained. The historical archive data set may include a large amount of historical archive data. Each history archive data in the history archive data set may be preprocessed to obtain a preprocessed history archive data set. For example, each history archive data may be divided into structured data and unstructured data, and preprocessed to obtain each preprocessed history archive data, thereby obtaining a preprocessed history archive data set. Thereafter, the history archive data associated with each history archive data in the preprocessed history archive data set may be determined. The training set is obtained by taking the preprocessed history file data set as input and the information of the history file data associated with each history file data as a label. Training the preset classification model by using the training set can obtain the target classification model. By the method, the target classification model can be obtained so as to determine the invisible relevance between the archive data.

In some embodiments of the present disclosure, determining the history archive data associated with each history archive data in the preprocessed history archive data set may include: extracting characteristics of each history file data in the preprocessed history file data set to obtain a characteristic matrix corresponding to each history file data; calculating association parameters between every two historical archive data in the historical archive data set based on the feature matrix corresponding to each historical archive data; and determining the historical archive data associated with each historical archive data in the historical archive data set according to the association parameters between every two historical archive data in the historical archive data set.

Specifically, feature extraction can be performed on each history file data in the preprocessed history file data set to obtain a corresponding feature matrix. In some embodiments of the present disclosure, feature extraction is performed on each history archive data in the preprocessed history archive data set to obtain a feature matrix corresponding to each history archive data, which may include: performing binary matrix processing on each history file data in the preprocessed history file data set to obtain a feature matrix corresponding to each history file data; and calculating the probability of each feature in the feature matrix, and taking the probability of each feature as the weight value of each feature. Of course, other matrixing methods may be used to determine the feature matrix corresponding to the historical archive data.

And then, calculating the relevance parameters between every two historical archive data in the historical archive data set based on the characteristic data corresponding to each historical archive data. The relevance parameters herein may include a relevance degree parameter and/or a relevance type.

In one embodiment, the relevance parameter between two history archival data may be determined based on the duty ratio of the same feature in the feature matrix corresponding to the two history archival data, thereby determining whether the two history archival data are relevant.

In another embodiment, the correlation between two historic profile data may be calculated based on the correlation between features in the feature matrix corresponding to the two historic profile data. For example, the feature matrix corresponding to the history archive data a includes the feature 1 and the feature 2 with higher weights, and the feature matrix corresponding to the history archive data B includes the feature 3 with higher weights. In all the history archive data in the preprocessed history archive data set, when the probability of simultaneous occurrence of the feature 1, the feature 2 and the feature 3 is larger than a preset probability, it may be determined that the history archive data a is associated with the history archive data B, and the association degree parameter may be determined based on the weight of the feature and the probability of simultaneous occurrence of the feature.

In some embodiments of the present disclosure, calculating the association parameter between two history archive data in the history archive data set based on the feature matrix corresponding to each history archive data may include: determining whether the first historical archive data and the second historical archive data have the same characteristics; and under the condition that the same characteristics are determined to exist, determining the relevance parameters between the first historical archive data and the second historical archive data according to the first weight values of the same characteristics in the first historical archive data and the second weight values in the second historical archive data. For example, a product of the first weight value and the second weight value may be calculated as a correlation parameter between the two history archive data, and in case the correlation parameter is greater than a preset value, the first history archive data and the second history archive data are determined to be correlated.

In some embodiments of the present disclosure, calculating the association parameter between two history archive data in the history archive data set based on the feature matrix corresponding to each history archive data may include: extracting a first feature set with feature weights meeting a first preset condition from a feature matrix corresponding to the first historical archive data; extracting a second feature set with feature weights meeting a second preset condition from a feature matrix corresponding to second historical archive data; calculating association parameters between one or more features in the first feature set and one or more features in the second feature set; and calculating a relevance parameter between the first historical archive data and the second historical archive data based on the feature weights of one or more features in the first feature set, the feature weights of one or more features in the second feature set and the calculated relevance parameter.

Specifically, a first feature set with feature weights meeting a first preset condition may be extracted from a feature matrix corresponding to the first history archive data. Wherein the first preset condition may include one of: the feature weight is larger than the first preset weight, and the features are ranked from large to small according to the feature weight and then ranked in front of the features of the first preset number. And extracting a second feature set with feature weights just meeting a second preset condition from the feature matrix of the second history archive data pair drink. Wherein the second preset condition may include one of: the feature weight is larger than a second preset weight, and the features are ranked from large to small according to the feature weight and then ranked in front of the features of a second preset number.

After extracting the first feature set and the second feature set, association parameters between one or more features in the first feature set and one or more features in the second feature set may be calculated. In one embodiment, one or more first features may be taken from a first feature set and one or more second features may be taken from a second feature set. For example, a number of times one or more first features and one or more second features are concurrently present in a feature matrix corresponding to all of the historical archive data in the historical archive data set may be counted, and an association parameter between one or more features in the first feature set and one or more features in the second feature set may be determined based on the number of times. Thereafter, a relevance parameter between the first historical profile data and the second historical profile data may be calculated based on the feature weights of the one or more features in the first feature set, the feature weights of the one or more features in the second feature set, and the calculated relevance parameter. The larger the association parameter is, the larger the feature weight is, the larger the association parameter between the first historical archive data and the second historical archive data is, and the higher the association degree is. In this way, implicit associations between historical profile data can be discovered.

In some embodiments of the present disclosure, determining, according to the association parameter between each pair of history archive data in the history archive data set, history archive data associated with each history archive data in the history archive data set may include: and determining the two history archive data with the relevance parameter larger than a preset threshold value as the history archive data which are mutually related. By the method, whether the two historical archive data are associated or not can be determined according to the calculated association parameters.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. Specific reference may be made to the foregoing description of related embodiments of the related process, which is not described herein in detail.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The above method is described below in connection with a specific embodiment, however, it should be noted that this specific embodiment is only for better illustrating the present specification and should not be construed as unduly limiting the present specification.

In this embodiment, a method for processing archive data is provided, which may include the following steps.

Step 1, acquiring a financial archive, and preprocessing financial archive data.

Further, the financial archival data is preprocessed, including the following: dividing the financial archive data into structured data and unstructured data according to the difference of the data structure of the financial archive; cleaning the structured data to remove data from missing values, error values and noise; performing word segmentation on unstructured data, and extracting feature data from the data subjected to word segmentation; and merging the data processed by the structured data and the unstructured data to form new financial archival data.

In this embodiment, it should be noted that the structured data refers to data having a fixed structure, and in the financial archives, the data of the relational archival database belongs to such data, and the data may have a fixed structure after being processed by the computer, and the unstructured data refers to data without a fixed structure.

The word segmentation refers to obtaining word words in files, and then obtaining features of the word segmentation, namely feature words, for subsequent relational construction.

Through the preprocessing, on one hand, irrelevant data can be removed, the subsequent data processing amount is reduced, and further, the data processing efficiency is improved, and on the other hand, the subsequent data association processing can be facilitated.

And step 2, classifying the data relevance of the preprocessed financial archives by adopting a classification model. Classifying the data with relevance into one type, mining invisible associated data, managing the invisible associated data, and optimizing the data search result.

Further, before using, the classification model needs to be subjected to model training, and the training process is as follows: preprocessing historical financial archival data, and then performing binary matrix processing; calculating the probability of each feature in the obtained binary matrix, and taking the corresponding probability as the weight value of the corresponding probability for the feature; establishing a virtual association between historical financial archival data. If one of the two files contains cable purchasing specifications, a virtual association between the two files is established. Assigning a value to the virtual relativity according to the weight of the feature; different feature weights are assigned different values, such as highest level assigned 5, next highest level assigned 4, and so on. The historical financial archival data with virtual relativity and assignment is divided into a training set and a testing set, the training set is used for training the classification model, and the testing set is used for testing, so that the trained classification model is obtained.

The binary matrix processing process is as follows: selecting a plurality of features with highest weights from the preprocessed financial archival data; and matrixing the characteristic binary matrixes to obtain corresponding binary matrixes.

Further, if the test result is within the preset threshold (set accuracy), training of the classification model is completed, otherwise, training of the classification model is performed again until the test result is within the preset threshold.

In addition, establishing a virtual association between historical financial profile data is determined by a confidence measure.

In this embodiment, it should be noted that the classification model used is preferably a neural network model, and the number ratio of the training set to the test set is 7:1 during model training, so as to ensure that the trained classification model has excellent accuracy.

When a plurality of features with highest weights are selected from the preprocessed financial archival data, the number of occurrences of the extracted feature words is used as the weight.

Assuming that the { feature 1, feature 2} - { feature 3} computation, the item set { feature 1, feature 2} appears 3 times in all transactions, and the total number of transactions is 6, so the { feature 1, feature 2} support is 3/6=0.5, and the confidence of the rule ({ feature 1, feature 2} - { feature 3 }) is the quotient of the item set { feature 1, feature 2, feature 3} support count and { feature 1, feature 2} support count, { feature 1, feature 2, feature 3} support count is 2, { feature 1, feature 2} support count is 3, so the confidence of the rule is 2/3=0.67, if 0.67 is within the rule threshold, a virtual association is established (data association of each financial profile data), otherwise, no virtual association is established. Within the rule threshold, the description accords with the set relevance, namely the relevance exists, and the relevance is built by the feature words.

The method in the embodiment can establish the invisible relevance among the financial archival data, so that a manager can search out the required invisible relevance data during searching.

And step 3, storing the classified financial archival data.

Further, before storing the classified financial archival data, sorting the data of the virtual relevance according to the weight value, and establishing priority according to the sorting. The number of occurrences of the feature value is used as a weight value, here after the sorting, the ranking, i.e. the priority of the search, is established.

In this embodiment, the searching results are output according to the priority when the manager searches, so that the manager can check the results conveniently.

In this embodiment, obtain the financial archives, carry out preprocessing to financial archives data, adopt classification model to carry out data relevance classification with the financial archives data of preprocessing, store the financial archives data after classifying again, can establish the stealthy relevance between the financial archives data for managers can search out required stealthy relevance data when searching, be convenient for manage relevant data, optimize data search result.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Based on the same inventive concept, an archive data processing device is also provided in the embodiments of the present specification, as described in the following embodiments. Since the principle of the file data processing device for solving the problem is similar to that of the file data processing method, the implementation of the file data processing device can refer to the implementation of the file data processing method, and the repetition is omitted. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated. Fig. 6 is a block diagram of a archive data processing device according to an embodiment of the present disclosure, as shown in fig. 6, including: the structure is described below, which is an acquisition module 601, a preprocessing module 602, a merging module 603, and a classification module 604.

The acquisition module 601 is configured to acquire target archive data; the target archive data is divided into structured data and unstructured data.

The preprocessing module 602 is configured to preprocess the structured data and the unstructured data respectively, so as to obtain preprocessed structured data and unstructured data.

The merging module 603 is configured to merge the preprocessed structured data and unstructured data to obtain preprocessed target archive data.

The classification module 604 is configured to input the preprocessed target archive data into a target classification model, so as to obtain association relationship data between the target archive data and the historical archive data.

In some embodiments of the present description, the preprocessing module may be specifically configured to: performing data cleaning on the structured data to obtain preprocessed structured data; and performing word segmentation on the unstructured data, and extracting characteristic data from the unstructured data subjected to word segmentation to obtain the unstructured data subjected to pretreatment.

In some embodiments of the present disclosure, the apparatus further includes a storage module, where the storage module is specifically configured to: after the classification module inputs the preprocessed target archive data into a target classification model to obtain association relation data between the target archive data and historical archive data, determining information of the historical archive data associated with the target archive data based on the association relation data between the target archive data; and storing the information of the historical archive data associated with the target archive data in a database in an associated manner.

In some embodiments of the present disclosure, the apparatus further includes a search module, where the search module is specifically configured to: receiving a search request sent by a client; inquiring archive data corresponding to the search word and archive data associated with the archive data from a database based on the search word; and feeding back the archive data corresponding to the search word and the archive data associated with the archive data to the client.

In some embodiments of the present description, the object classification model is constructed by: acquiring a history archive dataset; preprocessing each history file data in the history file data set to obtain a preprocessed history file data set; determining history archive data associated with each history archive data in the preprocessed history archive data set; taking the preprocessed historical archive data set as input, and taking information of the historical archive data associated with each historical archive data as a label to obtain a training set; and training the preset classification model by using the training set to obtain a target classification model.

In some embodiments of the present description, determining the history archive data associated with each history archive data in the preprocessed history archive data set includes: extracting characteristics of each history file data in the preprocessed history file data set to obtain a characteristic matrix corresponding to each history file data; calculating association parameters between every two historical archive data in the historical archive data set based on the feature matrix corresponding to each historical archive data; and determining the historical archive data associated with each historical archive data in the historical archive data set according to the association parameters between every two historical archive data in the historical archive data set.

In some embodiments of the present disclosure, feature extraction is performed on each history archive data in the preprocessed history archive data set to obtain a feature matrix corresponding to each history archive data, including: performing binary matrix processing on each history file data in the preprocessed history file data set to obtain a feature matrix corresponding to each history file data; and calculating the probability of each feature in the feature matrix, and taking the probability of each feature as the weight value of each feature.

In some embodiments of the present disclosure, calculating, based on the feature matrix corresponding to each of the historical archive data, a correlation parameter between each two of the historical archive data in the historical archive data set includes: determining whether the first historical archive data and the second historical archive data have the same characteristics; and under the condition that the same characteristics are determined to exist, determining the relevance parameters between the first historical archive data and the second historical archive data according to the first weight values of the same characteristics in the first historical archive data and the second weight values in the second historical archive data.

In some embodiments of the present disclosure, determining, according to a correlation parameter between each pair of history archive data in the history archive data set, history archive data associated with each history archive data in the history archive data set includes: and determining the two history archive data with the relevance parameter larger than a preset threshold value as the history archive data which are mutually related.

From the above description, it can be seen that the following technical effects are achieved in the embodiments of the present specification: the archive data is divided into structured data and unstructured data according to types, the structured data and the unstructured data are respectively processed and synthesized to obtain preprocessed archive data, the input preprocessed archive data is processed through a target classification model, and the association relationship among the archive data can be obtained, so that the invisible association among the archive data can be established, a manager can search out required invisible association data during searching, the archive data can be conveniently managed, and the data searching result is optimized.

The embodiment of the present disclosure further provides a schematic structural diagram of a computer device, which may specifically refer to fig. 7, where the schematic structural diagram is based on the archive data processing method provided by the embodiment of the present disclosure, and the computer device may specifically include an input device 71, a processor 72, and a memory 73. Wherein the memory 73 is for storing processor executable instructions. The processor 72, when executing the instructions, implements the steps of the archive data processing method described in any of the embodiments above.

In this embodiment, the input device may specifically be one of the main apparatuses for exchanging information between the user and the computer system. The input device may include a keyboard, mouse, camera, scanner, light pen, handwriting input board, voice input device, etc.; the input device is used to input raw data and a program for processing these numbers into the computer. The input device may also acquire and receive data transmitted from other modules, units, and devices. The processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor, and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a programmable logic controller, and an embedded microcontroller, among others. The memory may in particular be a memory device for storing information in modern information technology. The memory may comprise a plurality of levels, and in a digital system, may be memory as long as binary data can be stored; in an integrated circuit, a circuit with a memory function without a physical form is also called a memory, such as a RAM, a FIFO, etc.; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card, and the like.

In this embodiment, the specific functions and effects of the computer device may be explained in comparison with other embodiments, and will not be described herein.

There is also provided in an embodiment of the present specification a computer storage medium based on an archive data processing method, the computer storage medium storing computer program instructions which, when executed, implement the steps of the archive data processing method described in any of the embodiments above.

In the present embodiment, the storage medium includes, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.

In this embodiment, the functions and effects of the program instructions stored in the computer storage medium may be explained in comparison with other embodiments, and are not described herein.

It will be apparent to those skilled in the art that the modules or steps of the embodiments described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, embodiments of the present specification are not limited to any specific combination of hardware and software.

It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of the disclosure should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the protection scope of the present specification.

Claims

1. A method of archive data processing comprising:

2. A archive data processing method according to claim 1, wherein preprocessing the structured data and the unstructured data respectively to obtain preprocessed structured data and unstructured data includes:

3. A archival data processing method according to claim 1, wherein after inputting the preprocessed target archival data into a target classification model, obtaining association relationship data between the target archival data and historical archival data, further comprising:

4. A archival data processing method as claimed in claim 3, further comprising:

5. A archival data processing method as claimed in claim 1, wherein the object classification model is constructed by:

6. A archive data processing method according to claim 5, wherein determining the history archive data associated with each history archive data in the preprocessed set of history archive data comprises:

7. A archive data processing method according to claim 6, wherein extracting features of each history archive data in the preprocessed history archive data set to obtain a feature matrix corresponding to each history archive data includes:

8. A archive data processing method according to claim 6, wherein calculating the association parameter between each two of the historical archive data in the historical archive data set based on the feature matrix corresponding to each of the historical archive data includes:

9. A archive data processing method according to claim 6, wherein calculating the association parameter between each two of the historical archive data in the historical archive data set based on the feature matrix corresponding to each of the historical archive data includes:

10. A archive data processing method according to claim 6, wherein determining the history archive data associated with each history archive data in the history archive data set based on the association parameter between each pair of history archive data in the history archive data set comprises:

11. An archive data processing device, comprising:

12. A computer device comprising a processor and a memory for storing processor-executable instructions which when executed by the processor implement the steps of the method of any one of claims 1 to 10.

13. A computer readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the steps of the method of any of claims 1 to 10.