Disclosure of Invention
The embodiment of the disclosure provides a method, a device, equipment and a storage medium for analyzing data heat. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.
In a first aspect, an embodiment of the present disclosure provides a method for analyzing data heat, including:
receiving query information, reference information, user interaction information, business attribute importance degree information, data release time information and data use time information of a data table to be counted in a data warehouse;
calculating data dimension values of query information, reference information, user interaction information and service attribute importance degree information of a data table to be counted according to a pre-trained linear regression model;
and calculating the heat of the data according to the data dimension value, the data release time information and the data use time information.
In one embodiment, after receiving the user interaction information of the data table to be counted in the data warehouse, the method further includes:
calculating the exposure times, browsing times, praise times and user scores of the data table to be counted;
and calculating a user interaction information value according to the exposure times, the browsing times, the praise times and the user scores.
In one embodiment, after receiving the business attribute importance information of the data table to be counted in the data warehouse, the method further includes:
acquiring a service attribute category corresponding to a data table to be counted;
and inquiring the corresponding business attribute importance degree value according to the business attribute category.
In one embodiment, before calculating the data dimension value according to the pre-trained linear regression model, the method further comprises:
carrying out data dimension numerical value labeling on query information, reference information, user interaction information and service attribute importance degree information of a plurality of data tables;
dividing the marked data into a training set and a test set;
and training the linear regression model according to the training set and the test set to obtain the trained linear regression model.
In one embodiment, the formula for the linear regression model is as follows:
S=w0+w1x1+w2x2+w3x3+w4x4;
wherein S represents a data dimension value, x1Representing the number of times the data table is queried, x, within a preset period2Representing the number of times the data table is referenced within a preset period, x3Representing a value of user interaction information, x4Representing the importance value, w, of the service attribute0...w4Representing a weight parameter.
In one embodiment, calculating the heat of the data according to the data dimension value, the data publishing time information and the data using time information comprises calculating the heat of the data according to the following formula:
wherein f represents the data heat, s represents the data dimension value, MageInhours represents the difference between the data publishing time and the current time, and MusedTimeInHour represents the latest using time of the data.
In one embodiment, after calculating the heat of the data, the method further comprises:
and sequencing all the data in the data warehouse from high to low according to the corresponding heat values, and pushing a preset number of data with the heat values ranked in the top to a client for display.
In a second aspect, an embodiment of the present disclosure provides an apparatus for analyzing data heat, including:
the data receiving module is used for receiving query information, reference information, user interaction information, service attribute importance degree information, data release time information and data use time information of a data table to be counted in the data warehouse;
the first calculation module is used for calculating data dimension values of query information, reference information, user interaction information and service attribute importance degree information of a data table to be counted according to a pre-trained linear regression model;
and the second calculation module is used for calculating the heat of the data according to the data dimension value, the data release time information and the data use time information.
In a third aspect, the present disclosure provides an apparatus for analyzing data heat, including a processor and a memory storing program instructions, where the processor is configured to execute the method for analyzing data heat provided by the foregoing embodiments when executing the program instructions.
In a fourth aspect, the disclosed embodiments provide a computer-readable medium, on which computer-readable instructions are stored, the computer-readable instructions being executable by a processor to implement a method for analyzing data heat provided by the above embodiments.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
according to the data heat degree analysis method provided by the embodiment of the disclosure, data information of multiple dimensions such as query times, reference times, release time, user behaviors and business importance degrees of data are comprehensively considered, a linear regression algorithm is used as a model, the weight of each data dimension is calculated through the model, a heat degree value with high accuracy is obtained, and the method is more suitable for a calculation scene of data heat degree in a data warehouse in the field of freight transportation.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Detailed Description
The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of systems and methods consistent with certain aspects of the invention, as detailed in the appended claims.
In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
In the data heat analysis method in this embodiment, a dimension more suitable for the data heat of the data warehouse is adopted, a linear regression is used to select a suitable dimension weight, the mart data of the data warehouse and the bottom layer on which the mart data depends are used as a mutual momentum dimension, and a user calculates the heat using multiple dimensions such as the dimension, so that the dimension expansion is better compatible, and the method is more suitable for the data warehouse scene.
Fig. 1 is a schematic flow chart illustrating a method for analyzing data heat according to an exemplary embodiment, and referring to fig. 1, the method specifically includes the following steps.
S101, receiving query information, reference information, user interaction information, service attribute importance degree information, data release time information and data use time information of a data table to be counted in a data warehouse.
In this embodiment, for example, the analysis of the data heat in the freight field data warehouse is taken as an example, and in order to improve the accuracy of the data heat statistics, statistics is performed through data of multiple dimensions.
Firstly, acquiring query information of a data table to be counted in a data warehouse, carrying out grammar analysis on an sq1 statement script through the sq1 statement script executed on a computing platform by a user to generate grammar data, then extracting a used table, and accumulating the times of the used table to obtain the query times of the data table to be counted.
And further, calculating the number of references according to the blood relationship of the data table to be counted. Including building the data relationship of the table according to the user executing the sql script program. The table kindred relationship is constructed as follows:
i. collecting DML statements and insert DDL statements;
II, syntax analysis is carried out on the statements collected in the step i, and an abstract syntax tree is generated;
traversing the syntax tree to acquire inputTable and outputTable information in the syntax tree;
and iv, building the relationship obtained in the step iii into the blood relationship of the tree structure.
Fig. 2 is a schematic diagram illustrating a data relationship according to an exemplary embodiment, and as shown in fig. 2, the data table2 refers to the data table, and the data table3 and the data table4 refer to the data table, and the number of times of reference of the data table to be counted can be calculated according to the relationship between the data table and the blood relationship.
Further, the method also comprises the steps of collecting user interaction information of the data table to be counted, and calculating a user interaction value according to the received user interaction information, wherein the steps comprise: and calculating the exposure times, the browsing times, the praise times and the user scores of the data table to be counted, and calculating the user interaction information value according to the exposure times, the browsing times, the praise times and the user scores. In one possible implementation, the user interaction information value is calculated according to the following formula:
furthermore, the method also comprises the step of collecting business attribute importance degree information of the data table to be counted, wherein the business attribute importance degree information shows the importance degree of the business corresponding to the data table, for example, in the field of freight transportation, the importance degree corresponding to vehicle data is larger, the importance degree corresponding to financial data is smaller, and the description is carried out through a business attribute importance degree value.
In one embodiment, a service importance information table is set according to the importance of the service corresponding to the data, and the service importance information table stores different types of services and importance values corresponding to the different types of services. As shown in the following table:
| business
|
Service attribute importance degree value
|
| Vehicle with a steering wheel
|
100
|
| User' s
|
80
|
| Enterprise
|
60
|
| Others
|
40 |
The method comprises the steps of obtaining a service attribute type corresponding to a data table to be counted, inquiring a service importance degree information table according to the service attribute type to obtain a corresponding service attribute importance degree value, obtaining high-heat data which more accord with the application field by obtaining service attribute information, and improving the applicability of the data.
And finally, collecting and storing the time information of the data, wherein the time information comprises the publishing time information of the data and the using time information of the data.
According to the step, the data of multiple dimensions of the data table to be counted are collected and analyzed, and the accuracy and the applicability of heat calculation are improved.
S102, calculating data dimension values of query information, reference information, user interaction information and service attribute importance degree information of the data table to be counted according to the pre-trained linear regression model.
In order to improve the calculation accuracy, the embodiment of the present disclosure calculates the weight of each dimension data by using a linear regression model, and obtains the numerical value of each data dimension.
Specifically, firstly, query information, reference information, user interaction information and service attribute importance information of a large number of data tables are obtained, and query times, reference times, user interaction information values and service attribute importance values of each data table are calculated according to the obtained data.
And then, evaluating the heat degree of the table structure by professional service personnel, and manually labeling the acquired data set to obtain labeled data. Preprocessing the labeled data, for example, attaching zero to null data, deleting abnormal data, normalizing the dimension data to obtain preprocessed labeled data, dividing the preprocessed labeled data into a training set and a test set, and training a linear regression model to obtain a trained linear regression model. The dimension weight parameters in the embodiment are obtained based on a large amount of data training, are more suitable for application scenarios, and solve the problem that manual labeling and rule setting are inaccurate in the prior art.
In one embodiment, the formula for the linear regression model is as follows:
S=w0+w1x1+w2x2+w3x3+w4x4
wherein S represents a data dimension value, x1Representing the number of times the data table is queried, x, within a preset period2Representing the number of times the data table is referenced within a preset period, x3Representing user interactionsInformation value, x4Representing the importance value, w, of the service attribute0...w4Representing a weight parameter.
In an alternative embodiment, the data dimension may be extended according to the following formula:
S=w0+w1x1+w2x2+w3x3+w4x4+…+wnxn;
wherein x isnRepresenting an extended data dimension, s representing a data dimension value, x1Representing the number of times the data table is queried, x, within a preset period2Representing the number of times the data table is referenced within a preset period, x3Representing a value of user interaction information, x4Representing the importance value, w, of the service attribute0...wnThe weight parameters are expressed, the linear regression model provided by the embodiment of the disclosure is well compatible with dimension expansion, and a person skilled in the art can expand data dimensions by himself or herself according to practical application.
In a possible implementation manner, when the linear regression model is trained, model parameter selection can be performed through 5-fold cross validation and gridding hyper-parameters, the optimal performance is selected through the RMSE value to test on the test set, and model parameters are adjusted until the effect on the test set reaches the optimal effect and approaches the effect of the training set, so that the trained linear regression model is obtained.
Wherein, the 5-fold cross validation method comprises the following steps of 1: dividing the data set into 5 parts; step 2: selecting one part of the test set as a test set, and taking the other four parts of the test set as a training set; and step 3: and 2, performing the step 5 times, wherein the selected test set is different each time. By performing cross validation and evaluation of the model, the accuracy of model training can be improved.
After the trained linear regression model is obtained, the query times, the reference times, the user interaction information values and the business attribute importance degree values of the data table to be counted in the step S101 are input into the linear regression model to obtain data dimension values.
S103, calculating the heat of the data according to the data dimension value, the data release time information and the data use time information.
In one embodiment, calculating the heat of the data according to the data dimension value, the data publishing time information and the data using time information comprises calculating the heat of the data according to the following formula:
wherein f represents the data heat, s represents the data dimension value, MageInhours represents the difference between the data publishing time and the current time, and MusedTimeInHour represents the latest using time of the data.
In an optional embodiment, after the heat of the data is calculated, the heat value is stored in the metadata, and the method further includes sorting all the data in the data warehouse from high to low according to the corresponding heat values, and pushing a preset number of data with the top heat value to the client for display. When a user searches or views data assets, the data with higher heat can be better displayed, so that the user can quickly search the assets which are interested in the user. Furthermore, after the data is released, the value of the assets can be evaluated through the heat degree, resources are put into the iterative development of the data assets with higher heat degree, the quality of the data is improved, and the scene of data application is expanded.
In an optional embodiment, after the heat degree of the data is calculated, the method further includes obtaining types of the data table, classifying the data according to the types, sorting the data in each type from high to low according to the heat degree value, and adding a preset number of data with the heat degree value ranked in the top to the heat degree information table. And storing the data in the heat information table in a classified manner to obtain the data with higher heat in each data type. The data with higher heat can be better displayed, so that the user can quickly search the assets which are interested by the user.
In an optional embodiment, with the development of a large data platform, large data centers such as large-scale data warehouses and data lakes are increasingly common, the data centers also bring storage and performance pressure while continuously settling data, and therefore after the heat degree of the data is calculated, the method further comprises the steps of cleaning the data with the heat degree value lower than a preset heat degree threshold in a preset period, for example, acquiring the storage time of the data, and automatically cleaning the data when the storage time of the data is larger than the preset time degree threshold and the heat degree of the data is lower than the preset heat degree threshold. And the data meeting the cleaning condition can be sent to a manager, and the data is cleaned after a deleting instruction of the manager is received.
The data heat degree analysis method provided by the embodiment of the disclosure considers not only the query use times of data, but also the data of multiple dimensions such as the service attribute importance degree of the data, the user interaction information and the like, uses a linear regression algorithm as a model, calculates the weight of each data dimension through the model, can continuously train and adjust the model in the actual use process, obtains the weight parameters which more accord with the data in the freight transportation field, and further obtains the heat degree value with higher accuracy. And convenience is brought to later data application and expansion.
The embodiment of the present disclosure further provides an apparatus for analyzing data heat, which is configured to execute the method for analyzing data heat of the foregoing embodiment, as shown in fig. 3, the apparatus includes:
the data receiving module 301 is configured to receive query information, reference information, user interaction information, service attribute importance information, data publishing time information, and data using time information of a data table to be counted in a data warehouse;
the first calculation module 302 is configured to calculate data dimension values of query information, reference information, user interaction information, and service attribute importance information of a data table to be counted according to a pre-trained linear regression model;
the second calculating module 303 is configured to calculate the heat of the data according to the data dimension value, the data publishing time information, and the data using time information.
It should be noted that, when the data heat analysis apparatus provided in the foregoing embodiment executes the data heat analysis method, only the division of the functional modules is taken as an example, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules, so as to complete all or part of the functions described above. In addition, the data heat degree analysis device provided by the above embodiment and the data heat degree analysis method embodiment belong to the same concept, and the detailed implementation process is shown in the method embodiment, which is not described herein again.
The embodiment of the present disclosure further provides an electronic device corresponding to the method for analyzing data heat provided by the foregoing embodiment, so as to execute the method for analyzing data heat.
Referring to fig. 4, a schematic diagram of an electronic device provided in some embodiments of the present application is shown. As shown in fig. 4, the electronic apparatus includes: a processor 400, a memory 401, a bus 402 and a communication interface 403, wherein the processor 400, the communication interface 403 and the memory 401 are connected through the bus 402; the memory 401 stores a computer program that can be executed on the processor 400, and the processor 400 executes the computer program to perform the method for analyzing the heat of data provided by any of the foregoing embodiments of the present application.
The Memory 401 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 403 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.
Bus 402 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 401 is used for storing a program, and the processor 400 executes the program after receiving an execution instruction, and the method for analyzing data heat disclosed in any of the foregoing embodiments of the present application may be applied to the processor 400, or implemented by the processor 400.
Processor 400 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 400. The Processor 400 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 401, and the processor 400 reads the information in the memory 401 and completes the steps of the method in combination with the hardware.
The electronic device provided by the embodiment of the application and the method for analyzing the data heat provided by the embodiment of the application have the same beneficial effects as the method adopted, operated or realized by the electronic device.
Referring to fig. 5, the computer readable storage medium is an optical disc 500, on which a computer program (i.e., a program product) is stored, and when the computer program is executed by a processor, the computer program performs the method for analyzing the heat of data provided by any of the foregoing embodiments.
It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.
The computer-readable storage medium provided by the above embodiments of the present application and the method for analyzing data heat provided by the embodiments of the present application have the same beneficial effects as the method adopted, run or implemented by the application program stored in the computer-readable storage medium.
The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.