CN111339175B

CN111339175B - Data processing method, device, electronic equipment and readable storage medium

Info

Publication number: CN111339175B
Application number: CN202010133602.XA
Authority: CN
Inventors: 刘奔
Original assignee: Chengdu Yunli Technology Co Ltd
Current assignee: Chengdu Yunli Technology Co Ltd
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2023-08-11
Anticipated expiration: 2040-02-28
Also published as: CN111339175A

Abstract

The application provides a data processing method, a device, an electronic device and a readable storage medium, comprising: obtaining a plurality of original data; transmitting the plurality of raw data to a distributed message queue Kafka; obtaining a plurality of original data from Kafka by utilizing a Flink, and performing ETL processing to obtain a plurality of integrated data, wherein the integrated data is provided with a label; writing each integrated data into a corresponding database according to the label of each integrated data; and collecting monitoring data from the Kafka, the Flink and API interfaces corresponding to the databases, and monitoring the monitoring data, wherein the monitoring data comprises original data and integrated data. The data processing method provided by the embodiment of the application can utilize the Flink to carry out ETL processing on the original data, monitor the data generated in each stage of the whole data processing flow, monitor the data generated in each stage, better grasp the change condition of the data in the processed process, unify the caliber of the data and be beneficial to ensuring the consistency of the data.

Description

Data processing method, device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of big data technologies, and in particular, to a data processing method, a device, an electronic apparatus, and a readable storage medium.

Background

In the prior art, the demands for real-time data are more and more, a plurality of independent real-time tasks can cause great cluster resource waste and pay high development and operation costs, so that a unified real-time data warehouse is needed to improve the task expansibility and save the cluster resources.

The existing real-time data warehouse is usually developed based on a chimney type architecture, in the chimney type architecture, each requirement is usually provided with a corresponding task, the problem that single requirements are calculated respectively and the data caliber is not uniform often occurs, and the data consistency is difficult to ensure.

Disclosure of Invention

An embodiment of the application aims to provide a data processing method, a device, electronic equipment and a readable storage medium, which are used for solving the problems that in the prior art, the data caliber is not uniform and the data consistency is difficult to guarantee.

In a first aspect, an embodiment of the present application provides a data processing method, where the method includes: obtaining a plurality of original data; transmitting the plurality of raw data to a distributed message queue Kafka; obtaining the plurality of original data from Kafka by utilizing a Flink and carrying out ETL processing to obtain a plurality of integrated data, wherein the integrated data is provided with a label; writing each integrated data into a corresponding database according to the label of each integrated data; and collecting monitoring data from the Kafka, the Flink and the API interfaces corresponding to the databases, and monitoring the monitoring data, wherein the monitoring data comprises original data and integrated data.

In the above embodiment, the integrated data is provided with a tag, and each integrated data can be written into the corresponding database according to the tag. The data processing method provided by the embodiment of the application can utilize the Flink to carry out ETL processing on the original data, and monitor the data generated in each stage of the whole data processing flow. The data generated in each stage is monitored, so that the change condition of the data in the processed process can be well mastered, the data caliber is unified, and the consistency of the data is guaranteed.

In one possible design, the monitoring data includes: judging whether the specific value of the monitoring data exceeds the corresponding numerical range; if yes, an alarm signal is sent out.

In the above embodiment, when monitoring the monitoring data, a specific value of the monitoring data may be obtained, and whether the specific value exceeds a numerical range corresponding to the monitoring data is determined, if yes, the abnormal condition of the data is represented, and an alarm signal is sent to remind the user.

In one possible design, the monitoring data includes: judging whether a heartbeat signal of the monitoring data is received within a preset time period; if not, an alarm signal is sent out.

In the above embodiment, when monitoring the monitoring data, the timing may be performed after each heartbeat signal corresponding to the monitoring data is received, and if the time length after the last time of receiving the monitoring data exceeds the preset time period, the abnormality of the corresponding monitoring data is represented, and an alarm signal is sent to remind the user.

In one possible design, the database includes Redis; the writing each integrated data into the corresponding database according to the label of the integrated data comprises the following steps: if the tag of the integrated data characterizes the integrated data, frequent query and read-write actions are required to be performed on the integrated data, and the integrated data is written into Redis.

In the embodiment, the data stored in the Redis database can be more conveniently and rapidly queried and read-written, and the integrated data meeting the requirements is stored in the Redis database, so that the running speed and the efficiency of the whole data processing method are improved.

In one possible design, the database includes Mysql; the writing each integrated data into the corresponding database according to the label of the integrated data comprises the following steps: if the label of the integrated data indicates that the data amount corresponding to the integrated data is smaller or the integrated data is required to be provided for a fixed report for use, writing the integrated data into Mysql.

In the embodiment, the data stored in the Mysql database can be more conveniently provided for the fixed report for use, and the integrated data meeting the requirements is stored in the Mysql database, so that the running speed and the efficiency of the whole data processing method are improved.

In one possible design, the database includes GreenPlum; the writing each integrated data into the corresponding database according to the label of the integrated data comprises the following steps: if the tag of the integrated data is present, the integrated data is characterized by multi-dimensional analysis, and the integrated data is written into GreenPlum.

In the embodiment, the data stored in the GreenPlum database can be more conveniently subjected to multidimensional analysis, and the integrated data meeting the requirements is stored in the GreenPlum database, so that the running speed and the efficiency of the whole data processing method are improved.

In one possible design, after the writing of each of the integrated data into the corresponding database according to the tag of the integrated data, the method further includes: and acquiring and processing the data from at least one database in the plurality of databases according to the service type by utilizing the unified service platform.

In the above embodiment, the unified service platform includes multiple service types, and the unified service platform may acquire data from the corresponding databases according to the service types, and may acquire data from the databases in the multiple databases each time, thereby implementing multiplexing of the multiple databases and reducing cost.

In one possible design, the method further comprises: analyzing the operation rule of the monitoring data by utilizing the analysis code to obtain the source and the destination of the monitoring data and the operation process of the monitoring data; and determining the topological relation among the monitoring data according to the source and the destination of the monitoring data and the operation process of the monitoring data.

In the above embodiment, the analysis code can be used to analyze the real-time processing logic of the monitoring data, record where the data is obtained from and sent to after being processed, thereby constructing the blood-margin relationship between the data and perfecting the management of the metadata.

In a second aspect, an embodiment of the present application provides a data processing apparatus, including: the original data acquisition module is used for acquiring a plurality of original data; the data storage module is used for sending the plurality of original data to a distributed message queue Kafka; the integrated data acquisition module is used for acquiring the plurality of original data from Kafka by utilizing the Flink and carrying out ETL processing to acquire a plurality of integrated data, wherein the integrated data is provided with a label; the data writing module is used for writing each integrated data into a corresponding database according to the label of each integrated data; and the data monitoring module is used for collecting monitoring data from the Kafka, the Flink and the API interfaces corresponding to the databases and monitoring the monitoring data, wherein the monitoring data comprises original data and integrated data.

In one possible design, the data monitoring module is specifically configured to determine whether a specific value of the monitored data exceeds a corresponding numerical range; if yes, an alarm signal is sent out.

In one possible design, the data monitoring module is specifically configured to determine whether a heartbeat signal of the monitoring data is received within a preset time period; if not, an alarm signal is sent out.

In one possible design, the database includes Redis; the data writing module is specifically configured to write the integrated data into the dis when the tag having the integrated data characterizes the integrated data to perform frequent query and read-write actions.

In one possible design, the database includes Mysql; the data writing module is specifically used for writing the integrated data into Mysql when the label with the integrated data represents that the data quantity corresponding to the integrated data is smaller or the integrated data is required to be provided for a fixed report.

In one possible design, the database includes GreenPlum; the data writing module is specifically configured to write the integrated data into greenplus when the tag with the integrated data characterizes the integrated data to be subjected to multidimensional analysis.

In one possible design, the apparatus further comprises: and the platform processing module is used for acquiring and processing data from at least one database in the plurality of databases according to the service type by utilizing the unified service platform.

In one possible design, the apparatus further comprises: the rule analysis module is used for analyzing the operation rule of the monitoring data by utilizing the analysis code to obtain the source and the destination of the monitoring data and the operation process of the monitoring data; and the topological relation determining module is used for determining the topological relation among the monitoring data according to the source and the destination of the monitoring data and the operation process of the monitoring data.

In a third aspect, the present application provides an electronic device comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the method of the first aspect or any alternative implementation of the first aspect.

In a fourth aspect, the present application provides a readable storage medium having stored thereon a computer program which when executed by a processor performs the method of the first aspect or any alternative implementation of the first aspect.

In a fifth aspect, the application provides a computer program product which, when run on a computer, causes the computer to perform the method of the first aspect or any of the possible implementations of the first aspect.

In order to make the above objects, features and advantages of the embodiments of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating steps of step S140 in FIG. 1;

FIG. 3 is a flowchart illustrating a portion of steps of a data processing method according to an embodiment of the present application;

fig. 4 is a schematic block diagram of a data processing apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

In a comparative embodiment, the existing real-time data warehouse is developed based on a chimney-like architecture. In a chimney type architecture, the problem that individual demands are respectively calculated and the data caliber is not uniform often occurs, and the data consistency is difficult to ensure. The data processing method provided by the embodiment of the application can utilize the Flink to carry out ETL processing on the original data, monitor the data generated in each stage of the whole data processing flow, monitor the data generated in each stage, and better grasp the change condition of the data in the processed process, so that the data caliber is uniform, and the consistency of the data is ensured.

Fig. 1 is a schematic diagram of a data processing method according to an embodiment of the present application, where the data processing method may be executed by an electronic device, and the electronic device may be a server or a terminal device, and the data processing method according to the embodiment of the present application includes steps S110 to S150 as follows:

step S110, obtaining a plurality of original data.

The raw data includes data generated by the business system and event data. The data generated by the business system comprises user transaction information, complaint information and the like; the event data includes a user browsing log, a gateway log, and the like.

The specific obtaining mode of the original data can be as follows: raw data is collected using data collection tools, typically by accessing a data source, including Canal, flume, oggforbigdata, and the like.

Step S120, transmitting the plurality of raw data to the distributed message queue Kafka.

The data generated by the business system, usually in the form of mysql binlog, is accessed to Kafka; event data typically exists in json format and is also accessed to Kafka.

Step S130, obtaining the plurality of original data from Kafka by utilizing Fink and performing ETL processing to obtain a plurality of integrated data, wherein the integrated data is provided with a label.

And acquiring original data from Kafka by using a Flink real-time computing framework, and performing extraction-transformation-Load (ETL) processing on the original data to obtain integrated data with labels, wherein the labels can reflect the attribute of the integrated data.

The real-time data warehouse constructed based on the Flink real-time computing framework has strong expansibility, and each module of the system has high cohesion and low coupling, so that the subsequent architecture evolution expansion is convenient. The data model has clear hierarchy, when the service changes, the quick iteration can be realized by only changing the logic of the bottom layer and arranging the modules according to the modules, and the service change is responded; and the construction and maintenance cost is low, the modeling is unified, the repeated calculation is reduced, and the data consistency is ensured.

And step S140, writing each integrated data into a corresponding database according to the label of each integrated data.

Optionally, referring to fig. 2, in a specific embodiment, the database may include dis, mysql, and greenplus, and step S140 may include the following steps S141 to S143:

in step S141, if the tag of the integrated data indicates that the integrated data needs to be frequently queried and read/written, the integrated data is written into the Redis.

In step S142, if the tag of the integrated data indicates that the data size corresponding to the integrated data is smaller or the integrated data needs to be provided for the fixed report, the integrated data is written into Mysql.

In step S143, if the tag of the integrated data is present to characterize the integrated data, multidimensional analysis is performed, and the integrated data is written into greenplus.

And the integrated data meeting the requirements is stored in the corresponding database, so that the running speed and the efficiency of the whole data processing method are improved.

After step 140, step S150 is continuously performed, and monitoring data is collected from the Kafka, the Flink and the API interfaces corresponding to the databases, and the monitoring data is monitored.

The monitoring data comprise original data and integrated data, and an API interface can be provided with interface services by an elastic search and an HBase. Through the API interfaces and development programs provided by Kafka, flink, a plurality of databases and other components, staff can monitor the state of each task, thereby realizing the monitoring of the full link from data access, data processing to data on-off service and the monitoring of data quality. For data quality monitoring, the monitoring data is typically calculated in real time according to monitoring rules, which include data volume fluctuation, primary key uniqueness, and the like. The monitoring result of the monitoring data and the monitoring result of the data quality can be displayed through an open source tool grafana.

After obtaining a plurality of original data, storing the original data in Kafka, obtaining the original data from Kafka by using a Flink, and performing ETL processing to obtain integrated data (i.e. the processed original data), wherein the integrated data is provided with a label. Each integrated data can be written into a corresponding database according to the tag. Raw data and aggregate data may be collected from Kafka, flink, and API interfaces corresponding to multiple databases and monitored. The data processing method provided by the embodiment of the application can utilize the Flink to carry out ETL processing on the original data, and monitor the data generated in each stage of the whole data processing flow, so that the data caliber is uniform, and the consistency of the data is ensured.

Optionally, in a specific embodiment, step S150 includes: judging whether the specific value of the monitoring data exceeds the corresponding numerical range; if yes, an alarm signal is sent out.

When monitoring the monitoring data, a specific value of the monitoring data can be obtained, whether the specific value exceeds a numerical range corresponding to the monitoring data or not is judged, if yes, the abnormal state of the data is represented, and an alarm signal is sent to remind a user.

Step S150 may further include: judging whether a heartbeat signal of the monitoring data is received within a preset time period; if not, an alarm signal is sent out.

When monitoring the monitoring data, the timing can be performed after each heartbeat signal corresponding to the monitoring data is received, if the time length after the last time of receiving the monitoring data exceeds a preset time period, the corresponding monitoring data is characterized as abnormal, for example, if the heartbeat signal of Kafka is not received in the corresponding preset time period, the Kafka service can be judged to be unavailable, and an alarm signal is sent to remind a user.

The alarm signal may be an audible and visual alert signal, such as a intermittent flashing light, a continuous sounding sound; and may also be communication information signals, such as push messages, short message notifications, telephone notifications, etc. for communication applications. The particular type of alert signal should not be construed as limiting the application.

Optionally, when the alarm signal is a communication information signal, the format may be as follows:

alarm event ID:1

Alarm type:

job task: task name

Data time: 20200217

Alarm details: task operation failure

Alarm time: 00:00:0

Optionally, after step S140, the method may further include: and acquiring and processing the data from at least one database in the plurality of databases according to the service type by utilizing the unified service platform.

The number of service types is multiple, for example, a certain service type is that one data is acquired every time it is executed, and then the operation corresponding to the unified service platform acquires one data for an interface corresponding to a certain database in multiple databases (Redis database, mysql database, greenplus database).

For another example, if a certain service type is to query a certain data according to a query condition, the unified service platform is correspondingly operated to obtain a plurality of data from a certain database in a plurality of databases (Redis database, mysql database, greenplus database), and to screen the data from the plurality of data according to the query condition.

Referring to fig. 3, in a specific implementation manner, the data processing method provided by the embodiment of the present application may further include steps S210 to S220, where step S210 may be performed after step S140, or may be performed after step "acquire and process data from at least one database of the multiple databases according to a service type by using a unified service platform".

Step S210, analyzing the operation rule of the monitoring data by utilizing the analysis code to obtain the source and the destination of the monitoring data and the operation process of the monitoring data.

Step S220, determining a topological relation between the monitoring data according to the source and destination of the monitoring data and the operation process undergone by the monitoring data.

The analysis code can be stored on the distributed version control system GIT, the operation rule of the monitoring data can be specifically a flinksql execution plan and a data dependency relationship, and the analysis code is utilized to analyze the flinksql execution plan and the data dependency relationship, so that the topological relationship between the monitoring data can be obtained. The source and the destination of the monitoring data can be represented in a form of tables, the tables can be connected through nodes, and the nodes are marked with specific operation processes.

The analysis code can be used for analyzing the real-time processing logic of the monitoring data, and recording where the data is acquired from and sent to after being processed, so that the blood relationship among the data is constructed, and the management of the metadata is perfected.

Referring to fig. 4, fig. 4 shows a data processing apparatus provided by an embodiment of the present application, where the apparatus 400 includes:

the raw data acquisition module 410 is configured to acquire a plurality of raw data.

The data storing module 420 is configured to send the plurality of raw data to the distributed message queue Kafka.

And the integrated data obtaining module 430 is configured to obtain the plurality of raw data from Kafka by using the link and perform ETL processing to obtain a plurality of integrated data, where the integrated data has a tag.

The data writing module 440 is configured to write each piece of integrated data into a corresponding database according to the tag of each piece of integrated data.

And the data monitoring module 450 is configured to collect monitoring data from the Kafka, the Flink and the API interfaces corresponding to the databases, and monitor the monitoring data, where the monitoring data includes original data and integrated data.

The data writing module 440 is specifically configured to write the integrated data into the dis when the tag having the integrated data characterizes that the integrated data needs to be frequently queried and read-written.

The data writing module 440 is specifically configured to write the integrated data into Mysql when the tag with the integrated data indicates that the data size corresponding to the integrated data is smaller or the integrated data needs to be provided for the fixed report.

The data writing module 440 is specifically configured to write the integrated data into greenplus when the tag with the integrated data characterizes the integrated data to be subjected to multidimensional analysis.

The data monitoring module 450 is specifically configured to determine whether a specific value of the monitored data exceeds a corresponding numerical range; if yes, an alarm signal is sent out.

The data monitoring module 450 is specifically configured to determine whether a heartbeat signal of the monitoring data is received within a preset time period; if not, an alarm signal is sent out.

The apparatus further comprises:

and the platform processing module is used for acquiring and processing data from at least one database in the plurality of databases according to the service type by utilizing the unified service platform.

And the rule analysis module is used for analyzing the operation rule of the monitoring data by utilizing the analysis code to obtain the source and the destination of the monitoring data and the operation process of the monitoring data.

And the topological relation determining module is used for determining the topological relation among the monitoring data according to the source and the destination of the monitoring data and the operation process of the monitoring data.

The working principle of the data processing device provided by the embodiment of the application is the same as that of the data processing method, and will not be described herein.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via some communication interfaces, devices or units, in electrical, mechanical, or other forms.

Further, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of data processing, the method comprising:

obtaining a plurality of original data;

transmitting the plurality of raw data to a distributed message queue Kafka;

obtaining the plurality of original data from Kafka by utilizing a Flink and carrying out ETL processing to obtain a plurality of integrated data, wherein the integrated data is provided with a label;

writing each integrated data into a corresponding database according to the label of each integrated data;

collecting monitoring data from the Kafka, the Flink and API interfaces corresponding to a plurality of databases, and monitoring the monitoring data, wherein the monitoring data comprises original data and integrated data;

the method further comprises the steps of:

analyzing the operation rule of the monitoring data by utilizing the analysis code to obtain the source and the destination of the monitoring data and the operation process of the monitoring data; the operation rule is a flinksql execution plan and a data dependency relationship;

determining the topological relation among the monitoring data according to the source and the destination of the monitoring data and the operation process of the monitoring data;

and analyzing the flinksql execution plan and the data dependency relationship by using the analysis code to obtain the topological relationship between the monitoring data.

2. The method of claim 1, wherein monitoring the monitoring data comprises:

judging whether the specific value of the monitoring data exceeds the corresponding numerical range;

if yes, an alarm signal is sent out.

3. The method of claim 1, wherein monitoring the monitoring data comprises:

judging whether a heartbeat signal of the monitoring data is received within a preset time period;

if not, an alarm signal is sent out.

4. The method of claim 1, wherein the database comprises a Redis;

the writing each integrated data into the corresponding database according to the label of the integrated data comprises the following steps:

if the tag of the integrated data characterizes the integrated data, frequent query and read-write actions are required to be performed on the integrated data, and the integrated data is written into Redis.

5. The method of claim 1, wherein the database comprises Mysql;

if the label of the integrated data indicates that the data amount corresponding to the integrated data is smaller or the integrated data is required to be provided for a fixed report for use, writing the integrated data into Mysql.

6. The method of claim 1, wherein the database comprises greenplus;

if the tag of the integrated data is present, the integrated data is characterized by multi-dimensional analysis, and the integrated data is written into GreenPlum.

7. The method of claim 1, wherein after said writing each of said consolidated data into a corresponding database in accordance with a tag of said consolidated data, said method further comprises:

and acquiring and processing the data from at least one database in the plurality of databases according to the service type by utilizing the unified service platform.

8. A data processing apparatus, the apparatus comprising:

the original data acquisition module is used for acquiring a plurality of original data;

the data storage module is used for sending the plurality of original data to a distributed message queue Kafka;

the integrated data acquisition module is used for acquiring the plurality of original data from Kafka by utilizing the Flink and carrying out ETL processing to acquire a plurality of integrated data, wherein the integrated data is provided with a label;

the data writing module is used for writing each integrated data into a corresponding database according to the label of each integrated data;

the data monitoring module is used for collecting monitoring data from the Kafka, the Flink and the API interfaces corresponding to the databases and monitoring the monitoring data, wherein the monitoring data comprises original data and integrated data;

the device also comprises an analysis module, wherein the analysis module is used for: analyzing the operation rule of the monitoring data by utilizing the analysis code to obtain the source and the destination of the monitoring data and the operation process of the monitoring data; the operation rule is a flinksql execution plan and a data dependency relationship; determining the topological relation among the monitoring data according to the source and the destination of the monitoring data and the operation process of the monitoring data; and analyzing the flinksql execution plan and the data dependency relationship by using the analysis code to obtain the topological relationship between the monitoring data.

9. An electronic device, comprising: a processor, a storage medium, and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor in communication with the storage medium via the bus when the electronic device is running, the processor executing the machine-readable instructions to perform the method of any one of claims 1-7 when executed.

10. A readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the method according to any of claims 1-7.