CN117648342A

CN117648342A - Multi-data stream parallel processing method and device and nonvolatile storage medium

Info

Publication number: CN117648342A
Application number: CN202311597814.3A
Authority: CN
Inventors: 王鹏哲; 阮宜龙
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2023-11-27
Filing date: 2023-11-27
Publication date: 2024-03-05
Also published as: WO2025112848A1

Abstract

The application discloses a multi-data stream parallel processing method, a device and a nonvolatile storage medium. Wherein the method comprises the following steps: determining task configuration information of a target task flow, wherein the target task flow comprises at least one data flow, the task configuration information comprises a correlation mode among the data flows in the target task flow, data characteristic information of each data flow in the target task flow and data table types of the data flows; in the pre-writing batch phase of the target task flow, executing a first type of operation on the data flow in the target task flow according to the task configuration information, wherein the first type of operation comprises at least one of the following operations: data deduplication and data association; and in the data writing merging stage of the target task flow, carrying out lasting disk dropping processing on the data flow in the target task flow according to the task configuration information. The method and the device solve the technical problem of extremely large data of the computing task buffer quantity caused by real-time data stream association in the computing layer when the data streams are associated in the related art.

Description

Multi-data stream parallel processing method and device and nonvolatile storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a method and apparatus for parallel processing of multiple data streams, and a nonvolatile storage medium.

Background

In the related art, when the data streams are associated, the real-time data streams are usually associated at a computing layer, and an external storage system is required to be introduced during the association, so that storage and network transmission pressures are high, performance bottlenecks are easy to form, and the cache data volume of a computing task is extremely large.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides a multi-data stream parallel processing method, a multi-data stream parallel processing device and a non-volatile storage medium, which at least solve the technical problem of extremely large data of a calculation task buffer amount caused by real-time data stream association in a calculation layer when the data streams are associated in the related technology.

According to an aspect of the embodiments of the present application, there is provided a multi-data stream parallel processing method, including: determining task configuration information of a target task flow, wherein the target task flow comprises at least one data flow, the task configuration information comprises a correlation mode among the data flows in the target task flow, data characteristic information of each data flow in the target task flow and data table types of the data flows; in the pre-writing batch phase of the target task flow, executing a first type of operation on the data flow in the target task flow according to the task configuration information, wherein the first type of operation comprises at least one of the following operations: data deduplication and data association; and in the data writing merging stage of the target task flow, carrying out lasting disk dropping processing on the data flow in the target task flow according to the task configuration information.

Optionally, the association means includes one of: normal write mode, wide table JOIN mode, data stream JOIN mode.

Optionally, in the case that the association mode is the broad table connection mode, the step of executing the first type of operation on the data stream in the target task stream according to the task configuration information includes: determining whether target data with the same key value exists in the target task stream; determining whether the target data with the same key value belong to the same data stream or not under the condition that the target data with the same key value is determined to exist; under the condition that the target data with the same key value belong to the same data stream, updating the data stream to which the target data belong; in the case where the target data having the same key value does not belong to the same data stream, the data splicing process is performed on the data stream to which the target data belongs.

Optionally, in the case that the association mode is a data stream JOIN mode, the step of executing the first type of operation on the data stream in the target task stream according to the task configuration information includes: determining whether target data with the same key value exists in the target task stream; in the case that the target data with the same key value are determined to exist, determining whether the target data with the same key value belong to the same data table; under the condition that the target data corresponds to the same data table, the target data with later data generation time is reserved; and under the condition that the data do not correspond to the same data table, performing data stream JOIN processing on the data stream corresponding to the data.

Optionally, the step of performing data stream JOIN processing on the data stream corresponding to the data stream includes: determining a JOIN sequence between data streams, and associating the data streams according to the JOIN sequence; when the data flows are associated again according to the JOIN sequence, the association mode is inner. JOIN or right. JOIN, and if JOIN association fails, the target data is returned and marked as hidden data; and returning the target data and adding a hidden mark to the data which does not participate in the association in the data stream under the condition that the association mode is left. JOIN when the data stream is associated again according to the JOIN sequence.

Optionally, in the case that the association mode is a data stream JOIN mode, the step of performing persistent landing processing on the data stream in the target task stream according to the task configuration information includes: determining write merging logic corresponding to table type information of data streams, wherein the write merging logic comprises a write merging mode for a first type of data stream and a second type of data stream, the first type of data stream is a data stream with a hidden mark in corresponding data, and the second type of data stream is a data stream without the hidden mark in the corresponding data; and performing persistent disc dropping processing on the data stream according to the write merging logic corresponding to the data stream.

Optionally, in the case that the association mode is a broad table connection mode, the step of performing persistent landing processing on the data stream in the target task stream according to the task configuration information includes: determining target data with the same key value; determining the generation time of target data, and determining a target column in which the latest generated target data is located; and overlaying the historical data by using the target column, and performing persistence disc dropping processing on the data associated with the data stream after overlaying.

Optionally, the task configuration information further includes parallelism; the step of determining configuration information of the target task flow comprises: acquiring a preset parallelism parameter; or determining the number of data sources associated with the target task flow; and configuring parallelism parameters of the target task flows according to the number of the data sources.

Optionally, after the step of performing persistent disk-dropping processing on the data stream in the target task stream according to the task configuration information, the multi-data stream parallel processing method further includes: determining the query type of the data query instruction and the table type of a data table corresponding to the data query instruction; and processing the data in the data table according to the query type and the table type.

According to another aspect of the embodiments of the present application, there is also provided a multi-data stream parallel processing apparatus, including: the first processing module is used for determining task configuration information of target task flows, wherein the target task flows comprise at least one data flow, the task configuration information comprises a correlation mode among the data flows in the target task flows, data characteristic information of each data flow in the target task flows and data table types of the data flows; the second processing module is used for executing a first type of operation on the data stream in the target task stream according to the task configuration information in the pre-writing batch phase of the target task stream, wherein the first type of operation comprises at least one of the following operations: data deduplication and data association; and the third processing module is used for performing lasting disk-dropping processing on the data stream in the target task stream according to the task configuration information in the data writing merging stage of the target task stream.

According to another aspect of the embodiments of the present application, there is further provided a nonvolatile storage medium, in which a program is stored, where when the program runs, a device in which the nonvolatile storage medium is controlled to execute the multi-data stream parallel processing method.

According to another aspect of the embodiments of the present application, there is also provided an electronic device, including: the system comprises a memory and a processor, wherein the processor is used for running a program stored in the memory, and the program runs to execute a multi-data stream parallel processing method.

In the embodiment of the application, task configuration information for determining a target task flow is adopted, wherein the target task flow comprises at least one data flow, the task configuration information comprises a correlation mode among the data flows in the target task flow, data characteristic information of each data flow in the target task flow and data table types of the data flows; in the pre-writing batch phase of the target task flow, executing a first type of operation on the data flow in the target task flow according to the task configuration information, wherein the first type of operation comprises at least one of the following operations: data deduplication and data association; in the data writing merging stage of the target task flow, the data flow in the target task flow is subjected to persistent disk-drop processing according to the task configuration information, and the data flow is associated in the pre-writing batch-collecting stage, so that the purpose of sinking the association logic of the data flow to the storage layer of the data lake is achieved, the technical effect that all data association tasks are not required to be executed in the calculation layer is achieved, and the technical problem of huge data buffering amount of the calculation task caused by real-time data flow association in the calculation layer when the data flow is associated in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 is a schematic structural view of a computer apparatus (mobile device) provided according to an embodiment of the present application;

FIG. 2 is a flow chart of a multi-data-stream parallel processing method according to an embodiment of the present application;

FIG. 3 is a flow diagram of a multi-data flow processing flow provided according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a multi-data-stream parallel processing apparatus according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For better understanding of the embodiments of the present application, technical terms related in the embodiments of the present application are explained below:

hudi: in this application reference is made to data lakes.

Payload (load): hudi Payload is an expandable data processing mechanism, and through different Payload we can realize the customized data writing mode of complex scene, greatly increasing the flexibility of data processing. Tools for performing operations such as deduplication, filtering, merging and the like on data when the Payload is written into and read from the Hudi table.

State (State): in order to meet the requirement of historical data in operator computing, a checkpoint mechanism is used to ensure fault tolerance, and a data structure stored in a state storage module is used for storing states of various objects. state is used to store intermediate results or metadata of the nodes of the computing process, etc.

Compactions (merger): coordinating the operation of the difference data structure in Hudi changes the updates from a row-based log file to a column format file.

Base file (Base file): in Hudi, the base file is stored in the form of a columnar file.

Log file: in Hudi, delta files such as log files are stored in the form of line files.

At present, due to the development of real-time computing frames such as Flink, spark Streaming and the like and the rising of Kafka and MPP technologies, the technology in the real-time computing field tends to be perfect, and along with the popularization of technologies such as the Internet of things, machine learning and the like, real-time Streaming computing is fully applied in the fields such as intelligent recommendation, real-time fraud detection, complex event processing, real-time machine learning and the like. With the support of various application scenes, the traditional single data stream statistical aggregation calculation cannot meet the requirement of complex service scene index calculation, and real-time data stream association connection is in need of solving.

In the related art, when associating data flows, the following common data flow JOIN connection schemes are available: 1. and multitasking parallel processing, wherein the data streams needing to be associated are written into an external storage system, and external data storage spot check is carried out in the key data streams, so that data association is completed. 2. The link data is cached using a Flink streaming computing framework using an Interval Join, window Join, or self State State. 3. Column store, each streaming task corresponds to a single real-time data source, updating only some columns involved by the single data source.

However, the above schemes have some drawbacks, and for the above related mode 1, there is a problem that external storage is required to be relied on, and the operation and maintenance burden is increased; the data magnitude is large, and pressure is caused for storage and network IO, so that performance bottlenecks are easy to form. For the above-mentioned association scheme 2,Flink Interval Join or Window Join: the buffer data is needed to be stored in the memory, so that the task pressure is high, and the State is too large, which results in too long checkpoint time and influences the timeliness of service data and the stability of the Flink streaming task.

In addition, when the table is written simultaneously by multiple streams in the association mode, the problem of write competition exists, and the write locks are required to be acquired sequentially, so that deadlock and performance loss are easy to cause.

In addition, in the related art, when data association is performed, under the condition of loading cold and hot data, since the hot data is stored in the memory, a reasonable TTL (time to live) cannot be set, and there may be a situation that Cache data (Cache data) is not updated in time, resulting in data exception. In addition, in the related technology, the real-time association of the data flow is performed in the calculation layer, which can cause the conditions of extremely large data volume of the calculation task cache, task backpressure and the like.

In order to solve the above-mentioned problems, related solutions are provided in the embodiments of the present application, and the following detailed description is provided.

According to embodiments of the present application, there is provided a method embodiment of a multi-data stream parallel processing method, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different from that herein.

The method embodiments provided by the embodiments of the present application may be performed in a mobile terminal, a computer terminal, or similar computing device. Fig. 1 shows a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing a multi-data stream parallel processing method. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more processors 102 (shown as 102a, 102b, … …,102 n) which may include, but are not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA, a memory 104 for storing data, and a transmission module 106 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the present application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination to interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the multi-data stream parallel processing method in the embodiment of the present application, and the processor 102 executes the software programs and modules stored in the memory 104, thereby executing various functional applications and data processing, that is, implementing the multi-data stream parallel processing method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).

In the above operating environment, the embodiment of the present application provides a multi-data stream parallel processing method, as shown in fig. 2, including the following steps:

step S202, task configuration information of a target task flow is determined, wherein the target task flow comprises at least one data flow, the task configuration information comprises a correlation mode among the data flows in the target task flow, data characteristic information of each data flow in the target task flow and data table types of the data flows;

In the technical solution provided in step S202, the association manner includes one of the following: normal write mode, wide table JOIN mode, data stream JOIN mode.

As an alternative implementation manner, when task configuration is performed on the target task flow, the configured content may include:

the connection mode between the data streams corresponding to the target task stream is as follows: JOIN (data stream JOIN mode), common (normal write mode), wide-mode, etc.

Single stream parsing scheme (feature) for data parsing and pattern matching, such as: scheme = { "type": "record", "fields" [ { "name": "field1", "type": string "}, {" name ":" field2"," type ": string" } ], wherein the scheme includes information about the structure and type of each stream, such as field name, data type, length, constraint, etc.

JOIN logic, such as t1.Left. JOIN. T2, for data streams associated with JOIN mode, where t1 and t2 represent the associated two data streams.

In some embodiments of the present application, the task configuration information further includes parallelism; the step of determining configuration information of the target task flow comprises: acquiring a preset parallelism parameter; or determining the number of data sources associated with the target task flow; and configuring parallelism parameters of the target task flows according to the number of the data sources.

Specifically, the parallelism refers to the number of data streams that can be processed simultaneously in the target task stream. When the parallelism is set, a user can configure global parallelism in the streaming computing framework and can also configure parallelism of a single flow task. When a task is executed, if the task is not independently personalized to configure parallelism, the task is configured as default configuration of the task with global parallelism. If the user does not set the global task parallelism and the parallelism for a single task, the parallelism speculation can be performed according to the specific number of data sources. For example, a Kafka Source type data Source, the task parallelism configuration is performed according to the Topic partition number. For example, the parallelism may be set equal to the number of data sources or the number of Topic partitions.

After the configuration is completed, all data streams to be associated can be accessed, a streaming task is configured to read a data source Topic list, a partition discovery mechanism is set, and the number of source Topic partitions is dynamically acquired. For example, when the partition is increased, the increased partition is subjected to corresponding consumption registration, and the data is brought into a data reading list, so that the data streams to be associated are dynamically accessed in real time.

After the task configuration setting of the target task flow is completed, the target task flow can load the configuration file and read the task configuration in the starting stage, and then judge the connection mode between the data flows and load the Payload data processing class corresponding to different modes.

Step S204, in the pre-writing batch phase of the target task stream, a first type of operation is executed on the data stream in the target task stream according to the task configuration information, wherein the first type of operation includes at least one of the following: data deduplication and data association;

during the pre-write batch phase, different data deduplication logic and association logic may be configured according to the manner of connection between the streams. For example, for a data stream in the normal write mode, only the deduplication operation is performed for the data stream, and updating and cleaning deletion of data are performed according to a preset configuration scheme, but the data stream is not connected or associated with other data streams.

In the technical solution provided in step S204, the step of executing the first type of operation on the data stream in the target task stream according to the task configuration information in the case that the association mode is the broad-table connection mode includes: determining whether target data with the same key value exists in the target task stream; determining whether the target data with the same key value belong to the same data stream or not under the condition that the target data with the same key value is determined to exist; under the condition that the target data with the same key value belong to the same data stream, updating the data stream to which the target data belong; in the case where the target data having the same key value does not belong to the same data stream, the data splicing process is performed on the data stream to which the target data belongs.

Specifically, since the data processed at this time includes data written by different data streams, that is, columns included in each piece of data may be different. Therefore, when data is merged and de-duplicated, it is necessary to determine whether two records (data) of the same record key are from the same data stream. If the data flow is confirmed to come from the same data flow, the data corresponding to the key value is updated, namely the latest generated data is reserved. If not, performing data splicing processing. In the data splicing process, if two data streams have overlapping columns, the overlapping columns are applied as the latest data, that is, replaced with the latest data, ordered according to the precombine field (precombination field).

As an optional implementation manner, in the case that the association mode is a data stream JOIN mode, the step of performing the first type of operation on the data stream in the target task stream according to the task configuration information includes: determining whether target data with the same key value exists in the target task stream; in the case that the target data with the same key value are determined to exist, determining whether the target data with the same key value belong to the same data table; under the condition that the target data corresponds to the same data table, the target data with later data generation time is reserved; and under the condition that the data do not correspond to the same data table, performing data stream JOIN processing on the data stream corresponding to the data.

In some embodiments of the present application, the step of performing data stream JOIN processing on the data stream corresponding to the data includes: determining a JOIN sequence between data streams, and associating the data streams according to the JOIN sequence; when the data flows are associated again according to the JOIN sequence, the association mode is inner. JOIN or right. JOIN, and if JOIN association fails, the target data is returned and marked as hidden data; and returning the target data and adding a hidden mark to the data which does not participate in the association in the data stream under the condition that the association mode is left. JOIN when the data stream is associated again according to the JOIN sequence.

Specifically, when the JOIN mode is used to connect different data streams, the data to be processed by the target task stream will include the data written by the different data streams. I.e. the data may originate from different data tables. Therefore, at this stage, two records (data) of the same record dkey (key value) need to further determine whether the data corresponding schemes are the same, if so, it is indicated that the two data are from the same data table, and the data update is completed by using the latest data to cover the old data. If the two data are different, it is indicated that the two data are derived from different data tables, at this time, it is necessary to match and correspond the data according to the data flow scheme and the configuration table scheme, distinguish which data flow table the data are specifically derived from (the data flow table can be regarded as a data flow in the present application), and perform data JOIN on the data flow table according to the configuration connection JOIN mode, and perform flow association according to the JOIN sequence.

Then, if the JOIN mode of the related data stream table is Inner JOIN or Right JOIN and there is a situation that JOIN is not associated, the intermediate data associated previously (i.e. the data with the same key value found previously) is returned and marked as hidden data. If the next sequential JOIN mode between related data flow tables is Left JOIN, the data can be returned without data hiding marking. And adding a hidden mark to other data which do not participate in association, returning, and preparing for the data writing work of the next stage.

The return of data in the pre-write batch stage refers to allowing the data to undergo a write merge process.

Step S206, in the data writing merging stage of the target task flow, the data flow in the target task flow is subjected to lasting disk-dropping processing according to the task configuration information.

In the technical solution provided in step S206, in the data write merge stage, the Hudi table may be used to determine the responsive write data logic according to different data table types. For example, for Copy On Write table type, the base file may be re-copied one Copy and the history data and delta data updated during the re-writing process to complete the persistent blanking. For the Merge On Read table type, the incremental data may be written into the log file first, and then the combined update of the data in the history data base file and the incremental log file may be performed at the time of the action.

As an alternative implementation manner, for a data stream in a normal writing mode, according to the Payload configuration, two records of the same record key are compared according to a precombine field, and data conforming to the Payload reservation rule is subjected to persistent disk dropping processing.

In some embodiments of the present application, in the case that the association manner is the broad table connection mode, the step of performing the persistent landing processing on the data stream in the target task stream according to the task configuration information includes: determining target data with the same key value; determining the generation time of target data, and determining a target column in which the latest generated target data is located; and overlaying the historical data by using the target column, and performing persistence disc dropping processing on the data associated with the data stream after overlaying.

Specifically, for the data stream connected by the wide table, for two pieces of data with the same recordKey, all columns of the latest data are used for covering the historical data, the latest data are spliced, and then the lasting disc is carried out. For example, for the record with key1 as the main key in the base file, when the column in which the data with the same key value as the data is found in the space Map is the three columns B, C and D, the column a is the data without key1 as the main key. In this case, the latest data is used to update B, C, D three columns, and the updated B, C, D three columns are spliced with the a columns again after the update is completed.

As an optional implementation manner, in the case that the association mode is a data stream JOIN mode, the step of performing persistent disk drop processing on the data stream in the target task stream according to the task configuration information includes: determining write merging logic corresponding to table type information of data streams, wherein the write merging logic comprises a write merging mode for a first type of data stream and a second type of data stream, the first type of data stream is a data stream with a hidden mark in corresponding data, and the second type of data stream is a data stream without the hidden mark in the corresponding data; and performing persistent disc dropping processing on the data stream according to the write merging logic corresponding to the data stream.

Specifically, for a data stream connected in JOIN mode, when performing persistent drop processing, it is necessary to determine the type of a data table to be processed and then determine the corresponding write merge logic. For example, for data of the Merge On Read table type, when the base file and the Log file perform the compression stage, if there is data that does not include a hidden flag in the incremental Log file (Log file) and there is no data that has a corresponding recordKey in the corresponding base file (base file), determining the data that has the same recordKey as new associated data, and writing the new data into the base file corresponding to the new instant time (timestamp). If the stock data of the same recordKey exists in the base file, the incremental data is used for overlaying and updating the historical data, and the historical data is written into the base file corresponding to the new instance time.

If the corresponding Log file contains the data with the hidden mark and the base file contains the data with the same recordKey, updating the columns in the incremental data into the historical data, splicing the data into complete data, and then performing persistence disc dropping. If the base file does not contain the data with the recordKey, the hidden data is written into the Log file corresponding to the new instance time for carrying out data association on the related data stream in the next compact.

For Copy On Write table type data, this type of data is equivalent to being written to a compact and contains only base file. The specific logic for writing is thus as follows: and for the data with the same recordKey, performing merge processing on the new and old data, and updating the overlapped columns in an overlay updating mode. And for the data without the same recordKey, continuing marking as hidden data, and writing the hidden data into a base file corresponding to the new instance time.

In some embodiments of the present application, after the step of performing persistent disk-dropping processing on the data stream in the target task stream according to the task configuration information, the multi-data stream parallel processing method further includes: determining the query type of the data query instruction and the table type of a data table corresponding to the data query instruction; and processing the data in the data table according to the query type and the table type.

In particular, during the data application phase, such as data querying, different processing may be performed on the data in the data tables for different types of tables and query instructions. For example, for Copy on Write table type data table, no increment and historical data combination is needed in query, so that only data containing hidden marks needs to be filtered out, namely the final associated real-time flow width table data.

And for the data table of the Merge On Read table type, if the query type of the query instruction is Snapshot View query, the history base file and the increment Log file are required to be subjected to Merge merging operation once, and the specific operation logic of the merging operation is consistent with the compact process of writing the merging stage. And then filtering out the data containing the hidden marks, namely the final correlated real-time flow width table data.

In summary, the complete flow of processing multiple data streams in the present application is shown in fig. 3, and includes the following steps:

step S302, loading a task configuration file in a streaming computing task, and determining a correlation mode between data streams and data stream analysis characteristics;

step S304, loading payload of corresponding mode according to the association mode configuration between data streams;

Step S306, according to the association mode of the data stream, performing data pre-deduplication and association operations in the data pre-writing batch stage;

step S308, merging the historical data and the incremental data in the data writing-in and landing stage according to the association mode of the data stream and the data table type.

It can be seen that the method is performed by sinking the JOIN process for implementing the data flow into the Hudi store of the data lake and performing the broad-table connection by using the Hudi payload technology, so that data association between cache states in the computing engine is not required, and external storage is not required to be introduced for associating data caches. Therefore, the operation and maintenance for an external storage system are not needed, and extra pressure is not caused to storage and network IO, so that performance bottlenecks are not caused. In addition, for a single streaming computing task, data writing of a plurality of associated dimension tables can be performed, the situation that concurrent writing locking possibly exists when multiple tasks are written into the single table at the same time is avoided, task processing performance is improved, and the problem of deadlock possibly generated during concurrent writing is avoided.

In addition, task configuration information for determining a target task flow is adopted, wherein the target task flow comprises at least one data flow, the task configuration information comprises a correlation mode among the data flows in the target task flow, data characteristic information of each data flow in the target task flow and data table types of the data flows; in the pre-writing batch phase of the target task flow, executing a first type of operation on the data flow in the target task flow according to the task configuration information, wherein the first type of operation comprises at least one of the following operations: data deduplication and data association; in the data writing merging stage of the target task flow, the data flow in the target task flow is subjected to persistent disk-drop processing according to the task configuration information, and the data flow is associated in the pre-writing batch-collecting stage, so that the purpose of sinking the association logic of the data flow to the storage layer of the data lake is achieved, the technical effect that all data association tasks are not required to be executed in the calculation layer is achieved, and the technical problem of huge data buffering amount of the calculation task caused by real-time data flow association in the calculation layer when the data flow is associated in the related technology is solved.

The embodiment of the application provides a multi-data-stream parallel processing device, and fig. 4 is a schematic structural diagram of the device. As shown in fig. 4, the apparatus includes: a first processing module 40, configured to determine task configuration information of a target task flow, where the target task flow includes at least one data flow, the task configuration information includes a correlation manner between data flows in the target task flow, data characteristic information of each data flow in the target task flow, and a data table type of the data flow; the second processing module 42 is configured to perform a first type of operation on the data stream in the target task stream according to the task configuration information in a pre-write scraping stage of the target task stream, where the first type of operation includes at least one of: data deduplication and data association; and the third processing module 44 is configured to perform persistent disk-dropping processing on the data stream in the target task stream according to the task configuration information in the data write merging stage of the target task stream.

In some embodiments of the present application, the association means includes one of: normal write mode, wide table JOIN mode, data stream JOIN mode.

In some embodiments of the present application, the task configuration information further includes parallelism; the step of the first processing module 40 determining configuration information of the target task flow includes: acquiring a preset parallelism parameter; or determining the number of data sources associated with the target task flow; and configuring parallelism parameters of the target task flows according to the number of the data sources.

In some embodiments of the present application, the step of executing, by the second processing module 42, the first type of operation on the data stream in the target task stream according to the task configuration information in the case that the association mode is the broad-table connection mode includes: determining whether target data with the same key value exists in the target task stream; determining whether the target data with the same key value belong to the same data stream or not under the condition that the target data with the same key value is determined to exist; under the condition that the target data with the same key value belong to the same data stream, updating the data stream to which the target data belong; in the case where the target data having the same key value does not belong to the same data stream, the data splicing process is performed on the data stream to which the target data belongs.

In some embodiments of the present application, the step of executing, by the second processing module 42, the first type of operation on the data stream in the target task stream according to the task configuration information in the case that the association mode is the data stream JOIN mode includes: determining whether target data with the same key value exists in the target task stream; in the case that the target data with the same key value are determined to exist, determining whether the target data with the same key value belong to the same data table; under the condition that the target data corresponds to the same data table, the target data with later data generation time is reserved; and under the condition that the data do not correspond to the same data table, performing data stream JOIN processing on the data stream corresponding to the data.

In some embodiments of the present application, the step of performing, by the second processing module 42, data stream JOIN processing on the data stream corresponding to the data stream includes: determining a JOIN sequence between data streams, and associating the data streams according to the JOIN sequence; when the data flows are associated again according to the JOIN sequence, the association mode is inner. JOIN or right. JOIN, and if JOIN association fails, the target data is returned and marked as hidden data; and returning the target data and adding a hidden mark to the data which does not participate in the association in the data stream under the condition that the association mode is left. JOIN when the data stream is associated again according to the JOIN sequence.

In some embodiments of the present application, in the case that the association mode is the data stream JOIN mode, the step of performing, by the third processing module 44, persistent dropped-disk processing on the data stream in the target task stream according to the task configuration information includes: determining write merging logic corresponding to table type information of data streams, wherein the write merging logic comprises a write merging mode for a first type of data stream and a second type of data stream, the first type of data stream is a data stream with a hidden mark in corresponding data, and the second type of data stream is a data stream without the hidden mark in the corresponding data; and performing persistent disc dropping processing on the data stream according to the write merging logic corresponding to the data stream.

In some embodiments of the present application, in the case that the association mode is the broad table connection mode, the step of performing, by the third processing module 44, persistent disk-dropping processing on the data stream in the target task stream according to the task configuration information includes: determining target data with the same key value; determining the generation time of target data, and determining a target column in which the latest generated target data is located; and overlaying the historical data by using the target column, and performing persistence disc dropping processing on the data associated with the data stream after overlaying.

In some embodiments of the present application, after the step of performing persistent disk-dropping processing on the data stream in the target task stream according to the task configuration information, the multi-data stream parallel processing device is further configured to: determining the query type of the data query instruction and the table type of a data table corresponding to the data query instruction; and processing the data in the data table according to the query type and the table type.

Note that each module in the above-described multi-data-stream parallel processing apparatus may be a program module (for example, a set of program instructions for implementing a specific function), or may be a hardware module, and for the latter, it may be represented by the following form, but is not limited thereto: the expression forms of the modules are all a processor, or the functions of the modules are realized by one processor.

According to an embodiment of the present application, there is provided a nonvolatile storage medium. The nonvolatile storage medium stores a program, wherein the device where the nonvolatile storage medium is controlled to execute the following multi-data stream parallel processing method when the program runs: determining task configuration information of a target task flow, wherein the target task flow comprises at least one data flow, the task configuration information comprises a correlation mode among the data flows in the target task flow, data characteristic information of each data flow in the target task flow and data table types of the data flows; in the pre-writing batch phase of the target task flow, executing a first type of operation on the data flow in the target task flow according to the task configuration information, wherein the first type of operation comprises at least one of the following operations: data deduplication and data association; and in the data writing merging stage of the target task flow, carrying out lasting disk dropping processing on the data flow in the target task flow according to the task configuration information.

According to an embodiment of the present application, there is provided an electronic device, including a memory and a processor, where the processor is configured to execute a program stored in the memory, and the program executes the following multi-data-stream parallel processing method: determining task configuration information of a target task flow, wherein the target task flow comprises at least one data flow, the task configuration information comprises a correlation mode among the data flows in the target task flow, data characteristic information of each data flow in the target task flow and data table types of the data flows; in the pre-writing batch phase of the target task flow, executing a first type of operation on the data flow in the target task flow according to the task configuration information, wherein the first type of operation comprises at least one of the following operations: data deduplication and data association; and in the data writing merging stage of the target task flow, carrying out lasting disk dropping processing on the data flow in the target task flow according to the task configuration information.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be essentially or a part contributing to the related art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims

1. A method for parallel processing of multiple data streams, comprising:

determining task configuration information of a target task flow, wherein the target task flow comprises at least one data flow, the task configuration information comprises a correlation mode among the data flows in the target task flow, data characteristic information of each data flow in the target task flow and a data table type of the data flow;

in the pre-writing batch phase of the target task flow, executing a first type of operation on the data flow in the target task flow according to the task configuration information, wherein the first type of operation comprises at least one of the following operations: data deduplication and data association;

and in the data writing merging stage of the target task flow, carrying out lasting disk-dropping processing on the data flow in the target task flow according to the task configuration information.

2. The method of parallel processing of multiple data streams according to claim 1, characterized in that the association means comprises one of: normal write mode, wide table JOIN mode, data stream JOIN mode.

3. The method according to claim 2, wherein, in the case that the association mode is the wide table connection mode, the step of performing a first type of operation on the data stream in the target task stream according to the task configuration information includes:

determining whether target data with the same key value exists in the target task flow;

determining whether the target data with the same key value belong to the same data stream or not under the condition that the target data with the same key value is determined to exist;

updating the data stream to which the target data belongs under the condition that the target data with the same key value belongs to the same data stream;

and executing data splicing processing on the data stream to which the target data belongs under the condition that the target data with the same key value does not belong to the same data stream.

4. The method according to claim 2, wherein, in the case that the association mode is the data stream JOIN mode, the step of performing a first type of operation on the data stream in the target task stream according to the task configuration information includes:

Determining whether the target data with the same key value belong to the same data table or not under the condition that the target data with the same key value is determined to exist;

under the condition that the target data corresponds to the same data table, reserving the target data with later data generation time;

and under the condition that the data does not correspond to the same data table, carrying out data stream JOIN processing on the data stream corresponding to the data.

5. The method for parallel processing of multiple data streams according to claim 4, wherein said step of performing data stream JOIN processing on the data stream corresponding to the data comprises:

determining a JOIN order between the data streams, and associating the data streams according to the JOIN order;

when the data flows are associated again according to the JOIN sequence, the association mode is inner. JOIN or right. JOIN, and if JOIN association fails, the target data are returned and marked as hidden data;

and returning the target data and adding a hidden mark to the data which does not participate in the association in the data stream under the condition that the association mode is left. JOIN when the data stream is associated again according to the JOIN sequence.

6. The method for parallel processing of multiple data streams according to claim 5, wherein in the case that the association mode is the JOIN mode, the step of performing persistent drop processing on the data stream in the target task stream according to the task configuration information includes:

determining write merging logic corresponding to table type information of the data stream, wherein the write merging logic comprises a write merging mode for a first type data stream and a second type data stream, the first type data stream is a data stream with the hidden mark in corresponding data, and the second type data stream is a data stream without the hidden mark in corresponding data;

and performing persistent disc-dropping processing on the data stream according to the write-in merging logic corresponding to the data stream.

7. The method for parallel processing of multiple data streams according to claim 2, wherein, in the case that the association mode is the wide table connection mode, the step of performing persistent disk-dropping processing on the data stream in the target task stream according to the task configuration information includes:

determining target data with the same key value;

determining the generation time of the target data, and determining a target column in which the latest generated target data is located;

And covering historical data by the target column, and performing lasting disk drop processing on the data associated with the data stream after covering.

8. The multi-data stream parallel processing method according to claim 1, wherein the task configuration information further includes parallelism; the step of determining the configuration information of the target task flow comprises the following steps:

acquiring a preset parallelism parameter; or,

determining the number of data sources associated with the target task flow; and configuring the parallelism parameter of the target task flow according to the number of the data sources.

9. The multiple data stream parallel processing method according to claim 1, wherein after the step of performing persistent disk-drop processing on the data stream in the target task stream according to the task configuration information, the multiple data stream parallel processing method further comprises:

determining the query type of a data query instruction and the table type of a data table corresponding to the data query instruction;

and processing the data in the data table according to the query type and the table type.

10. A multi-data stream parallel processing apparatus, comprising:

the first processing module is used for determining task configuration information of target task flows, wherein the target task flows comprise at least one data flow, the task configuration information comprises a correlation mode among the data flows in the target task flows, data characteristic information of each data flow in the target task flows and data table types of the data flows;

The second processing module is used for executing a first type of operation on the data stream in the target task stream according to the task configuration information in a pre-writing batch stage of the target task stream, wherein the first type of operation comprises at least one of the following operations: data deduplication and data association;

and the third processing module is used for performing lasting disk-dropping processing on the data stream in the target task stream according to the task configuration information in the data writing merging stage of the target task stream.

11. A non-volatile storage medium, wherein a program is stored in the non-volatile storage medium, and wherein the program, when executed, controls a device in which the non-volatile storage medium is located to perform the multi-stream parallel processing method according to any one of claims 1 to 9.

12. An electronic device, comprising: a memory and a processor for running a program stored in the memory, wherein the program is run to perform the multi-data stream parallel processing method according to any one of claims 1 to 9.