CN113472904A

CN113472904A - Big data engine unified access system and method

Info

Publication number: CN113472904A
Application number: CN202111033433.3A
Authority: CN
Inventors: 朱辉; 薛延波; 张涛; 赵鹏
Original assignee: Beijing Huapin Borui Network Technology Co Ltd
Current assignee: Beijing Huapin Borui Network Technology Co Ltd
Priority date: 2021-09-03
Filing date: 2021-09-03
Publication date: 2021-10-01

Abstract

The invention discloses a big data engine unified access system and a method, comprising the following steps: accessing and managing computing engines of different frameworks, wherein the computing engines of different frameworks are adapted to the creation and recovery of engines of multiple versions through engine management; receiving and sending a data service request of a user; acquiring a data service request, and distributing the data service request to a computing engine for processing by unified routing; and analyzing the data service request, selecting a calculation engine suitable for processing the data service request according to the data service request for processing, and returning a result set of data service response of the calculation engine after the processing is finished. The invention has the beneficial effects that: by docking and managing the computing engines with different frames, the complexity of the system is reduced, and a proper computing engine is selected for processing according to different data service requests, so that the operating efficiency of the system can be improved, and the resource consumption of the system is reduced; and a standardized service interface is provided, so that the use difficulty of the big data service is reduced.

Description

Big data engine unified access system and method

Technical Field

The invention relates to the technical field of big data, in particular to a big data engine unified access system and a big data engine unified access method.

Background

With the popularization and application of artificial intelligence and big data technology, rich computing and storage engines are derived for different scenes, such as: distributed computing storage frameworks such as Apache Hive, Spark, Hbase, Impala, etc. At present, a plurality of open source middleware are generally introduced to meet business requirements, so that a large data platform architecture is enriched continuously. However, as data applications, tool systems, and underlying computing and storage components become more and more numerous, the overall data platform becomes an exponential growth in both scalability, ease of use, and access complexity.

Disclosure of Invention

In order to solve the above problems, an object of the present invention is to provide a unified big data engine access system and method, which can aggregate multiple big data engines, provide a standardized service interface to the outside, reduce system complexity, and improve expandability and usability.

The invention provides a big data engine unified access system, which comprises a server, wherein the server comprises:

the data service SDK is used for receiving a data service request of a user, sending the data service request of the user to the computing engine and acquiring a result set of data service response of the computing engine;

the uniform routing access layer is used for receiving a data service request sent by the data service SDK, identifying back-end service according to a standardized URL, uniformly distributing routing, and sending the data service request to a computing engine for processing;

the computing engine access layer is used for analyzing the data service requests distributed by the unified routing access layer and selecting a computing engine suitable for processing the data service requests according to different types of data service requests, and the computing engine returns a response result according to the data service requests;

and the engine manager is used for selecting the engine manager corresponding to the computing engine according to the docked computing engines of different frameworks, and the engine managers of different computing engines are adapted to the creation and recovery of the multi-version engine.

As a further improvement of the invention, the data service SDK supports both asynchronous and synchronous response result sets, the operations on the result sets comprising: iteratively outputting, downloading the file, and returning the result set address.

As a further improvement of the invention, the unified routing access layer further comprises the steps of performing authority verification, access audit and parameter validity check on the data service request sent by the data service SDK.

As a further improvement of the present invention, the computing engine access layer further includes performing parameter validity check and data authority check on the parsed data service request.

As a further improvement of the present invention, the system further comprises a service registry, and different data service requests are registered and discovered by using the service registry.

As a further improvement of the invention, the system creates and starts different data service processes according to different data service requests, and different data service processes are communicated by adopting an RPC engine.

As a further improvement of the present invention, the interpreting of the data service request by the computing engine access layer includes parsing the data service request to identify the data source information that is relied upon; and the calculation engine access layer predicts according to the data source information and selects a calculation engine suitable for processing the data service request.

As a further improvement of the invention, the system adopts containerization for resource management, and comprises the following steps: the system selects a calculation engine suitable for processing the data service request according to the data service requests of different types, the calculation engine carries out containerized application through predicting resource consumption required by responding to the data service request, and resources are automatically recycled after the data service request response is finished.

As a further improvement of the invention, the system further comprises a client, wherein the client is used for authority verification and submitting a data service request, and the client performs authority verification and controls by regularly acquiring server configuration; the data service request is firstly subjected to authority verification through the client side, then the data service request is sent to the server side, and the server side carries out re-verification on the authority of the server side file according to the data service request.

The invention also provides a big data engine unified access method, which comprises the following steps:

submitting and acquiring a plurality of data service requests;

respectively analyzing the data service requests and acquiring data source information on which each data service request depends;

and selecting a calculation engine suitable for processing the data service request according to the data source information of the data service request, distributing a plurality of data service requests to the corresponding calculation engines suitable for processing the data service requests through uniform routing distribution for processing, and returning a service response result.

As a further improvement of the invention, the method performs unified management on the computing engines of different frameworks, and comprises the following steps: an engine manager corresponding to the compute engine is selected, the engine managers of different compute engines adapting to multi-version engine creation and reclamation.

As a further improvement of the invention, the method creates and starts different data service processes according to different data service requests, and different data service processes are communicated by adopting an RPC engine.

As a further improvement of the present invention, the method further comprises: selecting a computing engine suitable for processing the data service request according to the data service requests of different types, performing containerized application by the computing engine through predicting resource consumption required by responding to the data service request, and automatically recovering resources after the data service request response is finished.

The invention has the beneficial effects that: by docking and managing the computing engines with different frames, the complexity of the system is reduced, and a proper computing engine is selected for processing according to different data service requests, so that the operating efficiency of the system can be improved, and the resource consumption of the system is reduced; the big data computing engine is combined with containerization to apply and release resources and isolate the environment, so that the problem of isolating big data resources, authorities and the operating environment is well solved. Meanwhile, a standardized service interface is provided externally, a user can use a big data platform to process and calculate data without mastering a big data complex calculation engine, the use threshold of big data is reduced, and the popularization of big data technology is facilitated.

Drawings

Fig. 1 is a schematic diagram illustrating a system configuration of a big data engine unified access system according to an embodiment of the present invention;

fig. 2 is a system framework diagram of a big data engine unified access system according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a unified big data engine access method according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the embodiment of the present invention, the directional indications are only used to explain the relative positional relationship between the components, the movement situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.

In addition, in the description of the present invention, the terms used are for illustrative purposes only and are not intended to limit the scope of the present disclosure. The terms "comprises" and/or "comprising" are used to specify the presence of elements, steps, operations, and/or components, but do not preclude the presence or addition of one or more other elements, steps, operations, and/or components. The terms "first," "second," and the like may be used to describe various elements, not necessarily order, and not necessarily limit the elements. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. These terms are only used to distinguish one element from another. These and/or other aspects will become apparent to those of ordinary skill in the art in view of the following drawings, and the description of the embodiments of the present invention will be more readily understood. The drawings are used for the purpose of illustrating embodiments of the disclosure only. One skilled in the art will readily recognize from the following description that alternative embodiments of the illustrated structures and methods of the present invention may be employed without departing from the principles of the present disclosure.

As shown in fig. 1-2, the big data engine unified access system according to an embodiment of the present invention includes a server, where the server includes: the system comprises a data service SDK, a uniform routing access layer, a calculation engine access layer and an engine manager.

The data service SDK provides a standardized SDK for receiving a data service request of a user, sending the data service request of the user to the compute engine, and obtaining a result set of data service responses of the compute engine.

The uniform routing access layer is used for receiving a data service request sent by the data service SDK, identifying back-end service according to a standardized URL, uniformly distributing routing, and sending the data service request to a computing engine for processing; the uniform routing access layer provides API support and sends data to the uniform routing access layer through an API data service SDK.

The calculation engine access layer is used for analyzing the data service requests distributed by the uniform routing access layer, selecting a calculation engine suitable for processing the data service requests according to the data service requests of different types, returning a response result set according to the data service requests by the calculation engine, and asynchronously or synchronously waiting for the result return set by the calculation engine access layer and then outputting the result in a standardized manner.

The engine manager is used to interface the computing engines of different frameworks, for example: apache Hive, Spark, Hbase, Impala and the like, selecting an engine manager corresponding to the computing engine, and adapting the engine managers of different computing engines to the creation and recovery of multiple versions of the engine. The compute engine supports multiple components, such as: apache Hive, Spark, Hbase, Impala, JDBC, etc.; the same component supports multiple versions, for example: apache Hive supports versions 1.0, 2.0. The engine manager may exist in the form of a plug-in, by which the engine manager may select a different plug-in for load initialization regardless of whether the user selects a different component or a different version of the same component. Through a plug-in engine management mode, the computing engines of different versions can be supported to provide services at the same time. Between the engine manager and the engine, the communication protocol can be abstractly encapsulated, and only a Common class library (Common Jar) needs to be introduced.

In an alternative embodiment, the data service SDK obtains a result set of data service responses of the compute engine while supporting asynchronous and synchronous response result sets, and the operations on the result set include: iteratively outputting, downloading the file, and returning the result set address. According to different situations, response speed can be submitted by responding to the result set in different modes.

According to an optional implementation mode, the uniform routing access layer provides API support, the data service SDK sends a data service request to the uniform routing access layer, the uniform routing access layer performs authority verification, access audit, parameter validity check and the like on the data service request sent by the data service SDK, identifies back-end service according to a standardized URL, performs routing distribution uniformly, and sends the data service request to the computing engine for processing. The data service request of the user can be effectively managed by carrying out operations such as authority verification, access audit, parameter validity check and the like through the unified routing access layer.

According to an optional implementation mode, a computing engine access layer analyzes a data service request distributed by a uniform routing access layer, performs parameter validity verification and data permission verification on the analyzed data service request, selects a proper computing engine after the verification is passed, submits a data service request task to the selected computing engine, returns a response result set according to the data service request, and asynchronously or synchronously waits for a result return set by the computing engine access layer and then outputs the result in a standardized manner. The data service request of the user can be further effectively managed through the parameter validity check and the data authority check of the data service request by the computing engine access layer.

In an optional implementation manner, the big data engine unified access system of this embodiment employs a micro service framework, which includes a service registry, and different data service requests employ the service registry to perform registration and discovery of data services. Different data Service processes are created and started according to different data Service requests, the different data Service processes are communicated by adopting an RPC engine, and an RPC frame is an declarative Web Service client, so that calling among micro-services is simplified.

In an alternative embodiment, the interpreting of the data service request by the computing engine access layer includes parsing the data service request, identifying the data source information that depends on the data service request, and the predicting by the computing engine access layer according to the data source information, for example, includes predicting according to the storage mode, size, and fields required for computing of the data source, and selecting a computing engine suitable for processing the data service request, that is, a suitable computing engine, according to the prediction result. For example: the volume of the data stored in the column is faster when the Impala is selected within a certain range; the query range is large, the data volume is large, and if Spark consumes resources, Hive is selected to be appropriate; and for frequently accessed data, the data is automatically cached in the memory of the Hbase middleware according to the volume and stored, so that the calculation efficiency is greatly improved.

In an alternative implementation manner, the big data engine unified access system of this embodiment employs containerization for resource management. The data query engines are more in types and depend on larger environment differences; the query engine is uncertain in resource consumption, and can have the problems of resource recovery capacity expansion and data isolation, the system of the embodiment of the invention adopts containerization for resource management, according to different types of data service requests, the computing engine performs containerization application through estimating resource consumption required by responding to the data service requests, for example, the containerization application includes estimating resources such as memory, CPU, I/O and network, and the system can automatically recover the resources after the processing of the computing engine is finished. Since the environments in which different computing engines are deployed are different, templating can also be used for resource management.

An optional implementation manner is that the big data engine unified access system of this embodiment further includes a client, where the client is used for permission verification and submitting a data service request, and the system adopts a micro-service framework, integrates service disaster tolerance and load balancing middleware, and can use an Http client with load balancing. The client side carries out authority verification and controls the system by acquiring the configuration of the server side at regular time, the system can adopt the binding of four roles of a user, a data group account, a function group account and data resources, and the large data distributed storage middleware mostly adopts a Linux user to carry out an authority verification system, so that the system can initialize a Linux user according to the data group account. The data service request firstly carries out authority verification through the client, judges whether the data read-write authority exists or not, then sends the data service request to the server, and the server carries out re-verification on the file authority of the server according to the data service request. Based on the double authority verification of the group account, the authority authentication is rigorous, and the safety of the system is improved.

The invention also provides a method for unified access of big data engines, as shown in fig. 3, the method comprises: submitting and acquiring a plurality of data service requests, respectively analyzing the data service requests, acquiring data source information depended by each data service request, selecting a computing engine suitable for processing the data service requests according to the data source information of the data service requests, distributing the data service requests to the corresponding computing engine suitable for processing the data service requests through uniform routing distribution for processing, and returning a service response result set.

In an alternative embodiment, the compute engine supports multiple components, such as: apache Hive, Spark, Hbase, Impala, JDBC, etc.; the same component supports multiple versions, for example: apache Hive supports versions 1.0, 2.0. For the submitted data service request, the interpretation of the data service request includes parsing the data service request, identifying the data source information that is dependent on the data service request, predicting according to the data source information, for example, predicting according to the storage mode, size, and field required by calculation of the data source, and selecting a calculation engine suitable for processing the data service request, that is, a suitable calculation engine, according to the prediction result. For example: the volume of the data stored in the column is faster when the Impala is selected within a certain range; the query range is large, the data volume is large, and if Spark consumes resources, Hive is selected to be appropriate; and for frequently accessed data, the data is automatically cached in the memory of the Hbase middleware according to the volume and stored, so that the calculation efficiency is greatly improved.

In an optional implementation mode, the data service requests can be submitted and acquired through a standardized data service SDK, the data service SDK sends the data service requests to a uniform routing access layer for uniform routing distribution, and the data service requests are sent to a calculation engine access layer; the computing engine access layer analyzes the data service requests distributed by the uniform routing access layer, selects a computing engine suitable for processing the data service requests according to the data service requests of different types, returns a response result set according to the data service requests, and asynchronously or synchronously waits for the result return set and then outputs in a standardized way. The data service SDK acquires a result set of data service response of the computing engine and simultaneously supports asynchronous and synchronous response result sets, and the operation on the result set comprises the following steps: iteratively outputting, downloading the file, and returning the result set address. According to different situations, response speed can be submitted by responding to the result set in different modes.

Further, the data service SDK sends a data service request to the uniform routing access layer, the uniform routing access layer performs authority verification, access audit, parameter validity check and the like on the data service request sent by the data service SDK, the computing engine access layer analyzes the data service request distributed by the uniform routing access layer, performs parameter validity check and data authority check on the analyzed data service request, selects a proper computing engine after the check is passed, submits a data service request task to the selected computing engine, and further performs validity authentication on the data service request by performing the parameter validity check and the data authority check on the data service request through the computing engine access layer.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Furthermore, those of ordinary skill in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It will be understood by those skilled in the art that while the present invention has been described with reference to exemplary embodiments, various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A big data engine unified access system is characterized in that the system comprises a server side, and the server side comprises:

the computing engine access layer is used for analyzing the data service requests distributed by the unified routing access layer and selecting a computing engine suitable for processing the data service requests according to different types of data service requests, and the computing engine returns response results according to the data service requests;

2. The system of claim 1, wherein the data service SDK supports both asynchronous and synchronous response result sets, and wherein operations on the result sets comprise: iteratively outputting, downloading the file, and returning the result set address.

3. The system of claim 1, wherein the unified routing access layer further comprises performing authority verification, access audit and parameter validity check on the data service request sent by the data service SDK.

4. The system of claim 1, wherein the compute engine access layer further comprises performing a parameter validity check and a data permission check on the parsed data service request.

5. The system of claim 1, further comprising a service registry, wherein different data service requests use the service registry for registration and discovery of data services.

6. The system of claim 1, wherein the system creates and starts different data service processes according to different data service requests, and the different data service processes communicate with each other by using an RPC engine.

7. The system of claim 1, wherein the computing engine access layer interpreting the data service request comprises parsing the data service request to identify dependent data source information; and the calculation engine access layer predicts according to the data source information and selects a calculation engine suitable for processing the data service request.

8. The system of claim 1, wherein the system employs containerization for resource management, comprising: the system selects a calculation engine suitable for processing the data service request according to the data service requests of different types, the calculation engine carries out containerized application through predicting resource consumption required by responding to the data service request, and resources are automatically recycled after the data service request response is finished.

9. The system of claim 1, further comprising a client for performing the permission verification and submitting the data service request, wherein the permission verification performed by the client is controlled by periodically obtaining the configuration of the server; the data service request is firstly subjected to authority verification through the client side, then the data service request is sent to the server side, and the server side carries out re-verification on the authority of the server side file according to the data service request.

10. A big data engine unified access method is characterized by comprising the following steps:

submitting and acquiring a plurality of data service requests;

and selecting a calculation engine suitable for processing the data service request according to the data source information of the data service request, distributing a plurality of data service requests to the corresponding calculation engines suitable for processing the data service request through uniform routing distribution for processing, and returning a service response result after the processing is finished.