CN112286961B

CN112286961B - SQL optimization query method and device

Info

Publication number: CN112286961B
Application number: CN202011048287.7A
Authority: CN
Inventors: 卢勇亮; 郑尧峰; 汤国强; 桑强; 孙迁
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Jiangsu Suning Cloud Computing Co ltd; SuningCom Co ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2022-11-18
Anticipated expiration: 2040-09-29
Also published as: CA3132004A1; CN112286961A

Abstract

The invention discloses an SQL optimization query method and device, relates to the technical field of big data, and can avoid the starting of a fusing mechanism and improve the query efficiency. The method comprises the following steps: when the SQL query statement is analyzed to be a UNION set type query statement, counting the number of storage segments related to target data, wherein the UNION set type query statement comprises a plurality of sub query statements; when the number of related storage segments is larger than a threshold value, splitting the query cycle corresponding to the sub-query statement into a plurality of granularity cycles, summarizing target data dimensions corresponding to the sub-query statements, and adding the summarized target data dimensions as new conditions into the filtering conditions of the sub-query statement; and pushing the sub-query statements containing the filter conditions down to a computing engine, reading the target data corresponding to each granularity period and the dimension value in the target data, and returning a query result to the user after the target data is computed by the processing end. The device is applied with the method provided by the scheme.

Description

SQL optimization query method and device

技术领域technical field

本发明涉及大数据技术领域，尤其涉及一种SQL优化查询方法及装置。The present invention relates to the technical field of big data, in particular to an SQL optimization query method and device.

背景技术Background technique

随着业务数据量的不断增加，流量模型摄入到Druid引擎中datasource的数据量巨大，每次SQL进行同环比查询时，内层的union all子句需要去查SQL指定周期内的所有数据。而为了保护druid引擎对其他任务的有效查询，druid引擎会判断当前SQL周期内的segment总数，并在segment总数超过阈值时启动熔断机制，即便未超过阈值，采用全量查询的方式也会增加查询时间，降低了查询效率。As the amount of business data continues to increase, the traffic model ingests a huge amount of data into the datasource of the Druid engine. Every time SQL performs a year-on-year query, the inner union all clause needs to check all the data within the specified period of SQL. In order to protect the effective query of other tasks by the druid engine, the druid engine will judge the total number of segments in the current SQL cycle, and start the fuse mechanism when the total number of segments exceeds the threshold. Even if the threshold is not exceeded, the full query method will also increase the query time , reducing the query efficiency.

发明内容Contents of the invention

本发明的目的在于提供一种SQL优化查询的方法及装置，能够避免熔断机制的启动，同时提升查询效率。The purpose of the present invention is to provide a method and device for SQL optimization query, which can avoid the activation of the fuse mechanism and improve the query efficiency at the same time.

为了实现上述目的，本发明的第一方面提供一种SQL优化查询的方法，包括：In order to achieve the above object, the first aspect of the present invention provides a method for SQL optimization query, including:

解析SQL查询语句为UNION集合类型的查询语句时，统计目标数据所涉及存储片段的数量，所述UNION集合类型的查询语句包括多个子查询语句；When parsing the SQL query statement as the query statement of the UNION set type, counting the number of storage segments involved in the target data, the query statement of the UNION set type includes a plurality of sub-query statements;

在涉及的所述存储片段的数量大于阈值时，将所述子查询语句对应的查询周期拆分成多个粒度周期，以及将多个子查询语句对应的目标数据维度汇总后作为新增条件加入所述子查询语句的过滤条件中；When the number of the storage segments involved is greater than the threshold, the query period corresponding to the subquery statement is split into multiple granularity periods, and the target data dimensions corresponding to the multiple subquery statements are summarized and added as new conditions to all In the filter condition of the subquery statement;

将包含所述过滤条件的子查询语句下推至计算引擎并发读取每个粒度周期所对应的目标数据及其中的维度值，经处理端计算后向用户返回查询结果。Push down the sub-query statement containing the filter condition to the calculation engine and concurrently read the target data corresponding to each granularity period and the dimension value therein, and return the query result to the user after calculation by the processing end.

优选地，在所述存储片段的数量大于阈值时，将所述子查询语句对应的查询周期拆分成多个粒度周期的方法包括：Preferably, when the number of stored segments is greater than a threshold, the method for splitting the query cycle corresponding to the sub-query statement into multiple granularity cycles includes:

提取每个所述子查询语句对应的查询周期，将各查询周期按照预设粒度平均拆分成多个粒度周期。The query period corresponding to each sub-query statement is extracted, and each query period is evenly split into multiple granularity periods according to the preset granularity.

较佳地，将多个子查询语句对应的目标数据维度汇总后作为新增条件加入所述子查询语句的过滤条件中的方法包括：Preferably, the method of summarizing the target data dimensions corresponding to multiple sub-queries and adding them as new conditions to the filter conditions of the sub-queries includes:

识别每个子查询语句对应的目标数据维度，当各所述子查询语句的目标数据维度一致，则取任一所述子查询语句的目标数据维度作为新增条件加入每个所述子查询语句的过滤条件中；Identify the corresponding target data dimension of each sub-query statement, when the target data dimensions of each sub-query statement are consistent, then get the target data dimension of any one of the sub-query statements as a new condition and add each of the sub-query statements in the filter condition;

当各所述子查询语句的目标数据维度不一致，则分别将各所述子查询语句的目标数据维度汇总合并后作为新增条件加入每个所述子查询语句的过滤条件中。When the target data dimensions of each of the sub-query statements are inconsistent, the target data dimensions of each of the sub-query statements are aggregated and combined as new conditions and added to the filter conditions of each of the sub-query statements.

优选地，所述UNION集合类型的查询语句至少包括第一周期子查询语句和第二周期子查询语句，所述过滤条件除所述新增条件之外还包括查询度量、维度排序和目标数据查询数量。Preferably, the query statement of the UNION set type includes at least a first period subquery statement and a second period subquery statement, and the filter conditions include query metrics, dimension sorting and target data query in addition to the newly added conditions quantity.

较佳地，计算引擎并发读取每个粒度周期所对应的目标数据及其中的维度值的方法包括：Preferably, the method for the calculation engine to concurrently read the target data corresponding to each granularity period and the dimension values therein includes:

计算引擎根据每个子查询语句及其中的过滤条件，分别从相应数据源的对应存储片段中查询符合所述目标数据查询数量条件的目标数据并提取其中的维度值。According to each sub-query statement and the filtering conditions therein, the calculation engine respectively queries the target data meeting the target data query quantity condition from the corresponding storage segment of the corresponding data source and extracts the dimension value therein.

进一步地，经处理端计算后向用户返回查询结果的方法包括：Further, the method of returning query results to the user after being calculated by the processing terminal includes:

处理端根据各子查询语句返回的所述维度值，按照预设规则计算后向用户返回查询结果。The processing end returns query results to the user after calculating according to the dimension values returned by each sub-query statement according to preset rules.

优选地，在步骤将包含所述过滤条件的子查询语句下推至计算引擎并发读取每个粒度周期所对应的目标数据及其中的维度值，经处理端计算后向用户返回查询结果之后还包括：Preferably, in the step, the sub-query statement containing the filter condition is pushed down to the calculation engine and concurrently reads the target data corresponding to each granularity period and the dimension value therein, and returns the query result to the user after the processing terminal calculates. include:

统计历史SQL查询语句中的热点数据源及其中的热点数据维度，并对热点数据维度对应的维值进行预提取，以备后续SQL查询语句的快速调用。Count the hot data sources and the hot data dimensions in the historical SQL query statements, and pre-extract the dimension values corresponding to the hot data dimensions for quick invocation of subsequent SQL query statements.

将每个子查询语句及对应的目标数据和其中的维度值在数据库中进行缓存，以在后续出现相同子查询语句时能够从数据库中快速调用。Each subquery statement, corresponding target data and dimension values therein are cached in the database, so that the same subquery statement can be quickly called from the database when the same subquery statement appears subsequently.

与现有技术相比，本发明提供的SQL优化查询的方法具有以下有益效果：Compared with the prior art, the method for SQL optimization query provided by the present invention has the following beneficial effects:

本发明提供的SQL优化查询方法中，在计算引擎根据SQL查询语句查询目标数据之前需对目标数据所涉及存储片段的数量进行预统计，当涉及的存储片段数量大于阈值时，若直接进行全量查询可能会触发计算引擎的熔断机制，故本发明对于此种情况采取的方案是将子查询语句对应的查询周期拆分成多个粒度周期，同时为了避免不同子查询语句对应的目标数据维度不同而导致无法输出查询结果的问题，在查询之前需将多个子查询语句对应的目标数据维度汇总作为新增条件加入每个子查询语句的过滤条件中，上述操作完成之后，最终将包含过滤条件的子查询语句下推至计算引擎并发读取每个粒度周期所对应的目标数据及其中的维度值，经处理端计算后向用户返回查询结果。In the SQL optimization query method provided by the present invention, before the calculation engine queries the target data according to the SQL query statement, it is necessary to perform pre-statistics on the number of storage segments involved in the target data. When the number of storage segments involved is greater than the threshold, if the full query is directly performed It may trigger the fuse mechanism of the computing engine, so the solution adopted by the present invention for this situation is to split the query cycle corresponding to the sub-query statement into multiple granularity cycles, and at the same time, in order to avoid different target data dimensions corresponding to different sub-query statements The problem that the query results cannot be output. Before querying, it is necessary to summarize the target data dimensions corresponding to multiple subquery statements as new conditions and add them to the filter conditions of each subquery statement. After the above operations are completed, the subquery containing the filter conditions will eventually be The statement is pushed down to the calculation engine to concurrently read the target data corresponding to each granularity cycle and the dimension values in it, and return the query result to the user after calculation by the processing end.

可见，本发明不仅解决了在全量查询过程中易触发计算引擎熔断机制的问题，还能保证每个子查询语句对应的目标数据维度统一，确保能够输出查询结果。It can be seen that the present invention not only solves the problem that the fuse mechanism of the computing engine is easily triggered during the full query process, but also ensures that the target data dimensions corresponding to each sub-query statement are unified, and the query results can be output.

本发明的第二方面提供一种SQL优化查询的装置，应用于上述技术方案所述的SQL优化查询方法中，所述装置包括：The second aspect of the present invention provides a device for SQL optimized query, which is applied to the SQL optimized query method described in the above technical solution, and the device includes:

语句解析单元，用于解析SQL查询语句为UNION集合类型的查询语句时，统计目标数据所涉及存储片段的数量，所述UNION集合类型的查询语句包括多个子查询语句；A statement parsing unit, configured to count the number of storage segments involved in the target data when parsing the SQL query statement as a query statement of the UNION set type, the query statement of the UNION set type including a plurality of sub-query statements;

识别处理单元，用于在涉及的所述存储片段的数量大于阈值时，将所述子查询语句对应的查询周期拆分成多个粒度周期，以及将多个子查询语句对应的目标数据维度汇总后作为新增条件加入所述子查询语句的过滤条件中；An identification processing unit, configured to split the query cycle corresponding to the sub-query statement into multiple granularity cycles when the number of the storage segments involved is greater than a threshold, and after summarizing the target data dimensions corresponding to the multiple sub-query statements Adding it as a new condition into the filter condition of the subquery statement;

查询单元，用于将包含所述过滤条件的子查询语句下推至计算引擎并发读取每个粒度周期所对应的目标数据及其中的维度值，经处理端计算后向用户返回查询结果。The query unit is configured to push down the sub-query statement containing the filter condition to the calculation engine and concurrently read the target data corresponding to each granularity period and the dimension value therein, and return the query result to the user after calculation by the processing terminal.

与现有技术相比，本发明提供的SQL优化查询装置的有益效果与上述技术方案提供的SQL优化查询方法的有益效果相同，在此不做赘述。Compared with the prior art, the beneficial effect of the SQL optimized query device provided by the present invention is the same as that of the SQL optimized query method provided by the above technical solution, and will not be repeated here.

本发明的第三方面提供一种计算机可读存储介质，计算机可读存储介质上存储有计算机程序，计算机程序被处理器运行时执行上述SQL优化查询方法的步骤。A third aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the above-mentioned SQL optimization query method are executed.

与现有技术相比，本发明提供的计算机可读存储介质的有益效果与上述技术方案提供的SQL优化查询方法的有益效果相同，在此不做赘述。Compared with the prior art, the beneficial effect of the computer-readable storage medium provided by the present invention is the same as the beneficial effect of the SQL optimization query method provided by the above technical solution, and will not be repeated here.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本发明的一部分，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。在附图中：The accompanying drawings described here are used to provide a further understanding of the present invention, and constitute a part of the present invention. The schematic embodiments of the present invention and their descriptions are used to explain the present invention, and do not constitute improper limitations to the present invention. In the attached picture:

图1为本发明实施例中SQL优化查询方法的部分流程示意图；Fig. 1 is a partial flow diagram of the SQL optimization query method in an embodiment of the present invention;

图2为本发明实施例中SQL优化查询方法的全流程示意图；Fig. 2 is a schematic diagram of the whole process of the SQL optimization query method in the embodiment of the present invention;

图3为本发明实施例中SQL优化查询方法的一种示例图。FIG. 3 is an example diagram of an SQL optimization query method in an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的上述目的、特征和优点能够更加明显易懂，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述。显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动的前提下所获得的所有其它实施例，均属于本发明保护的范围。In order to make the above objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Apparently, the described embodiments are only some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

实施例一Embodiment one

请参阅图1，本实施例提供一种SQL优化查询方法，包括：Referring to Fig. 1, the present embodiment provides a SQL optimization query method, including:

解析SQL查询语句为UNION集合类型的查询语句时，统计目标数据所涉及存储片段的数量，UNION集合类型的查询语句包括多个子查询语句；在涉及的存储片段的数量大于阈值时，将子查询语句对应的查询周期拆分成多个粒度周期，以及将多个子查询语句对应的目标数据维度汇总后作为新增条件加入子查询语句的过滤条件中；将包含过滤条件的子查询语句下推至计算引擎并发读取每个粒度周期所对应的目标数据及其中的维度值，经处理端计算后向用户返回查询结果。When parsing the SQL query statement as a query statement of the UNION collection type, count the number of storage segments involved in the target data. The query statement of the UNION collection type includes multiple sub-query statements; when the number of storage segments involved is greater than the threshold, the sub-query statement The corresponding query cycle is split into multiple granularity cycles, and the target data dimensions corresponding to multiple sub-query statements are summarized and added as new conditions to the filter conditions of the sub-query statement; the sub-query statement containing the filter condition is pushed down to the calculation The engine concurrently reads the target data corresponding to each granularity period and the dimension values in it, and returns the query result to the user after calculation by the processing end.

本实施例提供的SQL优化查询方法中，在计算引擎根据SQL查询语句查询目标数据之前需对目标数据所涉及存储片段的数量进行预统计，当涉及的存储片段数量大于阈值时，若直接进行全量查询可能会触发计算引擎的熔断机制，故本实施例对于此种情况采取的方案是将子查询语句对应的查询周期拆分成多个粒度周期，同时为了避免不同子查询语句对应的目标数据维度不同而导致无法输出查询结果的问题，在查询之前需将多个子查询语句对应的目标数据维度汇总作为新增条件加入每个子查询语句的过滤条件中，上述操作完成之后，最终将包含过滤条件的子查询语句下推至计算引擎并发读取每个粒度周期所对应的目标数据及其中的维度值，经处理端计算后向用户返回查询结果。In the SQL optimization query method provided in this embodiment, before the calculation engine queries the target data according to the SQL query statement, it is necessary to perform pre-statistics on the number of storage segments involved in the target data. When the number of storage segments involved is greater than the threshold, if the full The query may trigger the fuse mechanism of the computing engine, so the solution adopted in this embodiment is to split the query period corresponding to the subquery statement into multiple granularity periods, and at the same time, in order to avoid the target data dimension corresponding to different subquery statements For the problem that the query results cannot be output because of the difference, before querying, it is necessary to summarize the target data dimensions corresponding to multiple subquery statements as new conditions and add them to the filter conditions of each subquery statement. After the above operations are completed, the filter conditions will eventually be included. The sub-query statement is pushed down to the calculation engine to concurrently read the target data corresponding to each granularity cycle and the dimension values in it, and return the query result to the user after calculation by the processing end.

可见，本实施例不仅解决了在全量查询过程中易触发计算引擎熔断机制的问题，还能保证每个子查询语句对应的目标数据维度统一，确保能够输出查询结果。It can be seen that this embodiment not only solves the problem that the fuse mechanism of the computing engine is easily triggered during the full query process, but also ensures that the target data dimensions corresponding to each sub-query statement are unified, so that the query result can be output.

具体实施时，UNION集合类型的查询语句通常表示为UNION ALL，包括有多个子查询语句，如包括第一周期子查询语句和第二周期子查询语句，过滤条件(limit条件)除新增条件之外还包括查询度量、维度排序和目标数据查询数量等。计算引擎为druid，数据源为datasource，存储片段为segment，处理端为spark端，其中，过滤条件中的目标数据查询数量主要是用于限制计算引擎目标的数据读取数量，为了避免触发熔断机制。In specific implementation, the query statement of the UNION collection type is usually expressed as UNION ALL, which includes multiple subquery statements, such as the first cycle subquery statement and the second cycle subquery statement, and the filter conditions (limit conditions) except the newly added conditions It also includes query measures, dimension sorting, and the number of target data queries. The calculation engine is druid, the data source is datasource, the storage segment is segment, and the processing end is spark. Among them, the number of target data queries in the filter condition is mainly used to limit the number of data read by the calculation engine target, in order to avoid triggering the fuse mechanism .

上述实施例中，在所述存储片段的数量大于阈值时，将子查询语句对应的查询周期拆分成多个粒度周期的方法包括：In the above embodiment, when the number of stored segments is greater than the threshold, the method for splitting the query cycle corresponding to the sub-query statement into multiple granularity cycles includes:

提取每个子查询语句对应的查询周期，将各查询周期按照预设粒度平均拆分成多个粒度周期。The query period corresponding to each sub-query statement is extracted, and each query period is evenly split into multiple granularity periods according to the preset granularity.

示例性地，SQL查询语句由内层语句和外层语句构成，以环比周期查询为例，内层语句包括第一周期子查询语句和第二周期子查询语句，也即当前周期子查询语句和历史周期子查询语句，内层语句的主要作用是从druid中分别读取第一周期子查询语句和第二周期子查询语句所对应的目标数据，外层语句的主要作用是在spark端基于内层语句读取的目标数据维度值做相应的环比计算。Exemplarily, the SQL query statement is composed of an inner statement and an outer statement. Taking the ring cycle query as an example, the inner statement includes the first period subquery statement and the second period subquery statement, that is, the current period subquery statement and The main function of the historical period subquery statement is to read the target data corresponding to the subquery statement of the first period and the subquery statement of the second period respectively from druid. The main function of the outer statement is to The target data dimension value read by the layer statement is calculated accordingly.

具体实施时，请参阅图2，通过解析SQL查询语句获取逻辑计划树，提取出可用于裁剪的元数据信息，例如包括目标数据所涉及的存储片段查询指令、目标数据维度查询指令、目标数据度量查询指令、目标数据周期查询指令等，然后查看该查询周期中目标数据所涉及的存储片段数量是否大于阈值，若不大于阈值可采用现有技术将SQL查询语句的整个子查询下推给druid进行全量查询，若大于阈值可将查询周期平均拆分成多个粒度周期进行并发查询，以此避免触发druid的熔断机制。比如datasource摄入时最小粒度是天，当前SQL查询语句要查询一个月的数据，则将一个月的查询周期拆分成30个粒度周期。接下来整理每个粒度周期对应的SQL查询语句，目的是为了使每个粒度周期对应的SQL查询语句仅保留所需要的目标维度数据，过滤掉其他无关的衍生度量数据。整理好这些粒度周期对应的SQL信息后，将其转化成druid所能接收的json语句，然后并发请求druid索取目标数据，druid查询每个粒度周期对应的目标数据后，会将每个粒度周期对应的维度集合及对应维度值返回，之后合并所有粒度周期返回的数据集合，通过选择排序算法，得到一个不多于目标数据查询数量的数据集合。For specific implementation, please refer to Figure 2, obtain the logical plan tree by parsing the SQL query statement, and extract the metadata information that can be used for clipping, such as storage fragment query instructions involved in the target data, target data dimension query instructions, and target data metrics Query command, target data cycle query command, etc., and then check whether the number of storage segments involved in the target data in the query cycle is greater than the threshold, if not greater than the threshold, the existing technology can be used to push down the entire subquery of the SQL query statement to druid for execution Full query, if it is greater than the threshold, the query period can be split into multiple granularity periods on average for concurrent query, so as to avoid triggering druid's fuse mechanism. For example, when the datasource is ingested, the minimum granularity is days, and the current SQL query statement needs to query data for one month, then the one-month query period is divided into 30 granularity periods. Next, organize the SQL query statements corresponding to each granularity period, so that the SQL query statements corresponding to each granularity period only retain the required target dimension data, and filter out other irrelevant derived measurement data. After sorting out the SQL information corresponding to these granularity periods, convert it into a json statement that druid can receive, and then concurrently request druid to obtain the target data. After druid queries the target data corresponding to each granularity period, it will correspond to each granularity period. The dimension set and the corresponding dimension value are returned, and then the data sets returned by all granularity periods are merged, and a data set that is not more than the number of target data queries is obtained through the selection sorting algorithm.

示例性地，请参阅图3，通过SQL查询语句汇总本月及上月同环比增长结果，设定过滤条件中的目标数据查询数量为500条数据。如果直接在druid进行周期为一个月的全量数据查询，druid大表中一共有2亿条目标数据，本实施例将一个月拆分成30个粒度周期，每个粒度周期对应有大概10万条目标数据，利用druid并发读取本月及上月每个粒度周期所对应的数据集合，由于过滤条件中目标数据查询数量为500，故从每个粒度周期中仅过滤出前500条目标数据，之后再将本月与上月属于同一粒度周期的目标数据汇总成500条目标数据，再利用druid对每个粒度周期的500条目标数据进行扫描，再次过滤后将最终得到500条查询结果返回给用户。相比较于现有技术中采用的全量查询的方案，本实施例仅根据需要设置过滤条件中目标数据查询数量为500条，然后在druid中根据过滤条件仅对排序靠前的500条目标数据进行计算读取，极大的减轻了druid的计算压力。Exemplarily, please refer to Figure 3, the year-on-year growth results of this month and last month are summarized through SQL query statements, and the target data query quantity in the filter condition is set to 500 pieces of data. If you directly perform a full-scale data query with a period of one month in Druid, there are a total of 200 million pieces of target data in the Druid large table. In this embodiment, a month is divided into 30 granularity periods, and each granularity period corresponds to about 100,000 pieces of data. For target data, use druid to concurrently read the data sets corresponding to each granularity cycle of this month and last month. Since the number of target data queries in the filter condition is 500, only the first 500 pieces of target data are filtered out from each granularity cycle, and then Then aggregate the target data belonging to the same granularity cycle of this month and last month into 500 target data, and then use druid to scan the 500 target data of each granularity cycle, and after filtering again, finally get 500 query results and return them to the user . Compared with the full query scheme adopted in the prior art, this embodiment only sets the number of target data queries in the filter condition to 500 according to needs, and then only performs the query on the top 500 target data according to the filter condition in druid. Computational reading greatly reduces the computational pressure of druid.

上述实施例中，将多个子查询语句对应的目标数据维度汇总后作为新增条件加入子查询语句的过滤条件中的方法包括：In the above embodiment, the method of summarizing the target data dimensions corresponding to multiple sub-query statements and adding them as new conditions to the filter conditions of the sub-query statements includes:

识别每个子查询语句对应的目标数据维度，当各子查询语句的目标数据维度一致，则取任一子查询语句的目标数据维度作为新增条件加入每个子查询语句的过滤条件中；当各子查询语句的目标数据维度不一致，则分别将各子查询语句的目标数据维度汇总合并后作为新增条件加入每个子查询语句的过滤条件中。Identify the target data dimension corresponding to each sub-query statement. When the target data dimensions of each sub-query statement are consistent, take the target data dimension of any sub-query statement as a new condition and add it to the filter condition of each sub-query statement; when each sub-query statement If the target data dimensions of the query statements are inconsistent, the target data dimensions of each sub-query statement are aggregated and merged and added as new conditions to the filter conditions of each sub-query statement.

具体实施时，为了防止因第一周期和第二周期对应目标数据维度不一致，导致spark端无法根据druid基于第一周期获取的目标数据和基于第二周期获取的目标数据进行相应运算的问题发生，如第一周期对应的目标数据维度数为100，第二周期对应的目标数据维度数为150，此时需将第一周期和第二周期对应的目标数据维度数合并起来作为新增条件加入每个子查询语句的过滤条件中，以使合并后的目标数据维度能够覆盖第一周期和第二周期对应目标数据的维度。During the specific implementation, in order to prevent the problem that the spark side cannot perform corresponding calculations based on the target data obtained in the first cycle and the target data obtained in the second cycle based on the target data acquired in the first cycle and the second cycle due to the inconsistency of the corresponding target data dimensions in the first cycle and the second cycle, For example, the number of target data dimensions corresponding to the first period is 100, and the number of target data dimensions corresponding to the second period is 150. At this time, the target data dimensions corresponding to the first period and the second period need to be combined as new conditions and added to each In the filter conditions of subquery statements, the merged target data dimension can cover the dimensions corresponding to the target data in the first period and the second period.

上述实施例中，计算引擎并发读取每个粒度周期所对应的目标数据及其中的维度值的方法包括：In the above embodiment, the method for the calculation engine to concurrently read the target data corresponding to each granularity period and the dimension values therein includes:

计算引擎根据每个子查询语句及其中的过滤条件，分别从相应数据源的对应存储片段中查询符合目标数据查询数量条件的目标数据并提取其中的维度值。According to each sub-query statement and the filter conditions therein, the computing engine queries the target data that meets the target data query quantity condition from the corresponding storage segment of the corresponding data source and extracts the dimension values therein.

上述实施例中，经处理端计算后向用户返回查询结果的方法包括：In the above-mentioned embodiment, the method for returning the query result to the user after being calculated by the processing terminal includes:

处理端根据各子查询语句返回的维度值，按照预设规则计算后向用户返回查询结果。例如，对于预设规则为增长率的计算，可将第一周期获取的维度值与第二周期获取的维度值相除，得到增长率结果。The processing end returns the query result to the user after calculation according to the dimension value returned by each sub-query statement according to the preset rules. For example, for the calculation that the preset rule is the growth rate, the dimension value obtained in the first cycle can be divided by the dimension value obtained in the second cycle to obtain the growth rate result.

上述实施例中，在步骤将包含所述过滤条件的子查询语句下推至计算引擎并发读取每个粒度周期所对应的目标数据及其中的维度值，经处理端计算后向用户返回查询结果之后还包括：In the above-mentioned embodiment, in the step, the sub-query statement containing the filter condition is pushed down to the calculation engine and concurrently reads the target data corresponding to each granularity period and the dimension value therein, and returns the query result to the user after calculation by the processing terminal Then also include:

具体实施时，在执行SQL查询语句的过程中不断统计，将使用最频繁的数据源及它的维度字段，在查询低谷期时动态构建druid语句，并对该维度及维值进行预提取，类似像物化视图那样预存储起来，后面发现如果有相同数据源及维度维值查询需求，可以从该物化视图中快速提取，而不需要再发请求到druid来取相应的维度维值集合数据，以此实现快速调用。During the specific implementation, statistics are kept during the execution of SQL query statements, the most frequent data source and its dimension fields will be used, and the druid statement will be dynamically constructed during the low query period, and the dimension and dimension value will be pre-extracted, similar to It is pre-stored like a materialized view, and later found that if there is the same data source and dimension value query requirements, it can be quickly extracted from the materialized view without sending a request to druid to fetch the corresponding dimension value set data, so as to This implements fast calls.

具体实施时，当本次SQL查询语句查询结果出来后，可以将每个子查询语句及对应的目标数据和其中的维度值在数据库中进行缓存，存储在Hbase或其它缓存库中，之后再有相同查询时可以直接根据缓存结果修改执行计划，而不需要重新执行，具体来讲，若当前SQL查询语句中包括了历史周期的查询结果，可将当前SQL查询语句查询周期进行拆分，对其中包括的历史周期查询结果直接从缓存库索取，对其中不包括的历史周期的当前期从druid查询获取。如：已缓存了1号到20号的维度数据，如果当前查询是1号到30号的维度，则只需要下压21号到30号之间的粒度周期去查询druid，大大减小了druid的压力。In specific implementation, when the query result of this SQL query statement comes out, each subquery statement and the corresponding target data and the dimension values in it can be cached in the database, stored in Hbase or other cache libraries, and then the same When querying, you can directly modify the execution plan based on the cached results without re-executing. Specifically, if the current SQL query statement includes the query results of the historical cycle, you can split the query cycle of the current SQL query statement, including The historical cycle query results are directly obtained from the cache library, and the current period of the historical cycle not included is obtained from the druid query. For example, dimension data from 1st to 20th has been cached. If the current query is from 1st to 30th dimension, you only need to press down on the granularity period between 21st and 30th to query druid, which greatly reduces druid pressure.

在得到当前期和历史期合并后的目标数据后，就可以开始整理SQL查询语句对应的子查询语句：After obtaining the combined target data of the current period and the historical period, you can start to organize the subquery statements corresponding to the SQL query statement:

1、给子查询语句的group by维度字段进行排序；1. Sort the group by dimension field of the subquery statement;

2、给子查询语句需要计算的衍生指标进行排序；2. Sort the derived indicators that need to be calculated by the subquery statement;

3、将多个子查询语句对应的目标数据维度汇总后作为新增条件加入子查询语句的过滤条件；3. Summarize the target data dimensions corresponding to multiple subquery statements and add them as new conditions to the filter conditions of the subquery statements;

整理好每个子查询语句后，将它转化成druid计算引擎的json语句，因为加上了过滤条件，此次子查询语句要查询的维度信息大于过滤条件。也就是说，druid无须扫描查询周期内的所有数据，也就不会发生熔断及查询超时异常问题，减小了druid的查询压力。After sorting out each subquery statement, convert it into a json statement of the druid computing engine. Because the filter condition is added, the dimension information to be queried by the subquery statement this time is greater than the filter condition. That is to say, Druid does not need to scan all the data in the query cycle, and there will be no fusing and query timeout exceptions, which reduces the query pressure of Druid.

又由于druid返回的数据量是过滤后的，所以spark端也无须接收查询周期内的所有数据，减少了spark端的数据处理量，整体效率将大幅提升。And because the amount of data returned by druid is filtered, the spark side does not need to receive all the data in the query cycle, which reduces the amount of data processing on the spark side, and the overall efficiency will be greatly improved.

实施例二Embodiment two

本实施例提供一种SQL优化查询装置，包括：This embodiment provides a SQL optimization query device, including:

与现有技术相比，本发明实施例提供的SQL优化查询装置的有益效果与上述实施例一提供的SQL优化查询方法的有益效果相同，在此不做赘述。Compared with the prior art, the beneficial effect of the SQL optimized query device provided in the embodiment of the present invention is the same as that of the SQL optimized query method provided in the first embodiment above, and will not be repeated here.

实施例三Embodiment Three

本实施例提供一种计算机可读存储介质，计算机可读存储介质上存储有计算机程序，计算机程序被处理器运行时执行上述SQL优化查询方法的步骤。This embodiment provides a computer-readable storage medium. A computer program is stored on the computer-readable storage medium. When the computer program is run by a processor, the steps of the above SQL optimization query method are executed.

与现有技术相比，本实施例提供的计算机可读存储介质的有益效果与上述技术方案提供的SQL优化查询方法的有益效果相同，在此不做赘述。Compared with the prior art, the beneficial effect of the computer-readable storage medium provided by this embodiment is the same as the beneficial effect of the SQL optimized query method provided by the above technical solution, which will not be repeated here.

本领域普通技术人员可以理解，实现上述发明方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，上述程序可以存储于计算机可读取存储介质中，该程序在执行时，包括上述实施例方法的各步骤，而的存储介质可以是：ROM/RAM、磁碟、光盘、存储卡等。Those of ordinary skill in the art can understand that all or part of the steps in the above-mentioned inventive method can be completed by instructing related hardware through a program. The above-mentioned program can be stored in a computer-readable storage medium. When the program is executed, it includes: For each step of the method in the above embodiments, the storage medium may be: ROM/RAM, magnetic disk, optical disk, memory card, and the like.

以上，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以所述权利要求的保护范围为准。The above is only a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Anyone familiar with the technical field can easily think of changes or replacements within the technical scope disclosed in the present invention, and should cover all Within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.

Claims

1. An SQL optimization query method is characterized by comprising the following steps:

when the SQL query statement is analyzed to be a UNION set type query statement, counting the number of storage segments related to target data, wherein the UNION set type query statement comprises a plurality of sub query statements;

when the number of the related storage segments is larger than a threshold value, splitting the query cycle corresponding to the sub-query statement into a plurality of granularity cycles, summarizing target data dimensions corresponding to the sub-query statements, and adding the summarized target data dimensions as newly added conditions into the filtering conditions of the sub-query statement;

pushing down the sub-query statements containing the filtering conditions to a computing engine, reading target data corresponding to each granularity period and a dimension value in the target data, and returning a query result to a user after the target data are computed by a processing end;

the method for adding the collected target data dimensions corresponding to the sub-query statements serving as the newly added conditions into the filter conditions of the sub-query statements comprises the following steps:

identifying a target data dimension corresponding to each sub-query statement, and when the target data dimensions of the sub-query statements are consistent, adding the target data dimension of any sub-query statement as a newly added condition into the filtering condition of each sub-query statement;

and when the target data dimensions of the sub-query statements are not consistent, summarizing and combining the target data dimensions of the sub-query statements respectively, and adding the combined target data dimensions as newly added conditions into the filtering conditions of each sub-query statement.

2. The method of claim 1, wherein splitting the query cycle corresponding to the sub-query statement into a plurality of granularity cycles when the number of the storage segments is greater than a threshold value comprises:

extracting a query cycle corresponding to each sub-query statement, and averagely dividing each query cycle into a plurality of granularity cycles according to preset granularity.

3. The method of claim 1, wherein the query statements of UNION set type include at least a first periodic sub-query statement and a second periodic sub-query statement, and wherein the filter conditions include, in addition to the new conditions, query metrics, dimensional ordering, and target data query quantity.

4. The method of claim 3, wherein the method for the compute engine to concurrently read the target data and the dimension value thereof corresponding to each granularity period comprises:

and respectively querying the target data meeting the target data query quantity condition from the corresponding storage segment of the corresponding data source and extracting the dimension value of the target data according to each sub-query statement and the filter condition in the sub-query statement by the computing engine.

5. The method of claim 4, wherein the method for returning the query result to the user after the processing end computing comprises:

and the processing terminal returns a query result to the user after calculating according to a preset rule according to the dimension value returned by each sub-query statement.

6. The method according to claim 1, wherein the pushing down the sub-query statement containing the filter condition to the compute engine and reading the target data and the dimension value in the target data corresponding to each granularity period in step, and after the query result is returned to the user after the computation by the processing end, the method further comprises:

and counting hot data sources in the historical SQL query statement and hot data dimensions in the historical SQL query statement, and pre-extracting dimensional values corresponding to the hot data dimensions for quick calling of subsequent SQL query statements.

7. The method according to claim 1, wherein the pushing down the sub-query statement containing the filter condition to the compute engine and reading the target data and the dimension value in the target data corresponding to each granularity period in step, and after the query result is returned to the user after the computation by the processing end, the method further comprises:

and caching each sub-query statement, the corresponding target data and the dimension value in the sub-query statement in the database so as to be quickly called from the database when the same sub-query statement appears in the follow-up process.

8. An apparatus for implementing the SQL optimized query method of claim 1, comprising:

the statement analyzing unit is used for counting the number of storage segments related to target data when an SQL query statement is analyzed to be a UNION set type query statement, and the UNION set type query statement comprises a plurality of sub query statements;

the identification processing unit is used for splitting the query cycle corresponding to the sub-query statement into a plurality of granularity cycles when the number of the related storage segments is larger than a threshold value, summarizing target data dimensions corresponding to the sub-query statements and adding the summarized target data dimensions as new conditions into the filter conditions of the sub-query statements;

and the query unit is used for pushing down the sub-query statements containing the filter conditions to a calculation engine, reading the target data corresponding to each granularity period and the dimension value in the target data, calculating by the processing end and returning a query result to the user.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of the claims 1 to 7.