CN106250565A

CN106250565A - Querying method based on burst relevant database and system

Info

Publication number: CN106250565A
Application number: CN201610771058.5A
Authority: CN
Inventors: 刘德建; 邱宗铭; 陈霖; 吴拥民; 陈宏展
Original assignee: Fujian TQ Digital Co Ltd
Current assignee: Fujian TQ Digital Co Ltd
Priority date: 2016-08-30
Filing date: 2016-08-30
Publication date: 2016-12-21
Anticipated expiration: 2036-08-30
Also published as: CN106250565B

Abstract

The present invention provides a kind of querying method based on burst relevant database and system to solve mass data to be read in prior art in the internal memory of Centroid, intermediate node order execution performance is excessively poor, data volume is big when, it is impossible to the problem meeting the demand of online query.The method comprising the steps of: the semantic following query statement of reception: SELECT column name A, COUNT (DISTINCT column name B) FROM table name T GROUP BY column name G；Column name G is the subset of column name A；It is avoided that Centroid memory consumption even internal memory overflows.

Description

Querying method based on burst relevant database and system

Technical field

The present invention relates to the inquiry of distributed data, particularly to the inquiry comprising count, distinct and groupby.

Background technology

Big data age, data statistics is a common business.Present statistical business usually builds at relationship type number On storehouse, such as mysql.This kind of relevant database, because the restriction of unit, data volume, after ten million rank, is added up Performance will drastically decline.Normally, multiple machine can be distributed data across by relevant database is done horizontal fragmentation On, the problem solving unit bottleneck.

After burst, the execution process of measurement type SQL can become once adds up, then in the execution of each burst node Statistics is once collected, to ensure the semanteme of SQL at Centroid.For having packet, duplicate removal, the SQL meeting of tally function Run into the problem in function and performance.Such SQL is as follows: select group_col, count (distinct dist_col) from table group by group_col.Wherein group_col and dist_col can be one or more field Name.

Common, perform if such SQL to be directly issued to each burst, and at Centroid, each burst is returned Result collect, it may appear that result mistake.Because in multiple bursts, in fact it could happen that group_col and dist_col mono- The data of sample, burst performs to collect again and there will be repeat count and cause result to become big.

Additionally, in order to ensure that result is correct, it is therefore possible to use following methods performs such SQL.

1. read, from each burst node, the field value that group_col and dist_col comprises.Concrete SQL is: select Group_col, dist_col from table；

2. all of data of order traversal, first calculate the packet belonging to data according to group_col, then in this packet Deduplication operation is performed according to dist_col.

After the most all data traversal complete, calculate the dist_col quantity after duplicate removal in each packet.

The method problem is encountered that, can mass data be read in the internal memory of Centroid, and intermediate node order is held Row performance is excessively poor, data volume is big when, it is impossible to meet the demand of online query.

Summary of the invention

Given below simplify one or more aspect is summarized to try hard to provide the basic comprehension in terms of this type of.This Summarize the extensive overview of the not all aspect contemplated, and be both not intended to identify the key or decisive of all aspects Key element is the most non-attempts to define the scope in terms of any or all.Its unique purpose is intended to be given in simplified form one or more Some concepts of individual aspect are using as the brightest sequence given later.

The present invention provides a kind of querying method based on burst relevant database and system to solve will be big in prior art Amount digital independent is in the internal memory of Centroid, and intermediate node order execution performance is excessively poor, data volume is big when, it is impossible to The problem meeting the demand of online query.

For achieving the above object, inventor provides querying method based on burst relevant database, including step:

S101, the semantic following query statement of reception: SELECT column name A, COUNT (DISTINCT column name B) FROM Table name T GROUP BY column name G；Column name G is the subset of column name A；

S102, perform SELECT column name C FROM table name T respectively at each burst node, column name C be column name A with The union of column name B；Every record in the Query Result of above-mentioned each burst node is processed, processes i.e. according to every note The b value of record takes its hash value, and identical data pipe put in the record that hash value is identical, and b value is corresponding to column name B in record Value；

S103, in the record of identical data pipeline, be grouped according to the value of G row respectively, in each packet, calculate this point Number count that in group, different b values occur, the corresponding corresponding relation (g, count) of each packet result of calculation；

S104, corresponding relation (G, the COUNT) merging that above-mentioned packet calculating in each pipeline is obtained, i.e. according to column name G, Merge the record that g value is identical, the value of the COUNT that value is all corresponding relations the merged row of the COUNT row of the record after merging It is added；The result merged is Query Result.

Further, column name G is equal to column name A.

Further, number N of data pipe is the twice of machine cpu core number, and described step " processes i.e. according to every note Record b value take its hash value, identical data pipe put in the record that hash value is identical " according to every record b value ask for this b The hash value that value is corresponding, and by this hash value mould N, distribute this data pipe that recorded correspondence, numerical value 0-according to the value of mould N To the most unique corresponding 1 data pipe of N-1.

Further, in step " the semantic following query statement of reception: SELECT column name A, COUNT (DISTINCT row name Claim B) FROM table name T GROUP BY column name G；Further comprised the steps of: before "

Reading database query statement,

Judge whether query statement meets the meaning of one's words；

If the meaning of one's words of not meeting, then return；

If meeting the meaning of one's words, then perform the step of above-mentioned reception query statement.

Further, the most above-mentioned S101, S102 are that the size according to the table T in burst node performs in batches, above-mentioned steps S103 performs at burst node.

A kind of distributed data base system is also provided herein, it include at least the first burst node and the second burst node, Receiver module, the first processing module, the second processing module, the 3rd processing module；First burst node and the second burst node divide Cun Chu there be the dropping cut slice data of data base；

The semantic following query statement of receiver module reception: SELECT column name A, COUNT (DISTINCT column name B) FROM table name T GROUP BY column name G；Column name G is the subset of column name A；First processing module is for the first burst Node and the first burst node send query statement SELECT column name C FROM table name T respectively, and column name C is column name A With the union of column name B, every record in the Query Result of each burst node is processed, process i.e. according to every record B value take its hash value, identical data pipe put in the record that hash value is identical, and b value is in record corresponding to column name B Value；

Second processing module, for the record at identical data pipeline, is grouped, in each packet according to the value of G row respectively Number count that in this packet of interior calculating, different b values occur, each packet result of calculation be correspondence a corresponding relation (g, count)；

3rd processing module merges, i.e. for above-mentioned packet in each pipeline calculates the corresponding relation (G, COUNT) obtained According to column name G, merging the record that g value is identical, the value of the COUNT row of the record after merging is all corresponding relations merged The value of COUNT row is added；The result merged is Query Result.

Further, column name G is equal to column name A.

Further, number N of data pipe is the twice of machine cpu core number, and the first processing module is for according to every The b value of record asks for the hash value that this b value is corresponding, and by this hash value mould N, according to the distribution of mould N value, this recorded correspondence Data pipe, the most unique corresponding 1 data pipe of numerical value 0-to N-1.

Further, receiver module is at " the semantic following query statement of reception: SELECT column name A, COUNT (DISTINCT column name B) FROM table name T GROUP BY column name G；Before ", it is additionally operable to reading database query statement, Judge that query statement meets the meaning of one's words；If the meaning of one's words of not meeting, then return；If meeting the meaning of one's words, then continue executing with.

Further, the first processing module, the second processing module are according to the size of table T in burst node, divide table T data Criticize and perform；3rd processing module is positioned at burst node.

Being different from prior art, technique scheme reads the process of data and uses streaming (small lot) to read, and batch is pressed Need ground that data are read internal memory from each burst, it is to avoid disposable mass data pours in Centroid and causes excessive internal memory to disappear Consume even internal memory to overflow.

For addressing relevant purpose before reaching, this one or more aspect is included in and is hereinafter fully described and appended The feature particularly pointed out in claim.The following description and drawings illustrate this one or more in terms of some explanation Property feature.But, it is several that these features only indicate in the various modes of the principle that can use various aspect, and This description is intended to this type of aspects all and equivalence aspect thereof.

Accompanying drawing explanation

Below with reference to accompanying drawing, disclosed aspect is described, it is provided that accompanying drawing illustrates that and non-limiting disclosed side Face, label sign similar elements similar in accompanying drawing, and wherein:

Fig. 1 is to combine concrete data querying method based on burst relevant database described herein is described；

Fig. 2 is querying method based on burst relevant database described herein.

Label in Fig. 1 is for referring to its table pointed to.

Detailed description of the invention

By describing the technology contents of technical scheme, structural feature in detail, being realized purpose and effect, below in conjunction with concrete real Execute example and coordinate accompanying drawing to be explained in detail.In the following description, elaborate that numerous details is to provide right for explanatory purposes Thorough understanding in terms of one or more.It will be evident that do not have these details can put into practice this type of aspect yet.

The present invention discloses a kind of querying method based on burst relevant database, and burst relevant database is that level is divided The distributed data base of sheet includes step:

Step S101-S103 can perform in burst node, it is also possible to is by the data of burst node or process The data that corresponding step obtains read the 3rd node by the side of data stream after (such as Centroid), in the 3rd node execution 's.Centroid refers to the node that data-handling capacity is strong, it is also possible to refer to host node (other burst nodes are for from node).

Preferably, step S102, S103 perform in burst node, reduce Centroid the most to a certain extent Load, uses due to above-mentioned Distributed Calculation simultaneously, improves technical efficiency and speed.

As a example by readily appreciate step S104, further below illustrate: in certain embodiments, have 4 each data Pipeline, in each data pipe, is grouped according to G train value respectively, calculates the number that in packet, different b values occur respectively, Such as the number of group that G train value is 2 statistics b value appearance, each data pipe calculate respectively following corresponding relation (2, 3), (2,4), (2,1), (2,2), merge the corresponding relation that g value is 2 and i.e. obtain (2,3+4+1+2), i.e. (2,10).By above-mentioned The corresponding relation of the institute that method merges each data pipe i.e. obtains Query Result corresponding to query statement that user inputs, this inquiry Result is (G, COUNT)；

It is understood that can be with the data read in burst node of small lot and real-time to reading in above-mentioned steps Data perform SELECT column name CFROM table name T, the Query Result of these small lot data is noted down one by one and processes (processing can be in burst node, it is also possible at Centroid), and need not disposably read or process mass data, find out certain One node calculates or internal memory over loading.

Being different from prior art, above-mentioned steps reads the process of data and uses streaming (small lot) to read, and batch is desirably Data are read internal memory from each burst, it is to avoid disposable mass data pours in Centroid and causes excessive memory consumption very Overflow to internal memory.

It is understandable that the B field in " DISTINCT column name B " may not numeric type, it may be possible to character string Or time, the combination of the most multiple fields.

For numeric type, character string type, time type, numeric results after taking hash value, can be obtained.More than multi-field Scene, can be by each field hash after value add up, as last hash result.Generally speaking, the purpose of hash It is aiming at the nonnumeric type of all B field and B field is the scene that multi-field combines, calculate the knot of a numeric type Really (ensure two just the same B field values here, hash result can be the same).

With once combine concrete data (See Fig. 1 and Fig. 2) inquiry based on burst relevant database described herein is described Method:

Step1. original data storage is at two bursts 111 and 121, and each burst is respectively arranged with 3 row records, and record of often going has 3 Row, are id respectively, user_id, date.

Step2. the SQL that needs perform now is: select date, count (distinct user_id) from table group by date。

Step3. data inquiry module is according to the quantity of burst, uses 2 threads to read data from two bursts respectively, reads The SQL used that fetches data is: select user_id, date from table, reads result (i.e. inquiry in step S102 Result) it is respectively 112 and 122.Data query in the query script of Step2, Step3 i.e. corresponding diagram 2；

Step4. the distinct_col field of each record can be traveled through during data distribution (for user_id word in this example Section), according to the hash value delivery 3 (quantity of data pipe is 3, numbering respectively 0,1,2) of this field, can obtain 0 or 1 or Person 2 three is worth one of them.According to result above, select the data pipe of reference numeral, place data into wherein.This example In, hash algorithm is the value directly taking user_id, user_id be 1 and 4 two records be assigned to the data pipe of index=1 In road.I.e. 1%3=1,4%3=1, in figure, 113,123 according to delivery result, and record is assigned to different pipeline 214, pipeline 224 and pipeline 234.The data distribution of the i.e. corresponding diagram 2 of Step 4；

The most here there are 3 threads to be responsible for data and calculate (keeping consistent with the quantity of data pipe), each calculating mould The corresponding data pipe of block.During calculating, each data are carried out according to group_col field (being date field here) Packet, in Fig. 1,214 obtain 216 according to the packet of date field, and 224 obtain 226 according to the packet of date field, and 234 according to date word Section packet obtains 236 the intermediate object program of said process (215,225,235 be designated as).According to distinct_col field (this In be user_id) duplicate removal.

Step6. in computing module has consumed the data pipe oneself being responsible for after all data, can obtain a date and The mapping relations of user_id chained list (after duplicate removal), count each user_id chained list, (i.e. in Fig. 1 merge 216,226, The table that 236 labels are corresponding obtains the table as shown in label 300) obtain the mapping of a date and user_id quantity (after duplicate removal) Relation.Above-mentioned Step 5, Step 6 process i.e. corresponding diagram 2 in data calculate；

Step7. result merging process is completed by main thread, after the result of each data calculation process is collected, for The record that date is identical, is added the value of count (distinct user_id) and obtains final result.Above-mentioned Step7 is the most right The result answering Fig. 2 merges.

It is understandable that in Fig. 2 and describes the data query of the present invention, data distribution and data meter by the mode simplified Calculation can be that multithreading performs.

The present invention also provides for a kind of distributed data base system realizing said method,

A kind of distributed data base system is also provided herein, for realizing above-mentioned inquiry based on burst relevant database Method, it include at least the first burst node and the second burst node, receiver module, the first processing module, the second processing module, 3rd processing module；First burst node and the second burst node store the dropping cut slice data of data base respectively.Receive mould Block, the first processing module, the second processing module, burst node can be in can also be in Centroid.

Receiver module connects the first processing module, and the first processing module connects the second processing module, and the second processing module is even Connect the 3rd processing module.

The semantic following query statement of receiver module reception: SELECT column name A, COUNT (DISTINCT column name B) FROM table name T GROUP BY column name G；Column name G is the subset of column name A；First processing module is for the first burst Node and the first burst node send query statement SELECT column name CFROM table name T respectively, column name C be column name A with The union of column name B, processes every record in the Query Result of each burst node, processes i.e. according to every record B value takes its hash value, and identical data pipe put in the record that hash value is identical, and b value is corresponding to column name B in record Value；

In certain embodiments, column name G is equal to column name A.

In certain embodiments, number N of data pipe is the twice of machine cpu core number, and the first processing module is used for B value according to every record asks for the hash value that this b value is corresponding, and by this hash value mould N, distributes this record according to mould N value To corresponding data pipe, the most unique corresponding 1 data pipe of numerical value 0-to N-1.

In certain embodiments, receiver module for " receive semantic following query statement: SELECT column name A, COUNT (DISTINCT column name B) FROM table name T GROUP BY column name G；Before ", it is additionally operable to reading database inquiry Statement, it is judged that query statement meets the meaning of one's words；If the meaning of one's words of not meeting, then return；If meeting the meaning of one's words, then continue executing with.

In certain embodiments, the first processing module, the second processing module are according to the size of table T in burst node, right Table T data perform in batches；3rd processing module is positioned at Centroid.

It should be noted that in this article, the relational terms of such as first and second or the like is used merely to a reality Body or operation separate with another entity or operating space, and deposit between not necessarily requiring or imply these entities or operating Relation or order in any this reality.And, term " includes ", " comprising " or its any other variant are intended to Comprising of nonexcludability, so that include that the process of a series of key element, method, article or terminal unit not only include those Key element, but also include other key elements being not expressly set out, or also include for this process, method, article or end The key element that end equipment is intrinsic.In the case of there is no more restriction, statement " including ... " or " comprising ... " limit Key element, it is not excluded that there is also other key element in including the process of described key element, method, article or terminal unit.This Outward, in this article, " be more than ", " being less than ", " exceeding " etc. are interpreted as not including this number；More than " ", " below ", " within " etc. understand For including this number.

Those skilled in the art are it should be appreciated that the various embodiments described above can be provided as method, device or computer program product Product.These embodiments can use complete hardware embodiment, complete software implementation or combine software and hardware in terms of embodiment Form.All or part of step in the method that the various embodiments described above relate to can instruct relevant hardware by program Completing, described program can be stored in the storage medium that computer equipment can read, and is used for performing the various embodiments described above side All or part of step described in method.Described computer equipment, includes but not limited to: personal computer, server, general-purpose computations Machine, special-purpose computer, the network equipment, embedded device, programmable device, intelligent mobile terminal, intelligent home device, Wearable Smart machine, vehicle intelligent equipment etc.；Described storage medium, includes but not limited to: RAM, ROM, magnetic disc, tape, CD, sudden strain of a muscle Deposit, the storage of USB flash disk, portable hard drive, storage card, memory stick, the webserver, network cloud storage etc..

The various embodiments described above are with reference to according to the method described in embodiment, equipment (system) and computer program Flow chart and/or block diagram describe.It should be understood that can every by computer program instructions flowchart and/or block diagram Flow process in one flow process and/or square frame and flow chart and/or block diagram and/or the combination of square frame.These computers can be provided Programmed instruction to the processor of computer equipment to produce a machine so that the finger performed by the processor of computer equipment Order produces for realizing specifying in one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame The device of function.

These computer program instructions may be alternatively stored in the computer that computer equipment can be guided to work in a specific way and set In standby readable memory so that the instruction being stored in this computer equipment readable memory produces the manufacture including command device Product, this command device realizes at one flow process of flow chart or multiple flow process and/or one square frame of block diagram or multiple square frame middle finger Fixed function.

These computer program instructions also can be loaded on computer equipment so that performs a series of on a computing device Operating procedure is to produce computer implemented process, thus the instruction performed on a computing device provides for realizing in flow process The step of the function specified in one flow process of figure or multiple flow process and/or one square frame of block diagram or multiple square frame.

Although being described the various embodiments described above, but those skilled in the art once know basic wound The property made concept, then can make other change and amendment to these embodiments, so the foregoing is only embodiments of the invention, Not thereby the scope of patent protection of the present invention, every equivalent structure utilizing description of the invention and accompanying drawing content to be made are limited Or equivalence flow process conversion, or directly or indirectly it is used in other relevant technical fields, the most in like manner it is included in the patent of the present invention Within protection domain.

Claims

1. querying method based on burst relevant database, it is characterised in that include step:

S101, the semantic following query statement of reception: SELECT column name A, COUNT (DISTINCT column name B) FROM table name Claim T GROUP BY column name G；Column name G is the subset of column name A；

S102, performing SELECT column name C FROM table name T respectively at each burst node, column name C is column name A and row name Claim the union of B；Every record in the Query Result of above-mentioned each burst node is processed, processes i.e. according to every record B value takes its hash value, and identical data pipe put in the record that hash value is identical, and b value is corresponding to column name B in record Value；

S103, to the record at identical data pipeline, be grouped according to the value of G row respectively, in each packet, calculate this packet Number count that interior different b value occurs, the corresponding corresponding relation (g, count) of each packet result of calculation；

S104, corresponding relation (G, the COUNT) merging above-mentioned packet calculating in each pipeline obtained, i.e. according to column name G, merge The record that g value is identical, the value of the COUNT that value is all corresponding relations the merged row of the COUNT row of the record after merging is added； The result merged is Query Result.

Querying method based on burst relevant database the most according to claim 1, it is characterised in that column name G is equal to Column name A.

Querying method based on burst relevant database the most according to claim 1, it is characterised in that data pipe Number N is the twice of machine cpu core number, and described step " processes and i.e. takes its hash value, hash value phase according to every record b value With record put into identical data pipe " according to the b value of every record ask for the hash value of this b value correspondence, and should Hash value mould N, distributes this data pipe that recorded correspondence according to the value of mould N, numerical value 0 to N-1 the most corresponding 1 number respectively According to pipeline.

Querying method based on burst relevant database the most according to claim 1, its feature is being, in step " the semantic following query statement of reception: SELECT column name A, COUNT (DISTINCT column name B) FROM table name T GROUP BY column name G；Further comprised the steps of: before "

Reading database query statement,

Judge whether query statement meets the meaning of one's words；

If the meaning of one's words of not meeting, then return；

Querying method based on burst relevant database the most according to claim 1, it is characterised in that the most above-mentioned S101, S102 are that the size according to the table T in burst node performs in batches, and above-mentioned steps S103 performs at burst node.

6. distributed data base system, it is characterised in that it includes at least the first burst node and the second burst node, receives mould Block, the first processing module, the second processing module, the 3rd processing module；First burst node and the second burst node store respectively There are the dropping cut slice data of data base；It is characterized in that,

The semantic following query statement of receiver module reception: SELECT column name A, COUNT (DISTINCT column name B) FROM table Title T GROUP BY column name G；Column name G is the subset of column name A；First processing module for the first burst node and First burst node sends query statement SELECT column name C FROM table name T respectively, by the Query Result of each burst node In every record process, process i.e. according to every record b value take its hash value, phase put in the record that hash value is identical Same data pipe, b value value corresponding to column name B in record；

Second processing module, for the record at identical data pipeline, is grouped according to the value of G row respectively, counts in each packet Calculate number count that different b values occur in this packet, each packet result of calculation be correspondence a corresponding relation (g, count)；

3rd processing module merges for above-mentioned packet in each pipeline calculates the corresponding relation (G, COUNT) obtained, i.e. basis Column name G, merges the record that g value is identical, and the value of the COUNT row of the record after merging is all corresponding relations merged The value of COUNT row is added；The result merged is Query Result.

Distributed data base system the most according to claim 6, it is characterised in that column name G is equal to column name A.

Distributed data base system the most according to claim 6, it is characterised in that number N of data pipe is machine cpu The twice of core number, the first processing module is used for the b value according to every record and asks for the hash value that this b value is corresponding, and should Hash value mould N, according to the distribution of mould N value, this recorded the data pipe of correspondence, numerical value 0-to N-1 the most corresponding 1 number respectively According to pipeline.

Distributed data base system the most according to claim 6, it is characterised in that receiver module is for " receiving semanteme Following query statement: SELECT column name A, COUNT (DISTINCT column name B) FROM table name T GROUP BY column name G；Before ", it is additionally operable to reading database query statement, it is judged that query statement meets the meaning of one's words；If the meaning of one's words of not meeting, then return；If Meet the meaning of one's words, then continue executing with.

Distributed data base system the most according to claim 6, it is characterised in that the first processing module, the second process mould Block is according to the size of table T in burst node, performs table T data in batches；3rd processing module is positioned at burst node.