Disclosure of Invention
The invention aims to provide a data query method, a data query device and electronic equipment, so as to reduce the occupation amount of a memory and a CPU (central processing unit) and reduce the occupation amount of network bandwidth.
In a first aspect, the invention provides a data query method, which is applied to a computing node of a distributed database; the computing node is in communication connection with a plurality of storage nodes; the plurality of storage nodes store data of the first logic table and data of the second logic table; the method comprises the following steps: receiving a data query instruction sent by a user side; the data query instruction carries a first identifier of a first logic table, a second identifier of a second logic table and a query condition; pulling data of the first logic table from the plurality of storage nodes, and establishing a hash table corresponding to the first logic table in a designated memory according to the pulled data; wherein the hash table of the specified memory is used for: providing data to be queried for each storage node; acquiring first data meeting the query condition from data of a second logic table stored by a plurality of storage nodes; acquiring second data meeting the query condition from the hash table; and determining a query result corresponding to the data query instruction based on the first data and the second data, and returning the query result to the user side.
In an optional implementation manner, the step of determining a query result corresponding to the data query instruction based on the first data and the second data includes: determining a plurality of target information meeting the query condition from the first data and the second data; splicing the target information to obtain a splicing result; and determining the splicing result as a query result corresponding to the data query instruction.
In an optional embodiment, the step of performing a splicing process on the multiple pieces of target data information to obtain a splicing result includes: and splicing the target information by adopting a plurality of threads to obtain a splicing result.
In an optional implementation manner, the step of obtaining first data satisfying the query condition from data of the second logical table stored in the plurality of storage nodes includes: for each storage node of the plurality of storage nodes, performing the following operations in parallel: pulling data of the second logic table from the storage node; and querying the first data meeting the query condition from the data of the second logic table.
In an optional implementation manner, after the step of querying the first data satisfying the query condition from the data of the second logical table, the method further includes: and deleting the data of the pulled second logic table.
In an optional implementation manner, after the step of returning the query result to the user side, the method further includes: and releasing the hash table from the specified memory.
In a second aspect, the present invention provides a data query apparatus, which is disposed at a computing node of a distributed database; the computing node is in communication connection with a plurality of storage nodes; the plurality of storage nodes store data of the first logic table and data of the second logic table; the device includes: the instruction receiving module is used for receiving a data query instruction sent by a user side; the data query instruction carries a first identifier of a first logic table, a second identifier of a second logic table and a query condition; the hash table establishing module is used for pulling the data of the first logic table from the plurality of storage nodes and establishing a hash table corresponding to the first logic table in the appointed memory according to the pulled data; wherein the hash table of the specified memory is used for: providing data to be queried for each storage node; the data query module is used for acquiring first data meeting query conditions from data of a second logic table stored by a plurality of storage nodes; acquiring second data meeting the query condition from the hash table; and the result returning module is used for determining a query result corresponding to the data query instruction based on the first data and the second data and returning the query result to the user side.
In an optional implementation manner, the return-to-result module is further configured to: determining a plurality of target information meeting the query condition from the first data and the second data; splicing the target information to obtain a splicing result; and determining the splicing result as a query result corresponding to the data query instruction.
In a third aspect, the present invention provides an electronic device comprising a processor and a memory, the memory storing machine executable instructions capable of being executed by the processor, the processor executing the machine executable instructions to implement the data query method of any one of the preceding embodiments.
In a fourth aspect, the present invention provides a machine-readable storage medium having stored thereon machine-executable instructions which, when invoked and executed by a processor, cause the processor to carry out the data query method of any one of the preceding embodiments.
The embodiment of the invention has the following beneficial effects:
according to the data query method, the data query device and the electronic equipment, a computing node firstly receives a data query instruction sent by a user side, wherein the data query instruction carries a first identifier of a first logic table, a second identifier of a second logic table and query conditions; pulling data of the first logic table from a plurality of storage nodes connected with the computing node, and establishing a hash table corresponding to the first logic table in a designated memory according to the pulled data; acquiring first data meeting the query conditions from data of a second logic table stored by a plurality of storage nodes, and acquiring second data meeting the query conditions from a hash table; and then determining a query result corresponding to the data query instruction based on the first data and the second data, and returning the query result to the user side. The computing node of the method constructs a hash table in the appointed memory which can be accessed by each storage node, and realizes data query through the hash table and the data stored in the plurality of storage nodes.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention as set forth above.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the distributed database, a computing node is connected with a plurality of storage nodes; the computing node communicates with each storage node through a hash connection algorithm, as shown in fig. 1, the computing node in fig. 1 is responsible for parsing and parallel computing of SQL (Structured Query Language), where fig. 1 includes two storage nodes (or two storage node clusters) of Group0 and Group1, each storage node stores different inner table data and outer table data (T1 in fig. 1 is the inner table data, T2 is the outer table data, T1 and T2 are both logical tables, T1_0, T1_1, T1_2, and T1_3 are physical sub-tables corresponding to T1, and T2_0, T2_1, T2_2, and T2_3 are physical sub-tables corresponding to T2).
Specifically, the compute node may pull the internal table data stored in all the storage nodes, and establish a full amount of hash tables for each storage node, that is, the hash tables corresponding to each storage node are the same, where the hash table corresponding to each storage node is stored in the memory of the compute node, which can only be accessed by the storage node. Fig. 1 includes two storage nodes, two hash tables are constructed in the computing node, so as to facilitate parallel processing of data between the computing node and the storage nodes, but the hash tables store multiple copies in the computing node, which results in large occupation of memory and CPU, and meanwhile, each piece of data in the multiple copies of data needs to be pulled from the storage node to perform hash table query, which results in large occupation of network bandwidth.
Based on the above problems, embodiments of the present invention provide a data query method, apparatus, and electronic device; the technology can be applied to scenes of data access, data query and the like of the distributed database. In order to facilitate understanding of the embodiment, a detailed description is first given of a data query method disclosed in the embodiment of the present invention, where the method is applied to a computing node of a distributed database; the computing node is in communication connection with a plurality of storage nodes; the plurality of storage nodes store data of the first logic table and data of the second logic table; as shown in fig. 2, the method comprises the following specific steps:
step S202, receiving a data query instruction sent by a user side; the data query instruction carries a first identifier of the first logic table, a second identifier of the second logic table and a query condition.
The data query command may be sent by a user through a user side, where the user side may be a mobile terminal (e.g., a mobile phone, a tablet computer, etc.), or a computer, etc. The data query instruction carries the identifiers of the first logic table and the second logic table to be queried, and query conditions. The data in the first logical table may be internal table data stored in a storage node, the data in the second logical table may be external table data stored in the storage node, each storage node may store a part of the data in the first logical table and a part of the data in the second logical table, and the data stored in each storage node is different, or may be understood as integrating the data stored in a plurality of storage nodes, so that the complete data in the first logical table and the complete data in the second logical table may be obtained.
In a specific implementation, a unique identifier is already set for each logic table when the logic table is constructed so as to search the corresponding logic table. Usually, the SQL statement constructs a logical table, and defines an inner table and an outer table according to the position of the identifier of the logical table in the SQL statement, for example, the SQL statement is: select from T2left join T1 on T2.id ═ T1. id; t2 is located on the left and T1 is located on the right, then T2 is the outer surface and T1 is the inner surface, wherein T1 and T2 are also understood to be the identifiers of the inner and outer surfaces.
Step S204, pulling data of the first logic table from the plurality of storage nodes, and establishing a hash table corresponding to the first logic table in a designated memory according to the pulled data; wherein the hash table of the specified memory is used for: providing each storage node with data to be queried.
After receiving the data query instruction, the computing node needs to pull data (equivalent to the internal table data) of the first logical table from each storage node of the plurality of storage nodes connected to the computing node, and then establishes a hash table corresponding to the total amount of the first logical table in a pre-applied specified memory according to the pulled data. The designated memory is a global memory in the compute node, and the hash table corresponding to the first logic table is stored in the global memory, so that each storage node connected with the compute node can be ensured to access the designated memory, and data to be queried can be acquired from the designated memory.
In the invention, only one internal table data is stored in the computing node, and simultaneously only one full hash table is required to be established, and the full hash table is not required to be established for each storage node, thereby reducing the occupation of the memory.
Step S206, acquiring first data meeting the query condition from the data of the second logic table stored in the plurality of storage nodes; and acquiring second data meeting the query condition from the hash table.
After the hash table is constructed, the computing node needs to acquire data (equivalent to the first data) satisfying the query condition in the data query instruction from data of the second logical table stored in the plurality of storage nodes, which may also be understood as that the computing node acquires the first data satisfying the query condition from appearance data stored in the storage nodes. And then, the computing node acquires second data meeting the query condition from the hash table stored in the specified memory.
Step S208, based on the first data and the second data, determining a query result corresponding to the data query instruction, and returning the query result to the user side.
In specific implementation, after the first data and the second data are queried, the first data and the second data need to be integrated and processed according to query conditions to determine a final query result, and the query result is returned to the user side, so that the query of the data is completed.
In the data query method provided by the embodiment of the invention, a computing node firstly receives a data query instruction sent by a user side, wherein the data query instruction carries a first identifier of a first logic table, a second identifier of a second logic table and a query condition; pulling data of the first logic table from a plurality of storage nodes connected with the computing node, and establishing a hash table corresponding to the first logic table in a designated memory according to the pulled data; acquiring first data meeting the query conditions from data of a second logic table stored by a plurality of storage nodes, and acquiring second data meeting the query conditions from a hash table; and then determining a query result corresponding to the data query instruction based on the first data and the second data, and returning the query result to the user side. The computing node of the method constructs a hash table in the appointed memory which can be accessed by each storage node, and realizes data query through the hash table and the data stored in the plurality of storage nodes.
The embodiment of the invention also provides another data query method, which is realized on the basis of the method of the embodiment; the method focuses on a specific process of acquiring first data meeting the query condition from data of a second logic table stored in a plurality of storage nodes (implemented by the following step S306), and includes: determining a specific process of a query result corresponding to the data query instruction based on the first data and the second data (realized by steps S310 to S312 described below); as shown in fig. 3, the method comprises the following specific steps:
step S302, receiving a data query instruction sent by a user side; the data query instruction carries a first identifier of the first logic table, a second identifier of the second logic table and a query condition.
Step S304, pulling data of the first logic table from the plurality of storage nodes, and establishing a hash table corresponding to the first logic table in the designated memory according to the pulled data.
Step S306, for each storage node of the plurality of storage nodes, performing the following operations in parallel: pulling data of the second logic table from the storage node; and querying the first data meeting the query condition from the data of the second logic table.
The computing node may concurrently pull data (corresponding to the appearance data) of the second logical table of each storage node from the plurality of storage nodes, and then query the first data satisfying the query condition from the pulled data of the second logical table.
In a specific implementation, after the first data meeting the query condition is queried from the data of the second logic table, the pulled data of the second logic table needs to be deleted to release the memory, so that the memory occupation is reduced.
In step S308, second data satisfying the query condition is obtained from the hash table.
In step S310, a plurality of target information satisfying the query condition is determined from the first data and the second data.
Step S312, splicing the target information to obtain a splicing result; and determining the splicing result as a query result corresponding to the data query instruction.
During specific implementation, a plurality of target information meeting the query conditions can be determined from the first data and the second data, then the target information needs to be spliced, and a splicing result obtained after splicing is used as a query result corresponding to the data query instruction.
In some embodiments, in order to return data satisfying the query condition as soon as possible, multiple threads may be used to splice multiple pieces of target information to obtain a spliced result. It can also be understood that, in this way, one thread may be used to pull the appearance data from multiple storage nodes, and then multiple threads may be used to perform the splicing process on the data (corresponding to the above target information) that meets the query condition.
Step S314, returning the query result to the user side.
In specific implementation, after the query result is returned to the user side, the hash table needs to be released from the specified memory to release the space of the specified memory, so that sufficient memory space is provided for internal table data storage and hash table construction in the next data query.
For the convenience of understanding the embodiments of the present invention, the following description will be made in detail by taking the query condition of the data query request as an example of querying the same data in the two logical tables. Suppose that a data query request of SQL language sent by a client is received by a computing node as follows: select t2. birthdaplace, t2.name, t1.name from t2left join t2 on t2. birthdaplace ═ t1. birthdaplace; where t1 represents the first id of the first logical table, t2 represents the second id of the second logical table, and berthplace represents native. After receiving the data query request, the computing node pulls the stored data of the first logical table from the plurality of storage nodes connected to the computing node according to the first identifier, and constructs a hash table (equivalent to the hash table) corresponding to the first logical table according to the pulled data, where the data related to the first logical table and the hash table are both stored in a global memory (equivalent to the specified memory), for example, a hash table h1 is constructed, and the hash table queries data that are identical to the two tables by using a third column of the data as a key, where the hash table specifically includes the following contents:
beijing:0,xiaoming,beijing,20;
tianjin:1,xiaohong,tianjin,18;
xian:2,xiaojing,xian,21;
chengdu:3,xiaowei,chengdu,22;
where the colon is preceded by a key and followed by a value.
Then, the concurrent calculation nodes pull the data of the second logic table from each storage node according to the second identifier, and determine the first data meeting the query condition from the data; and then, the second data is inquired in a hash table h1 stored in the global memory to obtain second data. Assuming that the first data determined from the second logical table is 1, xiaowang, tianjin,25, retrieving native information tianjin; then, the second data is 1, xiaohong, tianjin and 18 native place is tianjin after query in the hash table; that is, the data meeting the conditions in the two tables are respectively: 1 in the hash table, xiaohong, tianjin, 18, 1 in the first logical table, xiaowang, tianjin, 25. Assuming that the information to be acquired is only native place, internal table name (name in hash table), and external table name (name in second logic table), the data meeting the condition needs to be processed and spliced (multithreading concurrent processing may be adopted), and the result of splicing is: tianjian, xiaohong, xiaowang, and then returns the splice result to the caller.
The data query method comprises the steps that firstly, a computing node receives a data query instruction which is sent by a user side and carries a first identifier of a first logic table, a second identifier of a second logic table and query conditions; then, according to the data of the first logic table pulled from the plurality of storage nodes, a hash table corresponding to the first logic table is established in the appointed memory; and then for each storage node in the plurality of storage nodes, executing the following operations in parallel: pulling data of the second logic table from the storage node; querying first data meeting query conditions from the data of the second logic table; then acquiring second data meeting the query condition from the hash table; and determining a plurality of target information meeting the query conditions from the first data and the second data, splicing the plurality of target information to obtain a splicing result, determining the splicing result as a query result corresponding to the data query instruction, and returning the query result to the user side. In the method, only one internal table data is reserved in the computing node to construct a hash table, so that the memory, the CPU and the network resource are saved, and the hash connection efficiency is improved.
For the embodiment of the data query method, the embodiment of the invention provides a data query device, which is arranged at a computing node of a distributed database; the computing node is in communication connection with a plurality of storage nodes; the plurality of storage nodes store data of the first logic table and data of the second logic table; as shown in fig. 4, the apparatus includes:
the instruction receiving module 40 is configured to receive a data query instruction sent by a user side; the data query instruction carries a first identifier of the first logic table, a second identifier of the second logic table and a query condition.
A hash table establishing module 41, configured to pull data of the first logic table from the multiple storage nodes, and establish a hash table corresponding to the first logic table in the specified memory according to the pulled data; wherein the hash table of the specified memory is used for: providing each storage node with data to be queried.
A data query module 42, configured to obtain first data that meets a query condition from data in the second logical table stored in the plurality of storage nodes; and acquiring second data meeting the query condition from the hash table.
And the result returning module 43 is configured to determine a query result corresponding to the data query instruction based on the first data and the second data, and return the query result to the user side.
In the data query device, the computing node firstly receives a data query instruction sent by a user side, wherein the data query instruction carries a first identifier of a first logic table, a second identifier of a second logic table and a query condition; pulling data of the first logic table from a plurality of storage nodes connected with the computing node, and establishing a hash table corresponding to the first logic table in a designated memory according to the pulled data; acquiring first data meeting the query conditions from data of a second logic table stored by a plurality of storage nodes, and acquiring second data meeting the query conditions from a hash table; and then determining a query result corresponding to the data query instruction based on the first data and the second data, and returning the query result to the user side. The computing node of the method constructs a hash table in the appointed memory which can be accessed by each storage node, and realizes data query through the hash table and the data stored in the plurality of storage nodes.
Further, the result returning module 43 is further configured to: determining a plurality of target information meeting the query condition from the first data and the second data; splicing the target information to obtain a splicing result; and determining the splicing result as a query result corresponding to the data query instruction.
In a specific implementation, the result returning module 43 is further configured to: and splicing the target information by adopting a plurality of threads to obtain a splicing result.
Further, the data query module 42 is further configured to: for each storage node of the plurality of storage nodes, performing the following operations in parallel: pulling data of the second logic table from the storage node; and querying the first data meeting the query condition from the data of the second logic table.
Specifically, the apparatus further includes a data deleting module, configured to: and deleting the pulled data of the second logic table after querying the first data meeting the query condition from the data of the second logic table.
In some embodiments, the apparatus further includes a table deleting module, configured to release the hash table from the specified memory after returning the query result to the user side.
The data query apparatus provided in the embodiment of the present invention has the same implementation principle and technical effect as those of the foregoing method embodiments, and for the sake of brief description, reference may be made to corresponding contents in the foregoing method embodiments for parts of embodiments that are not mentioned in the apparatus embodiments.
An embodiment of the present invention further provides an electronic device, as shown in fig. 5, the electronic device includes a processor 101 and a memory 100, where the memory 100 stores machine executable instructions capable of being executed by the processor 101, and the processor 101 executes the machine executable instructions to implement the data query method.
Further, the electronic device shown in fig. 5 further includes a bus 102 and a communication interface 103, and the processor 101, the communication interface 103, and the memory 100 are connected through the bus 102.
The memory 100 may include a high-speed Random Access Memory (RAM) and may further include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 103 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 102 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 5, but this does not indicate only one bus or one type of bus.
The processor 101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 101. The processor 101 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 100, and the processor 101 reads the information in the memory 100, and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.
The embodiment of the present invention further provides a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are called and executed by a processor, the machine-executable instructions cause the processor to implement the data query method.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.