CN104102710A

CN104102710A - Massive data query method

Info

Publication number: CN104102710A
Application number: CN201410336964.3A
Authority: CN
Inventors: 赵仁明; 辛国茂; 亓开元; 房体盈
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2014-07-15
Filing date: 2014-07-15
Publication date: 2014-10-15

Abstract

The invention discloses a massive data query method, which is characterized in that it includes: establishing an index mapping between an HBase non-rowkey query field and a rowkey; when querying, according to the index mapping relationship, the corresponding query field is queried in SolrCloud rowkey; use the rowkey to search in HBase, and display the query results in pages.

Description

A Massive Data Query Method

技术领域technical field

本发明涉及大数据领域，具体涉及一种基于SolrCloud和HBase的海量数据查询方法。The invention relates to the field of big data, in particular to a massive data query method based on SolrCloud and HBase.

背景技术Background technique

大数据(Big data)通常用来形容一个公司创造的大量非结构化数据和半结构化数据，这些数据在下载到关系型数据库用于分析时会花费过多时间和金钱。大数据分析常和云计算联系到一起，因为实时的大型数据集分析需要像MapReduce、HBase一样的框架来向数十、数百或甚至数千的电脑分配工作。大数据分析相比于传统的数据仓库应用，具有数据量大、查询分析复杂等特点。大数据需要特殊的技术，以有效地处理大量的容忍经过时间内的数据。适用于大数据的技术，包括大规模并行处理(MPP)数据库、数据挖掘电网、分布式文件系统、分布式数据库、云计算平台、互联网和可扩展的存储系统。Big data (Big data) is usually used to describe the large amount of unstructured and semi-structured data created by a company, which takes too much time and money to download to a relational database for analysis. Big data analysis is often associated with cloud computing, because real-time analysis of large data sets requires frameworks like MapReduce and HBase to distribute work to tens, hundreds, or even thousands of computers. Compared with traditional data warehouse applications, big data analysis has the characteristics of large data volume and complex query and analysis. Big data requires special techniques to efficiently handle large volumes of data that tolerate elapsed time. Technologies applicable to big data, including massively parallel processing (MPP) databases, data mining grids, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.

Solr是一个独立的企业级搜索应用服务器，它对外提供类似于Web-service的API接口。用户可以通过http请求，向搜索引擎服务器提交一定格式的XML文件，生成索引；也可以通过Http Get操作提出查找请求，并得到XML或json格式的返回结果。SolrCloud是Solr4.0版本以后基于Solr和Zookeeper的分布式搜索方案。SolrCloud是Solr的基于Zookeeper一种部署方式。Solr is an independent enterprise-level search application server that provides an API interface similar to Web-service. Users can submit XML files in a certain format to the search engine server through http requests to generate indexes; they can also submit search requests through Http Get operations and get returned results in XML or json format. SolrCloud is a distributed search solution based on Solr and Zookeeper after Solr4.0. SolrCloud is a deployment method of Solr based on Zookeeper.

HBase是一个分布式的、面向列的开源数据库，该技术来源于Fay Chang所撰写的Google论文“Bigtable：一个结构化数据的分布式存储系统”。HBase–Hadoop Database，是一个高可靠性、高性能、面向列、可伸缩的分布式存储系统，利用HBase技术可在廉价PC Server上搭建起大规模结构化存储集群。HBase在提供高并发读写操作支持的同时，也存在着一些显著的缺陷：由于HBase只对rowkey(行键值)进行排序，所以HBase无法实现对于rowkey以外字段的快速查找和检索。同时HBase也无法实现基于查询的分页显示和逐页查询。因此，设计一种基于SolrCloud和HBase的海量数据查询方法，可以有效的解决这些问题。HBase is a distributed, column-oriented open source database, which is derived from the Google paper "Bigtable: A Distributed Storage System for Structured Data" written by Fay Chang. HBase–Hadoop Database is a high-reliability, high-performance, column-oriented, and scalable distributed storage system. Using HBase technology, a large-scale structured storage cluster can be built on a cheap PC Server. While HBase provides support for highly concurrent read and write operations, it also has some significant defects: because HBase only sorts rowkeys (row key values), HBase cannot quickly search and retrieve fields other than rowkeys. At the same time, HBase cannot realize query-based paging display and page-by-page query. Therefore, designing a massive data query method based on SolrCloud and HBase can effectively solve these problems.

发明内容Contents of the invention

为了解决上述技术问题，本发明提供了一种海量数据查询方法及装置，实现了灵活的海量数据的多条件查询，模糊查询及查询结果的分页。In order to solve the above technical problems, the present invention provides a massive data query method and device, which realize flexible multi-condition query of massive data, fuzzy query and pagination of query results.

一种海量数据查询方法，包括：A massive data query method, comprising:

建立HBase非行键值rowkey查询字段与rowkey的索引映射；Establish HBase non-rowkey rowkey query field and rowkey index mapping;

查询时，根据所述索引映射关系，在SolrCloud中查询到查询字段对应的rowkey；When querying, according to the index mapping relationship, the rowkey corresponding to the query field is queried in SolrCloud;

使用所述rowkey在HBase中进行查找，并将查询结果分页显示。Use the rowkey to search in HBase, and display the query results in pages.

优选地，在HBase中的数据发生变化时，定期的更新SolrCloud中的索引映射。Preferably, when the data in HBase changes, the index mapping in SolrCloud is regularly updated.

优选地，所述索引映射是分布式存储的，Preferably, the index mapping is stored in a distributed manner,

当主服务器接收索引映射的更新时，将更新的索引映射发送到同一分片的其他副本服务器上；When the master server receives the update of the index mapping, it sends the updated index mapping to other replica servers of the same shard;

当副本服务器接收索引映射的更新时，将更新的索引映射发送到所属的主服务器上。When the replica server receives the update of the index mapping, it sends the updated index mapping to the master server to which it belongs.

优选地，使用Mapreduce模型加速索引映射的建立。Preferably, a Mapreduce model is used to speed up the establishment of the index mapping.

一种海量数据查询装置，包括：A massive data query device, comprising:

映射模块，对HBase非rowkey查询字段建立与rowkey的索引映射；The mapping module establishes an index mapping with rowkey for HBase non-rowkey query fields;

查询模块，根据索引映射关系，先在SolrCloud中查询到该查询字段所对应的HBase rowkey，再使用该rowkey在HBase中查询所需的数据；The query module, according to the index mapping relationship, first queries the HBase rowkey corresponding to the query field in SolrCloud, and then uses the rowkey to query the required data in HBase;

显示模块，将查询结果向用户分页显示。The display module displays the query results in pages to the user.

优选地，更新模块，当HBase中的数据变更时，定期的更新SolrCloud中的索引映射。Preferably, the update module regularly updates the index mapping in SolrCloud when the data in HBase changes.

优选地，同步模块，在该装置作为主服务器时，将更新的索引映射发送到同一分片的其他副本服务器上。Preferably, the synchronization module sends the updated index mapping to other replica servers of the same fragment when the device acts as the master server.

优选地，同步模块，在该装置作为副本服务器时，当更新模块对索引映射更新后，同步模块将更新的索引映射发送到所属的主服务器上。Preferably, the synchronization module, when the device serves as a replica server, after the update module updates the index mapping, the synchronization module sends the updated index mapping to the master server to which it belongs.

本申请的技术方案使用SolrCloud存储和维护HBase中的需要查询的非rowkey字段到rowkey的索引映射，根据查询条件查找到对应的rowkey，再使用rowkey在HBase中进行数据的查找，从而实现了灵活的海量数据的多条件查询，模糊查询及查询结果的分页；同时，SolrCloud采用分布式方式部署，可以实现集中式的信息存储，自动容错，近实时搜索和自动的负载均衡。The technical solution of this application uses SolrCloud to store and maintain the index mapping from non-rowkey fields that need to be queried to rowkey in HBase, find the corresponding rowkey according to the query conditions, and then use the rowkey to search for data in HBase, thereby realizing flexible Multi-condition query of massive data, fuzzy query and pagination of query results; at the same time, SolrCloud is deployed in a distributed manner, which can realize centralized information storage, automatic fault tolerance, near real-time search and automatic load balancing.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本申请的一部分，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。在附图中：The accompanying drawings described here are used to provide a further understanding of the present invention and constitute a part of the application. The schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention. In the attached picture:

图1是本发明实施例的Solr+HBase查询示意图；Fig. 1 is the Solr+HBase query schematic diagram of the embodiment of the present invention;

图2是本发明实施例的HBase rowkey索引映射示意图；Fig. 2 is the HBase rowkey index mapping schematic diagram of the embodiment of the present invention;

图3是本发明实施例的SolrCloud集群分布示意图；Fig. 3 is the SolrCloud cluster distribution schematic diagram of the embodiment of the present invention;

图4是本发明实施例的海量数据查询方法流程图；Fig. 4 is a flow chart of a massive data query method according to an embodiment of the present invention;

图5是本发明实施例的海量数据查询装置结构图。FIG. 5 is a structural diagram of a massive data query device according to an embodiment of the present invention.

具体实施方式Detailed ways

本发明采用基于SolrCloud+HBase的方法，可以对HBase中的指定的非rowkey字段建立与rowkey的索引映射，查询时先找到所要查询的字段对应的rowkey，然后在HBase中查找，避免了HBase直接查询时查询条件单一的问题。本发明在显示查询结果时，可以分页显示；从而提供了方便易实现的多条件查询及查询结果分页，同时提供了传统HBase存储所不具备的全文索引，模糊查询的能力。对于数量统计类的请求，直接通过Solr的索引映射即可取得结果，不必再对HBase进行查询请求。The present invention adopts a method based on SolrCloud+HBase, which can establish an index mapping with a rowkey for a specified non-rowkey field in HBase. When querying, first find the rowkey corresponding to the field to be queried, and then search in HBase, avoiding direct query by HBase When querying a problem with a single condition. When the present invention displays query results, it can be displayed in pages; thereby providing convenient and easy-to-implement multi-condition query and query result paging, and at the same time providing full-text index and fuzzy query capabilities that traditional HBase storage does not possess. For requests for quantity statistics, the results can be obtained directly through Solr's index mapping, and there is no need to query HBase.

下面结合附图及具体实施例对本发明进行详细说明。The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

HBase中的每条记录是根据rowkey进行有序索引的，其索引的方式如图2中描述的一样，是一个多级索引的形式，采用类似于3层B+树的定位。首先是从zookeeper中找到root region所在的位置，从而加载-ROOT-这个region。-ROOT-region是.META.表的第一个region，里面存放了.META.表的其他所有region的位置信息。而.META.表是常驻在所有RegionServer的内存中的，其中存放着所有数据表的region位置信息。当通过rowkey在HBase中查询时，就是通过查找-ROOT-，.META.然后定位到数据所在的region，然后从该region中取出有效数据的。由于每一步的索引都是已经建立好的且有序的，所以在HBase中使用这种基于rowkey的查询效率是很高的。但对于非rowkey的查询，效率就显著下降。Each record in HBase is indexed in an orderly manner according to the rowkey. The indexing method is the same as that described in Figure 2. It is a multi-level index and adopts a positioning similar to a 3-layer B+ tree. The first is to find the location of the root region from zookeeper, so as to load the -ROOT-region. -ROOT-region is the first region of the .META. table, which stores the location information of all other regions of the .META. table. The .META. table is resident in the memory of all RegionServers, where the region location information of all data tables is stored. When querying in HBase through rowkey, it searches for -ROOT-, .META. Then locates the region where the data is located, and then retrieves valid data from the region. Since the indexes at each step are already established and ordered, it is very efficient to use this rowkey-based query in HBase. But for non-rowkey queries, the efficiency drops significantly.

为了解决这一问题，本发明使用SolrCloud预先对非rowkey查询字段建立其对应rowkey的一组索引映射，查询过程如图1所示，先在SolrCloud中查询到该查询条件所对应的HBase rowkey，再使用该rowkey在HBase中查询数据，最后，向客户端返回查询结果。这种方式可以大大提高查询效率。In order to solve this problem, the present invention uses SolrCloud to pre-establish a group of index mappings of its corresponding rowkey to the non-rowkey query field. The query process is shown in Figure 1. First, the HBase rowkey corresponding to the query condition is found in SolrCloud, and then Use the rowkey to query data in HBase, and finally, return the query result to the client. This method can greatly improve query efficiency.

SolrCloud采用分布式方式部署，可以实现集中式的信息存储，自动容错，近实时搜索和自动的负载均衡。如图3所示，这是一个拥有6个节点(服务器)的SolrCloud集群，索引映射分布在两个Shard(分片)里面，每个Shard包含三个Solr节点，一个Leader(主)节点，两个Replica(副本)节点。每一个Shard同时存在3个副本，当2个节点同时宕机时，系统仍可正常工作。集群的所有状态信息由Zookeeper集群统一维护。对于这6个节点，任何一个节点都可以接受索引映射的更新请求，从而实现了负载均衡。例如当Server4这个节点收到了关于Shard1索引映射的更新请求，Server4会将信息转发给索引映射应当所属的那个Leader节点，即Server1。Server1节点更新结束后，将版本号和索引映射发给同属于一个Shard的其他Replicas节点，即Server2和Server3，来完成同步。SolrCloud is deployed in a distributed manner, which can realize centralized information storage, automatic fault tolerance, near real-time search and automatic load balancing. As shown in Figure 3, this is a SolrCloud cluster with 6 nodes (servers). The index mapping is distributed in two Shards (shards). Each Shard contains three Solr nodes, one Leader (master) node, and two Shards. A Replica (replica) node. Each shard has 3 copies at the same time. When 2 nodes are down at the same time, the system can still work normally. All status information of the cluster is maintained uniformly by the Zookeeper cluster. For these 6 nodes, any node can accept the update request of the index mapping, thus realizing load balancing. For example, when the node Server4 receives an update request about the index mapping of Shard1, Server4 will forward the information to the Leader node to which the index mapping should belong, that is, Server1. After the update of the Server1 node is completed, the version number and index mapping are sent to other Replicas nodes belonging to the same shard, namely Server2 and Server3, to complete the synchronization.

本发明提供的海量数据查询方法，如图4所示，包括：The massive data query method provided by the present invention, as shown in Figure 4, includes:

步骤401，对HBase非rowkey查询字段建立与rowkey的索引映射。Step 401, establishing an index mapping with a rowkey for an HBase non-rowkey query field.

当HBase的数据建立时，根据设置的查询条件，使用SolrCloud建立非rowkey字段与rowkey的索引映射。所述查询条件是针对HBase非rowkey字段设置的。When HBase data is created, SolrCloud is used to create an index mapping between non-rowkey fields and rowkeys according to the set query conditions. The query condition is set for HBase non-rowkey fields.

在Solr索引映射建立的阶段，可使用Mapreduce模型加速索引映射的建立。At the stage of establishing Solr index mapping, the Mapreduce model can be used to accelerate the establishment of index mapping.

步骤402，查询时，根据索引映射关系，在SolrCloud中查询到对应的rowkey。Step 402, when querying, according to the index mapping relationship, the corresponding rowkey is queried in SolrCloud.

当需要进行查询时，根据索引映射关系，先在SolrCloud中查询到该查询条件所对应的HBase rowkey，再使用该rowkey在HBase中查询所需的数据。When a query is required, according to the index mapping relationship, first query the HBase rowkey corresponding to the query condition in SolrCloud, and then use the rowkey to query the required data in HBase.

步骤403，将查询结果向用户分页显示。Step 403, displaying the query results to the user in pages.

根据rowkey在HBase中获得数据后，向用户显示时，根据设置的分页方式，向用户显示。After the data is obtained in HBase according to the rowkey, when displaying to the user, it will be displayed to the user according to the set paging method.

在HBase原有的方式中，查询结果不支持分页显示，用户对查询结果只能全部查看。而本发明的改进是，对查询结果分页显示，例如：每页显示20项，用户可以对所显示的项目一目了然。In the original method of HBase, the query results do not support paging display, and users can only view all the query results. The improvement of the present invention is that the query results are displayed in pages, for example, 20 items are displayed on each page, and the user can know the displayed items at a glance.

优选地，该方法还可以包括：当HBase中的数据变更时，定期的更新SolrCloud中的索引映射。Preferably, the method may further include: regularly updating the index mapping in SolrCloud when the data in HBase changes.

本发明还提供了相应的海量数据查询装置，如图5所示，包括：The present invention also provides a corresponding massive data query device, as shown in Figure 5, including:

查询模块，根据索引映射关系，先在SolrCloud中查询到该查询条件所对应的HBase rowkey，再使用该rowkey在HBase中查询所需的数据；The query module, according to the index mapping relationship, first queries the HBase rowkey corresponding to the query condition in SolrCloud, and then uses the rowkey to query the required data in HBase;

优选地，该装置还包括更新模块，当HBase中的数据变更时，定期的更新SolrCloud中的索引映射。Preferably, the device further includes an update module, which regularly updates the index mapping in SolrCloud when the data in the HBase changes.

本发明所述的海量数据查询装置可以作为一个服务器节点，如图3所示，在每个服务器中都可以设置，构成一个集群。The massive data query device of the present invention can be used as a server node, as shown in FIG. 3 , can be set in each server to form a cluster.

优选地，还装置还包括同步模块，在该装置作为副本服务器时，当更新模块对索引映射更新后，同步模块将更新的索引映射发送到所属的主服务器上。Preferably, the device further includes a synchronization module. When the device serves as a copy server, after the update module updates the index mapping, the synchronization module sends the updated index mapping to the master server to which it belongs.

应用实施例Application example

1.Solr schema(架构)文件的定义和配置1. Definition and configuration of Solr schema (architecture) files

修改schema.xml文件，在其中添加需要索引的字段。同时修改原来的uniqueKey，设置HBase表中的rowkey为Solr的uniqueKey。Modify the schema.xml file and add fields that need to be indexed. At the same time, modify the original uniqueKey and set the rowkey in the HBase table to Solr's uniqueKey.

2.索引映射的建立2. Establishment of index mapping

通过HBase API全表扫描(Scan)的方式或者通过MapReduce的方式对HBase中的数据建立Solr索引。Create a Solr index for the data in HBase through HBase API full table scan (Scan) or through MapReduce.

3.查询和分页的实现3. Realization of query and paging

查询的时候，在Solr中查找到查询条件所对应的一个或一组rowkey。在获得了这些rowkey之后，分组的使用rowkey在HBase中进行查询，从而查询到实际的结果并且实现了分页查找。When querying, one or a group of rowkeys corresponding to the query conditions are found in Solr. After obtaining these rowkeys, the grouped rowkeys are used to query in HBase, so that the actual results can be queried and paging search can be realized.

本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件完成，所述程序可以存储于计算机可读存储介质中，如只读存储器、磁盘或光盘等。可选地，上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现。相应地，上述实施例中的各模块/单元可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。本申请不限制于任何特定形式的硬件和软件的结合。Those skilled in the art can understand that all or part of the steps in the above method can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium, such as a read-only memory, a magnetic disk or an optical disk, and the like. Optionally, all or part of the steps in the foregoing embodiments may also be implemented using one or more integrated circuits. Correspondingly, each module/unit in the foregoing embodiments may be implemented in the form of hardware, or may be implemented in the form of software function modules. This application is not limited to any specific form of combination of hardware and software.

以上所述，仅为本发明的较佳实例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred examples of the present invention, and are not intended to limit the protection scope of the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. a mass data inquiry method, is characterized in that, comprising:

Set up the index-mapping of the non-line unit value of HBase rowkey inquiry field and rowkey;

When inquiry, according to described index-mapping relation, in SolrCloud, inquire rowkey corresponding to inquiry field;

Use described rowkey to search in HBase, and by Query Result Pagination Display.

2. the method for claim 1, is characterized in that,

When data in HBase change, the index-mapping in regular renewal SolrCloud.

3. method as claimed in claim 2, is characterized in that,

Described index-mapping is distributed storage,

In the time of the renewal of master server reception hint mapping, the index-mapping of renewal is sent on other replica servers of same burst;

In the time of the renewal of replica server reception hint mapping, on the master server under the index-mapping of renewal is sent to.

4. the method for claim 1, is characterized in that,

Use Mapreduce model to accelerate the foundation of index-mapping.

5. a mass data inquiry unit, is characterized in that, comprising:

Mapping block, the index-mapping to the foundation of the non-rowkey inquiry of HBase field with rowkey;

Enquiry module according to index-mapping relation, first inquires the corresponding HBase rowkey of this inquiry field in SolrCloud, re-uses this rowkey and in HBase, inquires about required data;

Display module, by Query Result to user's Pagination Display.

6. device as claimed in claim 5, is characterized in that, also comprises:

Update module, in the time of data change in HBase, the index-mapping in regular renewal SolrCloud.

7. device as claimed in claim 5, is characterized in that, also comprises:

Synchronization module,, sends to the index-mapping of renewal on other replica servers of same burst during as master server at this device.

8. device as claimed in claim 7, is characterized in that,

Synchronization module, at this device during as replica server, after update module is upgraded index-mapping, on the master server under synchronization module sends to the index-mapping of renewal.