CN110109920B

CN110109920B - Data comparison method and server

Info

Publication number: CN110109920B
Application number: CN201910207220.4A
Authority: CN
Inventors: 贾仁庆; 任晓静; 丁以樵
Original assignee: MIGU Culture Technology Co Ltd
Current assignee: MIGU Culture Technology Co Ltd
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2022-03-22
Anticipated expiration: 2039-03-19
Also published as: CN110109920A

Abstract

The embodiment of the invention relates to the technical field of data processing, and discloses a data comparison method and a server. The data comparison method comprises the following steps: reading two data in parallel; wherein each of the data comprises a plurality of records and each of the records forms a hash node; attempting to insert each of the hash nodes into a hash tree; and in the process of trying to insert, if determining that the node matched with the hash node exists in the hash tree, rejecting the node. The embodiment of the invention also provides a server. The technical scheme of the embodiment of the invention can efficiently process comparison of large data volume and effectively solve the bottleneck problem of the storage capacity of the Hash table in the comparison scheme.

Description

Data comparison method and server

技术领域technical field

本发明实施例涉及数据处理技术领域，特别涉及一种数据比对方法及服务器。The embodiments of the present invention relate to the technical field of data processing, and in particular, to a data comparison method and a server.

背景技术Background technique

在局数据同步的过程中，由于各种原因经常出现各系统之间的局数据不一致的情况，比如内容接口信息局数据由子公司同步给局数据系统，再由局数据系统同步给安全管控，经常会出现子公司与局数据系统之间、局数据系统与安全管控之间接口数据不一致的场景，因此需要对各网元数据是否一致进行比对。In the process of office data synchronization, there are often inconsistencies in office data among various systems due to various reasons. For example, the content interface information office data is synchronized by the subsidiary to the office data system, and then synchronized by the office data system to the security management and control. There will be situations where the interface data between the subsidiary and the data system of the office is inconsistent, and the data of the interface between the data system of the office and the security management and control system is inconsistent. Therefore, it is necessary to compare whether the metadata of each network is consistent.

目前，对各网元数据是否一致进行比对时，会把第一比对数据插入第一Hash表A中，然后逐条读取第二比对数据，并判断在第一Hash表A中是否存在与当前读取数据相同的第一数据，如果存在就把第一相同数据从第一Hash表A中删除，如果不存在就把当前数据存储在第二Hash表B中。当数据读取完毕之后，第一差异数据存储在第一Hash表A中、第二差异数据存储在第二Hash表B中。At present, when comparing whether the data of each network element is consistent, the first comparison data is inserted into the first Hash table A, and then the second comparison data is read one by one, and it is judged whether it exists in the first Hash table A. If the first data identical to the current read data exists, delete the first identical data from the first Hash table A, and store the current data in the second Hash table B if it does not exist. After the data is read, the first difference data is stored in the first Hash table A, and the second difference data is stored in the second Hash table B.

发明人发现现有技术中至少存在如下问题：在比对开始前，第一比对数据需要先被存入Hash表中；在第一比对数据的数据量非常大的情况下，由于Hash表容量的限制，会出现瓶颈，处理速度由于碰撞严重而快速下滑的问题。The inventor found that there are at least the following problems in the prior art: before the comparison starts, the first comparison data needs to be stored in the Hash table; Due to the limitation of capacity, there will be bottlenecks, and the processing speed will drop rapidly due to serious collisions.

发明内容SUMMARY OF THE INVENTION

本发明实施方式的目的在于提供一种数据比对方法及服务器，可以高效地处理大数据量的比对，有效解决比对方案中Hash表存储容量的瓶颈问题。The purpose of the embodiments of the present invention is to provide a data comparison method and a server, which can efficiently handle the comparison of a large amount of data, and effectively solve the bottleneck problem of the storage capacity of the Hash table in the comparison scheme.

为解决上述技术问题，本发明的实施方式提供了一种数据比对方法，包括：并行读取两份数据；其中，每份所述数据包含多个记录且每个所述记录形成一个哈希节点；将每个所述哈希节点尝试插入哈希树中；在尝试插入的过程中，若确定所述哈希树中存在与所述哈希节点匹配的节点，剔除所述节点。In order to solve the above technical problem, an embodiment of the present invention provides a data comparison method, comprising: reading two copies of data in parallel; wherein, each copy of the data includes multiple records and each of the records forms a hash node; try to insert each hash node into the hash tree; in the process of trying to insert, if it is determined that there is a node matching the hash node in the hash tree, the node is eliminated.

本发明的实施方式还提供了一种服务器，包括：至少一个处理器；以及，与所述至少一个处理器通信连接的存储器；其中，所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行上述数据比对方法。Embodiments of the present invention also provide a server, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores a program executable by the at least one processor instructions, the instructions being executed by the at least one processor to enable the at least one processor to perform the above-described data comparison method.

本发明的实施方式还提供了一种计算机可读存储介质，存储有计算机程序，所述计算机程序被处理器执行时实现上述数据比对方法。Embodiments of the present invention further provide a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the above-mentioned data comparison method is implemented.

本发明实施方式相对于现有技术而言，并行读取两份数据，并当确定哈希树中存在与哈希节点匹配的节点时，剔除该节点。两份数据采用并行读取方式，可以避免现有技术中由于先要容纳完整的一份数据而可能导致的由于哈希树容量限制造成的瓶颈问题；剔除与哈希节点匹配的节点，即剔除数据比对中不需要的记录，仅保留需要的记录，可以进一步节省存储空间；因此，本发明实施例可以高效地处理大数据量的比对，有效解决比对方案中Hash表存储容量的瓶颈问题。Compared with the prior art, the embodiment of the present invention reads two pieces of data in parallel, and when it is determined that there is a node matching the hash node in the hash tree, the node is eliminated. The two copies of data are read in parallel, which can avoid the bottleneck problem caused by the capacity limitation of the hash tree that may be caused by accommodating a complete copy of the data in the prior art; the node that matches the hash node is eliminated, that is, the elimination of For the records that are not needed in the data comparison, only the required records are kept, which can further save storage space; therefore, the embodiment of the present invention can efficiently handle the comparison of a large amount of data, and effectively solve the bottleneck of the storage capacity of the Hash table in the comparison scheme question.

另外，所述剔除所述节点，包括：若所述节点为子节点，选择所述节点的叶子节点替换所述节点并删除选择的所述叶子节点，若所述节点为叶子节点，删除所述节点。本实施例提出了剔除节点的一种具体方式。In addition, the removing the node includes: if the node is a child node, selecting a leaf node of the node to replace the node and deleting the selected leaf node; if the node is a leaf node, deleting the node node. This embodiment proposes a specific way of removing nodes.

另外，所述节点与所述哈希节点匹配，包括：所述节点与所述哈希节点的关键词相同且数据来源不同。本实施例提出了匹配的一种具体方式。In addition, matching the node with the hash node includes: the node and the hash node have the same keyword and different data sources. This embodiment proposes a specific way of matching.

另外，在确定所述哈希树存在与所述哈希节点匹配的所述节点后，还包括：判断所述节点与所述哈希节点的记录内容是否相同；若所述记录内容不同，输出并保存所述节点与所述哈希节点，并进入剔除所述节点的步骤；若所述记录内容相同，直接进入所述剔除所述节点的步骤。本实施例中，还可以识别出数据源不同、关键字相同且记录内容不同的记录，即可以识别出更多种类的差异数据。In addition, after determining that the hash tree has the node that matches the hash node, the method further includes: judging whether the record content of the node and the hash node is the same; if the record content is different, outputting And save the node and the hash node, and enter the step of eliminating the node; if the record content is the same, directly enter the step of eliminating the node. In this embodiment, records with different data sources, the same keywords, and different record contents can also be identified, that is, more types of different data can be identified.

另外，所述哈希树为基于质数的哈希树。本实施例基于质数的哈希树来实现数据比对方法，容量更大且比对效率更高。In addition, the hash tree is a prime number-based hash tree. This embodiment implements a data comparison method based on a hash tree of prime numbers, which has a larger capacity and higher comparison efficiency.

附图说明Description of drawings

一个或多个实施例通过与之对应的附图中的图片进行示例性说明，这些示例性说明并不构成对实施例的限定，附图中具有相同参考数字标号的元件表示为类似的元件，除非有特别申明，附图中的图不构成比例限制。One or more embodiments are exemplified by the pictures in the corresponding drawings, and these exemplifications do not constitute limitations of the embodiments, and elements with the same reference numerals in the drawings are denoted as similar elements, Unless otherwise stated, the figures in the accompanying drawings do not constitute a scale limitation.

图1是根据本发明第一实施例的数据比对方法的流程图；1 is a flowchart of a data comparison method according to a first embodiment of the present invention;

图2是根据本发明第一实施例中插入hash节点的示意图；2 is a schematic diagram of inserting a hash node according to the first embodiment of the present invention;

图3是根据本发明第一实施例中剔除节点的示意图；FIG. 3 is a schematic diagram of culling a node according to the first embodiment of the present invention;

图4是根据本发明第二实施例的数据比对方法的流程图；4 is a flowchart of a data comparison method according to a second embodiment of the present invention;

图5是根据本发明第三实施例的服务器的示意图。FIG. 5 is a schematic diagram of a server according to a third embodiment of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合附图对本发明的各实施方式进行详细的阐述。然而，本领域的普通技术人员可以理解，在本发明各实施方式中，为了使读者更好地理解本申请而提出了许多技术细节。但是，即使没有这些技术细节和基于以下各实施方式的种种变化和修改，也可以实现本申请所要求保护的技术方案。In order to make the objectives, technical solutions and advantages of the embodiments of the present invention clearer, the various embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, those of ordinary skill in the art can appreciate that, in the various embodiments of the present invention, many technical details are set forth in order for the reader to better understand the present application. However, even without these technical details and various changes and modifications based on the following embodiments, the technical solutions claimed in the present application can be realized.

本发明的第一实施方式涉及一种数据比对方法，包括：并行读取两份数据；其中，每份数据包含多个记录且每个记录形成一个哈希节点；将每个哈希节点尝试插入哈希树中；在尝试插入的过程中，若确定哈希树中存在与哈希节点匹配的节点，剔除该节点。The first embodiment of the present invention relates to a data comparison method, which includes: reading two pieces of data in parallel; wherein each piece of data includes multiple records and each record forms a hash node; Insert into the hash tree; in the process of trying to insert, if it is determined that there is a node matching the hash node in the hash tree, remove the node.

下面对本实施方式的数据比对方法的实现细节进行具体的说明，以下内容仅为方便理解提供的实现细节，并非实施本方案的必须。The implementation details of the data comparison method of the present embodiment will be specifically described below. The following contents are only provided for the convenience of understanding, and are not necessary for implementing the present solution.

本实施方式中的数据比对方法可以应用于服务器，服务器用于对两份数据进行一致性比对，以识别出两份数据中的差异数据。例如，子公司要将一份数据A同步至局数据系统，为了确定局数据系统接收到的数据B与子公司发送的数据A是否完全一致，要对子公司内的这份数据A与局数据系统接收到的这份数据B进行比对。因此，服务器直接从子公司获取数据A，并直接从局数据系统获取数据B；然后对这两份数据A、B进行比对，并得到这两份数据A、B之间的差异数据。The data comparison method in this embodiment can be applied to a server, and the server is configured to perform consistency comparison between two pieces of data, so as to identify the difference data in the two pieces of data. For example, a subsidiary needs to synchronize a copy of data A to the data system of the bureau. In order to determine whether the data B received by the data system of the bureau is completely consistent with the data A sent by the subsidiary, it is necessary to compare the data A in the subsidiary with the data of the bureau. The data B received by the system is compared. Therefore, the server obtains data A directly from the subsidiary and data B directly from the bureau data system; then compares the two data A and B, and obtains the difference data between the two data A and B.

如图1所示为第一实施例的数据比对方法的流程图，具体包括以下步骤：1 is a flowchart of the data comparison method of the first embodiment, which specifically includes the following steps:

步骤101，并行读取两份数据。In step 101, two copies of data are read in parallel.

服务器并行读取这两份数据A、B，并将这两份数据A、B尝试插入哈希树中。具体的，每份数据包含多个记录且每个记录形成一个哈希hash节点。以其中一份数据A为例，服务器依次读取数据A中的各条记录，并将各条记录(即hash节点)插入hash树中。在读取各记录的过程中，每个记录形成一个哈希hash节点，每个hash节点包括关键字key、记录内容word以及数据源标志flag；其中，flag表示来自哪份数据，如记录1来自数据A，那么记录1的flag＝A；word表示该条记录的具体内容；key可以唯一表征该条记录的特征，如该条记录的word为合作伙伴局数据，则key可以为合作伙伴代码。本实施例中，服务器尝试插入每条记录时，将一个记录处理完成后，再处理下一个记录。The server reads the two pieces of data A and B in parallel, and tries to insert the two pieces of data A and B into the hash tree. Specifically, each piece of data contains multiple records and each record forms a hash node. Taking one piece of data A as an example, the server reads each record in data A in turn, and inserts each record (ie, a hash node) into the hash tree. In the process of reading each record, each record forms a hash node, and each hash node includes the keyword key, the record content word, and the data source flag flag; among them, the flag indicates which data is from, such as record 1 from data A, then the flag=A of record 1; the word represents the specific content of the record; the key can uniquely characterize the feature of the record, if the word of the record is the partner office data, the key can be the partner code. In this embodiment, when the server attempts to insert each record, it processes one record and then processes the next record.

本实施例中根据质数分辨定理构造hash树，以实现本实施例所述的数据对比方法。设计人员可以根据存储容量的需要选择多个质数来构建hash树。本实施例中，选择质数序列PRIM＝{3、5、7、11、13、17、19、23、29}，即hash树的根节点下有3个节点，作为第一层，第一层中的每个节点下有5个节点，作为第二层，第二层中的每个节点下有7个节点，作为第三层；以此类推。根据质数分辨定理可知，这9个质数可以容纳3*5*7*11*13*17*19*23*29，即30亿的数据；本实施例中选择上述9个质数构建hash树，可以兼顾存储容量需求以及复杂度。需要说明的是，本实施例对选择的hash树的类型不作限定，任何适用于本申请的数据比对方法的hash树类型均属于本申请保护范围。In this embodiment, a hash tree is constructed according to the prime number resolution theorem, so as to realize the data comparison method described in this embodiment. The designer can choose multiple prime numbers to construct the hash tree according to the storage capacity. In this embodiment, the prime number sequence PRIM={3, 5, 7, 11, 13, 17, 19, 23, 29} is selected, that is, there are 3 nodes under the root node of the hash tree, which are used as the first layer. There are 5 nodes under each node in the second layer, and there are 7 nodes under each node in the second layer, which is the third layer; and so on. According to the prime number resolution theorem, these 9 prime numbers can accommodate 3*5*7*11*13*17*19*23*29, that is, 3 billion data. In this embodiment, the above 9 prime numbers are selected to construct a hash tree, which can be Take into account storage capacity requirements and complexity. It should be noted that this embodiment does not limit the type of the selected hash tree, and any type of hash tree applicable to the data comparison method of the present application falls within the protection scope of the present application.

以下步骤102～107为将记录(hash节点)尝试插入基于质数的hash树的过程。The following steps 102 to 107 are the process of trying to insert a record (hash node) into a hash tree based on prime numbers.

步骤102，对哈希节点的关键字进行哈希运算并得到随机数。Step 102, perform hash operation on the key of the hash node and obtain a random number.

以其中一份数据A中的一个记录为例进行说明，该记录形成hash节点，对hash节点中的关键字key进行hash运算并得到一个随机数，以确保向将各hash节点插入到hash树的插入位置满足或近似满足随机分布。计算出随机数后进入位置搜索步骤，即进入步骤103。Take a record in one of the data A as an example to illustrate. The record forms a hash node, and the keyword key in the hash node is hashed and a random number is obtained to ensure that each hash node is inserted into the hash tree. The insertion position satisfies or approximately satisfies a random distribution. After the random number is calculated, the location search step is entered, that is, step 103 is entered.

步骤103，根据随机数与哈希树中当前搜索层对应的质数，确定哈希节点对应的节点位置。Step 103: Determine the node position corresponding to the hash node according to the random number and the prime number corresponding to the current search layer in the hash tree.

具体的，利用步骤102中得到的随机数对当前搜索层对应的质数作取余数运算，并得到余数i，即，hash节点对应的节点位置为当前搜索层的第i个节点位置。Specifically, use the random number obtained in step 102 to perform a remainder operation on the prime number corresponding to the current search layer, and obtain the remainder i, that is, the node position corresponding to the hash node is the ith node position of the current search layer.

搜索hash节点对应的hash树中的节点位置，是从hash树的根节点开始查找的；因此，首先从第一层开始查找，即当前搜索层的初始值为第一层，第一层对应的质数为PRIM[0]＝3；即，本次搜索中，hash节点对应的节点位置为第一层的第i个子节点位置。Searching for the position of the node in the hash tree corresponding to the hash node starts from the root node of the hash tree; therefore, the search starts from the first layer, that is, the initial value of the current search layer is the first layer, and the first layer corresponds to the first layer. The prime number is PRIM[0]=3; that is, in this search, the position of the node corresponding to the hash node is the position of the ith child node of the first layer.

步骤104，判断节点位置是否被占据；若未被占据，进入步骤105；若已被占据，进入步骤106。Step 104, determine whether the node position is occupied; if it is not occupied, go to step 105; if it is occupied, go to step 106.

如果该hash节点对应的节点位置尚未有其他hash节点插入，则表示该节点位置未被占据，此时该hash节点可以插入该节点位置，即步骤105；如果该hash节点对应的节点位置已经有其他hash节点插入，则表示该节点位置已经被占据。If the node position corresponding to the hash node has not been inserted by other hash nodes, it means that the node position is not occupied, and the hash node can be inserted into the node position at this time, that is, step 105; if the node position corresponding to the hash node already has other hash nodes If the hash node is inserted, it means that the node position has been occupied.

步骤105，将哈希节点插入节点位置。Step 105, insert the hash node into the node position.

步骤106，确定节点位置的节点与哈希节点是否匹配；若是，进入步骤107；若否，则进入步骤108，然后再返回步骤103。Step 106 , determine whether the node at the node position matches the hash node; if yes, go to step 107 ; if not, go to step 108 , and then return to step 103 .

当该hash节点对应的节点位置已经被占据时，服务器返回该节点位置的节点，其中，该节点位置的节点也是一个hash节点，包括key、word、flag；本实施例中将待插入的记录称之为hash节点，将已经被插入在节点位置的记录称之为节点，只是为了从名字上区分这两者。服务器判断该hash节点和该节点是否匹配。本实施例中，节点位置中的节点与待插入的hash节点匹配，包括，该节点与该hash节点的关键字相同且数据来源不同。即，当该节点的key与该hash节点的key相同，且该节点的flag与该hash节点的flag不同时，可以认为该节点和该hash节点是分别来自上述两份数据的同一个记录，此时要剔除该节点，即进入步骤107；当该节点的key和该hash节点的key不相同或者该节点的flag和该hash节点的flag相同，则确定该节点和该hash节点不匹配，此时，需要往hash树的下一层继续搜索该hash节点对应的节点位置。When the node position corresponding to the hash node has been occupied, the server returns the node at the node position, wherein the node at the node position is also a hash node, including key, word, and flag; in this embodiment, the record to be inserted is called It is a hash node, and the record that has been inserted at the node position is called a node, just to distinguish the two from the name. The server determines whether the hash node matches the node. In this embodiment, the node in the node position matches the hash node to be inserted, including that the node and the hash node have the same keyword and different data sources. That is, when the key of the node is the same as the key of the hash node, and the flag of the node is different from the flag of the hash node, it can be considered that the node and the hash node are the same record from the above two pieces of data respectively. When the node is to be eliminated, enter step 107; when the key of the node is not the same as the key of the hash node or the flag of the node is the same as the flag of the hash node, it is determined that the node does not match the hash node, at this time , you need to continue to search for the node position corresponding to the hash node to the next layer of the hash tree.

需要说明的是，本实施例中所述的节点位置中的节点与待插入的hash节点匹配的具体条件，仅仅是举例说明，并不以此为限；本领域技术人员可以根据实际的数据比对需要设定匹配的具体条件。It should be noted that the specific conditions for matching the nodes in the node positions with the hash nodes to be inserted described in this embodiment are merely illustrative, and not limited thereto; those skilled in the art can compare the actual data Set specific conditions for matching.

步骤107，剔除节点。Step 107, remove the node.

当该节点和该hash节点被确定为匹配时，就会把该节点从hash树中剔除。本实施例中剔除节点的方式包括：若该节点为子节点，选择该节点的叶子节点替换该节点并删除选择的该叶子节点，若该节点为叶子节点，删除该节点。When the node and the hash node are determined to match, the node is removed from the hash tree. The method of removing a node in this embodiment includes: if the node is a child node, selecting a leaf node of the node to replace the node and deleting the selected leaf node; if the node is a leaf node, deleting the node.

步骤108，根据预设规则更新当前搜索层。Step 108: Update the current search layer according to preset rules.

在本次搜索中，当节点位置的节点与hash节点不匹配时，需要向下一层继续搜索该hash节点对应的节点位置。本实施例中的预设规则为，将当前搜索层更新为当前搜索层的下一层。如当前搜索层为第一层，则更新后的当前搜索层为第二层；如当前搜索层为第二层，则更新后的当前搜索层为第三层，以此类推。步骤108之后再返回步骤103。In this search, when the node at the node position does not match the hash node, it is necessary to continue searching for the node position corresponding to the hash node to the next layer. The preset rule in this embodiment is to update the current search layer to the next layer of the current search layer. If the current search layer is the first layer, the updated current search layer is the second layer; if the current search layer is the second layer, the updated current search layer is the third layer, and so on. After step 108, return to step 103.

如图2所示为当hash节点对应的节点位置未被占据时，hash节点插入其对应的节点位置的示意图。图2中所示为将记录1和记录2尝试插入hash树的过程。记录1中，key＝10，flag＝A，表示记录1来自数据A且key为10，记录2中，key＝1，flag＝B，表示记录2来自数据B且key为1；需要说明的是，由于记录内容的数据量较大，图2中未示出。As shown in Figure 2, when the node position corresponding to the hash node is not occupied, the hash node is inserted into its corresponding node position. Figure 2 shows the process of trying to insert record 1 and record 2 into the hash tree. In record 1, key=10, flag=A, indicating that record 1 comes from data A and key is 10, in record 2, key=1, flag=B, indicating that record 2 comes from data B and key is 1; it should be noted that , which is not shown in FIG. 2 due to the large amount of data recorded.

如图2中标有记录1的虚线所指示的，记录1对应于hash树第一层中的节点位置P1已经被一个节点(key＝0，flag＝A)占据且记录1与该节点不匹配，故再次搜索第二层中的节点位置P12，记录1对应于第二层中的节点位置P12未被占据，图中标志为null的表示未被占据；故将记录1插入到第二层中对应的节点位置P12；同理，如图2中标有记录2的虚线所指示的，记录2对应于hash树第一层中的节点位置P3已经被一个节点(key＝1，flag＝B)占据且记录2与该节点不匹配，故再次搜索第二层中的节点位置P35，记录2对应于第二层中的节点位置P35未被占据，故将记录2插入到第二层中对应的节点位置P35。图2的例子中，记录2对应于hash树第一层中的节点位置P3上的节点(key＝1，flag＝B)，其key与flag与该hash节点相同，表示数据B中有重复的记录。As indicated by the dotted line marked record 1 in Figure 2, record 1 corresponds to node position P1 in the first layer of the hash tree that has been occupied by a node (key=0, flag=A) and record 1 does not match this node, Therefore, the node position P12 in the second layer is searched again, and record 1 corresponds to the node position P12 in the second layer that is not occupied, and the mark marked as null in the figure indicates that it is not occupied; therefore, record 1 is inserted into the second layer corresponding to Similarly, as indicated by the dotted line marked with record 2 in Figure 2, record 2 corresponds to the node position P3 in the first layer of the hash tree that has been occupied by a node (key=1, flag=B) and Record 2 does not match this node, so the node position P35 in the second layer is searched again. Record 2 corresponds to the node position P35 in the second layer that is not occupied, so record 2 is inserted into the corresponding node position in the second layer. P35. In the example of FIG. 2, record 2 corresponds to the node (key=1, flag=B) at the node position P3 in the first layer of the hash tree, and its key and flag are the same as the hash node, indicating that there are duplicates in data B Record.

图3所示为当hash节点对应的节点位置已被占据，且节点位置的节点与该hash节点匹配时，剔除该节点的示意图。图3中所示为将记录3尝试插入hash树的过程。记录3中，key＝0，flag＝B，表示记录3来自数据B且key为0。如图3中标有记录3的虚线所指示的，记录3对应于hash树第一层中的节点位置P1已经被一个节点(key＝0，flag＝A)占据，记录3与该节点的key相同且flag不同，表示记录3与该节点匹配，两者是重复的记录，所以要将该节点(key＝0，flag＝A)剔除；此时，获取该节点(key＝0，flag＝A)的一个叶子节点node1，用该叶子节点node1替换该节点，并删除该叶子节点node1。需要说明的是，本实施例对剔除节点的方式不作任何限定，在其他例子中，也可以用该节点下层的各节点依次向上替换；例如图3中，可以先用node2替换该节点，再用node1替换node2，最后再删除node1。Figure 3 shows a schematic diagram of removing the node when the node position corresponding to the hash node is occupied and the node at the node position matches the hash node. Figure 3 shows the process of trying to insert record 3 into the hash tree. In record 3, key=0, flag=B, indicating that record 3 comes from data B and the key is 0. As indicated by the dotted line marked record 3 in Figure 3, record 3 corresponds to the node position P1 in the first layer of the hash tree that has been occupied by a node (key=0, flag=A), and record 3 has the same key as the node And the flag is different, indicating that record 3 matches the node, and the two are duplicate records, so the node (key=0, flag=A) should be removed; at this time, the node (key=0, flag=A) is obtained a leaf node node1, replace the node with the leaf node node1, and delete the leaf node node1. It should be noted that this embodiment does not impose any limitation on the method of removing a node. In other examples, the nodes at the lower layer of the node may be replaced in sequence upward; for example, in FIG. 3, the node may be replaced by node1 replaces node2, and finally deletes node1.

本实施例中，在各记录插入hash树的过程中，可以将来自不同数据且关键字相同的记录剔除掉，即在将所有的记录处理完毕后，保存在hash树中的节点是两份数据中的差异数据，该差异数据为仅在其中一份数据中存在的记录(仅在数据A中存在，或仅在数据B中存在)；遍历hash树即可得到两份数据中的这类差异数据。In this embodiment, in the process of inserting each record into the hash tree, records from different data and with the same keyword can be eliminated, that is, after all records are processed, the nodes stored in the hash tree are two copies of data The difference data in the data, the difference data is a record that exists only in one of the data (only exists in data A, or only exists in data B); traversing the hash tree can get such differences in the two data data.

本发明实施方式相对于现有技术而言，并行读取两份数据，并当确定哈希树中存在与哈希节点匹配的节点时，剔除该节点。两份数据采用并行读取方式，可以避免现有技术中由于先要容纳完整的一份数据而可能导致的由于哈希树容量限制造成的瓶颈问题，当数据量较大时，哈希树容量限制造成的瓶颈问题尤其严重；并且，剔除与哈希节点匹配的节点，即剔除数据比对中不需要的记录，仅保留需要的记录，可以进一步节省存储空间；因此，本发明实施例可以高效地处理大数据量的比对，有效解决比对方案中Hash表存储容量的瓶颈问题。Compared with the prior art, the embodiment of the present invention reads two pieces of data in parallel, and when it is determined that there is a node matching the hash node in the hash tree, the node is eliminated. Two copies of data are read in parallel, which can avoid the bottleneck problem caused by the capacity limitation of the hash tree that may be caused by accommodating a complete copy of the data in the prior art. When the amount of data is large, the capacity of the hash tree The bottleneck problem caused by the restriction is particularly serious; and, removing the nodes matching the hash nodes, that is, removing the unnecessary records in the data comparison, and retaining only the required records, can further save the storage space; therefore, the embodiment of the present invention can efficiently It can handle the comparison of large data volume effectively, and effectively solve the bottleneck problem of Hash table storage capacity in the comparison scheme.

本发明的第二实施方式涉及一种数据比对方法。第二实施方式与第一实施方式大致相同，主要区别之处在于：在本发明第二实施方式中，还可以识别出关键字相同但记录内容不同的节点。The second embodiment of the present invention relates to a data comparison method. The second embodiment is substantially the same as the first embodiment, and the main difference is that in the second embodiment of the present invention, nodes with the same keyword but different recorded contents can also be identified.

如图4所示是根据本发明第二实施例的数据比对方法的流程图，包括如下步骤：4 is a flowchart of a data comparison method according to a second embodiment of the present invention, including the following steps:

步骤201，并行读取两份数据；其中，每份数据包含多个记录且每个记录形成一个哈希节点；Step 201, read two pieces of data in parallel; wherein, each piece of data includes multiple records and each record forms a hash node;

步骤202，对哈希节点的关键字进行哈希运算并得到随机数，并进入位置搜索步骤；Step 202, perform hash operation on the keyword of the hash node and obtain a random number, and enter the location search step;

步骤203，根据随机数与哈希树中当前搜索层对应的质数，确定哈希节点对应的节点位置；Step 203, according to the random number and the prime number corresponding to the current search layer in the hash tree, determine the node position corresponding to the hash node;

步骤204，判断节点位置是否被占据；若未被占据，进入步骤105；若已被占据，进入步骤106；Step 204, determine whether the node position is occupied; if it is not occupied, go to step 105; if it is occupied, go to step 106;

步骤205，将哈希节点插入节点位置Step 205, insert the hash node into the node position

步骤206，确定节点位置的节点与哈希节点是否匹配；若是，进入步骤207；若否，则进入步骤210后再返回步骤203；Step 206, determine whether the node at the node position matches the hash node; if so, go to Step 207; if not, go to Step 210 and then return to Step 203;

其中，当确定节点位置的节点与哈希节点匹配时，即确定哈希树中存在与哈希节点匹配的节点。Wherein, when the node whose position is determined matches the hash node, it is determined that there is a node matching the hash node in the hash tree.

步骤207，判断节点与哈希节点的记录内容是否相同；若否，则进入步骤208；若是，则进入步骤209；Step 207, determine whether the record content of the node and the hash node are the same; if not, go to step 208; if so, go to step 209;

步骤208，输出并保存节点与哈希节点；并进入步骤209。Step 208 , output and save the node and the hash node; and go to step 209 .

步骤209，剔除节点。Step 209, remove the node.

步骤210，根据预设规则更新当前搜索层。Step 210, update the current search layer according to preset rules.

其中，步骤201～206、209～210与第一实施例中的步骤101～108大致相同，此处不再赘述；不同之处在于，还包括步骤207、208。Among them, steps 201 to 206 and 209 to 210 are substantially the same as steps 101 to 108 in the first embodiment, and will not be repeated here; the difference is that steps 207 and 208 are also included.

当步骤206中判断出节点位置的节点与hash节点匹配时，还要对节点与hash节点的记录内容是否相同进行判断；如果记录内容相同，表示该节点与hash节点完全一致，则直接进入步骤209；如果记录内容不相同，表示该节点与该hash节点并不完全一致，则进入步骤209，输出并保存该节点与该hash节点，可以将该节点与该hash节点保存在一张预设的差异数据表里，该差异数据表专门用于保存flag不同、key相同且word不同的记录。本实施例中，key相同的记录中的word存在不同，可能是由于网络传输中的各种不确定因素引起的，如数据从子公司同步给局数据系统的过程中，记录内容中的某些部分发生了丢失或改变。When it is determined in step 206 that the node at the node position matches the hash node, it is also necessary to judge whether the record content of the node and the hash node are the same; if the record content is the same, it means that the node is completely consistent with the hash node, then directly enter step 209 ; If the record content is not the same, it means that the node and the hash node are not completely consistent, then go to step 209, output and save the node and the hash node, the node and the hash node can be saved in a preset difference In the data table, the difference data table is specially used to save records with different flags, the same key and different words. In this embodiment, the words in the records with the same key are different, which may be caused by various uncertain factors in network transmission. Parts are lost or altered.

本实施例相对于第一实施例而言，还可以识别出数据源不同、关键字相同且记录内容不同的记录，即可以识别出更多种类的差异数据。Compared with the first embodiment, this embodiment can also identify records with different data sources, the same keywords, and different record contents, that is, more types of difference data can be identified.

上面各种方法的步骤划分，只是为了描述清楚，实现时可以合并为一个步骤或者对某些步骤进行拆分，分解为多个步骤，只要包括相同的逻辑关系，都在本专利的保护范围内；对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计，但不改变其算法和流程的核心设计都在该专利的保护范围内。The steps of the above various methods are divided only for the purpose of describing clearly. During implementation, they can be combined into one step or some steps can be split and decomposed into multiple steps. As long as the same logical relationship is included, they are all within the protection scope of this patent. ;Adding insignificant modifications to the algorithm or process or introducing insignificant designs, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.

本发明第三实施方式涉及一种服务器，如图5所示，包括：The third embodiment of the present invention relates to a server, as shown in FIG. 5 , including:

至少一个处理器501；以及，at least one processor 501; and,

与所述至少一个处理器501通信连接的存储器502；其中，a memory 502 in communication with the at least one processor 501; wherein,

所述存储器502存储有可被所述至少一个处理器501执行的指令，所述指令被所述至少一个处理器501执行，以使所述至少一个处理器501能够执行上述方法实施例。The memory 502 stores instructions executable by the at least one processor 501, the instructions being executed by the at least one processor 501 to enable the at least one processor 501 to perform the above-described method embodiments.

该服务器还可以包括：输入装置和输出装置。The server may also include: an input device and an output device.

其中，存储器502和处理器501采用总线方式连接，总线可以包括任意数量的互联的总线和桥，总线将一个或多个处理器501和存储器502的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起，这些都是本领域所公知的，因此，本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件，也可以是多个元件，比如多个接收器和发送器，提供用于在传输介质上与各种其他装置通信的单元。经处理器501处理的数据通过天线在无线介质上进行传输，进一步，天线还接收数据并将数据传送给处理器501。The memory 502 and the processor 501 are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors 501 and various circuits of the memory 502 together. The bus may also connect together various other circuits, such as peripherals, voltage regulators, and power management circuits, which are well known in the art and therefore will not be described further herein. The bus interface provides the interface between the bus and the transceiver. A transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing a means for communicating with various other devices over a transmission medium. The data processed by the processor 501 is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor 501 .

处理器501负责管理总线和通常的处理，还可以提供各种功能，包括定时，外围接口，电压调节、电源管理以及其他控制功能。而存储器502可以被用于存储处理器501在执行操作时所使用的数据。Processor 501 is responsible for managing the bus and general processing, and may also provide various functions including timing, peripheral interface, voltage regulation, power management, and other control functions. The memory 502 may be used to store data used by the processor 501 when performing operations.

上述产品可执行本申请实施例所提供的方法，具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节，可参见本申请实施例所提供的方法实施例。The above product can execute the method provided by the embodiments of the present application, and has functional modules and beneficial effects corresponding to the execution method. For technical details not described in detail in this embodiment, reference may be made to the method embodiments provided in the embodiments of this application.

本发明第四实施方式涉及一种计算机可读存储介质，存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。A fourth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The above method embodiments are implemented when the computer program is executed by the processor.

即，本领域技术人员可以理解，实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序存储在一个存储介质中，包括若干指令用以使得一个设备(可以是单片机，芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-OnlyMemory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。That is, those skilled in the art can understand that all or part of the steps in the method for implementing the above embodiments can be completed by instructing the relevant hardware through a program, and the program is stored in a storage medium and includes several instructions to make a device ( It may be a single chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, removable hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes.

本领域的普通技术人员可以理解，上述各实施方式是实现本发明的具体实施例，而在实际应用中，可以在形式上和细节上对其作各种改变，而不偏离本发明的精神和范围。Those skilled in the art can understand that the above-mentioned embodiments are specific examples for realizing the present invention, and in practical applications, various changes in form and details can be made without departing from the spirit and the spirit of the present invention. scope.

Claims

1. A method of data alignment, comprising:

reading two data in parallel; wherein each of the data comprises a plurality of records and each of the records forms a hash node;

attempting to insert each of the hash nodes into a hash tree; in the process of trying to insert, if the node matched with the hash node exists in the hash tree, the node is removed;

wherein the node matched with the hash node comprises: the nodes are the same as the keywords of the hash nodes and have different data sources;

after determining that the node matching the hash node exists in the hash tree, the method further includes:

judging whether the recorded contents of the node and the hash node are the same; if the recorded contents are different, outputting and storing the node and the hash node, and entering a step of eliminating the node; if the recorded contents are the same, directly entering the step of removing the nodes;

the eliminating the node comprises: and if the node is a child node, selecting a leaf node of the node to replace the node and deleting the selected leaf node, and if the node is a leaf node, deleting the node.

2. The data comparison method of claim 1, wherein the hash tree is a prime number based hash tree.

3. The method of claim 2, wherein the trial insertion process comprises:

carrying out Hash operation on the keywords of the Hash nodes to obtain random numbers, and entering a position searching step;

the position searching step is that the node position corresponding to the hash node is determined according to the random number and the prime number corresponding to the current searching layer in the hash tree;

inserting the hash node into the node location when the node location is unoccupied;

when the node position is occupied, if the node of the node position is matched with the hash node, determining that the node matched with the hash node exists in the hash tree; and if the node at the node position is not matched with the Hash node, updating the current search layer according to a preset rule, and entering the position search step.

4. The method of claim 2, wherein the prime number sequences used to construct the hash tree are 3, 5, 7, 11, 13, 17, 19, 23, 29.

5. A server, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the data alignment method of any one of claims 1 to 4.

6. A computer-readable storage medium storing a computer program, wherein the computer program is configured to implement the data matching method according to any one of claims 1 to 4 when executed by a processor.