[go: up one dir, main page]

CN101442731A - Method and apparatus for removing call ticket repeat - Google Patents

Method and apparatus for removing call ticket repeat Download PDF

Info

Publication number
CN101442731A
CN101442731A CNA2008101832739A CN200810183273A CN101442731A CN 101442731 A CN101442731 A CN 101442731A CN A2008101832739 A CNA2008101832739 A CN A2008101832739A CN 200810183273 A CN200810183273 A CN 200810183273A CN 101442731 A CN101442731 A CN 101442731A
Authority
CN
China
Prior art keywords
bill
string
binary tree
balanced binary
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008101832739A
Other languages
Chinese (zh)
Other versions
CN101442731B (en
Inventor
刘喜
姜晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Group Anhui Co Ltd
Original Assignee
China Mobile Group Anhui Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Group Anhui Co Ltd filed Critical China Mobile Group Anhui Co Ltd
Priority to CN2008101832739A priority Critical patent/CN101442731B/en
Publication of CN101442731A publication Critical patent/CN101442731A/en
Application granted granted Critical
Publication of CN101442731B publication Critical patent/CN101442731B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种话单剔重方法和装置,主要应用于通信计费领域,包括以下步骤:从话单中提取关键域信息;使用MD5算法对关键域信息进行计算,生成MD5特征串;将该MD5特征串与索引文件中存储的、正常话单对应的MD5特征串进行比较,若发现相同的MD5特征串,则该话单为重单,进行剔除,否则将该话单对应的MD5特征串保存至索引文件中,并确认该话单为正常话单。本发明中,索引文件中MD5特征串以哈希表的方式存储,该哈希表的冲突通过链表或平衡二叉树来解决。本发明综合哈希表、平衡二叉树和MD5算法之长,实现了对任意业务、任意类型话单的有效处理,通过系统处理之后形成唯一的剔重索引,且长度唯一,提高了剔重运行效率和可扩展性,极大的节约了存储空间。

Figure 200810183273

The present invention discloses a method and device for deduplication of bills, which are mainly used in the field of communication billing, comprising the following steps: extracting key domain information from the bill; using MD5 algorithm to calculate the key domain information, and generating MD5 characteristic strings; Compare the MD5 feature string with the MD5 feature string stored in the index file and corresponding to the normal bill, if the same MD5 feature string is found, then the bill is a duplicate and should be eliminated, otherwise the MD5 corresponding to the bill Save the feature string to the index file, and confirm that the bill is a normal bill. In the present invention, the MD5 characteristic string in the index file is stored in the form of a hash table, and the conflict of the hash table is resolved through a linked list or a balanced binary tree. The present invention integrates the strengths of the hash table, balanced binary tree and MD5 algorithm, realizes the effective processing of any business and any type of bills, and forms a unique de-duplication index after the system processing, and has a unique length, which improves the de-duplication operation efficiency and scalability, greatly saving storage space.

Figure 200810183273

Description

一种话单剔重方法和装置 A method and device for deduplicating phone calls

技术领域 technical field

本发明涉及通信计费领域,特别涉及一种广泛应用于通信领域中计费、结算系统的话单剔重方法和装置。The invention relates to the field of communication charging, in particular to a method and device for deduplication of bills widely used in charging and settlement systems in the communication field.

背景技术 Background technique

计费系统在对话单进行批价前,必须先对从各网元或业务平台采集到的每条话单进行剔重处理,因此对于每天需要处理海量话单的计费系统来说,剔重系统的性能将直接影响到话单处理的及时性。The billing system must deduplicate each call bill collected from each network element or service platform before appraising the dialogue bill. Therefore, for a billing system that needs to process a large number of bills every day The performance of the system will directly affect the timeliness of bill processing.

一般话单剔重过程均是在内存中进行的,基本原理如下:首先从话单中提取相应的关键域信息组合(例如手机号码、通话时间、对方号码、SP代码等,不同类的话单关键域组合不同),再到内存中查找是否已存在该组合字段的信息,如果已有该话单关键域信息组合,则判断为重单,否则将该话单关键域信息保存在内存中。而内存中只能保存一定量的关键域信息,对于过期的数据,将输出到磁盘,以文件形式存放。当处理延迟话单时,必须先将其重新加载到内存中。因此,如何管理、存放这些话单关键域信息,直接关系到系统需要的存储大小以及剔重的效率。Generally, the call list deduplication process is carried out in the memory, and the basic principle is as follows: First, extract the corresponding combination of key field information (such as mobile phone number, call time, party number, SP code, etc., different types of call list keys) from the call list. Field combination is different), and then search whether the information of the combination field exists in the memory, if there is a combination of the key field information of the bill, it is judged as a duplicate order, otherwise the key field information of the bill is stored in the memory. However, only a certain amount of key domain information can be stored in the memory, and the expired data will be output to the disk and stored in the form of a file. When processing delayed bills, they must first be reloaded into memory. Therefore, how to manage and store the key field information of these bills is directly related to the storage size required by the system and the efficiency of deduplication.

现有的查重技术一般都是提取话单中的关键域信息组合,直接存放到内存的页面中,并使用哈希链表或者平衡二叉树方法检索话单关键域组合信息存放页面,判断是否为重复信息。Existing duplication check technology generally extracts the combination of key field information in the bill, directly stores it in the page of the memory, and uses hash linked list or balanced binary tree method to retrieve the page where the key field combination information of the bill is stored, and judges whether it is a duplicate information.

在一篇申请号为03145603.0的中国专利文件中公开了一种基于内存方式的话单剔重方法,包括:提取话单中的关键域信息,简单组合之后存放到内存的页面中,使用HASH链表方法或者平衡二叉树方法检索话单关键域组合信息存放页面,并进行判断是否为重复信息。A Chinese patent document with the application number 03145603.0 discloses a memory-based call list deduplication method, including: extracting the key field information in the call list, storing them in the page of the memory after simple combination, and using the HASH linked list method Alternatively, the balanced binary tree method can retrieve the combined information storage page of the bill key field, and judge whether it is repeated information.

在一篇申请号为200610036536.4的中国专利文件中公开了一种消除文件存储系统中冗余文件的系统及方法,该消除文件存储系统中冗余文件的方法包括:通过扫描存储服务器模块获取文件的相关信息,包括文件的大小、文件引用数以及文件的ID,并计算文件内容的MD5值,并将文件的MD5值进行哈希运算后,通过内存哈希映射表找到相应的哈希表。如果该文件引用数超过阈值,则根据文件的MD5值以及文件的大小通过哈希表找出存储系统中冗余文件并进行删除。A Chinese patent document with the application number 200610036536.4 discloses a system and method for eliminating redundant files in the file storage system. The method for eliminating redundant files in the file storage system includes: obtaining files by scanning the storage server module Relevant information, including the size of the file, the number of file references, and the ID of the file, and calculate the MD5 value of the file content, and after the MD5 value of the file is hashed, the corresponding hash table is found through the memory hash map. If the number of file references exceeds the threshold, find redundant files in the storage system through the hash table based on the MD5 value of the file and the size of the file and delete them.

现有的剔重技术存在的缺点可以概括为以下两点:The shortcomings of the existing de-weighting technology can be summarized as the following two points:

1、各类话单的关键域信息不一样,字段长度及相应的组合之后的长度均不一样,不易管理,可扩展性差,且一般关键域信息组合字段都比较长,占用存储大,同时也会在一定程序上影响剔重比较的效率。1. The key field information of various bills is different, the length of the field and the length of the corresponding combination are different, it is not easy to manage, and the scalability is poor. In addition, the combination fields of the key field information are usually relatively long, occupying a large amount of storage, and at the same time It will affect the efficiency of deduplication comparison to a certain extent.

2、简单的哈希链表或者平衡二叉树方法在处理海量数据(每天超过亿次话单,峰值每小时可达上千万条)时就显得效率依然不够,不能满足日益增长的通信业务需求。2. The simple hash list or balanced binary tree method is still not efficient enough when processing massive amounts of data (more than 100 million call records per day, and the peak value can reach tens of millions per hour), and cannot meet the growing needs of communication services.

发明内容 Contents of the invention

本发明的目的是提供一种话单剔重方法和装置,以解决现有的各类话单的关键域信息不一样、不易管理、可扩展性差,处理海量数据效率不够的问题。The purpose of the present invention is to provide a method and device for deduplication of bills to solve the problems of different key field information of various existing bills, difficult management, poor scalability, and insufficient efficiency in processing massive data.

为了实现以上目的,本发明提供了一种话单剔重方法,包括以下步骤:In order to achieve the above object, the present invention provides a method for eliminating duplicate calls, comprising the following steps:

步骤a:从话单中提取关键域信息;Step a: extract key field information from the bill;

步骤b:使用MD5算法对该关键域信息进行计算,生成该话单对应的MD5特征串;Step b: use the MD5 algorithm to calculate the key field information, and generate the MD5 feature string corresponding to the bill;

步骤c:将所述MD5特征串,与索引文件中存储的、正常话单对应的MD5特征串进行比较,如果发现相同的MD5特征串,则该话单为重单,对该话单进行剔除,否则将所述MD5特征串保存至索引文件中,并确认该MD5特征串对应的话单为正常话单。Step c: comparing the MD5 characteristic string with the MD5 characteristic string stored in the index file and corresponding to the normal bill, if the same MD5 characteristic string is found, the bill is a duplicate, and the bill is eliminated , otherwise, save the MD5 characteristic string into the index file, and confirm that the bill corresponding to the MD5 characteristic string is a normal bill.

上述技术方案中,所述正常话单对应的MD5特征串以哈希表的方式存储于索引文件中;所述步骤c具体包括:In the technique scheme, the MD5 feature string corresponding to the normal call list is stored in the index file in the form of a hash table; the step c specifically includes:

步骤c1:对话单对应的MD5特征串,根据设定的哈希函数进行哈希运算;Step c1: The MD5 characteristic string corresponding to the dialogue ticket is hashed according to the set hash function;

步骤c2:根据哈希运算得到的函数值,找到所述哈希表中的存储节点:Step c2: According to the function value obtained by the hash operation, find the storage node in the hash table:

步骤c3:如果在该存储节点上找到与该话单对应的MD5特征串相同的MD5特征串,则该话单为重单,剔除该话单;否则,将该话单对应的MD5特征串插入到该存储节点中,并确认该话单为正常话单。Step c3: If the MD5 characteristic string identical to the MD5 characteristic string corresponding to the bill is found on the storage node, then the bill is a duplicate bill, and the bill is removed; otherwise, the MD5 characteristic string corresponding to the bill is inserted Go to the storage node, and confirm that the bill is a normal bill.

上述技术方案中,所述哈希表中的存储节点中的MD5特征串以链表的方式或者平衡二叉树的方式存储。In the above technical solution, the MD5 characteristic strings in the storage nodes in the hash table are stored in a linked list or a balanced binary tree.

上述技术方案中,当所述哈希表中的存储节点中的MD5特征串以平衡二叉树的方式存储时,将所述话单对应的MD5特征串插入到该话单存储节点中的步骤包括:将该MD5特征串插入到该存储节点上的平衡二叉树上,若插入的MD5特征串使得所述平衡二叉树失去平衡时,则通过旋转进行调整。In the above technical solution, when the MD5 characteristic string in the storage node in the hash table is stored in the form of a balanced binary tree, the step of inserting the corresponding MD5 characteristic string of the bill into the bill storage node includes: The MD5 characteristic string is inserted into the balanced binary tree on the storage node, and if the inserted MD5 characteristic string makes the balanced binary tree unbalanced, adjustment is performed by rotation.

优选地,当所述话单为正常话单时,则输出该话单,并输出更新后的索引增量文件。Preferably, when the bill is a normal bill, the bill is output, and an updated index increment file is output.

优选地,所述索引文件及索引增量文件,存储于内存或磁盘中;若内存占用量超过制定数值,则合并索引增量文件,将索引文件中时间较早的部分存储到磁盘中,并自动释放内存;若需要将磁盘中的索引文件重新加载到内存时,则重新加载。Preferably, the index file and the index increment file are stored in the memory or the disk; if the memory usage exceeds the specified value, the index increment file is merged, and the earlier part of the index file is stored in the disk, and Automatically release the memory; if the index file in the disk needs to be reloaded into the memory, reload it.

优选地,所述关键域信息,包括但不限于:由主叫号码、被叫号码、通话时间、SP代码组成的组合字段中的任意一种或集中的组合。Preferably, the key field information includes, but is not limited to: any combination of calling number, called number, call time, and SP code or a combination thereof.

为了实现以上目的,本发明还提供了一种话单剔重装置,包括:提取模块,用于从话单中提取关键域信息;特征串生成模块,用于使用MD5算法对关键域信息进行计算,生成MD5特征串;比较剔重模块,用于将所述MD5特征串与索引文件中存储的、正常话单对应的MD5特征串进行比较;如果发现相同的MD5特征串,则该话单为重单,对该话单进行剔除;否则将该话单对应的MD5特征串保存至索引文件中,并确认该话单为正常话单。In order to achieve the above object, the present invention also provides a bill deduplication device, including: an extraction module, used to extract key domain information from the bill; a feature string generation module, used to use the MD5 algorithm to calculate the key domain information , generate the MD5 feature string; compare and remove duplicate modules, for comparing the MD5 feature string stored in the index file with the MD5 feature string corresponding to the normal call list; if the same MD5 feature string is found, then the call list is If the bill is repeated, delete the bill; otherwise, save the MD5 characteristic string corresponding to the bill to the index file, and confirm that the bill is a normal bill.

优选地,所述比较剔重模块包括:计算定位单元,用于根据设定的哈希函数对所述MD5特征串进行哈希运算,并根据该运算所得的哈希函数值,找到所述哈希表中的存储节点;查找单元,用于在所述存储节点上查找与所述MD5特征串相同的特征串;剔重单元,用于根据所述查找的结果确定所述话单是否为重单,如果在所述哈希表的存储节点上找到与所述特征串相同的MD5特征串,所述话单为重单,进行剔除,否则,所述话单为正常话单。Preferably, the comparison and deduplication module includes: a calculation and positioning unit, configured to perform a hash operation on the MD5 characteristic string according to a set hash function, and find the hash function value according to the hash function value obtained by the operation. A storage node in the table; a search unit, which is used to search for the same feature string as the MD5 feature string on the storage node; a duplicate unit, which is used to determine whether the bill is heavy according to the result of the search If the MD5 signature string identical to the signature string is found on the storage node of the hash table, the bill is a duplicate bill, and is removed; otherwise, the bill is a normal bill.

优先地,所述哈希表的存储节点为一个链表或一颗平衡二叉树。Preferably, the storage node of the hash table is a linked list or a balanced binary tree.

优先地,当所述哈希表中的存储节点为一颗平衡二叉树时,所述比较剔重模块还包括:插入单元,用于将所述正常话单对应的MD5特征串插入到所述平衡二叉树上,同时输出该正常话单及其索引增量文件;旋转单元,用于如果所述MD5特征串插入所述平衡二叉树时,使所述平衡二叉树失去平衡,则通过该旋转单元旋转平衡二叉树进行调整。Preferably, when the storage node in the hash table is a balanced binary tree, the comparison and deduplication module further includes: an insertion unit for inserting the MD5 feature string corresponding to the normal bill into the balanced On the binary tree, output the normal call list and its index incremental file at the same time; the rotation unit is used to make the balanced binary tree unbalanced if the MD5 feature string is inserted into the balanced binary tree, then rotate the balanced binary tree by the rotation unit Make adjustments.

优先地,该话单剔重装置还包括加载模块,用于根据需要将磁盘中的索引文件重新加载到内存。Preferably, the bill deduplication device further includes a loading module, which is used to reload the index file in the disk to the memory as required.

本发明提供的话单剔重方法和装置,综合哈希表、平衡二叉树和MD5算法之长,实现了对任意业务、任意类型话单的有效处理,通过系统处理之后形成唯一的剔重索引,且长度唯一,提高了剔重运行效率和可扩展性,极大的节约了存储空间。The method and device for deduplicating bills provided by the present invention combine the strengths of hash tables, balanced binary trees and MD5 algorithms to realize effective processing of any business and any type of bills, and form a unique deduplication index after processing by the system, and The length is unique, which improves the efficiency and scalability of deduplication operation, and greatly saves storage space.

附图说明 Description of drawings

图1是本发明一种话单剔重方法实施例一的流程图;Fig. 1 is the flow chart of embodiment 1 of a kind of bill deduplication method of the present invention;

图2是采用平衡二叉树作为哈希表解决冲突的方案示意图;Fig. 2 is a schematic diagram of a scheme for resolving conflicts using a balanced binary tree as a hash table;

图3是不平衡二叉树示意图;Fig. 3 is a schematic diagram of an unbalanced binary tree;

图4是平衡二叉树示意图;Fig. 4 is a schematic diagram of a balanced binary tree;

图5是本发明一种话单剔重方法实施例二的流程图;Fig. 5 is a flow chart of Embodiment 2 of a method for deduplicating bills of the present invention;

图6是平衡二叉树旋转示意图;Fig. 6 is a schematic diagram of balanced binary tree rotation;

图7是在平衡二叉树上查找MD5特征串的示意图;Fig. 7 is a schematic diagram of searching for MD5 characteristic strings on a balanced binary tree;

图8是本发明一种话单剔重装置实施例一的结构图;Fig. 8 is a structural diagram of Embodiment 1 of a bill deduplication device according to the present invention;

图9是本发明一种话单剔重装置实施例二的结构图。Fig. 9 is a structural diagram of Embodiment 2 of a bill deduplication device according to the present invention.

具体实施方式 Detailed ways

下面结合附图,对本发明的具体实施方式进行详细描述。The specific implementation manners of the present invention will be described in detail below in conjunction with the accompanying drawings.

方法实施例method embodiment

如图1所示公开了一种话单剔重方法实施例一的流程图,下面将本实施例中提到的技术术语介绍如下:As shown in Figure 1, a flow chart of Embodiment 1 of a method for removing duplicates from a bill is disclosed, and the technical terms mentioned in this embodiment are introduced as follows:

MD5(Message-Digest Algorithm 5,信息-摘要算法5)是一种用于产生数字签名的单项散列算法,它可以将一个任意长度的“字节串”通过一个不可逆的字符串变换算法变换成一个16字节的串。利用MD5算法可以规范任意业务、任意类型的话单关键域信息组合,将其压缩为一个唯一的16字节的串。这样,任意话单的关键域信息均可以统一管理,统一存储。而且,一般话单的关键域信息均远多于16字节,因此将其变换成16字节串进行管理和存储,查找和比较时效率将明显提高,同时也能大大节约存储成本。另外,通过MD5算法产生的16字节串是均匀分布的,这非常有利于构造一个简单而最优的哈希函数进行定位,从而有效的提高了处理效率。MD5 (Message-Digest Algorithm 5, Information-Digest Algorithm 5) is a single-item hash algorithm for generating digital signatures, which can convert a "byte string" of any length into A 16-byte string. The MD5 algorithm can be used to standardize the combination of any business and any type of bill key field information, and compress it into a unique 16-byte string. In this way, the key field information of any bill can be managed and stored in a unified manner. Moreover, the key field information of general bills is much more than 16 bytes, so converting it into a 16-byte string for management and storage will significantly improve the efficiency of search and comparison, and can also greatly save storage costs. In addition, the 16-byte strings generated by the MD5 algorithm are evenly distributed, which is very conducive to constructing a simple and optimal hash function for positioning, thereby effectively improving the processing efficiency.

哈希函数是一个多对一的映射,假设为y=F(x),可能会存在多个x值对应一个y值,这称之为哈希表的冲突。因此在使用哈希表时,必须同时设计一个解决此冲突的方法,即在内存中如何存储同一y值对应的多个x值,本发明不仅可使用链表来存储这些数据,而且使用了平衡二叉树来解决哈希表冲突的问题。对于图2的数据,哈希函数将1、9、17、25四个数值对应到函数值1,11对应到函数值3等,对于函数值为1的地址来说,1、9、17、25这四个数据再以平衡二叉树这种数据结构来存储。The hash function is a many-to-one mapping, assuming that y=F(x), there may be multiple x values corresponding to one y value, which is called a collision of the hash table. Therefore, when using a hash table, a method for resolving this conflict must be designed at the same time, that is, how to store multiple x values corresponding to the same y value in the memory. The present invention can not only store these data using a linked list, but also use a balanced binary tree To solve the problem of hash table collision. For the data in Figure 2, the hash function maps the four values 1, 9, 17, and 25 to the function value 1, and 11 corresponds to the function value 3, etc. For the address with the function value 1, 1, 9, 17, 25 These four data are then stored in a data structure such as a balanced binary tree.

哈希表是指根据一个设定哈希函数,使得每个元素的关键字都与一个哈希函数值相对应,并将其作为该关键字在哈希表中的存储位置,即哈希地址,从而消除了通过遍历比较带来的时间浪费,其关键在于设计一个最优的哈希函数和处理冲突的方法。平衡二叉树是一棵空树,或者是一棵具有下列性质的二叉树:它的左子树和右子树都是平衡二叉树,且左子树和右子树的深度之差的绝对值不超过1,如图3所示为一棵不平衡二叉树、图4所示为一棵平衡二叉树,时间复杂度在平均和最坏情况下都是0(log n),查找效率非常高。在插入或删除节点后,可能会失去平衡,此时需要通过一次或多次旋转来进行调整,使其保持平衡。Hash table means that according to a set hash function, the keyword of each element corresponds to a hash function value, and it is used as the storage location of the keyword in the hash table, that is, the hash address , thus eliminating the waste of time caused by traversal comparison, the key lies in designing an optimal hash function and a method of dealing with conflicts. A balanced binary tree is an empty tree, or a binary tree with the following properties: its left and right subtrees are both balanced binary trees, and the absolute value of the difference between the depths of the left and right subtrees does not exceed 1 , as shown in Figure 3 is an unbalanced binary tree, and Figure 4 is a balanced binary tree, the time complexity is 0(log n) on average and in the worst case, and the search efficiency is very high. After inserting or removing nodes, it may become out of balance and need to be adjusted by one or more rotations to keep it balanced.

本发明一种话单剔重方法实施例一包括以下步骤:Embodiment 1 of a method for removing duplicates from a bill of the present invention comprises the following steps:

步骤S102,从话单中提取关键域信息,例如主叫号码、被叫号码、通话时间、SP代码等字段组合的任意一个或几个的组合;Step S102, extracting key field information from the call list, such as any one or a combination of field combinations such as calling number, called number, call time, SP code;

步骤S104,使用MD5算法对关键域信息进行计算,生成MD5特征串;Step S104, using the MD5 algorithm to calculate the key domain information to generate an MD5 feature string;

步骤S106,根据设定的哈希函数对在步骤S104中生成的MD5特征串进行哈希运算,并根据计算所得哈希函数值将所述MD5特征串定位到作为所述哈希表存储节点的平衡二叉树上;Step S106, perform a hash operation on the MD5 characteristic string generated in step S104 according to the set hash function, and locate the MD5 characteristic string to the storage node of the hash table according to the calculated hash function value on a balanced binary tree;

步骤S108,在已完成定位的平衡二叉树上查找在步骤S104中生成的MD5特征串,如果在平衡二叉树上找到与所述MD5特征串相同的MD5特征串,跳转到步骤S110,否则跳转到步骤S112;Step S108, search for the MD5 signature string generated in step S104 on the balanced binary tree that has been positioned, if the MD5 signature string identical to the MD5 signature string is found on the balanced binary tree, jump to step S110, otherwise jump to Step S112;

步骤S110,如果在平衡二叉树上找到与所述MD5特征串相同的MD5特征串,则确定该话单为重单,将重单剔除;Step S110, if the MD5 signature string identical to the MD5 signature string is found on the balanced binary tree, then it is determined that the bill is a duplicate bill, and the duplicate bill is eliminated;

步骤S112,如果在平衡二叉树上没有找到与所述MD5特征串相同的MD5特征串,则确定话单不是重单,步骤结束。Step S112, if no MD5 signature string identical to the MD5 signature string is found on the balanced binary tree, then it is determined that the bill is not a duplicate bill, and the step ends.

本发明一种话单剔重方法的实施例一综合哈希表、平衡二叉树和MD5算法之长,实现了对任意业务、任意类型话单的有效处理,通过剔重系统处理之后形成唯一的剔重索引,且长度唯一,提高了剔重运行效率和可扩展性,极大的节约了存储空间。The embodiment of a method for deduplication of bills of the present invention, the advantages of comprehensive hash table, balanced binary tree and MD5 algorithm, realizes the effective processing of any business and any type of bills, and forms a unique deduplication after being processed by the deduplication system. Re-indexing, with a unique length, improves the efficiency and scalability of de-duplication operations, and greatly saves storage space.

如图5所示公开了一种话单剔重方法实施例二的流程图,在实施例一基础上对方案进行了细化,包括以下步骤:As shown in Figure 5, a flow chart of Embodiment 2 of a method for eliminating duplicate calls is disclosed. On the basis of Embodiment 1, the scheme is refined, including the following steps:

步骤S202,将话单读入系统,话单剔重的步骤开始;Step S202, reading the call list into the system, and the step of removing duplicate calls starts;

步骤S204,从话单中提取关键域信息,例如主叫号码、被叫号码、通话时间、SP代码等字段组合的任意一种或几种的组合,正常话单所对应的MD5特征串以哈希表的方式存储于索引文件中,该索引文件存储于内存或磁盘中,该哈希表的冲突则通过平衡二叉树来解决;Step S204, extracting key field information from the call list, such as any one or combination of field combinations such as calling number, called number, call time, SP code, etc., the MD5 characteristic string corresponding to the normal call list is represented by hash The hash table is stored in the index file, and the index file is stored in memory or disk, and the conflict of the hash table is resolved by balancing the binary tree;

步骤S206,判断内存中是否有存储正常话单所对应的索引文件,如果有,跳转到步骤S210,否则,跳转到步骤S208;Step S206, judging whether there is an index file corresponding to the normal call list stored in the internal memory, if yes, jump to step S210, otherwise, jump to step S208;

步骤S208,如果内存中没有存储已处理话单的索引文件,将磁盘中的索引文件重新加载到内存,继续执行步骤S210;Step S208, if the index file of the processed bill is not stored in the memory, reload the index file in the disk to the memory, and continue to execute step S210;

步骤S210,否则,使用MD5算法对关键域信息进行计算,生成MD5特征串,MD5特征串为一个唯一的16字节串,每字节为8位二进制值,如:00000000 00000001 00000010 00000011 00000100 00000101 0000011000000111 00001000 00001001 00001010 00001011 00001100 0000110100001110 00010000;Step S210, otherwise, use the MD5 algorithm to calculate the key field information to generate an MD5 feature string, the MD5 feature string is a unique 16-byte string, and each byte is an 8-bit binary value, such as: 00000000 00000001 00000010 00000011 00000100 00000101 0000011000000111 00001000 00001001 00001010 00001011 00001100 0000110100001110 00010000;

步骤S212,由于步骤S210中生成的MD5特征串是均匀分布的,因此哈希函数可以直接取MD5特征串的第一位或前两位的值来将其定位到对应的作为哈希表存储节点的平衡二叉树上,这样一次哈希函数运算就可以定位到原数据量的1/256或1/65536。在此,说明一次哈希函数运算定位到原数据量的1/256或1/65536的原因:MD5算法生成的是一个16字节的特征串,如0123456789ABCDEF,其中每一位都为一个8位字节,即由8个二进制值组成,如01010101,因每个二进制位均有两种可能的取值,所以每字节可能的取值为256,即28,因此,如果取该16字节串的第一位作为哈希函数值直接定址,则遍历查找仅需要在后15字节串进行,对于均匀分布的数据来说,仅为16字节串取值的1/256,如果取前两位作为哈希函数值直接定址,则所需查找的数据则为16字节串的1/65536。Step S212, since the MD5 characteristic string generated in step S210 is evenly distributed, the hash function can directly take the value of the first or first two digits of the MD5 characteristic string to locate it to the corresponding storage node as a hash table On the balanced binary tree, such a hash function operation can locate 1/256 or 1/65536 of the original data volume. Here, explain the reason why a hash function operation locates 1/256 or 1/65536 of the original data volume: the MD5 algorithm generates a 16-byte characteristic string, such as 0123456789ABCDEF, each of which is an 8-bit Byte, which is composed of 8 binary values, such as 01010101, because each binary bit has two possible values, so the possible value of each byte is 256, that is, 2 8 , therefore, if the 16 words are taken The first bit of the section string is directly addressed as the hash function value, and the traversal search only needs to be performed in the last 15 byte strings. For evenly distributed data, it is only 1/256 of the value of the 16-byte string. If The first two digits are directly addressed as the hash function value, and the data to be searched is 1/65536 of the 16-byte string.

步骤S214,在哈希函数定位的节点是一棵平衡二叉树,在此平衡二叉树上查找MD5特征串,平衡二叉树任意节点的左子树为空或小于该节点值,而右子树为空或大于该节点值;如果该平衡二叉树的根节点值为空,则跳转到步骤S216,如果该平衡二叉树的根节点值等于该MD5特征串,则该话单为重单,跳转到步骤S226,如果该平衡二叉树的根节点值不等于该MD5特征串,则跳转到步骤S224;Step S214, the node positioned by the hash function is a balanced binary tree, and the MD5 characteristic string is searched on this balanced binary tree, the left subtree of any node of the balanced binary tree is empty or less than the node value, and the right subtree is empty or greater than The node value; if the root node value of the balanced binary tree is empty, then jump to step S216, if the root node value of the balanced binary tree is equal to the MD5 characteristic string, then the bill is a heavy single, jump to step S226, If the root node value of the balanced binary tree is not equal to the MD5 characteristic string, then jump to step S224;

步骤S216,如果该平衡二叉树的根节点值为空,则该话单不是重单,将该MD5特征串插入到此平衡二叉树中,同时实施步骤S218;Step S216, if the root node value of the balanced binary tree is empty, then the bill is not a duplicate, the MD5 character string is inserted into the balanced binary tree, and step S218 is implemented simultaneously;

步骤S218,如果该话单不是重单,将该MD5特征串插入到此平衡二叉树上的同时,输出正确话单和索引增量文件,跳转到步骤S220;Step S218, if the bill is not a duplicate, when inserting the MD5 characteristic string into the balanced binary tree, output the correct bill and index incremental file, and jump to step S220;

步骤S220,判断插入该MD5特征串是否导致该平衡二叉树失去平衡,如果失去平衡,则跳转到步骤S222,否则跳转到步骤S228;Step S220, judging whether inserting the MD5 characteristic string causes the balanced binary tree to lose balance, if it loses balance, then jump to step S222, otherwise jump to step S228;

步骤S222,如果该平衡二叉树失去平衡,则通过旋转进行调整;平衡二叉树这种数据结构,任一节点值大于其左子树而小于其右子树的值,而其左、右子树的高度差不大于1,当在其中插入某数值时,可能导致其左右子树之差大于1,从而不再是平衡二叉树,因此必须进行旋转使之保持平衡,例如图3中节点6下面的最简单的平衡二叉树,如图6所示,根节点14的左子树高度为0,右子树高度为1,而当插入节点17时,根节点14的左子树高度为0,右子树高度为2,不再为平衡二叉树,需对14旋转为17的左子树,而17为根节点,旋转之后仍然是平衡的。对于不平衡的二叉树,其查找次数多于平衡二叉树,上述示例说明:对于旋转之前的不平衡的二叉树,查找到22的比较次数为3,而对于旋转后的平衡二叉树则为2,当左右子树高度差较大时,其差异是很明显的;Step S222, if the balanced binary tree is out of balance, it is adjusted by rotation; in the data structure of a balanced binary tree, the value of any node is greater than the value of its left subtree and less than the value of its right subtree, and the height of its left and right subtrees The difference is not greater than 1. When a certain value is inserted in it, the difference between its left and right subtrees may be greater than 1, so it is no longer a balanced binary tree, so it must be rotated to keep it balanced. For example, the simplest one under node 6 in Figure 3 As shown in Figure 6, the height of the left subtree of the root node 14 is 0, and the height of the right subtree is 1. When inserting node 17, the height of the left subtree of the root node 14 is 0, and the height of the right subtree is 0. It is 2, and it is no longer a balanced binary tree. It is necessary to rotate 14 to the left subtree of 17, and 17 is the root node, which is still balanced after rotation. For an unbalanced binary tree, the number of searches is more than that of a balanced binary tree. The above example shows that for an unbalanced binary tree before rotation, the number of comparisons to find 22 is 3, and for a balanced binary tree after rotation. It is 2, when the left and right child When the tree height difference is large, the difference is obvious;

步骤S224,如果MD5特征串小于平衡二叉树的根节点值,则继续查找根节点的左子树,如果大于平衡二叉树的根节点值,则继续查找根节点的右子树,直到找到相同的节点值或某节点无相应的左子树或右子树为止;如果找到相同的节点值,则该话单是重单,跳转到步骤S226,如果无相应的左子树或右子树,则该话单不是重单,跳转到步骤S218;例如,某条话单生成的MD5特征串,假设为1122000000000003,如图7所示,首先定位到哈希地址为11的平衡二叉树,将MD5特征串与其根节点的值相比较,例如根节点A0为1123000000000000,MD5特征串小于根节点,则再查找其左子树的根节点A01,如为1122000000000000,则继续查找A01的右子树的根节点A012,如果A012的值为空该话单不是重单,跳转到步骤S218,如果A012的值为1122000000000003,该话单为重单,跳转到步骤S226,否则继续查找其左子树或右子树;Step S224, if the MD5 characteristic string is less than the root node value of the balanced binary tree, then continue to search the left subtree of the root node, if it is greater than the root node value of the balanced binary tree, then continue to search the right subtree of the root node until the same node value is found Or a certain node has no corresponding left subtree or right subtree; if the same node value is found, the bill is a duplicate order, and jumps to step S226, if there is no corresponding left subtree or right subtree, then the The bill is not a duplicate, jump to step S218; for example, the MD5 characteristic string generated by a certain bill is assumed to be 1122000000000003, as shown in Figure 7, first locate the balanced binary tree whose hash address is 11, and convert the MD5 characteristic string Compare with the value of its root node, for example, the root node A0 is 1123000000000000, and the MD5 characteristic string is smaller than the root node, then search for the root node A01 of its left subtree, if it is 11220000000000000, then continue to search for the root node A012 of the right subtree of A01 , if the value of A012 is empty, the bill is not a duplicate bill, jump to step S218, if the value of A012 is 1122000000000003, the bill is a duplicate bill, jump to step S226, otherwise continue to search for its left subtree or right subtree Tree;

步骤S226,如果该话单是重单,剔除该话单;Step S226, if the bill is a duplicate, delete the bill;

步骤S228,在执行完剔重话单的过程以后,话单处理结束,当话单剔重过程的内存占有量超过指定数值时,系统自动释放内存,并合并索引增量文件,在继续处理下一个话单文件时,将根据需要重新加载索引文件。Step S228, after executing the process of deduplicating bills, the bill processing ends, and when the memory occupation of the bill deduplicating process exceeds the specified value, the system automatically releases the memory, and merges the index incremental files, and continues processing When a CDR file is created, the index file will be reloaded as needed.

本发明一种话单剔重方法的实施例二利用MD5算法对话单关键域信息进行运算,得到长度统一且唯一的MD5特征串,对正常话单其对应的MD5特征串采用哈希表的方式存储,哈希表的冲突则通过平衡二叉树来解决,将MD5特征串与平衡二叉树的节点值进行比较,由此来判断话单是否是重单,实现了对任意业务、任意类型话单的剔重处理,提高了剔重系统的剔重运行效率和可扩展性,极大的节约了存储空间,节约的存储空间可达50%以上,通过哈希表和平衡二叉树的组合来存储已处理的话单,具有效率高的特点。Embodiment 2 of a method for eliminating duplication of bills of the present invention utilizes the MD5 algorithm to perform calculations on dialogue single key domain information to obtain a uniform and unique MD5 feature string, and adopts a hash table for the corresponding MD5 feature strings of normal bills Storage and hash table conflicts are resolved through a balanced binary tree, and the MD5 characteristic string is compared with the node value of the balanced binary tree to judge whether the bill is a duplicate order, and realize the picking of any business and any type of bill Heavy processing improves the operation efficiency and scalability of the deduplication system, and greatly saves storage space. The saved storage space can reach more than 50%. The processed words are stored through the combination of hash table and balanced binary tree. Single, with the characteristics of high efficiency.

除上述实施例外,本发明一种话单剔重方法利用MD5算法对话单关键域信息进行运算,得到MD5特征串,该MD5特征串采用哈希表的方式存储,哈希表的冲突也可通过链表来解决,将MD5特征串与链表的节点值进行比较,由此来判断话单是否为重单,与上述实施例相似的方法也可实现话单的剔重。Except above-mentioned embodiment, a kind of method of removing duplicates of call list of the present invention utilizes MD5 algorithm dialogue single key field information to carry out operation, obtains MD5 characteristic string, and this MD5 characteristic string adopts the mode storage of hash table, and the conflict of hash table also can pass through Linked list is used to solve, the node value of MD5 character string and linked list is compared, thereby judges whether bill is duplicate, and the method similar to above-mentioned embodiment also can realize the deduplication of bill.

装置实施例Device embodiment

如图8所示,公开了本发明一种话单剔重装置实施例一的结构图,一种话单剔重装置实施例一包括提取模块1、特征串生成模块2、比较剔重模块3:提取模块1用于从话单中提取关键域信息;特征串生成模块2用于使用MD5算法对关键域信息进行计算,生成MD5特征串;比较剔重模块3:用于将所述MD5特征串与索引文件中存储的、正常话单对应的MD5特征串进行比较,如果发现相同的MD5特征串,则该话单为重单,对该话单进行剔除;否则将该话单对应的MD5特征串保存至索引文件中,并确认该话单为正常话单。As shown in FIG. 8 , a structural diagram of Embodiment 1 of a bill deduplication device of the present invention is disclosed. Embodiment 1 of a bill deduplication device includes an extraction module 1 , a feature string generation module 2 , and a comparison deduplication module 3 : the extraction module 1 is used to extract the key field information from the bill; the feature string generation module 2 is used to use the MD5 algorithm to calculate the key field information to generate the MD5 feature string; the comparison module 3 is used to extract the MD5 feature String is compared with the MD5 characteristic string corresponding to the normal bill stored in the index file, if the same MD5 characteristic string is found, the bill is a duplicate bill, and the bill is eliminated; otherwise, the MD5 character string corresponding to the bill is Save the feature string to the index file, and confirm that the bill is a normal bill.

比较剔重模块3包括:计算定位单元4,用于根据设定的哈希函数对所述MD5特征串进行哈希运算,并根据该运算所得的函数值,找到所述哈希表中的存储节点;查找单元5,用于在所述存储节点上查找与所述MD5特征串相同的特征串;剔重单元6,用于根据所述查找的结果确定所述话单是否为重单,如果在所述哈希表的存储节点上找到与所述特征串相同的MD5特征串,所述话单为重单,进行剔除,否则,所述话单为正常话单。Compare and remove heavy module 3 and comprise: calculate and locate unit 4, be used for carrying out hash operation to described MD5 characteristic string according to the hash function of setting, and according to the function value gained by this operation, find the storage in the hash table Node; search unit 5, used to search the same feature string as the MD5 feature string on the storage node; duplicate unit 6, used to determine whether the bill is a heavy bill according to the result of the search, if Find the same MD5 feature string as the feature string on the storage node of the hash table, and the bill is a duplicate, and is removed; otherwise, the bill is a normal bill.

下面将一种话单剔重装置实施例一的工作流程描述如下:The workflow of Embodiment 1 of a call list deduplication device is described as follows:

提取模块1从话单中提取关键域信息,特征串生成模块2根据MD5算法对提取模块1中生成的关键域信息进行计算,生成MD5特征串,比较剔重模块3中的计算定位单元4根据设定的哈希函数对特征串生成模块2生成的MD5特征串进行哈希运算,根据计算所得哈希函数值将该MD5特征串定位到存储索引文件的平衡二叉树上,查找单元5在已完成定位的平衡二叉树上查找MD5特征串,查找单元5如果在平衡二叉树上找到与所述MD5特征串相同的MD5特征串,剔重单元6则确定话单为重单,将重单剔除,查找单元5如果在平衡二叉树上没有找到与所述MD5特征串相同的MD5特征串,剔重单元6则确定话单不是重单。Extraction module 1 extracts the key field information from bill, and feature string generation module 2 calculates the key field information generated in extraction module 1 according to MD5 algorithm, generates MD5 feature string, and compares the calculation location unit 4 in the deduplication module 3 according to The set hash function carries out the hash operation to the MD5 characteristic string generated by the characteristic string generation module 2, and locates the MD5 characteristic string on the balanced binary tree storing the index file according to the calculated hash function value, and the search unit 5 has completed Search the MD5 characteristic string on the balanced binary tree of location, if search unit 5 finds the MD5 characteristic string identical with described MD5 characteristic string on the balanced binary tree, remove the duplicate unit 6 and then determine that the bill is a heavy single, and the repeated single is eliminated, and the search unit 5. If no MD5 signature string identical to the MD5 signature string is found on the balanced binary tree, the deduplication unit 6 determines that the bill is not a duplicate bill.

本发明一种话单剔重装置实施例一综合了哈希表、平衡二叉树和MD5算法的优点,有效实现了对不同业务、不同类型话单的处理,生成了长度统一且唯一的MD5特征串,提高了话单剔重的效率,极大的节约了存储空间。Embodiment 1 of a bill deduplication device of the present invention combines the advantages of hash table, balanced binary tree and MD5 algorithm, effectively realizes the processing of different services and different types of bills, and generates a uniform and unique MD5 characteristic string , which improves the efficiency of deduplicating call lists and greatly saves storage space.

如图9所示,公开了本发明一种话单剔重装置实施例二的结构图,本发明一种话单剔重装置的实施例二包括了一种话单剔重装置实施例一的全部模块,此外,比较剔重模块还包括插入单元7、旋转单元8,话单剔重装置还包括加载模块9:插入单元7用于如果剔重单元6确定话单不是重单,将MD5特征串插入到平衡二叉树上,同时输出正确话单和索引增量文件,旋转单元8用于如果插入单元7插入的MD5特征串使平衡二叉树失去平衡,则通过旋转单元8旋转平衡二叉树进行调整,加载模块9用于根据需要将磁盘中的索引文件重新加载到内存。As shown in Fig. 9, the structural diagram of Embodiment 2 of a bill deduplication device of the present invention is disclosed. Embodiment 2 of a bill deduplication device of the present invention includes the first embodiment All modules. In addition, the comparison and deduplication module also includes an insertion unit 7 and a rotation unit 8. The bill deduplication device also includes a loading module 9: the insertion unit 7 is used to convert the MD5 feature if the deduplication unit 6 determines that the bill is not a duplicate. The string is inserted into the balanced binary tree, and the correct call list and index incremental file are output simultaneously. The rotation unit 8 is used to adjust the balanced binary tree by rotating the balanced binary tree through the rotation unit 8 if the MD5 characteristic string inserted by the insertion unit 7 makes the balanced binary tree unbalanced. Module 9 is used to reload the index file from disk to memory as needed.

下面将一种话单剔重装置实施例二的工作流程描述如下:The workflow of Embodiment 2 of a kind of call list deduplication device is described as follows:

提取模块1从话单中提取关键域信息,例如主叫号码、被叫号码、通话时间、SP代码等字段组合的任意一个或几个的组合,正常话单所对应的MD5特征串以哈希表的方式存储于索引文件中,该索引文件存储于内存或磁盘中,哈希表的冲突则通过平衡二叉树来解决;特征串生成模块2根据MD5算法对提取模块1中生成的关键域信息进行计算,生成MD5特征串,比较剔重模块3中的计算定位单元4根据设定的哈希函数对特征串生成模块2生成的MD5特征串进行哈希运算,并根据计算所得哈希函数值将所述MD5特征串定位到作为所述哈希表存储节点的平衡二叉树上,查找单元5在已完成定位的平衡二叉树上查找MD5特征串,查找单元5如果在平衡二叉树上找到与该MD5特征串相同的MD5特征串,剔重单元6则确定话单为重单,将重单剔除,查找单元5如果在平衡二叉树上没有找到与该MD5特征串相同的MD5特征串,剔重单元6则确定话单不是重单,查找单元5在平衡二叉树上查找MD5特征串的过程具体如方法实施例步骤S214、步骤S216、步骤S224所述;如果剔重单元6确定话单不是重单,插入单元7将MD5特征串插入到平衡二叉树上,同时输出正确话单和索引增量文件,如果插入单元7插入的MD5特征串使平衡二叉树失去平衡,旋转单元8通过旋转平衡二叉树进行调整,在执行完剔重话单的过程以后,话单处理结束,当话单剔重过程的内存占有量超过指定数值时,话单剔重装置自动释放内存,并合并索引增量文件,在继续处理下一个话单文件时,加载模块9将根据需要重新加载索引文件。Extraction module 1 extracts key field information from bill, such as calling number, called number, call time, SP code and other field combinations of any one or several combinations, and the corresponding MD5 feature string of normal bill is hashed The table is stored in the index file, and the index file is stored in the memory or disk, and the conflict of the hash table is solved by balancing the binary tree; the feature string generation module 2 performs the key domain information generated in the extraction module 1 according to the MD5 algorithm. Compute, generate MD5 feature string, compare the calculation positioning unit 4 in the deduplication module 3 to carry out the hash operation to the MD5 feature string generated by the feature string generating module 2 according to the hash function set, and according to the calculated hash function value will The MD5 characteristic string is positioned on the balanced binary tree as the storage node of the hash table, and the search unit 5 searches for the MD5 characteristic string on the balanced binary tree that has been positioned. If the search unit 5 finds the MD5 characteristic string on the balanced binary tree Identical MD5 characteristic strings, the repeated unit 6 then determines that the bill is a heavy single, and the repeated singles are removed, if the search unit 5 does not find the MD5 characteristic strings identical with the MD5 characteristic strings on the balanced binary tree, the repeated unit 6 then determines The bill is not a duplicate, and the search unit 5 searches the process of the MD5 characteristic string on the balanced binary tree specifically as described in method embodiment step S214, step S216, and step S224; if the duplicate unit 6 determines that the bill is not a duplicate, the insertion unit 7 The MD5 feature string is inserted into the balanced binary tree, and the correct call list and index incremental file are output simultaneously. If the MD5 feature string inserted by the insertion unit 7 makes the balanced binary tree unbalanced, the rotation unit 8 adjusts the balanced binary tree by rotating it. After the process of repeating the bill, the bill processing ends. When the memory usage of the bill deduplication process exceeds the specified value, the bill deduplication device automatically releases the memory and merges the index incremental files, and continues to process the next bill file, the load module 9 will reload the index file as needed.

本发明一种话单剔重装置实施例二利用MD5算法对话单关键域信息进行运算,得到长度统一且唯一的MD5特征串,正常话单所对应的MD5特征串采用哈希表的方式存储,哈希表的冲突则通过平衡二叉树来解决,将MD5特征串与平衡二叉树的节点值进行比较,由此来判断话单是否是重单,实现了对任意业务、任意类型话单的剔重处理,提高了剔重系统的剔重运行效率和可扩展性,极大的节约了存储空间。Embodiment 2 of a bill deduplication device of the present invention uses the MD5 algorithm to perform calculations on dialogue single key field information to obtain a uniform and unique MD5 feature string, and the MD5 feature string corresponding to a normal bill is stored in a hash table. The conflict of the hash table is solved by balancing the binary tree, and comparing the MD5 characteristic string with the node value of the balanced binary tree, so as to judge whether the bill is a duplicate order, and realize the duplicate processing of any business and any type of bill , which improves the operation efficiency and scalability of the deduplication system, and greatly saves storage space.

除上述实施例外,本发明一种话单剔重装置,比较剔重模块中的计算定为单元中也可用链表来作为哈希表的存储节点,将MD5特征串与链表的节点值进行比较,由此来判断话单是否为重单,与上述实施例相似的结构也可实现话单的剔重功能。Except above-mentioned embodiment, a kind of phone list of the present invention removes heavy device, compares the calculation in the heavy module and determines that in the unit, linked list can also be used as the storage node of hash table, and the node value of MD5 characteristic string and linked list is compared, From this, it is judged whether the bill is a duplicate bill, and the structure similar to the above embodiment can also realize the function of removing duplicate bills.

以上公开的仅为本发明的几个具体实施例,但是,本发明并非局限于此,任何本领域的技术人员能思之的变化都应落入本发明的保护范围。The above disclosures are only a few specific embodiments of the present invention, however, the present invention is not limited thereto, and any changes conceivable by those skilled in the art shall fall within the protection scope of the present invention.

Claims (13)

1.一种话单剔重方法,其特征在于,该方法包括以下步骤:1. A method for removing duplicates from a bill, is characterized in that the method may further comprise the steps: 步骤a:从话单中提取关键域信息;Step a: extract key field information from the bill; 步骤b:使用MD5算法对所述关键域信息进行计算,生成该话单对应的MD5特征串;Step b: use the MD5 algorithm to calculate the key field information, and generate the MD5 feature string corresponding to the bill; 步骤c:将所述MD5特征串,与索引文件中存储的、正常话单对应的MD5特征串进行比较;Step c: comparing the MD5 feature string with the MD5 feature string stored in the index file and corresponding to the normal bill; 如果发现相同的MD5特征串,则该话单为重单,对该话单进行剔除;否则将该话单对应的MD5特征串保存至索引文件中,并确认该话单为正常话单。If identical MD5 characteristic strings are found, the bill is a duplicate, and the bill is removed; otherwise, the MD5 characteristic string corresponding to the bill is saved in the index file, and the bill is confirmed to be a normal bill. 2.根据权利要求1所述的话单剔重方法,其特征在于,所述正常话单对应的MD5特征串,以哈希表的方式存储于索引文件中;所述步骤c具体包括:2. according to the described method of claim 1, it is characterized in that, the MD5 characteristic string corresponding to the normal bill is stored in the index file in the mode of hash table; Described step c specifically comprises: 步骤c1:对该话单对应的MD5特征串,根据设定的哈希函数进行哈希运算;Step c1: perform a hash operation on the MD5 characteristic string corresponding to the bill according to a set hash function; 步骤c2:根据哈希运算得到的函数值,找到所述哈希表中的存储节点;Step c2: Find the storage node in the hash table according to the function value obtained by the hash operation; 步骤c3:如果在该存储节点上找到与该话单对应的MD5特征串相同的MD5特征串,Step c3: If the MD5 characteristic string identical to the MD5 characteristic string corresponding to the bill is found on the storage node, 则该话单为重单,对该话单进行剔除;否则,将该话单对应的MD5特征串插入到该存储节点中,并确认该话单为正常话单。Then the bill is a duplicate bill, and the bill is eliminated; otherwise, the MD5 characteristic string corresponding to the bill is inserted into the storage node, and the bill is confirmed to be a normal bill. 3.根据权利要求2所述的话单剔重方法,其特征在于,所述哈希表中的存储节点中的MD5特征串,以链表的方式或者平衡二叉树的方式存储。3. the method for eliminating duplicates according to claim 2, is characterized in that, the MD5 characteristic string in the storage node in the said hash table is stored in the mode of linked list or the mode of balanced binary tree. 4.根据权利要求3所述的话单剔重方法,其特征在于,当所述哈希表中的存储节点中的MD5特征串,以平衡二叉树的方式存储时,所述步骤c3中,所述将该话单对应的MD5特征串插入到该存储节点中的步骤具体包括:4. according to the described method of claim 3, it is characterized in that, when the MD5 characteristic string in the storage node in the said hash table is stored in the mode of balanced binary tree, in said step c3, said The step of inserting the MD5 characteristic string corresponding to the bill into the storage node specifically includes: 将该MD5特征串插入到该存储节点上的平衡二叉树上,若插入的MD5特征串使得所述平衡二叉树失去平衡时,则通过旋转进行调整。The MD5 characteristic string is inserted into the balanced binary tree on the storage node, and if the inserted MD5 characteristic string makes the balanced binary tree unbalanced, adjustment is performed by rotation. 5.根据权利要求1-4中任一所述的话单剔重方法,其特征在于,当所述话单为正常话单时,则输出该话单、并输出更新后的索引增量文件。5. The method for deduplicating bills according to any one of claims 1-4, characterized in that, when the bills are normal bills, the bills are output, and an updated index increment file is output. 6.根据权利要求5所述的话单剔重方法,其特征在于,所述索引文件以及索引增量文件,存储于内存或磁盘中。6. The method for removing duplicates from the bill according to claim 5, wherein the index file and the index increment file are stored in memory or disk. 7.根据权利要求6所述的话单剔重方法,其特征在于,若内存占用量超过指定数值,则合并索引增量文件,将索引文件中时间较早的部分存储到磁盘中,并自动释放内存;若需要将磁盘中的索引文件重新加载到内存时,则重新加载。7. according to the described method of claim 6, it is characterized in that, if the memory usage exceeds the specified value, then merge the index increment file, store the earlier part in the index file in the disk, and release automatically memory; if the index file in the disk needs to be reloaded into the memory, reload it. 8.根据权利要1-4中任一所述的话单剔重方法,其特征在于,所述关键域信息包括:由主叫号码、被叫号码、通话时间、SP代码组成的组合字段中的任意一种或几种的组合。8. According to the method for eliminating duplicates according to any one of claims 1-4, it is characterized in that, the key domain information includes: in the combination field made up of calling number, called number, call time, SP code Any one or combination of several. 9、一种话单剔重装置,其特征在于,包括:9. A bill-removing device, characterized in that it comprises: 提取模块,用于从话单中提取关键域信息;An extraction module, used to extract key field information from the bill; 特征串生成模块,用于使用MD5算法对所述关键域信息进行计算,生成MD5特征串;A feature string generation module, configured to use the MD5 algorithm to calculate the key field information to generate an MD5 feature string; 比较剔重模块:用于将所述MD5特征串与索引文件中存储的、正常话单对应的MD5特征串进行比较;Compare duplicate module: be used for comparing described MD5 feature string with the MD5 feature string corresponding to the normal call list stored in the index file; 若如果发现相同的MD5特征串,则该话单为重单,对该话单进行剔除;否则将该话单对应的MD5特征串保存至索引文件中,并确认该话单为正常话单。If find identical MD5 characteristic string, then this bill is duplicate bill, and this bill is removed; Otherwise the MD5 characteristic string corresponding to this bill is saved in the index file, and confirms that this bill is normal bill. 10、根据权利要求9所述的话单剔重装置,其特征在于,所述的比较剔重模块包括:10. The bill deduplication device according to claim 9, characterized in that, the comparison deduplication module includes: 计算定位单元,用于根据设定的哈希函数对所述MD5特征串进行哈希运算,并根据该运算所得的函数值,找到所述哈希表中的存储节点;Calculation and positioning unit, used to carry out hash operation on the MD5 characteristic string according to the set hash function, and find the storage node in the hash table according to the function value obtained by the operation; 查找单元,用于在所述存储节点上查找与所述MD5特征串相同的特征串;A search unit, configured to search for a signature string identical to the MD5 signature string on the storage node; 剔重单元,用于根据所述查找的结果确定所述话单是否为重单,如果在所述哈希表的存储节点上找到与所述MD5特征串相同的MD5特征串,所述话单为重单,进行剔除,否则,所述话单为正常话单。A duplicate removal unit is used to determine whether the bill is a duplicate bill according to the result of the search, if the MD5 characteristic string identical to the MD5 characteristic string is found on the storage node of the hash table, the bill If it is a duplicate bill, remove it; otherwise, the bill is a normal bill. 11、根据权利要求10所述的话单剔重装置,其特征在于,所述哈希表的存储节点为一个链表或一颗平衡二叉树。11. The bill deduplication device according to claim 10, wherein the storage node of the hash table is a linked list or a balanced binary tree. 12、根据权利要求10或11所述的话单剔重装置,其特征在于,当所述哈希表中的存储节点为一颗平衡二叉树时,所述比较剔重模块还包括:12. The bill deduplication device according to claim 10 or 11, wherein when the storage node in the hash table is a balanced binary tree, the comparison deduplication module further includes: 插入单元,用于将所述正常话单对应的MD5特征串插入到所述平衡二叉树上,同时输出该正常话单及其索引增量文件;Insertion unit, for inserting the MD5 feature string corresponding to the normal bill into the balanced binary tree, and simultaneously output the normal bill and its index incremental file; 旋转单元,用于如果所述MD5特征串插入所述平衡二叉树时,使所述平衡二叉树失去平衡,则通过所述旋转单元旋转平衡二叉树进行调整。A rotation unit is configured to rotate the balanced binary tree through the rotation unit to adjust if the balanced binary tree is unbalanced when the MD5 character string is inserted into the balanced binary tree. 13、根据权利要求9所述的话单剔重装置,其特征在于,还包括加载模块,用于根据需要将磁盘中的索引文件重新加载到内存。13. The phone bill deduplication device according to claim 9, further comprising a loading module, configured to reload the index file in the disk to the memory as required.
CN2008101832739A 2008-12-12 2008-12-12 A method and device for deduplicating phone calls Expired - Fee Related CN101442731B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008101832739A CN101442731B (en) 2008-12-12 2008-12-12 A method and device for deduplicating phone calls

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008101832739A CN101442731B (en) 2008-12-12 2008-12-12 A method and device for deduplicating phone calls

Publications (2)

Publication Number Publication Date
CN101442731A true CN101442731A (en) 2009-05-27
CN101442731B CN101442731B (en) 2010-07-14

Family

ID=40726941

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008101832739A Expired - Fee Related CN101442731B (en) 2008-12-12 2008-12-12 A method and device for deduplicating phone calls

Country Status (1)

Country Link
CN (1) CN101442731B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102065192A (en) * 2009-11-18 2011-05-18 中国移动通信集团安徽有限公司 Call ticket de-duplication method and device
CN102156744A (en) * 2011-04-18 2011-08-17 北京神州数码思特奇信息技术股份有限公司 Method for eliminating repetition of memory dialog list
CN102298633A (en) * 2011-09-08 2011-12-28 厦门市美亚柏科信息股份有限公司 Method and system for investigating repeated data in distributed mass data
WO2012079460A1 (en) * 2010-12-14 2012-06-21 成都市华为赛门铁克科技有限公司 Method, device, and system to check for data repetitiveness
CN102541918A (en) * 2010-12-30 2012-07-04 阿里巴巴集团控股有限公司 Method and equipment for identifying repeated information
CN102591855A (en) * 2012-01-13 2012-07-18 广州从兴电子开发有限公司 Data identification method and data identification system
CN102591792A (en) * 2012-01-13 2012-07-18 广州从兴电子开发有限公司 Storage method for memory data
CN103037344A (en) * 2012-12-06 2013-04-10 亚信联创科技(中国)有限公司 Call bill repetition removing method and call bill repetition removing device
CN103207878A (en) * 2012-01-17 2013-07-17 阿里巴巴集团控股有限公司 Inspection method and device of published information
CN105930396A (en) * 2016-04-15 2016-09-07 北京思特奇信息技术股份有限公司 Database based duplicate removal method and system
US9785666B2 (en) 2010-12-28 2017-10-10 Microsoft Technology Licensing, Llc Using index partitioning and reconciliation for data deduplication
CN107357862A (en) * 2017-06-30 2017-11-17 中国联合网络通信集团有限公司 Calling list rearrangement method and device
CN108650429A (en) * 2018-04-08 2018-10-12 国网辽宁省电力有限公司信息通信分公司 A kind of calling list rearrangement method and re-scheduling system
CN108990001A (en) * 2017-06-05 2018-12-11 中兴通讯股份有限公司 Calling list rearrangement method, device, storage medium and computer equipment
CN109976896A (en) * 2019-04-09 2019-07-05 中国联合网络通信集团有限公司 Business re-scheduling treating method and apparatus
CN111209272A (en) * 2019-12-26 2020-05-29 杭州亚信云信息科技有限公司 Ticket repetition checking method, device and system
CN112069510A (en) * 2020-07-24 2020-12-11 北京思特奇信息技术股份有限公司 Data encryption and de-duplication method
CN113934685A (en) * 2020-07-13 2022-01-14 中国石油化工股份有限公司 Method and device for seismic data repeatability inspection, electronic equipment and storage medium
CN114245330A (en) * 2021-11-17 2022-03-25 中国联合网络通信集团有限公司 Ticket combination method, device, equipment, computer readable storage medium and product

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102065192A (en) * 2009-11-18 2011-05-18 中国移动通信集团安徽有限公司 Call ticket de-duplication method and device
WO2012079460A1 (en) * 2010-12-14 2012-06-21 成都市华为赛门铁克科技有限公司 Method, device, and system to check for data repetitiveness
US9785666B2 (en) 2010-12-28 2017-10-10 Microsoft Technology Licensing, Llc Using index partitioning and reconciliation for data deduplication
CN102541918A (en) * 2010-12-30 2012-07-04 阿里巴巴集团控股有限公司 Method and equipment for identifying repeated information
CN102156744A (en) * 2011-04-18 2011-08-17 北京神州数码思特奇信息技术股份有限公司 Method for eliminating repetition of memory dialog list
CN102298633A (en) * 2011-09-08 2011-12-28 厦门市美亚柏科信息股份有限公司 Method and system for investigating repeated data in distributed mass data
CN102591792B (en) * 2012-01-13 2015-04-29 从兴技术有限公司 Storage method for memory data
CN102591855A (en) * 2012-01-13 2012-07-18 广州从兴电子开发有限公司 Data identification method and data identification system
CN102591792A (en) * 2012-01-13 2012-07-18 广州从兴电子开发有限公司 Storage method for memory data
CN103207878A (en) * 2012-01-17 2013-07-17 阿里巴巴集团控股有限公司 Inspection method and device of published information
CN103207878B (en) * 2012-01-17 2016-05-04 阿里巴巴集团控股有限公司 The inspection method releasing news and device
CN103037344B (en) * 2012-12-06 2016-04-20 亚信科技(中国)有限公司 A kind of ticket De-weight method and device
CN103037344A (en) * 2012-12-06 2013-04-10 亚信联创科技(中国)有限公司 Call bill repetition removing method and call bill repetition removing device
CN105930396A (en) * 2016-04-15 2016-09-07 北京思特奇信息技术股份有限公司 Database based duplicate removal method and system
CN105930396B (en) * 2016-04-15 2019-04-09 北京思特奇信息技术股份有限公司 A kind of repetition removing method and system based on database
CN108990001B (en) * 2017-06-05 2021-04-20 中兴通讯股份有限公司 Ticket repetition eliminating method, device, storage medium and computer equipment
CN108990001A (en) * 2017-06-05 2018-12-11 中兴通讯股份有限公司 Calling list rearrangement method, device, storage medium and computer equipment
CN107357862A (en) * 2017-06-30 2017-11-17 中国联合网络通信集团有限公司 Calling list rearrangement method and device
CN107357862B (en) * 2017-06-30 2020-03-13 中国联合网络通信集团有限公司 Method and device for arranging repeated voice messages
CN108650429A (en) * 2018-04-08 2018-10-12 国网辽宁省电力有限公司信息通信分公司 A kind of calling list rearrangement method and re-scheduling system
CN109976896A (en) * 2019-04-09 2019-07-05 中国联合网络通信集团有限公司 Business re-scheduling treating method and apparatus
CN109976896B (en) * 2019-04-09 2021-06-29 中国联合网络通信集团有限公司 Method and device for re-arranging services
CN111209272A (en) * 2019-12-26 2020-05-29 杭州亚信云信息科技有限公司 Ticket repetition checking method, device and system
CN111209272B (en) * 2019-12-26 2023-04-18 杭州亚信云信息科技有限公司 Method, device and system for checking call ticket
CN113934685A (en) * 2020-07-13 2022-01-14 中国石油化工股份有限公司 Method and device for seismic data repeatability inspection, electronic equipment and storage medium
CN113934685B (en) * 2020-07-13 2025-02-25 中国石油化工股份有限公司 Method, device, electronic device and storage medium for checking repeatability of seismic data
CN112069510A (en) * 2020-07-24 2020-12-11 北京思特奇信息技术股份有限公司 Data encryption and de-duplication method
CN112069510B (en) * 2020-07-24 2024-01-30 北京思特奇信息技术股份有限公司 Data encryption and duplication elimination method
CN114245330A (en) * 2021-11-17 2022-03-25 中国联合网络通信集团有限公司 Ticket combination method, device, equipment, computer readable storage medium and product
CN114245330B (en) * 2021-11-17 2024-04-02 中国联合网络通信集团有限公司 Call record merging method, device, equipment, computer-readable storage medium and product

Also Published As

Publication number Publication date
CN101442731B (en) 2010-07-14

Similar Documents

Publication Publication Date Title
CN101442731A (en) Method and apparatus for removing call ticket repeat
CN102906751B (en) A kind of method of data storage, data query and device
JP6088506B2 (en) Managing data storage for range-based searches
US8255398B2 (en) Compression of sorted value indexes using common prefixes
US7739288B2 (en) Systems and methods of directory entry encodings
CN104462141B (en) Method, system and the storage engines device of a kind of data storage and inquiry
CN102027457B (en) Managing storage of individually accessible data units
CN100571317C (en) A kind of calling list rearrangement method and device
CN104077423A (en) Consistent hash based structural data storage, inquiry and migration method
CN110147204A (en) Method, device, system, and computer-readable storage medium for metadata storage
CN106250476B (en) A method, device and system for updating and synchronizing a whitelist
CN113535803B (en) An Efficient Retrieval and Reliability Verification Method of Blockchain Based on Keyword Index
CN102541925A (en) Method and device for rapidly storing and retrieving detailed tickets
CN111966654A (en) Mixed filter based on Trie dictionary tree
US8566324B1 (en) Inverted index and inverted list process for storing and retrieving information
US8949282B1 (en) Efficient storage of non-searchable attributes
CN113806803B (en) Data storage method, system, terminal equipment and storage medium
CN102065192B (en) Call ticket de-duplication method and device
US8930336B2 (en) Retrieval of searchable and non-searchable attributes
CN110971393B (en) Method and device for keyword query and verification based on blockchain dynamic social outsourcing data
US20120185514A1 (en) Optimized fetching for customization object attributes
CN102917341B (en) The storage of mobile phone and its cell-phone number attaching information and lookup method
CN107315806B (en) Embedded storage method and device based on file system
CN111309677A (en) File management method and device of distributed file system
Peng et al. A hive-based retrieval optimization scheme for long-term storage of massive call detail records

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100714

Termination date: 20211212