CN104050291A

CN104050291A - Parallel processing method and system for account balance data

Info

Publication number: CN104050291A
Application number: CN201410306448.6A
Authority: CN
Inventors: 赵仁明; 辛国茂; 亓开元; 房体盈
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Shanghai Wave Cloud Computing Service Co Ltd
Priority date: 2014-06-30
Filing date: 2014-06-30
Publication date: 2014-09-17
Anticipated expiration: 2034-06-30
Also published as: CN104050291B

Abstract

The invention discloses a method for parallel processing of account balance data. The method includes: one or more Map nodes executing the first task read different slice data of account balance detailed data, and generate the data in the read slice data. The first output parameter and the second output parameter of each balance record; wherein, the first output parameter includes at least an account ID, and the second output parameter is set as account status information, and the account status information includes at least: balance value, transaction date, and transaction number of the day; one or more Reduce nodes executing the first task read the different balance records processed by the Map node executing the first task, and according to the first output parameter and the second output parameter of the balance record The two output parameters respectively generate the daily average balance value records of each account; among them, the balance records with the same first output parameter are read by the same Reduce node. The invention can quickly acquire the statistical results of the daily average balance of the account under a large amount of data. The invention also discloses a parallel processing system for account balance data.

Description

A parallel processing method and system for account balance data

技术领域technical field

本发明涉及大数据处理技术领域，尤其涉及的是一种大数据量下的帐户余额数据的并行处理方法和系统。The invention relates to the technical field of big data processing, in particular to a method and system for parallel processing of account balance data under a large amount of data.

背景技术Background technique

数据是企业生产、经营、战略等几乎所有经营活动所依赖的、不可或缺的信息。数据就犹如企业经营者的眼睛一样，通过数据可以反映出经营的问题，就犹如舵手依赖导航一样。随着人类社会全面进入信息时代，数据更是成为与水、石油同等重要的战略资源。目前企业面临着数据量的大规模增长。例如，IDC最近的报告预测称，到2020年，全球数据量将扩大50倍。目前，大数据的规模尚是一个不断变化的指标，单一数据集的规模范围从几十TB到数PB不等。此外，各种意想不到的来源都能产生数据。Data is an indispensable information that almost all business activities such as production, operation, and strategy rely on. Data is like the eyes of business operators. Data can reflect business problems, just like the helmsman relies on navigation. As human society enters the information age, data has become a strategic resource as important as water and oil. Enterprises are currently facing massive growth in data volume. For example, a recent IDC report predicts that by 2020, the amount of data worldwide will expand 50-fold. At present, the size of big data is still a changing indicator, and the size of a single data set ranges from tens of terabytes to several petabytes. In addition, data can be generated from a variety of unexpected sources.

传统业务数据随时间演变已拥有标准的格式，能够被标准的商务智能软件识别。相较传统的业务数据，大数据具有多层结构，这意味着大数据会呈现出多变的形式和类型。由于大数据存在不规则和模糊不清的特性，造成很难甚至无法使用传统的应用软件进行分析。Traditional business data has evolved over time in a standard format that can be recognized by standard business intelligence software. Compared with traditional business data, big data has a multi-layered structure, which means that big data will present various forms and types. Due to the irregular and ambiguous nature of big data, it is difficult or even impossible to analyze it using traditional application software.

目前，企业面临的挑战是从各种形式的复杂数据中挖掘价值。Today, the challenge for enterprises is to extract value from various forms of complex data.

发明内容Contents of the invention

本发明所要解决的技术问题是提供一种账户余额数据的并行处理方法和系统，快速获取大数据量下的帐户日均余额的统计结果。The technical problem to be solved by the present invention is to provide a parallel processing method and system for account balance data, which can quickly obtain statistical results of account daily average balance under a large amount of data.

为了解决上述技术问题，本发明提供了一种账户余额数据的并行处理方法，该方法包括：In order to solve the above technical problems, the present invention provides a method for parallel processing of account balance data, the method comprising:

一个或多个执行第一任务的Map节点读取账户余额明细数据的不同分片数据，生成所读取的分片数据中每一条余额记录的第一输出参数和第二输出参数；其中，所述第一输出参数至少包括账户ID，所述第二输出参数设定为账户状态信息，所述账户状态信息至少包括：余额值、交易日期和当天交易序号；One or more Map nodes performing the first task read different fragmented data of the account balance detailed data, and generate the first output parameter and the second output parameter of each balance record in the read fragmented data; wherein, the The first output parameter includes at least an account ID, and the second output parameter is set as account status information, and the account status information includes at least: balance value, transaction date and transaction number of the day;

一个或多个执行第一任务的Reduce节点读取所述执行第一任务的Map节点处理完毕的不同余额记录，根据所述余额记录的第一输出参数和第二输出参数分别生成各账户的日均余额值记录；其中，第一输出参数相同的余额记录由同一个Reduce节点读取。One or more Reduce nodes executing the first task read the different balance records processed by the Map node executing the first task, and generate the daily balance records of each account according to the first output parameter and the second output parameter of the balance record respectively. Average balance value records; where the balance records with the same first output parameter are read by the same Reduce node.

进一步地，该方法还包括下述特点：Further, the method also includes the following features:

所述根据所述余额记录的第一输出参数和第二输出参数分别生成各账户的日均余额值记录，包括：The generating of the daily average balance value records of each account according to the first output parameter and the second output parameter of the balance record includes:

按照所述第一输出参数中的账户ID，遍历同一账户的各条余额记录，根据所述余额记录的第二输出参数确定该账户在查询起止时间范围内的每一天的余额值，将每一天的余额值在所述查询起止时间范围内取平均得到该账户的日均余额值，生成该账户的日均余额值记录。According to the account ID in the first output parameter, traverse each balance record of the same account, determine the balance value of the account for each day within the query start and end time range according to the second output parameter of the balance record, and calculate each day The balance value of the account is averaged within the start and end time range of the query to obtain the daily average balance value of the account, and a record of the daily average balance value of the account is generated.

在所述执行第一任务的Reduce节点根据所述余额记录的第一输出参数和第二输出参数分别生成各账户的日均余额值记录后，还包括：After the Reduce node performing the first task respectively generates the daily average balance value records of each account according to the first output parameter and the second output parameter of the balance record, it also includes:

一个或多个执行第二任务的Map节点读取不同账户的日均余额值记录，生成所读取的日均余额值记录的第一输出参数和第二输出参数；其中，所述日均余额值记录的第一输出参数设定为所述日均余额值所在的区间，所述日均余额值记录的第二输出参数设定为1；One or more Map nodes performing the second task read the daily average balance value records of different accounts, and generate the first output parameter and the second output parameter of the read daily average balance value records; wherein, the daily average balance value The first output parameter of the value record is set to the interval where the daily average balance value is located, and the second output parameter of the daily average balance value record is set to 1;

一个或多个执行第二任务的Reduce节点读取所述执行第二任务的Map节点处理完毕的不同日均余额值记录，根据所述日均余额值记录的第一输出参数和第二输出参数统计各日均余额值区间的账户数，包括：按照所述第一输出参数中的日均余额值区间，遍历同一日均余额值区间的各条日均余额值记录，将每一条日均余额值记录的第二输出参数进行累加，获得该日均余额值区间的账户数；其中，第一输出参数相同的日均余额值记录由同一个Reduce节点读取。One or more Reduce nodes executing the second task read the different daily average balance value records processed by the Map node executing the second task, and according to the first output parameter and the second output parameter of the daily average balance value record Counting the number of accounts in each daily average balance value interval, including: according to the daily average balance value interval in the first output parameter, traversing each daily average balance value record in the same daily average balance value interval, and recording each daily average balance value The second output parameter of the value record is accumulated to obtain the number of accounts in the daily average balance value interval; wherein, the daily average balance value records with the same first output parameter are read by the same Reduce node.

根据所述余额记录的第二输出参数确定该账户在查询起止时间范围内的每一天的余额值，包括：According to the second output parameter of the balance record, determine the balance value of the account for each day within the query start and end time range, including:

从查询起始时间的当日至查询终止时间的当日，判断每一天是否存在余额记录，如存在，将当日交易序号最大的余额记录的余额值作为当日的最终余额值，如不存在，追溯早于当日且具有余额记录的最近日期，将所述最近日期那一天的交易序号最大的余额记录的余额值作为当日的最终余额值。From the day of the query start time to the day of the query end time, judge whether there is a balance record for each day. If it exists, use the balance value of the balance record with the largest transaction number on that day as the final balance value of the day. If it does not exist, trace back to earlier than For the latest date with a balance record on that day, the balance value of the balance record with the largest transaction sequence number on the day of the latest date is taken as the final balance value of the day.

在一个或多个执行第一任务的Reduce节点读取所述执行第一任务的Map节点处理完毕的不同余额记录之前，还包括：Before one or more Reduce nodes executing the first task read the different balance records processed by the Map node executing the first task, it also includes:

计算每一条余额记录的第一参数的哈希值，建立第一参数的哈希值与所述执行第一任务的Reduce节点的映射关系；其中，所述映射关系用于供所述执行第一任务的Reduce节点根据所述映射关系读取对应的余额记录。Calculate the hash value of the first parameter of each balance record, and establish a mapping relationship between the hash value of the first parameter and the Reduce node that executes the first task; wherein, the mapping relationship is used for the execution of the first task. The Reduce node of the task reads the corresponding balance record according to the mapping relationship.

在一个或多个执行第二任务的Reduce节点读取所述执行第二任务的Map节点处理完毕的不同日均余额值记录之前，还包括：Before one or more Reduce nodes executing the second task read the different daily average balance value records processed by the Map node executing the second task, it also includes:

计算每一条日均余额值记录的第一参数的哈希值，建立第一参数的哈希值与执行第二任务的Reduce节点的映射关系；其中，所述映射关系用于供所述执行第二任务的Reduce节点根据所述映射关系读取对应的日均余额值记录。Calculate the hash value of the first parameter of each daily average balance value record, and establish a mapping relationship between the hash value of the first parameter and the Reduce node executing the second task; wherein, the mapping relationship is used for the execution of the second task The Reduce node of the second task reads the corresponding daily average balance value record according to the mapping relationship.

在一个或多个执行第一任务的Map节点读取账户余额明细数据的不同分片数据之前，还包括：Before one or more Map nodes performing the first task read different fragmented data of the account balance detailed data, it also includes:

根据查询起止时间确定账户余额明细数据的读取范围，包括：将全量余额明细数据和截止到查询终止时间当天的增量余额明细数据确定为账户余额明细数据的读取范围；Determine the reading range of account balance detailed data according to the start and end time of the query, including: determining the full balance detailed data and the incremental balance detailed data up to the day when the query ends as the reading range of account balance detailed data;

将属于该读取范围内的账户余额明细数据分片，建立每一个分片与执行第一任务的Map节点的映射关系；其中，所述映射关系用于供所述执行第一任务的Map节点根据所述映射关系读取对应的分片数据。Fragmenting the account balance detail data belonging to the reading range, and establishing a mapping relationship between each fragment and the Map node performing the first task; wherein, the mapping relationship is used for the Map node performing the first task The corresponding fragment data is read according to the mapping relationship.

为了解决上述技术问题，本发明还提供了一种账户余额数据的并行处理系统，该系统包括：In order to solve the above technical problems, the present invention also provides a parallel processing system for account balance data, the system includes:

Map处理模块，包括一个或多个执行第一任务的Map节点；各执行第一任务的Map节点用于读取账户余额明细数据的不同分片数据，生成所读取的分片数据中每一条余额记录的第一输出参数和第二输出参数；其中，所述第一输出参数至少包括账户ID，所述第二输出参数设定为账户状态信息，所述账户状态信息至少包括：余额值、交易日期和当天交易序号；The Map processing module includes one or more Map nodes that execute the first task; each Map node that executes the first task is used to read different fragment data of the account balance detailed data, and generates each item in the read fragment data. The first output parameter and the second output parameter of the balance record; wherein, the first output parameter includes at least an account ID, and the second output parameter is set as account status information, and the account status information includes at least: balance value, Transaction date and transaction serial number of the day;

Reduce处理模块，包括一个或多个执行第一任务的Reduce节点；各执行第一任务的Reduce节点用于读取所述执行第一任务的Map节点处理完毕的不同余额记录，根据所述余额记录的第一输出参数和第二输出参数分别生成各账户的日均余额值记录；其中，第一输出参数相同的余额记录由同一个Reduce节点读取。The Reduce processing module includes one or more Reduce nodes that perform the first task; each Reduce node that performs the first task is used to read the different balance records that the Map node that performs the first task has processed, according to the balance record The first output parameter and the second output parameter respectively generate the daily average balance value records of each account; wherein, the balance records with the same first output parameter are read by the same Reduce node.

进一步地，该系统还包括下述特点：Further, the system also includes the following features:

所述执行第一任务的Reduce节点用于根据所述余额记录的第一输出参数和第二输出参数分别生成各账户的日均余额值记录，包括：按照所述第一输出参数中的账户ID，遍历同一账户的各条余额记录，根据所述余额记录的第二输出参数确定该账户在查询起止时间范围内的每一天的余额值，将每一天的余额值在所述查询起止时间范围内取平均得到该账户的日均余额值，生成该账户的日均余额值记录。The Reduce node performing the first task is used to generate the daily average balance value record of each account according to the first output parameter and the second output parameter of the balance record, including: according to the account ID in the first output parameter , traverse the balance records of the same account, determine the balance value of the account for each day within the query start and end time range according to the second output parameter of the balance record, and calculate the balance value of each day within the query start and end time range Take the average to obtain the daily average balance value of the account, and generate the daily average balance value record of the account.

所述Map处理模块还包括一个或多个执行第二任务的Map节点，所述Reduce处理模块还包括一个或多个执行第二任务的Reduce节点；The Map processing module also includes one or more Map nodes that execute the second task, and the Reduce processing module also includes one or more Reduce nodes that execute the second task;

各执行第二任务的Map节点用于读取不同账户的日均余额值记录，生成所读取的日均余额值记录的第一输出参数和第二输出参数；其中，所述日均余额值记录的第一输出参数设定为所述日均余额值所在的区间，所述日均余额值记录的第二输出参数设定为1；Each Map node performing the second task is used to read the daily average balance value records of different accounts, and generate the first output parameter and the second output parameter of the read daily average balance value records; wherein, the daily average balance value The first output parameter of the record is set to the interval where the daily average balance value is located, and the second output parameter of the record of the daily average balance value is set to 1;

各执行第二任务的Reduce节点用于读取所述执行第二任务的Map节点处理完毕的不同日均余额值记录，根据所述日均余额值记录的第一输出参数和第二输出参数统计各日均余额值区间的账户数，包括：按照所述第一输出参数中的日均余额值区间，遍历同一日均余额值区间的各条日均余额值记录，将每一条日均余额值记录的第二输出参数进行累加，获得该日均余额值区间的账户数；其中，第一输出参数相同的日均余额值记录由同一个Reduce节点读取。Each Reduce node executing the second task is used to read the different daily average balance value records processed by the Map node executing the second task, and make statistics according to the first output parameter and the second output parameter of the daily average balance value record The number of accounts in each daily average balance value interval, including: according to the daily average balance value interval in the first output parameter, traverse each daily average balance value record in the same daily average balance value interval, and record each daily average balance value The second output parameter of the record is accumulated to obtain the number of accounts in the daily average balance value interval; wherein, the daily average balance value records with the same first output parameter are read by the same Reduce node.

所述执行第一任务的Reduce节点用于根据所述余额记录的第二输出参数确定该账户在查询起止时间范围内的每一天的余额值，包括：从查询起始时间的当日至查询终止时间的当日，判断每一天是否存在余额记录，如存在，将当日交易序号最大的余额记录的余额值作为当日的最终余额值，如不存在，追溯早于当日且具有余额记录的最近日期，将所述最近日期那一天的交易序号最大的余额记录的余额值作为当日的最终余额值。The Reduce node executing the first task is used to determine the balance value of the account for each day within the query start and end time range according to the second output parameter of the balance record, including: from the day of the query start time to the query end time On the current day, judge whether there is a balance record for each day. If it exists, take the balance value of the balance record with the largest transaction number on the current day as the final balance value of the current day. The balance value of the balance record with the largest transaction sequence number on the most recent date mentioned above is taken as the final balance value of the day.

所述Reduce处理模块还包括第一任务路由模块：The Reduce processing module also includes the first task routing module:

所述第一任务路由模块，用于在一个或多个执行第一任务的Reduce节点读取所述执行第一任务的Map节点处理完毕的不同余额记录之前，计算每一条余额记录的第一参数的哈希值，建立第一参数的哈希值与所述执行第一任务的Reduce节点的映射关系；其中，所述映射关系用于供所述执行第一任务的Reduce节点根据所述映射关系读取对应的余额记录。The first task routing module is used to calculate the first parameter of each balance record before one or more Reduce nodes executing the first task read the different balance records processed by the Map node executing the first task Hash value, establishes the mapping relationship between the hash value of the first parameter and the Reduce node performing the first task; wherein, the mapping relationship is used for the Reduce node performing the first task according to the mapping relationship Read the corresponding balance record.

所述Reduce处理模块还包括第二任务路由模块：The Reduce processing module also includes a second task routing module:

所述第二任务路由模块，用于在一个或多个执行第二任务的Reduce节点读取所述执行第二任务的Map节点处理完毕的不同日均余额值记录之前，计算每一条日均余额值记录的第一参数的哈希值，建立第一参数的哈希值与执行第二任务的Reduce节点的映射关系；其中，所述映射关系用于供所述执行第二任务的Reduce节点根据所述映射关系读取对应的日均余额值记录。The second task routing module is used to calculate each daily average balance value record before one or more Reduce nodes executing the second task read the different daily average balance value records processed by the Map node executing the second task The hash value of the first parameter of the value record, establishes the mapping relationship between the hash value of the first parameter and the Reduce node performing the second task; wherein, the mapping relationship is used for the Reduce node performing the second task according to The mapping relationship reads the corresponding daily average balance value records.

进一步地，该系统还包括下述特点：所述Map处理模块还包括分片模块：Further, the system also includes the following features: the Map processing module also includes a fragmentation module:

所述分片模块，用于在一个或多个执行第一任务的Map节点读取账户余额明细数据的不同分片数据之前，根据查询起止时间确定账户余额明细数据的读取范围，包括：将全量余额明细数据和截止到查询终止时间当天的增量余额明细数据确定为账户余额明细数据的读取范围；将属于该读取范围内的账户余额明细数据分片，建立每一个分片与执行第一任务的Map节点的映射关系；其中，所述映射关系用于供所述执行第一任务的Map节点根据所述映射关系读取对应的分片数据。The sharding module is used to determine the reading range of the account balance detail data according to the query start and end time before one or more Map nodes performing the first task read different shard data of the account balance detail data, including: The full balance detailed data and the incremental balance detailed data up to the day when the query is terminated are determined as the reading range of the account balance detailed data; the account balance detailed data belonging to the reading range is fragmented, and each fragment is established and executed A mapping relationship of the Map node of the first task; wherein, the mapping relationship is used for the Map node executing the first task to read corresponding fragment data according to the mapping relationship.

与现有技术相比，本发明提供的一种账户余额数据的并行处理方法和系统，基于MapReduce将大规模的帐户余额明细数据分成若干份交给Map节点并行处理，Map阶段对数据按照帐户进行了分类，处理完成后根据帐户id分组并路由给多个Reduce节点并行处理，从而快速获取大数据量下的帐户日均余额的统计结果，处理效率高、可扩展性强。Compared with the prior art, the present invention provides a method and system for parallel processing of account balance data. Based on MapReduce, the large-scale account balance detailed data is divided into several parts and handed over to Map nodes for parallel processing. In the Map stage, the data is processed according to the account. After the processing is completed, it is grouped according to the account id and routed to multiple Reduce nodes for parallel processing, so as to quickly obtain the statistical results of the average daily balance of the account under a large amount of data, with high processing efficiency and strong scalability.

附图说明Description of drawings

图1为本发明实施例的一种账户余额数据的并行处理方法中获取各用户的日均余额值的流程图。FIG. 1 is a flow chart of obtaining the daily average balance value of each user in a parallel processing method of account balance data according to an embodiment of the present invention.

图2为本发明实施例的一种账户余额数据的并行处理方法中统计各日均余额值区间的账户数的流程图。Fig. 2 is a flow chart of counting the number of accounts in each daily average balance value interval in a parallel processing method of account balance data according to an embodiment of the present invention.

图3为本发明实施例的一种账户余额数据的并行处理系统的结构示意图。FIG. 3 is a schematic structural diagram of a parallel processing system for account balance data according to an embodiment of the present invention.

图4为本发明应用示例的基于MapReduce的账户余额数据的处理架构示意图。FIG. 4 is a schematic diagram of a processing architecture of account balance data based on MapReduce in an application example of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，下文中将结合附图对本发明的实施例进行详细说明。需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互任意组合。In order to make the purpose, technical solution and advantages of the present invention more clear, the embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings. It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined arbitrarily with each other.

如图1所示，本发明实施例提供了一种账户余额数据的并行处理方法，该方法包括：As shown in Figure 1, an embodiment of the present invention provides a method for parallel processing of account balance data, the method comprising:

S10，一个或多个执行第一任务的Map节点读取账户余额明细数据的不同分片数据，生成所读取的分片数据中每一条余额记录的第一输出参数和第二输出参数；其中，所述第一输出参数至少包括账户ID，所述第二输出参数设定为账户状态信息，所述账户状态信息至少包括：余额值、交易日期和当天交易序号；S10, one or more Map nodes performing the first task read different fragmented data of the account balance detailed data, and generate the first output parameter and the second output parameter of each balance record in the read fragmented data; wherein , the first output parameter includes at least an account ID, the second output parameter is set as account status information, and the account status information includes at least: balance value, transaction date and transaction number of the day;

S20，一个或多个执行第一任务的Reduce节点读取所述执行第一任务的Map节点处理完毕的不同余额记录，根据所述余额记录的第一输出参数和第二输出参数分别生成各账户的日均余额值记录；其中，第一输出参数相同的余额记录由同一个Reduce节点读取；S20, one or more Reduce nodes executing the first task read the different balance records processed by the Map node executing the first task, and generate each account according to the first output parameter and the second output parameter of the balance record The daily average balance value record; wherein, the balance records with the same first output parameter are read by the same Reduce node;

该方法还可以包括下述特点：The method may also include the following features:

优选地，在一个或多个执行第一任务的Map节点读取账户余额明细数据的不同分片数据之前，还包括：Preferably, before one or more Map nodes performing the first task read different fragmented data of account balance detailed data, it also includes:

优选地，所述根据所述余额记录的第一输出参数和第二输出参数分别生成各账户的日均余额值记录，包括：Preferably, the generating of the daily average balance value records of each account according to the first output parameter and the second output parameter of the balance record includes:

优选地，根据所述余额记录的第二输出参数确定该账户在查询起止时间范围内的每一天的余额值，包括：Preferably, according to the second output parameter of the balance record, the balance value of the account for each day within the query start and end time range is determined, including:

优选地，在一个或多个执行第一任务的Reduce节点读取所述执行第一任务的Map节点处理完毕的不同余额记录之前，还包括：Preferably, before one or more Reduce nodes executing the first task read the different balance records processed by the Map node executing the first task, further comprising:

优选地，每一条余额记录的第一参数的哈希值为将所述第一参数对执行第一任务的Reduce节点总个数取模；Preferably, the hash value of the first parameter of each balance record is the modulus of the first parameter to the total number of Reduce nodes executing the first task;

优选地，如图2所示，步骤S20后还包括：Preferably, as shown in Figure 2, after step S20, it also includes:

S30，确定每个账户的日均余额值所在的区间，统计每个区间内的账户数；S30, determining the interval in which the daily average balance of each account is located, and counting the number of accounts in each interval;

优选地，确定每个账户的日均余额值所在的区间，统计每个区间内的账户数，包括：Preferably, determine the interval in which the daily average balance of each account is located, and count the number of accounts in each interval, including:

S301，一个或多个执行第二任务的Map节点读取不同账户的日均余额值记录，生成所读取的日均余额值记录的第一输出参数和第二输出参数；其中，所述日均余额值记录的第一输出参数设定为所述日均余额值所在的区间，所述日均余额值记录的第二输出参数设定为1；S301, one or more Map nodes executing the second task read the daily average balance value records of different accounts, and generate the first output parameter and the second output parameter of the read daily average balance value records; wherein, the daily average balance value records The first output parameter of the average balance value record is set to the interval where the daily average balance value is located, and the second output parameter of the daily average balance value record is set to 1;

S302，一个或多个执行第二任务的Reduce节点读取所述执行第二任务的Map节点处理完毕的不同日均余额值记录，根据所述日均余额值记录的第一输出参数和第二输出参数统计各日均余额值区间的账户数，包括：按照所述第一输出参数中的日均余额值区间，遍历同一日均余额值区间的各条日均余额值记录，将每一条日均余额值记录的第二输出参数进行累加，获得该日均余额值区间的账户数；其中，第一输出参数相同的日均余额值记录由同一个Reduce节点读取。S302. One or more Reduce nodes executing the second task read the different daily average balance value records processed by the Map node executing the second task, and according to the first output parameter and the second output parameter of the daily average balance value record, The output parameter counts the number of accounts in each average daily balance value interval, including: according to the daily average balance value interval in the first output parameter, traverse each daily average balance value record in the same daily average balance value interval, and record each daily average balance value interval The second output parameter of the average balance value record is accumulated to obtain the number of accounts in the daily average balance value interval; wherein, the daily average balance value records with the same first output parameter are read by the same Reduce node.

优选地，在一个或多个执行第二任务的Reduce节点读取所述执行第二任务的Map节点处理完毕的不同日均余额值记录之前，还包括：Preferably, before one or more Reduce nodes executing the second task read the different daily average balance value records processed by the Map node executing the second task, it also includes:

优选地，每一条日均余额值记录的第一参数的哈希值为将所述第一参数对执行第二任务的Reduce节点总个数取模；Preferably, the hash value of the first parameter of each daily average balance value record takes the modulus of the first parameter to the total number of Reduce nodes executing the second task;

其中，所述执行第二任务的Map节点与执行第一任务的Map节点是同一批节点或不同批的节点，也即，Map节点在执行完第一任务后，才可以执行第二任务。同理，所述执行第二任务的Reduce节点与执行第一任务的Reduce节点是同一批节点或不同批的节点，也即，Reduce节点在执行完第一任务后，才可以执行第二任务。Wherein, the Map nodes executing the second task and the Map nodes executing the first task are the same batch of nodes or different batches of nodes, that is, the Map nodes can execute the second task only after executing the first task. Similarly, the Reduce nodes executing the second task and the Reduce nodes executing the first task are the same batch of nodes or different batches of nodes, that is, the Reduce nodes can execute the second task only after executing the first task.

如图3所示，本发明实施例提供了一种账户余额数据的并行处理系统，该系统包括：As shown in Figure 3, an embodiment of the present invention provides a parallel processing system for account balance data, the system includes:

该系统还可以包括下述特点：The system may also include the following features:

优选地，所述执行第一任务的Reduce节点用于根据所述余额记录的第一输出参数和第二输出参数分别生成各账户的日均余额值记录，包括：按照所述第一输出参数中的账户ID，遍历同一账户的各条余额记录，根据所述余额记录的第二输出参数确定该账户在查询起止时间范围内的每一天的余额值，将每一天的余额值在所述查询起止时间范围内取平均得到该账户的日均余额值，生成该账户的日均余额值记录。Preferably, the Reduce node executing the first task is used to generate the daily average balance value records of each account according to the first output parameter and the second output parameter of the balance record, including: according to the first output parameter Account ID of the same account, traverse each balance record of the same account, determine the balance value of the account for each day within the query start and end time range according to the second output parameter of the balance record, and calculate the balance value of each day at the start and end of the query Take the average within the time range to obtain the daily average balance value of the account, and generate the daily average balance value record of the account.

优选地，所述Map处理模块还包括一个或多个执行第二任务的Map节点，所述Reduce处理模块还包括一个或多个执行第二任务的Reduce节点；Preferably, the Map processing module further includes one or more Map nodes executing the second task, and the Reduce processing module further includes one or more Reduce nodes executing the second task;

优选地，所述执行第一任务的Reduce节点用于根据所述余额记录的第二输出参数确定该账户在查询起止时间范围内的每一天的余额值，包括：从查询起始时间的当日至查询终止时间的当日，判断每一天是否存在余额记录，如存在，将当日交易序号最大的余额记录的余额值作为当日的最终余额值，如不存在，追溯早于当日且具有余额记录的最近日期，将所述最近日期那一天的交易序号最大的余额记录的余额值作为当日的最终余额值。Preferably, the Reduce node executing the first task is used to determine the balance value of the account for each day within the query start and end time range according to the second output parameter of the balance record, including: from the current day of the query start time to On the day of the query termination time, determine whether there is a balance record for each day. If it exists, use the balance value of the balance record with the largest transaction number on that day as the final balance value of the day. If it does not exist, trace back to the latest date with a balance record that is earlier than the current day , taking the balance value of the balance record with the largest transaction sequence number on the most recent date as the final balance value of the day.

优选地，所述Reduce处理模块还包括第一任务路由模块：Preferably, the Reduce processing module also includes a first task routing module:

优选地，所述Reduce处理模块还包括第二任务路由模块：Preferably, the Reduce processing module also includes a second task routing module:

优选地，所述Map处理模块还包括分片模块：Preferably, the Map processing module also includes a fragmentation module:

应用示例Application example

下面给出一个应用例子：统计各账户今年1月1日到1月8日的日均余额，以及各账户日均余额的分布区间。(假设有两个账户：分别是id001和id002)An application example is given below: statistics of the average daily balance of each account from January 1 to January 8 this year, and the distribution range of the average daily balance of each account. (Suppose there are two accounts: id001 and id002 respectively)

两个账户的原始明细数据如表1所示，包括去年(2013年)的部分全量数据以及今年(2014年)1月1日至1月8日的增量数据。The original detailed data of the two accounts are shown in Table 1, including part of the full data of last year (2013) and incremental data from January 1 to January 8 of this year (2014).

日期date 账户IDAccount ID 余额(元)balance (yuan) 序号serial number 2014010120140101 002002 2020 11 2014010220140102 002002 1010 11 2014010220140102 002002 5050 22 2014010320140103 002002 3030 11 2014010320140103 002002 1515 22

2014010620140106 002002 4040 11 2014010620140106 002002 5050 22 2014010620140106 002002 6060 33 2014010820140108 002002 3030 11 2013123020131230 002002 9090 11 2014010620140106 001001 6060 11 2014010820140108 001001 3030 11 2013122520131225 001001 9090 11

表1Table 1

如图4所示，基于MapReduce启动2个任务，对帐户信息进行并行处理和计算。As shown in Figure 4, two tasks are started based on MapReduce to process and calculate account information in parallel.

第一个任务将输入的大规模数据文件分成若干分片交给Map节点并行处理，Map阶段将数据按照<帐户id，账户信息〉的形式进行输出，该步操作对海量的数据进行了分类预处理，将相同帐户的明细数据定向到同一个Reduce节点中合并处理，比如，根据账号id哈希Hash值(针对Reduce节点数量取模)分组并路由给多个Reduce并行处理；Reduce阶段对Map阶段输出的每条帐户的明细数据进行处理，包括：The first task divides the input large-scale data file into several fragments and sends them to the Map node for parallel processing. In the Map stage, the data is output in the form of <account id, account information>. This step classifies and pre-classifies the massive data Processing, directing the detailed data of the same account to the same Reduce node for combined processing, for example, grouping and routing to multiple Reduce nodes for parallel processing according to the account id hash Hash value (modulo the number of Reduce nodes); The output detailed data of each account is processed, including:

a)读取从去年全量文件到查询终止时间当天增量文件；a) Read the incremental files from the full amount of files last year to the query termination time;

b)Map阶段对所有的数据进行分类，输出<帐户id，帐户明细数据信息〉b) The Map stage classifies all the data and outputs <account id, account detail data information>

c)Reduce阶段选出同一个帐户id中明细数据进行组织和处理，确定每一个账户在查询起始时间到终止时间的统计周期内每一天的余额值，然后计算出每个帐户在该统计周期内的日均余额。c) In the Reduce stage, the detailed data in the same account id is selected for organization and processing, and the balance value of each account is determined for each day in the statistical cycle from the query start time to the end time, and then calculated for each account in the statistical cycle average daily balance.

帐户id001和帐户id002的所有数据分别被两个Reduce节点拉取处理。对每一个账户，Reduce节点从查询起始时间的当日至查询终止时间的当日，判断每一天是否存在余额记录，如存在，将当日交易序号最大的余额记录的余额值作为当日的最终余额值，如不存在，追溯早于当日且具有余额记录的最近日期，将所述最近日期那一天的交易序号最大的余额记录的余额值作为当日的最终余额值。All data of account id001 and account id002 are fetched and processed by two Reduce nodes respectively. For each account, the Reduce node judges whether there is a balance record for each day from the day of the query start time to the day of the query end time. If there is, the balance value of the balance record with the largest transaction number on the day is taken as the final balance value of the day. If it does not exist, trace back to the latest date that is earlier than the current day and has a balance record, and use the balance value of the balance record with the largest transaction sequence number on the day of the latest date as the final balance value of the current day.

比如，查询起始时间当天(今年1月1日)，002账户有余额记录，则将002账户在2014年1月1日交易序号为1的余额记录的余额值“20元”作为002账户在在2014年1月1日的余额值；001账户在2014年1月1日没有余额记录，则将2013年的全量数据中距离今年最近的一条余额记录(2013年12月25日，交易序号为1的余额记录)的余额值“90元”作为001账户在2014年1月1日的余额值。For example, on the day when the query starts (January 1 this year), and there is a balance record in account 002, the balance value "20 yuan" in the balance record of account 002 on January 1, 2014 with transaction number 1 is taken as the balance value of account 002. The balance value on January 1, 2014; if the 001 account has no balance record on January 1, 2014, the balance record closest to this year in the full amount of data in 2013 (on December 25, 2013, the transaction number is 1) The balance value "90 yuan" is used as the balance value of the 001 account on January 1, 2014.

对002账户，第一Reduce节点确定的002账户在查询起止时间范围内每一天的余额值，如表2所示；For the 002 account, the balance value of the 002 account determined by the first Reduce node within the query start and end time range for each day, as shown in Table 2;

查询日期query date 余额值(元)Balance value (yuan) 2014010120140101 2020 2014010220140102 5050 2014010320140103 1515 2014010420140104 1515 2014010520140105 4040 2014010620140106 6060 2014010720140107 6060 2014010820140108 3030

表2Table 2

对001账户，第一Reduce节点确定的002账户在查询起止时间范围内每一天的余额值，如表3所示；For account 001, the balance value of account 002 determined by the first Reduce node within the query start and end time range for each day, as shown in Table 3;

日期date 余额值balance value 2014010120140101 9090 2014010220140102 9090

2014010320140103 9090 2014010420140104 9090 2014010520140105 9090 2014010620140106 6060 2014010720140107 6060 2014010820140108 3030

表3table 3

帐户002截止到2014年1月8日的日均余额计算如下：(20+50+15+15+40+60+60+30)/8＝36.25(元)；The daily average balance of account 002 as of January 8, 2014 is calculated as follows: (20+50+15+15+40+60+60+30)/8=36.25 (yuan);

帐户001截止到2014年1月8日的日均余额计算如下：(90+90+90+90+90+60+60+30)/8＝75；The average daily balance of account 001 as of January 8, 2014 is calculated as follows: (90+90+90+90+90+60+60+30)/8=75;

第二个任务根据第一个任务的计算结果，统计出各账户日均余额值的分布情况。The second task calculates the distribution of the daily average balance of each account based on the calculation results of the first task.

根据第一个任务所输出的日均余额，在第二个任务的Map阶段对日均余额进行判断。例如，设置区间[0,15)，[15,50)，[50,100]。则帐户002的日均余额所属区间为[15,50)，帐户001的日均余额所属区间为[50,100]。对账户002，以区间[15,50)作为第一输出参数(key)，1作为第二输出参数(value)；对账户001，以区间[50,100]作为第一输出参数(key)，1作为第二输出参数(value)。在第二个任务的Reduce端，对每一个区间中的value进行累加并输出，结果即为各区间的账户分布数。According to the daily average balance output by the first task, the daily average balance is judged in the Map stage of the second task. For example, set intervals [0,15), [15,50), [50,100]. Then the interval of the average daily balance of account 002 is [15,50), and the interval of the average daily balance of account 001 is [50,100]. For account 002, the interval [15,50] is used as the first output parameter (key), and 1 is used as the second output parameter (value); for account 001, the interval [50,100] is used as the first output parameter (key), and 1 is used as the second output parameter (value). The second output parameter (value). At the Reduce end of the second task, the value in each interval is accumulated and output, and the result is the account distribution number in each interval.

上述实施例提供的一种账户余额数据的并行处理方法和系统，基于MapReduce将大规模的帐户余额明细数据分成若干份交给Map节点并行处理，Map阶段对数据按照帐户进行了分类，处理完成后根据帐户id分组并路由给多个Reduce节点并行处理，从而快速获取大数据量下的帐户日均余额的统计结果，处理效率高、可扩展性强。The method and system for parallel processing of account balance data provided by the above embodiments, based on MapReduce, divides large-scale account balance detailed data into several parts and hands them to Map nodes for parallel processing. In the Map stage, the data is classified according to accounts. After the processing is completed According to the account id, it is grouped and routed to multiple Reduce nodes for parallel processing, so as to quickly obtain the statistical results of the average daily balance of the account under a large amount of data, with high processing efficiency and strong scalability.

本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序来指令相关硬件完成，所述程序可以存储于计算机可读存储介质中，如只读存储器、磁盘或光盘等。可选地，上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现，相应地，上述实施例中的各模块/单元可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。本发明不限制于任何特定形式的硬件和软件的结合。Those skilled in the art can understand that all or part of the steps in the above method can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable storage medium, such as a read-only memory, a magnetic disk or an optical disk, and the like. Optionally, all or part of the steps in the above embodiments can also be implemented using one or more integrated circuits. Correspondingly, each module/unit in the above embodiments can be implemented in the form of hardware, or can be implemented in the form of software function modules. The form is realized. The present invention is not limited to any specific combination of hardware and software.

需要说明的是，本发明还可有其他多种实施例，在不背离本发明精神及其实质的情况下，熟悉本领域的技术人员可根据本发明作出各种相应的改变和变形，但这些相应的改变和变形都应属于本发明所附的权利要求的保护范围。It should be noted that the present invention can also have other various embodiments, without departing from the spirit and essence of the present invention, those skilled in the art can make various corresponding changes and deformations according to the present invention, but these Corresponding changes and deformations should belong to the scope of protection of the appended claims of the present invention.

Claims

1. a method for parallel processing for account balance data, the method comprises:

The Map node of one or more execution first tasks reads the different fragment datas of account balance detailed data, generates the first output parameter and second output parameter of each remaining sum record in the fragment data reading; Wherein, described the first output parameter at least comprises account ID, and described the second output parameter is set as account status information, and described account status information at least comprises: remaining sum value, trade date and daylight trading sequence number;

The Reduce node of one or more execution first tasks reads the complete different remaining sum records of Map node processing of described execution first task, and the first output parameter recording according to described remaining sum and the second output parameter generate respectively the average daily remaining sum value record of each account; Wherein, the remaining sum record that the first output parameter is identical is read by same Reduce node.

2. the method for claim 1, is characterized in that:

Described the first output parameter recording according to described remaining sum and the second output parameter generate respectively the average daily remaining sum value record of each account, comprising:

According to the account ID in described the first output parameter, travel through each remaining sum record of same account, determine the remaining sum value of every day within inquiry beginning and ending time account according to the second output parameter of described remaining sum record, the remaining sum value of every day is averaged to the average daily remaining sum value that obtains the account within the described inquiry beginning and ending time, generates the average daily remaining sum value record of the account.

3. the method for claim 1, is characterized in that, generates respectively after the average daily remaining sum value record of each account according to the first output parameter and second output parameter of described remaining sum record at the Reduce of described execution first task node, also comprises:

The Map node of one or more execution the second tasks reads the average daily remaining sum value record of different accounts, generates the first output parameter and second output parameter of the average daily remaining sum value record reading; Wherein, the first output parameter of described average daily remaining sum value record is set as the interval at described average daily remaining sum value place, and the second output parameter of described average daily remaining sum value record is set as 1;

The Reduce node of one or more execution the second tasks reads the complete average daily remaining sum value record of difference of Map node processing of described execution the second task, according to the account number in the first output parameter of described average daily remaining sum value record and the second output parameter statistics each average daily remaining sum value interval, comprise: according to the average daily remaining sum value interval in described the first output parameter, travel through each the average daily remaining sum value record in same average daily remaining sum value interval, the second output parameter of each average daily remaining sum value record is added up, obtain the account number in this average daily remaining sum value interval; Wherein, the average daily remaining sum value record that the first output parameter is identical is read by same Reduce node.

4. method as claimed in claim 2, is characterized in that:

The remaining sum value of determining every day within inquiry beginning and ending time account according to the second output parameter of described remaining sum record, comprising:

From the same day of inquiry initial time to the same day of inquiring about the termination time, judge and whether have remaining sum record every day, as existed, resulting balance value using the remaining sum value of the remaining sum record of day trade transaction sequence number maximum as the same day, if do not existed, reviewed early than the same day and there is the nearest date that remaining sum records, the resulting balance value using the remaining sum value of the remaining sum record of the transaction sequence number maximum of described nearest that day on date as the same day.

5. the method for claim 1, is characterized in that:

Before the Reduce of one or more execution first tasks node reads the complete different remaining sums records of the Map node processing of described execution first task, also comprise:

Calculate the cryptographic hash of the first parameter of each remaining sum record, set up the mapping relations of the cryptographic hash of the first parameter and the Reduce node of described execution first task; Wherein, described mapping relations read corresponding remaining sum record for the Reduce node for described execution first task according to described mapping relations.

6. method as claimed in claim 3, is characterized in that:

Before the Reduce node of one or more execution the second tasks reads the complete average daily remaining sum value record of difference of the Map node processing of described execution the second task, also comprise:

Calculate the cryptographic hash of the first parameter of each average daily remaining sum value record, set up the cryptographic hash of the first parameter and the mapping relations of the Reduce node of second task of execution; Wherein, described mapping relations read corresponding average daily remaining sum value record for the Reduce node for described execution the second task according to described mapping relations.

7. the method for claim 1, is characterized in that:

Read the different fragment datas of account balance detailed data at the Map of one or more execution first tasks node before, also comprise:

Determine and comprise the read range of account balance detailed data according to the inquiry beginning and ending time: by full dose balance detail data be defined as the read range of account balance detailed data by the end of the increment balance detail data on inquiry same day termination time;

By the account balance detailed data burst belonging in this read range, set up the mapping relations of the Map node of each burst and execution first task; Wherein, described mapping relations read corresponding fragment data for the Map node for described execution first task according to described mapping relations.

8. a parallel processing system (PPS) for account balance data, this system comprises:

Map processing module, comprises the Map node of one or more execution first tasks; Each Map node of carrying out first task, for reading the different fragment datas of account balance detailed data, generates the first output parameter and second output parameter of each remaining sum record in the fragment data reading; Wherein, described the first output parameter at least comprises account ID, and described the second output parameter is set as account status information, and described account status information at least comprises: remaining sum value, trade date and daylight trading sequence number;

Reduce processing module, comprises the Reduce node of one or more execution first tasks; Each Reduce node of carrying out first task is for reading the complete different remaining sum records of Map node processing of described execution first task, and the first output parameter recording according to described remaining sum and the second output parameter generate respectively the average daily remaining sum value record of each account; Wherein, the remaining sum record that the first output parameter is identical is read by same Reduce node.

9. system as claimed in claim 8, is characterized in that:

The Reduce node of described execution first task is for generating respectively the average daily remaining sum value record of each account according to the first output parameter of described remaining sum record and the second output parameter, comprise: according to the account ID in described the first output parameter, travel through each remaining sum record of same account, determine the remaining sum value of every day within inquiry beginning and ending time account according to the second output parameter of described remaining sum record, the remaining sum value of every day is averaged to the average daily remaining sum value that obtains the account within the described inquiry beginning and ending time, generate the average daily remaining sum value record of the account.

10. system as claimed in claim 8, is characterized in that, described Map processing module also comprises the Map node of one or more execution the second tasks, and described Reduce processing module also comprises the Reduce node of one or more execution the second tasks;

Each Map node of carrying out the second task, for reading the average daily remaining sum value record of different accounts, generates the first output parameter and second output parameter of the average daily remaining sum value record reading; Wherein, the first output parameter of described average daily remaining sum value record is set as the interval at described average daily remaining sum value place, and the second output parameter of described average daily remaining sum value record is set as 1;

Each Reduce node of carrying out the second task is for reading the complete average daily remaining sum value record of difference of Map node processing of described execution the second task, according to the account number in the first output parameter of described average daily remaining sum value record and the second output parameter statistics each average daily remaining sum value interval, comprise: according to the average daily remaining sum value interval in described the first output parameter, travel through each the average daily remaining sum value record in same average daily remaining sum value interval, the second output parameter of each average daily remaining sum value record is added up, obtain the account number in this average daily remaining sum value interval; Wherein, the average daily remaining sum value record that the first output parameter is identical is read by same Reduce node.

11. systems as claimed in claim 9, is characterized in that:

The Reduce node of described execution first task is for determining the remaining sum value of every day within inquiry beginning and ending time account according to the second output parameter of described remaining sum record, comprise: from the same day of inquiry initial time to the same day of inquiring about the termination time, judge and whether have remaining sum record every day, as existed, resulting balance value using the remaining sum value of the remaining sum record of day trade transaction sequence number maximum as the same day, if do not existed, reviewed early than the same day and there is the nearest date that remaining sum records, resulting balance value using the remaining sum value of the remaining sum record of the transaction sequence number maximum of described nearest that day on date as the same day.

12. systems as claimed in claim 8, is characterized in that, described Reduce processing module also comprises first task routing module:

Described first task routing module, for read the complete different remaining sums records of the Map node processing of described execution first task at the Reduce of one or more execution first tasks node before, calculate the cryptographic hash of the first parameter of each remaining sum record, set up the mapping relations of the cryptographic hash of the first parameter and the Reduce node of described execution first task; Wherein, described mapping relations read corresponding remaining sum record for the Reduce node for described execution first task according to described mapping relations.

13. systems as claimed in claim 10, is characterized in that, described Reduce processing module also comprises the second task routing module:

Described the second task routing module, for read the complete average daily remaining sum value record of difference of the Map node processing of described execution the second task at the Reduce node of one or more execution the second tasks before, calculate the cryptographic hash of the first parameter of each average daily remaining sum value record, set up the cryptographic hash of the first parameter and the mapping relations of the Reduce node of second task of execution; Wherein, described mapping relations read corresponding average daily remaining sum value record for the Reduce node for described execution the second task according to described mapping relations.

14. systems as claimed in claim 8, is characterized in that, described Map processing module also comprises burst module:

Described burst module, for before the Map of one or more execution first tasks node reads the different fragment datas of account balance detailed data, determine and comprise the read range of account balance detailed data according to the inquiry beginning and ending time: by full dose balance detail data be defined as the read range of account balance detailed data by the end of the increment balance detail data on inquiry same day termination time; By the account balance detailed data burst belonging in this read range, set up the mapping relations of the Map node of each burst and execution first task; Wherein, described mapping relations read corresponding fragment data for the Map node for described execution first task according to described mapping relations.