CN104391748A

CN104391748A - Mapreduce calculation process optimization method

Info

Publication number: CN104391748A
Application number: CN201410673548.2A
Authority: CN
Inventors: 刘晶; 杨晋博; 黄敏
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: IEIT Systems Co Ltd
Priority date: 2014-11-21
Filing date: 2014-11-21
Publication date: 2015-03-04

Abstract

The invention discloses a method for optimizing the mapreduce calculation process. First, the original data file is divided into several files, and one of the unprocessed files is selected as the input of the sub-job to determine whether there is a file that needs to be merged. If there is no file, Then submit the task; start the Map task with the same processing process, execute the Map operation, sort, merge and partition the Map output, receive the Map output result, execute the Reduce operation, and save the output result; if there are files that need to be merged, submit the task, Start the Map task with multiple processing, send different input data to the corresponding Map, perform Map operations, multi-output sorting, merging, and partitioning, and finally check whether there are still data files in the original data file set that have not been processed. If not, End the program, otherwise, re-partition the good data file and execute this process again. The invention disperses the output time, reduces the instantaneous network transmission flow, reduces the occupancy rate of the local disk, and improves the MapReduce calculation process.

Description

A mapreduce calculation process optimization method

技术领域 technical field

本发明涉及计算机软件及并行计算技术领域，具体描述为是一种通过降低程序运行过程中本地磁盘上保存的中间数据量，降低磁盘负荷提升MapReduce计算过程优化方法。 The invention relates to the technical field of computer software and parallel computing, and is specifically described as an optimization method for improving the MapReduce computing process by reducing the amount of intermediate data stored on a local disk during program operation and reducing disk load.

背景技术 Background technique

随着计算机技术和互联网技术的迅猛发展，网络普及率和互联网用户的规模也在逐年攀升，用户规模不断攀升与数据处理量迅速增长的双重刺激为互联网应用带来了新的挑战。海量的数据需要巨大规模的存储资源作为基础,网络应用对数据的依赖性增加,使得对海量数据进行计算和处理的能力的需求越来越强烈,维护这些应用程序的数据存储的成本和数据计算处理的成本越来越高。Hadoop在出现后的短短几年里就得到蓬勃的发展,证明了其巨大的技术能力和应用价值,但应该看到的是,Hadoop毕竟还很年轻,在很多方面尚有不完整,很多公司纷纷展开对Hadoop完善与优化的研究,进一步提升MapReduce性能是有必要且有意义的。 With the rapid development of computer technology and Internet technology, the network penetration rate and the scale of Internet users are also increasing year by year. The double stimulation of the continuous increase of user scale and the rapid growth of data processing has brought new challenges to Internet applications. Massive data requires a huge amount of storage resources as the basis, and the increasing dependence of network applications on data makes the demand for the ability to calculate and process massive data more and more intense. The cost of maintaining data storage for these applications and data calculation Processing costs are getting higher and higher. Hadoop has developed vigorously in just a few years after its appearance, which proves its huge technical capabilities and application value, but it should be noted that Hadoop is still very young and incomplete in many aspects. It is necessary and meaningful to further improve the performance of MapReduce by carrying out research on the perfection and optimization of Hadoop.

MapReduce是Google提出的一个软件架构，用于大规模数据集（大于1TB）的并行运算。概念“Map（映射）”和“Reduce（化简）”，及他们的主要思想，都是从函数式编程语言借来的，还有从矢量编程语言借来的特性。[1]当前的软件实现是指定一个Map（映射）函数，用来把一组键值对映射成一组新的键值对，指定并发的Reduce（化简）函数，用来保证所有映射的键值对中的每一个共享相同的键组。 MapReduce is a software architecture proposed by Google for parallel computing of large-scale data sets (greater than 1TB). The concepts "Map" and "Reduce", as well as their main ideas, are borrowed from functional programming languages, along with features borrowed from vector programming languages. [1] The current software implementation is to specify a Map (mapping) function to map a set of key-value pairs into a new set of key-value pairs, and specify a concurrent Reduce (simplification) function to ensure that all mapped keys Each of the value pairs shares the same set of keys.

发明内容 Contents of the invention

本发明要解决的技术问题是：为了提高MapReduce任务处理能力，针对MapReduce任务处理过程中内存占用率高，网络资源集中消耗和网络堵塞，磁盘负荷高造成资源紧张的现状，提供一种MapReduce计算过程优化方法。 The technical problem to be solved by the present invention is: in order to improve the MapReduce task processing ability, aiming at the high memory usage rate in the MapReduce task processing process, the concentrated consumption of network resources and network congestion, and the current situation of resource shortage caused by high disk load, a MapReduce calculation process is provided Optimization.

本发明所采用的技术方案为： The technical scheme adopted in the present invention is:

一种mapreduce计算过程优化方法，首先将原始数据文件分成若干份文件，从未被处理的文件集中选择一份作为子作业的输入,判断是否有需要合并的文件,若不存在,则提交任务；启动具有相同处理过程的Map任务,执行Map操作,对Map输出排序、合并、分区后接收Map输出结果,执行Reduce操作,保存输出结果；若存在需要合并的文件,则提交任务,启动具有多种处理的Map任务,将不同的输入数据发送给相应的Map,执行Map操作,多输出排序、合并、分区，最后检查原始数据文件集中是否还有数据文件未被处理,若无,结束程序,否则,重新将划分好的数据文件再次执行此过程。 A method for optimizing the mapreduce calculation process. First, the original data file is divided into several files, and one of the unprocessed files is selected as the input of the sub-job to determine whether there is a file that needs to be merged. If not, the task is submitted; Start the Map task with the same processing process, execute the Map operation, sort, merge, and partition the Map output, receive the Map output result, execute the Reduce operation, and save the output result; if there is a file that needs to be merged, submit the task, start with multiple Process the Map task, send different input data to the corresponding Map, perform Map operations, multi-output sorting, merging, and partitioning, and finally check whether there are data files in the original data file set that have not been processed. If not, end the program, otherwise , re-partitioned data files to perform this process again. the

所述方法涉及的体系结构包括：输入数据处理模块、数据结果合并模块、计算过程和合并过程结合模块，其中： The architecture involved in the method includes: an input data processing module, a data result merging module, a calculation process and a merging process combination module, wherein:

输入数据处理模块，负责将原始数据拆分成若干份,这个拆分不是系统对输入数据的分片处理,需要的是手动划分,数据大小也远远大于分片数据块大小，为每一份数据启动一次作业处理,各子作业按照顺序依次执行；程序运行中任一时刻系统上都只有一个子作业在运行,且只有一部分原始数据被操作,因此这样既可以做到子作业产生的中间数据相对较少，缩短单个作业的运行时间,使中间数据及时删除； The input data processing module is responsible for splitting the original data into several parts. This splitting is not the system's fragmentation processing of the input data. What is needed is manual division, and the data size is much larger than the fragmented data block size. The data starts a job processing, and the sub-jobs are executed in sequence; at any time during the program running, only one sub-job is running on the system, and only a part of the original data is operated, so the intermediate data generated by the sub-jobs can be achieved in this way. Relatively few, shorten the running time of a single job, and delete the intermediate data in time;

由于各子作业的输出只是针对部分数据的计算,不是整个原始数据的计算结果,各部分结果间存在大量重复，数据结果合并模块，负责将重复的结果进行合并，减少冗余中间数据； Since the output of each sub-job is only for the calculation of part of the data, not the calculation results of the entire original data, there are a lot of repetitions among the results of each part, and the data result merging module is responsible for merging the repeated results to reduce redundant intermediate data;

计算过程和合并过程结合模块，负责将子作业计算过程和合并过程结合在一起处理,边计算边合并，除第一个子作业外,下一个子作业同时接收原始输入数据和上一个子作业的计算结果,根据应用类型的不同,对合并数据的map函数做相应的调整,这样在Map过程结束后,中间数据都是具有相同格式的键值对集合,不会对后续的Reduce任务产生影响。 The calculation process and merging process combination module is responsible for combining the sub-job calculation process and the merging process together, and merging while calculating. Except for the first sub-job, the next sub-job receives the original input data and the previous sub-job at the same time. According to the calculation results, according to the different application types, the map function of the combined data is adjusted accordingly, so that after the Map process is completed, the intermediate data is a set of key-value pairs with the same format, which will not affect the subsequent Reduce tasks.

所述方法的实现过程如下： The realization process of described method is as follows:

1）搭建hadoop集群，分别对未优化的单作业MapReduce程序以及优化后的多作业MapReduce程序分配2G、20G的输入数据文件； 1) Build a Hadoop cluster, and allocate 2G and 20G input data files to the unoptimized single-job MapReduce program and the optimized multi-job MapReduce program respectively;

2）将2G、20G的输入数据拆分成若干份数据文件后进行子作业迭代运行，对每一次作业启动一次作业处理，各子作业按照顺序依次执行，利用多输入操作将子作业计算过程和合并过程结合在一起处理,边计算边合并，除第一个子作业外,下一个子作业同时接收原始输入数据和上一个子作业的计算结果，根据应用类型的不同，对合并的数据的map函数做相应调整。 2) Split the input data of 2G and 20G into several data files and then iteratively run the sub-jobs, start a job processing for each job, execute each sub-job in sequence, and use multiple input operations to combine the sub-job calculation process and The merging process is combined and processed together, and merged while calculating. Except for the first sub-job, the next sub-job receives the original input data and the calculation result of the previous sub-job at the same time. According to the different application types, the map of the merged data The function is adjusted accordingly.

本发明有益效果：由于输入数据被分散到多个作业中，每个作业的Map输出的中间数据相对较少,并且每个作业的执行时间较短,导致该作业下的Map阶段输出数据能及时从本地磁盘删除,避免了其长期占用磁盘，实际上,在优化的MapReduce程序中,实际产生的中间数据并不比未优化程序的少，只不过分散了其输出时间,从而分散了磁盘I/O操作,降低了瞬时网络传输流量，及时对中间数据进行后一阶段的Reduce操作,减少了本地磁盘的占用率，从而提升MapReduce计算过程。 Beneficial effects of the present invention: Since the input data is dispersed into multiple jobs, the intermediate data output by the Map of each job is relatively small, and the execution time of each job is short, so that the output data of the Map stage under the job can be timely Deleted from the local disk, avoiding its long-term occupation of the disk. In fact, in the optimized MapReduce program, the actual intermediate data generated is not less than that of the unoptimized program, but the output time is dispersed, thereby dispersing the disk I/O The operation reduces the instantaneous network transmission traffic, and the subsequent stage of the Reduce operation is performed on the intermediate data in time, which reduces the occupancy rate of the local disk, thereby improving the MapReduce calculation process.

附图说明 Description of drawings

图1为本发明优化的MapReduce计算过程流程图。 Fig. 1 is a flow chart of the optimized MapReduce calculation process of the present invention.

具体实施方式 Detailed ways

下面根据说明书附图，结合具体实施例，对本发明进一步说明： Below according to accompanying drawing of description, in conjunction with specific embodiment, the present invention is further described:

一种mapreduce计算过程优化方法，首先需要将原始数据文件分成若干份文件，从未被处理的文件集中选择一份作为子作业的输入,判断是否有需要合并的文件,若不存在,则提交任务；启动具有相同处理过程的Map任务,执行Map操作,对Map输出排序、合并、分区后接收Map输出结果,执行Reduce操作,保存输出结果；若存在需要合并的文件,则提交任务,启动具有多种处理的Map任务,将不同的输入数据发送给相应的Map,执行Map操作,多输出排序、合并、分区，最后检查原始数据文件集中是否还有数据文件未被处理,若无,结束程序,否则,重新将划分好的数据文件再次执行此过程。 A mapreduce calculation process optimization method. First, the original data file needs to be divided into several files, and one of the unprocessed files is selected as the input of the sub-job to determine whether there are files that need to be merged. If not, submit the task. ;Start the Map task with the same processing process, execute the Map operation, sort, merge and partition the Map output, receive the Map output result, execute the Reduce operation, and save the output result; if there is a file that needs to be merged, submit the task, start the multi This is a Map task for processing, sending different input data to the corresponding Map, performing Map operations, multi-output sorting, merging, and partitioning, and finally checking whether there are still data files in the original data file set that have not been processed. If not, end the program. Otherwise, re-partition the good data file and execute this process again. the

计算过程运行结束后，我们发现，在优化后的MapReduce程序中,输入数据被分散到多个作业中,每个作业的Map输出的中间数据相对较少,并且每个作业的执行时间较短,导致该作业下的Map阶段输出数据能及时从本地磁盘删除,避免了其长期占一用磁盘。在优化后的MapReduce程序中,实际产生的中间数据总量并不比未优化程序的少，只不过我们分散了其输出时间,从而分散了磁盘I/O操作,降低了瞬时网络传输流量，及时对中间数据进行后一阶段的Reduce操作,减少了本地磁盘的占用率，如此可见，优化后的MapReduce程序测试效果更佳。 After the calculation process is finished, we found that in the optimized MapReduce program, the input data is distributed among multiple jobs, the intermediate data output by the Map of each job is relatively small, and the execution time of each job is short. As a result, the output data of the Map stage under this job can be deleted from the local disk in time, avoiding its long-term occupation of the disk. In the optimized MapReduce program, the total amount of intermediate data actually generated is not less than that of the unoptimized program, but we disperse its output time, thereby dispersing disk I/O operations, reducing instantaneous network transmission traffic, and timely processing The intermediate data is subjected to the Reduce operation in the latter stage, which reduces the occupancy rate of the local disk. It can be seen that the optimized MapReduce program has a better test effect.

Claims

1. a mapreduce computation process optimization method, it is characterized in that: first raw data file is divided into some parts of files, from not processed file set, select a input as subjob, judge whether the file needing to merge, if do not exist, then submit task to; Startup has the Map task of same process, performs Map operation, receives Map Output rusults, perform Reduce operation, preserve Output rusults Map after exporting sequence, merging, subregion; The file merged is needed if exist, then submit task to, start the Map task with multiple process, different input data are sent to corresponding Map, perform Map operation, multi output sequence, merging, subregion, finally check raw data file concentrates whether data file is not processed in addition, if nothing, terminates program, otherwise, again ready-portioned data file is performed this process again.

2. a kind of mapreduce computation process optimization method according to claim 1, it is characterized in that, the architecture that described method relates to comprises: input data processing module, data result merge module, computation process and merging process binding modules, wherein:

Input data processing module, raw data is responsible for split into some parts, this splits is not the burst process of system to input data, it is desirable that manually divide, size of data is also far longer than fragment data block size, for each number is according to startup one-stop operation process, each subjob performs in order successively; In program operation, any instant system all only has a subjob in operation, and only some raw data is operated;

Data result merges module, is responsible for the result of repetition to merge, and reduces redundancy intermediate data;

Computation process and merging process binding modules, be responsible for subjob computation process and merging process to combine process, calculating limit, limit merges, except first subjob, next subjob receives the result of calculation of original input data and a upper subjob simultaneously, according to the difference of application type, the map function being combined data does corresponding adjustment.

3. a kind of mapreduce computation process optimization method according to claim 1 and 2, it is characterized in that, the performing step of described method is as follows:

1) hadoop cluster is built, respectively to the input data file of many operations MapReduce programme distribution 2G, the 20G after the single job MapReduce program do not optimized and optimization;

2) operation of subjob iteration is carried out after the input Data Division of 2G, 20G being become some parts of data files, to the process of job initiation one-stop operation each time, each subjob performs in order successively, multi input is utilized to operate the process that subjob computation process and merging process to be combined, calculating limit, limit merges, except first subjob, next subjob receives the result of calculation of original input data and a upper subjob simultaneously, according to the difference of application type, the map function of the data be combined adjusts accordingly.