CN101833438A - A general data processing method based on multiple parallelism - Google Patents
A general data processing method based on multiple parallelism Download PDFInfo
- Publication number
- CN101833438A CN101833438A CN201010150549A CN201010150549A CN101833438A CN 101833438 A CN101833438 A CN 101833438A CN 201010150549 A CN201010150549 A CN 201010150549A CN 201010150549 A CN201010150549 A CN 201010150549A CN 101833438 A CN101833438 A CN 101833438A
- Authority
- CN
- China
- Prior art keywords
- data
- task
- execution
- application program
- parallel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 23
- 230000003068 static effect Effects 0.000 claims abstract description 20
- 238000004364 calculation method Methods 0.000 claims description 28
- 238000003860 storage Methods 0.000 claims description 27
- 230000008859 change Effects 0.000 claims description 3
- 230000006399 behavior Effects 0.000 abstract description 17
- 230000001788 irregular Effects 0.000 abstract description 4
- 230000007246 mechanism Effects 0.000 abstract description 3
- 238000005457 optimization Methods 0.000 abstract 1
- 238000000034 method Methods 0.000 description 27
- 238000011161 development Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000011160 research Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 4
- 230000003111 delayed effect Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000002708 enhancing effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000013523 data management Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000008521 reorganization Effects 0.000 description 2
- 238000012827 research and development Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000004040 coloring Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000009533 lab test Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000012913 prioritisation Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
Images
Landscapes
- Image Processing (AREA)
Abstract
Description
技术领域technical field
本发明涉及并行计算技术领域,尤其涉及一种基于异构多核架构的通用数据并行处理方法。The invention relates to the technical field of parallel computing, in particular to a general data parallel processing method based on a heterogeneous multi-core architecture.
背景技术Background technique
随着当今科学技术的迅猛发展,高性能计算已经成为科学技术发展中具有战略重要性的研究手段,它与传统的理论研究和实验室实验一起构成了现代科学技术和工程设计中互相补充、互相关联的研究方法,被国际上称为21世纪科学研究的三大“支柱”。高性能计算机的应用领域主要集中在科学研发、电信、金融、政府等,所以高性能计算机对于国家的贡献当然是功不可没的,为了加快当今信息化建设的步伐,越来越多的领域应用到高性能计算技术。高性能计算极大地加快了计算的速度,缩短了研制和生产周期。它的应用,大大拓宽了研究能力,促进和推动了现代科学与工程技术的发展。加快发展高性能计算对于提升我国科技自主创新能力、增强国家竞争力、保障国家安全、促进国民经济建设、建设创新型国家具有十分重要的战略意义。With the rapid development of today's science and technology, high-performance computing has become a strategically important research method in the development of science and technology. Together with traditional theoretical research and laboratory experiments, it constitutes a complementary and mutual Correlative research methods are known internationally as the three "pillars" of scientific research in the 21st century. The application fields of high-performance computers are mainly concentrated in scientific research and development, telecommunications, finance, government, etc., so the contribution of high-performance computers to the country is of course indispensable. In order to speed up the pace of today's information construction, more and more fields of application to high-performance computing technology. High-performance computing greatly accelerates the calculation speed and shortens the development and production cycle. Its application has greatly broadened research capabilities, promoted and promoted the development of modern science and engineering technology. Accelerating the development of high-performance computing is of great strategic significance for improving my country's independent innovation capabilities in science and technology, enhancing national competitiveness, safeguarding national security, promoting national economic construction, and building an innovative country.
在高性能计算领域的发展过程中,以RISC架构为主导的小型机曾经称霸高性能计算市场,后来由于X86架构的发展,在价格上占有绝对优势的X86架构最终以集群的形式取代了小型机。虽然通过创建分布式系统可以解决部分大型计算的问题,但是分布式系统有通信开销大,故障率高;数据的存取结构复杂,开销大;数据的安全性和保密性较难控制等弱点。随着计算机处理器,特别是GPU(Graphical Processing Unit)计算能力的飞速提高和低廉的价格,高性能计算逐步进入桌面(低端)领域,使得每一名研究人员、科学家以及工程师都有可能拥有自己的超级计算机,能够更快的解决问题,加快了科学发展的节奏。现在的GPU包含了上百个处理单元,对单精度浮点运算可以获得1TFLOPS的性能,对双精度浮点运算也可以获得超过80GFLOPS的性能,可以拥有4GB的显存,超过100GB/秒的带宽。尽管GPU原本是一种专为图形计算而设计的处理器,然而特别适合做大规模并行计算的GPU以强大的计算性能、较低的能耗、低廉的价格以及占地面积较小等特点迅速出现在许多非图形应用的高性能计算领域。如今,许多重要的科学工程都正在尝试将GPU计算能力添加到他们的代码里。软件工程师们正热烈期待着他们的工作能够通过GPU获得卓越的性能。In the development process of the high-performance computing field, the minicomputer dominated by the RISC architecture once dominated the high-performance computing market. Later, due to the development of the X86 architecture, the X86 architecture, which has an absolute advantage in price, finally replaced the minicomputer in the form of a cluster. . Although some large-scale computing problems can be solved by creating a distributed system, the distributed system has weaknesses such as high communication overhead and high failure rate; complex data access structure and high overhead; data security and confidentiality are difficult to control and other weaknesses. With the rapid improvement and low price of computer processors, especially GPU (Graphical Processing Unit) computing power, high-performance computing has gradually entered the desktop (low-end) field, making it possible for every researcher, scientist and engineer to have Our own supercomputer can solve problems faster and speed up the pace of scientific development. The current GPU contains hundreds of processing units, which can achieve 1TFLOPS performance for single-precision floating-point operations, and more than 80GFLOPS performance for double-precision floating-point operations. It can have 4GB of video memory and a bandwidth of more than 100GB/s. Although the GPU is originally a processor designed for graphics computing, it is especially suitable for large-scale parallel computing due to its powerful computing performance, low energy consumption, low price, and small footprint. Appears in the field of high performance computing for many non-graphics applications. Today, many important scientific projects are trying to add GPU computing power to their code. Software engineers are eagerly awaiting the superior performance of GPUs for their work.
然而,目前的大多数应用程序直接移植到GPU上来并不会立即得到性能的提高,甚至还会出现性能的下降。这主要是因为这些程序和结构并不是针对GPU架构的特点而设计的,无法挖掘出GPU全部的计算能力。如何利用并行应用程序进行高效的数据处理通常是一件复杂而耗时的工作。However, most of the current applications directly ported to the GPU will not immediately improve the performance, and even the performance will drop. This is mainly because these programs and structures are not designed for the characteristics of the GPU architecture, and cannot tap the full computing power of the GPU. How to use parallel applications for efficient data processing is usually a complex and time-consuming task.
发明内容Contents of the invention
本发明提供了一种融合了数据并行、任务并行、管道并行的多重并行数据处理方法,可以使应用程序进行数据处理时能够最大限度地有效使用硬件的计算资源和存储资源。The present invention provides a multi-parallel data processing method that integrates data parallelism, task parallelism, and pipeline parallelism, which can enable application programs to effectively use hardware computing resources and storage resources to the greatest extent when performing data processing.
一种基于多重并行的数据通用处理方法,执行在具有GPU和CPU处理器的计算机中:A method for general data processing based on multiple parallelism, executed in a computer with GPU and CPU processors:
(1)将进行数据处理的应用程序划分成若干执行行为;(1) Divide the application program for data processing into several execution behaviors;
每个执行行为可以完成至少一个对数据的基本操作,例如数据的访问、数据的存储等,或者计算指令;Each execution behavior can complete at least one basic operation on data, such as data access, data storage, etc., or calculation instructions;
(2)根据执行行为对数据的基本操作类型以及计算指令类型,将所有的执行行为划分成若干个任务,即将相似的执行行为划入同一个计算任务中;(2) Divide all execution behaviors into several tasks according to the basic operation types of execution behaviors on data and the types of calculation instructions, that is, to divide similar execution behaviors into the same calculation task;
相似的执行行为,是指具有相同的计算操作或相似的存储操作,相似的存储操作是指对数据的访问保持在存储区域的局部范围内。Similar execution behavior refers to having the same calculation operation or similar storage operation, and similar storage operation means that the access to data is kept in the local scope of the storage area.
此步骤的划分可以满足硬件的SIMD(Single Instruction,Multiple Data)执行特性和存储的局部访问特性。The division of this step can satisfy the SIMD (Single Instruction, Multiple Data) execution characteristics of the hardware and the local access characteristics of the storage.
每个任务完成指定的计算任务,划分时尽可能的短小而功能单一,任务间根据具体的情况可以并行执行,也可串行执行。Each task completes the specified computing task, and the division time is as short as possible and has a single function. The tasks can be executed in parallel or serially according to the specific situation.
(3)将应用程序需要处理的数据分为静态数据和动态数据,在可执行所述的应用程序的计算机显存中划分存储空间(存储池),在该存储空间中分别为静态数据和动态数据划分存储区域,即存储空间中一部分用于存储静态数据,其余的空间用于存储动态数据。(3) The data that the application program needs to be processed is divided into static data and dynamic data, and the storage space (storage pool) is divided in the computer video memory that can execute the application program, and the storage space is respectively static data and dynamic data. Divide the storage area, that is, a part of the storage space is used to store static data, and the rest is used to store dynamic data.
其中静态数据是指在应用程序执行过程中不会改变的数据,而动态数据是指在应用程序执行过程中产生的新数据。所有这些信息预先记录在一个配置文件里;The static data refers to data that does not change during the execution of the application program, and the dynamic data refers to new data generated during the execution of the application program. All this information is pre-recorded in a configuration file;
(4)根据对数据的处理方式,将步骤(2)中任务分为计算型任务和逻辑判断型任务,在GPU上运行计算型任务,在CPU上运行逻辑判断型任务,本发明采用基于管道并行、数据并行、任务并行的多重并行执行方式,完成应用程序的执行。(4) According to the processing mode to data, the task in step (2) is divided into calculation type task and logic judgment type task, runs calculation type task on GPU, runs logic judgment type task on CPU, the present invention adopts based on pipeline Multiple parallel execution methods of parallel, data parallel, and task parallel complete the execution of the application program.
管道是一种生产者—消费者执行模式,适合绝大多数应用程序的计算流程,而且管道通过数据重组可以有效地平衡工作负载,避免某一单元可能出现过多输出而使整个计算流程负载不均;数据并行执行模型正如当前主流的编程模型一样,如CUDA(Compute Unified Device Architecture),对大规模的整齐同构的数据集可以充分利用硬件SIMD特性,隐藏访存延迟;任务执行模型是一种可扩展的执行模式,可以明确地表示出程序执行过程中各单元的相互依赖关系以及动态的执行行为。为了充分发挥异构多核架构的特点,合理地使用硬件资源,本发明采用管道并行数据处理模式,将涉及大量计算的应用程序执行管道运行在GPU上,而涉及大量逻辑判断的数据任务调度管道运行在CPU上,两种管道异步并行执行,数据任务调度管道要比应用程序执行管道提前运行。通过这种管道并行执行模式,既可以保证计算的独立性和并行性,也可以避免使用原子、锁等昂贵的同步操作。The pipeline is a producer-consumer execution mode, which is suitable for the calculation process of most applications, and the pipeline can effectively balance the workload through data reorganization, so as to avoid the excessive output of a certain unit and the load of the entire calculation process. Both; the data parallel execution model is just like the current mainstream programming model, such as CUDA (Compute Unified Device Architecture), which can make full use of the hardware SIMD characteristics for large-scale neat and homogeneous data sets, and hide the memory access delay; the task execution model is a It is an extensible execution mode, which can clearly express the interdependence of each unit and the dynamic execution behavior in the process of program execution. In order to give full play to the characteristics of the heterogeneous multi-core architecture and rationally use hardware resources, the present invention adopts the pipeline parallel data processing mode to run the application program execution pipeline involving a large number of calculations on the GPU, and the data task scheduling pipeline involving a large number of logical judgments to run On the CPU, the two pipelines are executed asynchronously and in parallel, and the data task scheduling pipeline runs ahead of the application execution pipeline. Through this pipeline parallel execution mode, the independence and parallelism of calculations can be guaranteed, and expensive synchronization operations such as atomics and locks can be avoided.
在应用程序执行管道中,同一计算任务内部的执行行为以数据并行方式执行,而不同计算任务间的执行行为以任务并行的方式异步执行。In the application execution pipeline, the execution behavior within the same computing task is executed in a data parallel manner, while the execution behavior between different computing tasks is executed asynchronously in a task parallel manner.
由于一些程序可能产生不可预测的数据量而造成整个执行管道负载不均衡的情况。数据并行的执行模式很可能使一个任务产生大量新的数据,很难同时对这些数据进行存储并使用它们,大量新的数据也可能迅速消耗掉有限的显存,而且由于这些数据的产生是随机的不可预测的,也增加了对数据管理的难度。因此,执行步骤(4)之前,判断所要执行的任务将产生的新数据的大小是否超出了当前存储池的剩余空间;一旦经判断超出了存储池的剩余空间,我们将对该任务需要处理的数据进行分组,使数据分批进行处理。这种方法将会大大降低一些涉及到海量数据的算法给系统存储和带宽造成的负担,使显存中的数据都是正在计算的线程所需要的数据,从而进一步加强了线程的并行计算效率,提高了对硬件的有效使用能力。A condition in which the entire execution pipeline is unbalanced due to the unpredictable amount of data that some programs may generate. The execution mode of data parallelism is likely to cause a task to generate a large amount of new data, it is difficult to store and use these data at the same time, and a large amount of new data may quickly consume limited video memory, and because the generation of these data is random Unpredictability also increases the difficulty of data management. Therefore, before performing step (4), it is judged whether the size of the new data that will be generated by the task to be executed exceeds the remaining space of the current storage pool; Data is grouped so that data is processed in batches. This method will greatly reduce the burden on system storage and bandwidth caused by some algorithms involving massive data, so that the data in the video memory is the data required by the thread being calculated, thereby further enhancing the parallel computing efficiency of the thread and improving effective use of hardware.
由于在任务运行时,有可能出现不可预测的新任务,因此本发明采用基于优先级的动态调度,同时根据所调度的任务管理相应的数据转移。Since unpredictable new tasks may appear when tasks are running, the present invention adopts priority-based dynamic scheduling, and manages corresponding data transfer according to the scheduled tasks at the same time.
对每一个任务(包括步骤(2)中的任务以及在任务运行时出现的新任务)都设置一个优先级状态,当新任务出现时,根据所有任务的优先级状态选择优先级高的任务依次运行。Set a priority status for each task (including the task in step (2) and the new task that appears when the task is running), and when a new task appears, select the task with higher priority in order according to the priority status of all tasks run.
衡量任务的优先级主要基于其所需数据在存储层次的位置、所需处理器的类型以及所需数据集的大小。以数据驱动的方式进行任务调度,根据已空闲处理器的类型和当前存储池中静态数据所对应的任务进行调度。具体来说,按优先级由高到低的顺序结合以下几条原则:Prioritization of tasks is based primarily on the location of the data it requires in the storage hierarchy, the type of processor it requires, and the size of the data set it requires. Task scheduling is performed in a data-driven manner, and scheduling is performed according to the type of the idle processor and the tasks corresponding to the static data in the current storage pool. Specifically, combine the following principles in descending order of priority:
(1)任务的执行不需要静态数据;(1) The execution of the task does not require static data;
(2)所需数据在cache里;(2) The required data is in the cache;
(3)优先处理具备充分相似性数据的任务,或者产生的数据可以协(3) Prioritize the tasks with sufficient similarity data, or the generated data can be coordinated
助其它任务提高执行的优先级,或者多个任务的执行具有相似性。It can help other tasks to increase the priority of execution, or the execution of multiple tasks is similar.
(4)所需数据在GPU显存;(4) The required data is in the GPU memory;
(5)所需数据在CPU内存;(5) The required data is in the CPU memory;
(6)所需数据正在由硬盘传输到内存;(6) The required data is being transferred from the hard disk to the memory;
(7)所需数据集太小而无法充分利用硬件计算能力。(7) The required data set is too small to fully utilize the hardware computing power.
本发明方法的实施基于更成熟的异构多核架构,比如NVIDIA公司最新推出的Fermi架构,或者Inter公司即将推出的Larrabee架构等,这些架构一般均具有超过1TFLOPS的浮点运算能力,超过20的多核处理器,上百的硬件线程以及复杂的存储层次结构。The implementation of the method of the present invention is based on a more mature heterogeneous multi-core architecture, such as the newly released Fermi architecture of NVIDIA Corporation, or the Larrabee architecture to be launched by Inter Corporation. Processors, hundreds of hardware threads, and complex memory hierarchies.
本发明数据处理方法针对具有动态特征执行行为和不规则数据结构的复杂算法做了专门的优化,能够在数据处理时根据存储局部性原则和SIMD操作机制对数据进行动态管理使应用程序进行数据处理时能够最大限度地有效使用硬件的计算资源和存储资源。利用本发明方法可以迅速而便捷的开发出高性能的并行执行的应用程序,这无疑将会大大加快程序开发的进度和效率,节省研发费用。The data processing method of the present invention is specially optimized for complex algorithms with dynamic characteristic execution behavior and irregular data structure, and can dynamically manage data according to the principle of storage locality and SIMD operation mechanism during data processing, so that the application program can perform data processing It can maximize the effective use of hardware computing resources and storage resources. The method of the invention can quickly and conveniently develop a high-performance parallel execution application program, which will undoubtedly greatly accelerate the progress and efficiency of program development and save research and development costs.
附图说明Description of drawings
图1为CUDA及本发明模型随着场景复杂度递增所表现出的性能分析。Fig. 1 is a performance analysis of CUDA and the model of the present invention as scene complexity increases.
具体实施方式Detailed ways
选择一台配有一颗Intel Xeon 3.7GHz的4核CPU,一颗NvidiaGTX285(1G显存)的PC来验证本发明的可行性。基于PTX指令集实现了一套基于上述方法实现的编程接口,并按照本发明所提出的方法去重新设计和编写图形学中具有大量动态不规则性行为的光线跟踪算法,并与使用Nvidia公司的CUDA编程模型编写的代码所得到的效果作对比,并做了如下分析。Select one to be equipped with a 4-core CPU of Intel Xeon 3.7GHz, a PC of NvidiaGTX285 (1G video memory) to verify the feasibility of the present invention. Based on the PTX instruction set, a set of programming interface based on the above method is realized, and according to the method proposed by the present invention, the ray tracing algorithm with a large number of dynamic irregular behaviors in graphics is redesigned and written, and it is used with Nvidia's The effect of the code written by the CUDA programming model is compared, and the following analysis is made.
将应用程序划分成若干计算任务,为了满足硬件的SIMD/SIMT操作和局部访存特性,我们使具有相似执行行为或者相似访存行为的计算封装在一个计算任务内以进行有效的处理,每个计算任务尽可能的短小而功能单一,计算任务间根据具体的情况可以并行执行,也可串行执行。计算任务内部以数据并行方式计算,而计算任务间以任务并行方式异步计算。每一个计算任务都设有一个状态,用以处理可能存在相互依赖关系的计算任务间的执行。Divide the application program into several computing tasks. In order to meet the SIMD/SIMT operation and local memory access characteristics of the hardware, we encapsulate the calculations with similar execution behavior or similar memory access behavior in one computing task for effective processing. Each Computing tasks should be as short as possible and have a single function. Computing tasks can be executed in parallel or serially according to the specific situation. Computing tasks are computed in a data-parallel manner internally, and asynchronously computed between computing tasks in a task-parallel manner. Each computing task has a state to handle the execution of computing tasks that may have interdependencies.
根据光线跟踪算法中计算任务的特点在应用程序执行管道中创建了6个计算任务,分别进行光线产生、遍历加速结构、面片相交、着色、阴影等计算任务,同时在数据任务调度管道中进行光线排序和光线包的创建。这些任务均具有较好的并行执行能力,即较宽的SIMD执行宽度,但是光线的递归特性使得SIMD有效使用率很可能随着递归的进行而剧烈下降。另外,我们在实现时使用延迟计算技术来进一步提高SIMD利用率,即如果着色任务经计算后无法产生足够的光线而形成一个完整的光线包,相交计算将被延迟直到完整的光线包已经形成;同样地,如果相交计算任务无法产生足够多的光线进行着色计算,着色计算也将被延迟。According to the characteristics of the calculation tasks in the ray tracing algorithm, six calculation tasks are created in the application execution pipeline, which respectively perform calculation tasks such as ray generation, traversal acceleration structure, patch intersection, coloring, and shadow, and are carried out in the data task scheduling pipeline at the same time. Ray sequencing and creation of ray packages. These tasks all have good parallel execution capabilities, that is, a wide SIMD execution width, but the recursive nature of light makes the effective SIMD usage rate likely to drop sharply as the recursion proceeds. In addition, we use delayed calculation technology to further improve SIMD utilization during implementation, that is, if the shading task cannot generate enough rays to form a complete ray package after calculation, the intersection calculation will be delayed until the complete ray package has been formed; Likewise, if the intersection computation task cannot generate enough rays for the shading computation, the shading computation will also be delayed.
将数据分为静态数据和动态数据,其中静态数据是指在应用程序执行过程中不会改变的数据,而动态数据是指在应用程序执行过程中产生的不断变化的新数据。在初始化时设置一个存储池,根据具体的应用程序为静态数据在显存中分配一定的空间,其余的空间为动态数据所占有。所有这些信息记录在一个配置文件里。Divide data into static data and dynamic data, where static data refers to data that does not change during the execution of the application, and dynamic data refers to new data that is constantly changing during the execution of the application. Set up a storage pool during initialization, allocate a certain amount of space for static data in the video memory according to the specific application program, and the rest of the space is occupied by dynamic data. All this information is recorded in a configuration file.
一些应用程序所需的静态数据大小可能超出了显存的大小,这样就可能在程序执行过程中动态的调度静态数据,而每次导入的数据大小不一定跟上一次完全一致,这样就可能在静态数据区域和动态数据区间产生碎片。为了避免碎片的产生而有效的使用显存,我们可以在显存中采用双向分配的方法,在存储池的低地址端存放静态数据,而在存储池的高地址端存放动态数据。The size of the static data required by some applications may exceed the size of the video memory, so it is possible to dynamically schedule the static data during program execution, and the size of the data imported each time may not be exactly the same as the last time, so it may be in the static Data areas and dynamic data ranges generate fragments. In order to avoid fragmentation and effectively use the video memory, we can use a two-way allocation method in the video memory to store static data at the low address end of the storage pool, and store dynamic data at the high address end of the storage pool.
如上所述,为了充分发挥异构多核架构的特点,合理地使用硬件资源,本发明设计了应用程序执行管道与数据任务调度管道相结合的管道并行执行模式,将涉及大量计算的应用程序执行管道运行在GPU上,而涉及大量逻辑判断的数据任务调度管道运行在CPU上,两种管道异步并行执行,数据任务调度管道要比应用程序执行管道提前运行。通过这种管道并行执行模式,我们既可以保证计算的独立性和并行性,也可以避免使用原子、锁等昂贵的同步操作。As mentioned above, in order to give full play to the characteristics of the heterogeneous multi-core architecture and rationally use hardware resources, the present invention designs a pipeline parallel execution mode that combines the application program execution pipeline and the data task scheduling pipeline, and the application program execution pipeline that involves a large number of calculations It runs on the GPU, while the data task scheduling pipeline involving a large number of logical judgments runs on the CPU. The two pipelines are executed asynchronously and in parallel, and the data task scheduling pipeline runs ahead of the application execution pipeline. Through this pipeline parallel execution mode, we can not only guarantee the independence and parallelism of calculations, but also avoid expensive synchronization operations such as atomics and locks.
在实现时,本发明基于以下三点原则设计数据任务调度管道:①应尽可能地保持对静态数据的访问处在硬件存储层次中速度最快的一层(即cache、shared memory等),同时尽量延迟对数据的访问直到这种访问不可避免。②优先处理具备充分相似性数据的任务,或者产生的数据可以协助其它任务提高执行的优先级,或者多个任务的执行具有相似性。③以数据驱动的方式进行任务调度,根据已空闲处理器的类型和当前存储池中静态数据所对应的任务进行调度。When implementing, the present invention designs the data task scheduling pipeline based on the following three principles: 1. the access to static data should be kept at the fastest layer (i.e. cache, shared memory, etc.) in the hardware storage hierarchy as far as possible, and at the same time Try to delay access to data until such access is unavoidable. ② Prioritize the tasks with sufficient similarity data, or the generated data can assist other tasks to improve the priority of execution, or the execution of multiple tasks has similarity. ③ Task scheduling is performed in a data-driven manner, and scheduling is performed according to the type of the idle processor and the tasks corresponding to the static data in the current storage pool.
1)本发明设计了数据分析器来动态控制数据的使用,以解决一些程序可能产生不可预测的数据量而造成整个执行管道负载不均衡的情况。数据并行的执行模式很可能使一个计算任务产生大量新的数据,很难同时对这些数据进行存储并使用它们,大量新的数据也可能迅速消耗掉有限的显存,而且由于这些数据的产生是随机的不可预测的,也增加了对数据管理的难度。因此,本发明设置一个数据分析器,计算任务每一次执行之前,都要判断所将产生的新数据的大小是否超出了当前的剩余显存(具体的评估方法根据相应的应用而定);一旦经判断超出了剩余的显存容量,我们将对输入数据进行分组,使数据分批进行处理。我们的这种方法将会大大降低一些涉及到海量数据的算法给系统存储和带宽造成的负担,使显存中的数据都是正在计算的线程所需要的数据,从而进一步加强了线程的并行计算效率,提高了对硬件的有效使用能力。1) The present invention designs a data analyzer to dynamically control the use of data, so as to solve the situation that some programs may generate unpredictable data volumes and cause unbalanced load of the entire execution pipeline. The execution mode of data parallelism is likely to cause a large amount of new data to be generated by a computing task. It is difficult to store and use these data at the same time. A large amount of new data may also quickly consume limited video memory. The unpredictability of data also increases the difficulty of data management. Therefore, the present invention is provided with a data analyzer, and before the calculation task is executed each time, it is necessary to judge whether the size of the new data to be generated exceeds the current remaining display memory (the specific evaluation method is determined according to the corresponding application); Judging that the remaining video memory capacity is exceeded, we will group the input data to process the data in batches. Our method will greatly reduce the burden on system storage and bandwidth caused by some algorithms involving massive data, so that the data in the video memory is all the data required by the thread being calculated, thereby further enhancing the parallel computing efficiency of the thread , Improve the effective use of hardware.
为每个计算任务都设立了一个数据缓存区,用于管理计算任务每次产生或者消耗的数据。由于在一些复杂算法中数据的产生和消耗是动态不规则的,为了满足局部相似性原则和SIMD操作特性,使计算尽量集中在局部数据集中进行,有必要对这些数据重新组织,保证计算可以继续在硬件上有效地进行。当前硬件强大的带宽能力以及CPU强大的逻辑处理能力使得数据动态重组操作是十分可行的。A data cache area is set up for each computing task to manage the data generated or consumed by each computing task. Since the generation and consumption of data in some complex algorithms are dynamic and irregular, in order to satisfy the principle of local similarity and SIMD operation characteristics, so that the calculation can be concentrated in the local data set as much as possible, it is necessary to reorganize these data to ensure that the calculation can continue Do it efficiently in hardware. The powerful bandwidth capability of the current hardware and the powerful logical processing capability of the CPU make the dynamic data reorganization operation very feasible.
2)设计了任务调度器来对不可预测的任务执行序列进行基于优先级的动态调度,同时根据所调度的任务管理相应的数据转移。我们采用按需调度的方法,当某一个处理器可用时,①设置一个信号量,锁住调度器;②扫描整个待执行任务序列,选择优先级最高的任务,并作标记;③对调度器解锁。2) A task scheduler is designed to perform priority-based dynamic scheduling of unpredictable task execution sequences, and manage corresponding data transfers according to the scheduled tasks. We use the method of on-demand scheduling. When a certain processor is available, ① set a semaphore and lock the scheduler; ② scan the entire task sequence to be executed, select the task with the highest priority, and mark it; unlock.
优先级的确定是我们这个调度器的核心部分。针对混合处理资源的特点,我们衡量任务的优先级主要基于其所需数据在存储层次的位置、所需处理器的类型以及所需数据集的大小。具体来说,按优先级由高到低的顺序结合以下几条原则:(1)任务的执行不需要静态数据;(2)所需数据在cache里;(3)优先处理具备充分相似性数据的任务,或者产生的数据可以协助其它任务提高执行的优先级,或者多个任务的执行具有相似性。(4)所需数据在GPU显存;(5)所需数据在CPU内存;(6)所需数据正在由硬盘传输到内存;(7)所需数据集太小而无法充分利用硬件计算能力。Priority determination is the core part of our scheduler. Given the characteristics of mixed processing resources, we prioritize tasks based on the location of the required data in the storage hierarchy, the type of processor required, and the size of the required data set. Specifically, the following principles are combined in order of priority from high to low: (1) The execution of tasks does not require static data; (2) The required data is in the cache; (3) Prioritize data with sufficient similarity The tasks, or the generated data can assist other tasks to improve the execution priority, or the execution of multiple tasks is similar. (4) The required data is in the GPU memory; (5) The required data is in the CPU memory; (6) The required data is being transferred from the hard disk to the memory; (7) The required data set is too small to fully utilize the hardware computing power.
选择具有不同几何复杂度的测试场景,Bunny,Fairy,BART Kitchen作为测试模型文件,Select test scenarios with different geometric complexity, Bunny, Fairy, BART Kitchen as test model files,
Fairy为动态场景并带有两次反射计算,绘制分辨率为1024*1024。我们分别使用了本发明方法和CUDA编程模型对这个场景进行测试,结果如表1所示,可见本发明方法相比CUDA来说,取得了更好的性能。管道并行机制通过合理地使用硬件计算资源和存储资源,根据处理核的平衡负载进行了基于优先级的任务调度。Fairy is a dynamic scene with two reflection calculations, and the rendering resolution is 1024*1024. We used the method of the present invention and the CUDA programming model to test this scenario, and the results are shown in Table 1. It can be seen that the method of the present invention has achieved better performance than CUDA. The pipeline parallel mechanism implements priority-based task scheduling according to the balanced load of processing cores by rationally using hardware computing resources and storage resources.
表1Table 1
表1分别使用CUDA及本模型对场景Bunny,Fairy,BART Kitchen在1024*1024分辨率下每秒的绘制帧数。Table 1 uses CUDA and this model to draw frames per second for scenes Bunny, Fairy, and BART Kitchen at a resolution of 1024*1024.
为了验证本发明方法对硬件的并行使用能力,测试了标量处理器的利用率,其直接反映了我们对数据和任务的调度和组织方法能否最大限度地开发算法在硬件上的并行执行能力。注意,我们没有使用ALU的使用情况作为我们的测试标准,因为有些时候即便线程槽已被占用,但ALU也可能因为访存延迟或者SIMD的低利用率而未被完全使用。如表2所示,相比CUDA编程模型,本发明方法能够更加有效地使用GPU的计算资源。In order to verify the parallel use ability of the method of the present invention for hardware, the utilization rate of the scalar processor is tested, which directly reflects whether our scheduling and organization method for data and tasks can maximize the parallel execution ability of the algorithm on the hardware. Note that we did not use ALU usage as our test standard, because sometimes even if thread slots are occupied, ALU may not be fully used due to memory access latency or low SIMD utilization. As shown in Table 2, compared with the CUDA programming model, the method of the present invention can use GPU computing resources more effectively.
表2Table 2
表2 CUDA及本模型的GPU利用率比较。Table 2 Comparison of GPU utilization between CUDA and this model.
为了说明本发明方法能够针对不同复杂度的场景动态地组织数据和调度任务而不会出现负载不均衡的情况,图1可见CUDA编程模型当场景复杂度极度增加时会出现明显的负载不均横而使处理资源利用不佳的情况,最终导致性能的下降,而本发明方法则一直维持着较为稳定的性能。In order to illustrate that the method of the present invention can dynamically organize data and schedule tasks for scenes of different complexity without unbalanced load, it can be seen from Figure 1 that when the complexity of the scene is extremely increased, the CUDA programming model will have obvious load imbalance However, poor utilization of processing resources will eventually lead to performance degradation, while the method of the present invention maintains relatively stable performance.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010150549A CN101833438A (en) | 2010-04-19 | 2010-04-19 | A general data processing method based on multiple parallelism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010150549A CN101833438A (en) | 2010-04-19 | 2010-04-19 | A general data processing method based on multiple parallelism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101833438A true CN101833438A (en) | 2010-09-15 |
Family
ID=42717518
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201010150549A Pending CN101833438A (en) | 2010-04-19 | 2010-04-19 | A general data processing method based on multiple parallelism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101833438A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102567084A (en) * | 2010-12-31 | 2012-07-11 | 新奥特(北京)视频技术有限公司 | Multi-task parallel scheduling mechanism |
CN103197976A (en) * | 2013-04-11 | 2013-07-10 | 华为技术有限公司 | Method and device for processing tasks of heterogeneous system |
CN104040500A (en) * | 2011-11-15 | 2014-09-10 | 英特尔公司 | Scheduling thread execution based on thread affinity |
CN104102476A (en) * | 2014-08-04 | 2014-10-15 | 浪潮(北京)电子信息产业有限公司 | High-dimensional data stream canonical correlation parallel computation method and high-dimensional data stream canonical correlation parallel computation device in irregular steam |
CN104331271A (en) * | 2014-11-18 | 2015-02-04 | 李桦 | Parallel computing method and system for CFD (Computational Fluid Dynamics) |
CN104699461A (en) * | 2013-12-10 | 2015-06-10 | Arm有限公司 | Configuring thread scheduling on a multi-threaded data processing apparatus |
CN102567084B (en) * | 2010-12-31 | 2016-12-14 | 新奥特(北京)视频技术有限公司 | A kind of Multi-task parallel scheduling mechanism |
CN106537863A (en) * | 2013-10-17 | 2017-03-22 | 马维尔国际贸易有限公司 | Processing concurrency in a network device |
CN106886503A (en) * | 2017-02-08 | 2017-06-23 | 无锡十月中宸科技有限公司 | heterogeneous system, data processing method and device |
CN106941522A (en) * | 2017-03-13 | 2017-07-11 | 广州五舟科技股份有限公司 | Lightweight distributed computing platform and its data processing method |
CN108595211A (en) * | 2018-01-05 | 2018-09-28 | 百度在线网络技术(北京)有限公司 | Method and apparatus for output data |
CN110334049A (en) * | 2019-07-02 | 2019-10-15 | 上海联影医疗科技有限公司 | Data processing method, device, computer equipment and storage medium |
CN110580527A (en) * | 2018-06-08 | 2019-12-17 | 上海寒武纪信息科技有限公司 | Generating method, device and storage medium for general machine learning model |
CN110688327A (en) * | 2019-09-30 | 2020-01-14 | 百度在线网络技术(北京)有限公司 | Video memory management method and device, electronic equipment and computer readable storage medium |
US11726754B2 (en) | 2018-06-08 | 2023-08-15 | Shanghai Cambricon Information Technology Co., Ltd. | General machine learning model, and model file generation and parsing method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1538296A (en) * | 2003-02-18 | 2004-10-20 | Multithreaded kernal for graphics processing unit | |
CN101091175A (en) * | 2004-09-16 | 2007-12-19 | 辉达公司 | load balancing |
CN101354780A (en) * | 2007-07-26 | 2009-01-28 | Lg电子株式会社 | Graphic data processing apparatus and method |
CN101526934A (en) * | 2009-04-21 | 2009-09-09 | 浪潮电子信息产业股份有限公司 | Construction method of GPU and CPU combined processor |
-
2010
- 2010-04-19 CN CN201010150549A patent/CN101833438A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1538296A (en) * | 2003-02-18 | 2004-10-20 | Multithreaded kernal for graphics processing unit | |
CN101091175A (en) * | 2004-09-16 | 2007-12-19 | 辉达公司 | load balancing |
CN101354780A (en) * | 2007-07-26 | 2009-01-28 | Lg电子株式会社 | Graphic data processing apparatus and method |
CN101526934A (en) * | 2009-04-21 | 2009-09-09 | 浪潮电子信息产业股份有限公司 | Construction method of GPU and CPU combined processor |
Non-Patent Citations (5)
Title |
---|
《计算机世界》 20100111 汤铭 处理器:期待丰收"大年" , * |
LEI WANG,等: "Task Scheduling of Parallel Processing in CPU-GPU Collaborative Environment", 《INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY 2008》 * |
未标注: "CPU+GPU:混合处理器提高内部性能连接效率", 《新电脑》 * |
汤铭: "处理器:期待丰收"大年"", 《计算机世界》 * |
钱悦: "图形处理器CUDA编程模型的应用研究", 《计算机与数字工程》 * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102567084B (en) * | 2010-12-31 | 2016-12-14 | 新奥特(北京)视频技术有限公司 | A kind of Multi-task parallel scheduling mechanism |
CN102567084A (en) * | 2010-12-31 | 2012-07-11 | 新奥特(北京)视频技术有限公司 | Multi-task parallel scheduling mechanism |
CN104040500A (en) * | 2011-11-15 | 2014-09-10 | 英特尔公司 | Scheduling thread execution based on thread affinity |
CN104040500B (en) * | 2011-11-15 | 2018-03-30 | 英特尔公司 | Scheduling thread based on thread similitude performs |
CN103197976A (en) * | 2013-04-11 | 2013-07-10 | 华为技术有限公司 | Method and device for processing tasks of heterogeneous system |
CN106537863A (en) * | 2013-10-17 | 2017-03-22 | 马维尔国际贸易有限公司 | Processing concurrency in a network device |
CN104699461A (en) * | 2013-12-10 | 2015-06-10 | Arm有限公司 | Configuring thread scheduling on a multi-threaded data processing apparatus |
CN104699461B (en) * | 2013-12-10 | 2019-04-05 | Arm 有限公司 | Thread scheduling is configured in multi-thread data processing unit |
US10733012B2 (en) | 2013-12-10 | 2020-08-04 | Arm Limited | Configuring thread scheduling on a multi-threaded data processing apparatus |
CN104102476A (en) * | 2014-08-04 | 2014-10-15 | 浪潮(北京)电子信息产业有限公司 | High-dimensional data stream canonical correlation parallel computation method and high-dimensional data stream canonical correlation parallel computation device in irregular steam |
CN104331271A (en) * | 2014-11-18 | 2015-02-04 | 李桦 | Parallel computing method and system for CFD (Computational Fluid Dynamics) |
CN106886503A (en) * | 2017-02-08 | 2017-06-23 | 无锡十月中宸科技有限公司 | heterogeneous system, data processing method and device |
CN106941522A (en) * | 2017-03-13 | 2017-07-11 | 广州五舟科技股份有限公司 | Lightweight distributed computing platform and its data processing method |
CN106941522B (en) * | 2017-03-13 | 2019-12-10 | 广州五舟科技股份有限公司 | Lightweight distributed computing platform and data processing method thereof |
CN108595211A (en) * | 2018-01-05 | 2018-09-28 | 百度在线网络技术(北京)有限公司 | Method and apparatus for output data |
CN108595211B (en) * | 2018-01-05 | 2021-11-26 | 百度在线网络技术(北京)有限公司 | Method and apparatus for outputting data |
CN110580527B (en) * | 2018-06-08 | 2022-12-02 | 上海寒武纪信息科技有限公司 | Method and device for generating universal machine learning model and storage medium |
CN110580527A (en) * | 2018-06-08 | 2019-12-17 | 上海寒武纪信息科技有限公司 | Generating method, device and storage medium for general machine learning model |
US11726754B2 (en) | 2018-06-08 | 2023-08-15 | Shanghai Cambricon Information Technology Co., Ltd. | General machine learning model, and model file generation and parsing method |
CN110334049A (en) * | 2019-07-02 | 2019-10-15 | 上海联影医疗科技有限公司 | Data processing method, device, computer equipment and storage medium |
CN110688327A (en) * | 2019-09-30 | 2020-01-14 | 百度在线网络技术(北京)有限公司 | Video memory management method and device, electronic equipment and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101833438A (en) | A general data processing method based on multiple parallelism | |
Wang et al. | Kernel fusion: An effective method for better power efficiency on multithreaded GPU | |
Li et al. | Performance modeling in CUDA streams—A means for high-throughput data processing | |
CN105487838B (en) | A task-level parallel scheduling method and system for a dynamically reconfigurable processor | |
CN112306678A (en) | Method and system for parallel processing of algorithms based on heterogeneous many-core processor | |
CN110704360A (en) | Graph calculation optimization method based on heterogeneous FPGA data flow | |
CN101799762B (en) | Quick parallelization programming template method for remote sensing image processing algorithm | |
CN106991011A (en) | A method for processing big data tasks based on CPU multi-threading and GPU multi-granularity parallelism and collaborative optimization | |
CN101923492B (en) | Method for executing dynamic allocation command on embedded heterogeneous multi-core | |
CN102902512A (en) | Multi-thread parallel processing method based on multi-thread programming and message queue | |
CN104375805A (en) | Method for simulating parallel computation process of reconfigurable processor through multi-core processor | |
CN101387952A (en) | Single chip multiprocessor task scheduling management method | |
CN102193830B (en) | Many-core environment-oriented division mapping/reduction parallel programming model | |
CN112446471B (en) | Convolution acceleration method based on heterogeneous many-core processor | |
CN103049241A (en) | Method for improving computation performance of CPU (Central Processing Unit) +GPU (Graphics Processing Unit) heterogeneous device | |
CN105512088A (en) | Processor architecture capable of being reconstructed and reconstruction method thereof | |
CN105373367B (en) | The vectorial SIMD operating structures for supporting mark vector to cooperate | |
CN101840329B (en) | Data parallel processing method based on graph topological structure | |
CN114816529A (en) | Apparatus and method for configuring cooperative thread bundle in vector computing system | |
CN117608847A (en) | An edge reasoning framework optimization method and system for ARM platform | |
CN111008042B (en) | Efficient general-purpose processor execution method and system based on heterogeneous pipeline | |
CN105045564A (en) | Front end dynamic sharing method in graphics processor | |
CN112148361B (en) | Method and system for transplanting encryption algorithm of processor | |
Moustafa et al. | 3D cartesian transport sweep for massively parallel architectures with PARSEC | |
CN105653243A (en) | Method for distributing tasks by general purpose graphic processing unit in multi-task concurrent execution manner |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C53 | Correction of patent of invention or patent application | ||
CB03 | Change of inventor or designer information |
Inventor after: Xu Duanqing Inventor after: Yang Xin Inventor after: Zhao Lei Inventor after: Fang Yingming Inventor before: Xu Duanqing Inventor before: Yang Xin Inventor before: Zhao Lei |
|
COR | Change of bibliographic data |
Free format text: CORRECT: INVENTOR; FROM: XU DUANQING YANG XIN ZHAO LEI TO: XU DUANQING YANG XIN ZHAO LEI FANG YINGMING |
|
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20100915 |