CN103218304B

CN103218304B - Off-chip distribution method in a kind of embedded memory data slice

Info

Publication number: CN103218304B
Application number: CN201310114684.3A
Authority: CN
Inventors: 姚英彪; 陈越佳; 王璇; 曾宪彬
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2013-04-03
Filing date: 2013-04-03
Publication date: 2016-07-20
Anticipated expiration: 2033-04-03
Also published as: CN103218304A

Abstract

The present invention relates to off-chip distribution method in a kind of embedded memory data slice.On sheet, internal memory is as the key component of embedded system, directly affects the overall performance of system.Present invention firstly provides TCG model as weighing the new standard that data object causes Cache to lack, consider most key factor, such as data object size, life cycle, access times, temporal locality and spatial locality etc..Secondly propose SPM/Cache data distributing method and the data object being easiest to clash (TCG value is big) is assigned to SPM.Then data object big for TCG value is mapped to different Cache groups and avoids conflict by proposition fixing Cache data layout method.The inventive method makes internal memory hardware and the software run on it on sheet more mate, and reduces the time of routine access storage system, thus improving systematic entirety energy.

Description

A method for on-chip and off-chip allocation of embedded memory data

技术领域 technical field

本发明属于嵌入式内存技术领域。尤其涉及一种嵌入式内存数据片上片外分配方法。本发明能取得针对具体应用在具体内存配置上的性能最优，尤其适用于多媒体应用程序在便笺式存贮器/高速缓存混合内存结构上的性能优化。 The invention belongs to the technical field of embedded memory. In particular, it relates to a method for on-chip and off-chip allocation of embedded memory data. The present invention can achieve optimal performance on specific memory configuration for specific applications, and is especially suitable for performance optimization of multimedia application programs on the hybrid memory structure of scratch pad memory/cache cache.

背景技术 Background technique

由于制造工艺和电路逻辑结构的差异，处理器执行部件的速度一直高于存储器读写速度，并且随着半导体工艺技术的发展，这一速度差距造成的性能差异在逐步加大。解决处理器与外存速度失配的一个重要技术就是存储系统采用分层设计，在片上集成一个小的、但速度更快的存储器来提高系统存储访问性能。 Due to differences in manufacturing process and circuit logic structure, the speed of processor execution components has always been higher than the read and write speed of memory, and with the development of semiconductor process technology, the performance difference caused by this speed gap is gradually increasing. An important technology to solve the speed mismatch between the processor and the external memory is that the storage system adopts a layered design and integrates a small but faster memory on the chip to improve the system storage access performance.

片上内存结构作为嵌入式系统的重要部分，直接影响着系统的性能、功耗、成本等关键参数。片上内存结构有高速缓存Cache和便笺式存贮器SPM（Scratch-PadMemory）两种类型。SPM相比Cache存储每位花费更少面积和功耗，因而嵌入式系统片上内存结构采取SPM/Cache的混合结构渐渐成为一种趋势。然而，SPM的容量很小和专用性，使得如何有效使用片上内存资源成为嵌入式系统设计的关键问题。 As an important part of the embedded system, the on-chip memory structure directly affects key parameters such as system performance, power consumption, and cost. There are two types of on-chip memory structures: cache Cache and scratch pad memory SPM (Scratch-PadMemory). Compared with Cache, SPM consumes less area and power consumption per bit. Therefore, it is gradually becoming a trend to adopt a hybrid structure of SPM/Cache for the on-chip memory structure of embedded systems. However, the small capacity and specificity of SPM make how to effectively use on-chip memory resources a key issue in embedded system design.

现有的软件数据存储优化研究主要集中在如何增加Cache命令率，或者如何增加SPM访问次数，缺乏对采用Cache和SPM混合片上内存结构的数据内存访问优化研究。 Existing research on software data storage optimization mainly focuses on how to increase the command rate of Cache, or how to increase the number of SPM accesses, but there is a lack of research on data memory access optimization using the hybrid on-chip memory structure of Cache and SPM.

数据片上片外分配技术是一种嵌入式系统存储优化技术，利用该技术得到片上片外分配策略决定哪些数据通过SPM(称为片上)访问，哪些数据通过Cache(称为片外)访问。数据片上片外分配技术优化了数据在SPM和Cache之间的分配，可以取得对具体应用的性能最优，已经成为了嵌入式系统存储优化研究的热点。 Data on-chip and off-chip allocation technology is an embedded system storage optimization technology. Using this technology, the on-chip and off-chip allocation strategy determines which data is accessed through SPM (called on-chip) and which data is accessed through Cache (called off-chip). Data on-chip and off-chip allocation technology optimizes the allocation of data between SPM and Cache, and can achieve the best performance for specific applications. It has become a hot spot in the research of embedded system storage optimization.

发明内容 Contents of the invention

本发明的目的是针对现有技术的不足，提供一种嵌入式内存数据片上片外分配方法，能够实现具体应用程序在具体内存配置上的性能最优。 The object of the present invention is to provide a method for on-chip and off-chip allocation of embedded memory data in view of the deficiencies in the prior art, which can realize the optimal performance of specific application programs in specific memory configurations.

为了解决上述技术问题，本发明采用的技术方案包括如下步骤： In order to solve the problems of the technologies described above, the technical solution adopted in the present invention comprises the following steps:

步骤1.利用编译器和仿真器工具提取具体应用程序的信息； Step 1. Utilize compiler and emulator tools to extract specific application information;

步骤2.对这些信息建立TCG模型； Step 2. Build a TCG model on these information;

步骤3.提出数据分配方法将TCG值大的数据对象分配到SPM； Step 3. Propose a data allocation method to allocate data objects with large TCG values to SPM;

步骤4.提出数据布局方法将TCG值大的数据对象映射到不同的Cache组以避免冲突。 Step 4. A data layout method is proposed to map data objects with large TCG values to different Cache groups to avoid conflicts.

步骤1所述的具体应用程序的信息，包括数据对象的大小、生命周期、访问次数、时间局部性和空间局部性；所述的时间局部性是由时间关系图TRG（TemporalRelationshipGraph）来表示；空间局部性是由最大连续访问次数来表示。 The specific application information described in step 1 includes the size, life cycle, number of visits, temporal locality, and spatial locality of the data object; the temporal locality is represented by the temporal relationship graph TRG (TemporalRelationshipGraph); the spatial Locality is represented by the maximum number of consecutive accesses.

步骤2所述的TCG模型，其内容包括步骤1提取的数据对象的大小、生命周期、访问次数、时间局部性和空间局部性因素，其模型公式如下： The TCG model described in step 2 includes the size, life cycle, number of visits, temporal locality and spatial locality factors of the data object extracted in step 1, and its model formula is as follows:

TCG=（访问次数*生命周期*TRG值）/（最大连续访问次数*对象大小）。 TCG=(number of visits*lifetime*TRG value)/(maximum number of consecutive visits*object size).

步骤3所述的数据分配方法，具体包括如下步骤： The data distribution method described in step 3 specifically includes the following steps:

3-1.将全部数据对象按照TCG值降序排列，并初始化，然后分配到片外内存，作为待分配数据对象； 3-1. Arrange all data objects in descending order of TCG values, initialize them, and then allocate them to off-chip memory as data objects to be allocated;

3-2.在所有待分配数据对象中，依降序顺序选择第一个满足容量小于或等于便笺式存贮器剩余容量的数据对象，将该数据对象分配到片上便笺式存贮器； 3-2. Among all the data objects to be allocated, select the first data object whose capacity is less than or equal to the remaining capacity of the pad memory in descending order, and allocate the data object to the on-chip scratch pad memory;

3-3.重复步骤3-2，直到所有待分配数据对象容量均大于便笺式存贮器剩余容量，则结束。 3-3. Step 3-2 is repeated until the capacity of all data objects to be allocated is greater than the remaining capacity of the scratch pad, then the process ends.

步骤4所述的数据布局方法包含如下步骤： The data layout method described in step 4 includes the following steps:

4-1.计算剩余待分配数据对象中数据对象需要的高速缓存组数，计算公式如下： 4-1. Calculate the number of cache groups required by the data objects in the remaining data objects to be allocated. The calculation formula is as follows:

组数=数据对象大小/高速缓存组大小； Number of groups = data object size / cache group size;

4-2.将高速缓存当前组号分配给数据对象，并将缓存当前组号加一，同时数据对象所需组数减一； 4-2. Assign the current group number of the cache to the data object, add one to the current group number of the cache, and decrease the number of groups required by the data object by one;

4-3.重复步骤4-2，直到数据对象所需组数为零； 4-3. Repeat step 4-2 until the number of groups required for the data object is zero;

4-4.重复步骤4-1、4-2和4-3，直到剩余待分配数据对象全部分配完成。 4-4. Repeat steps 4-1, 4-2 and 4-3 until all remaining data objects to be allocated are allocated.

本发明的有益效果如下： The beneficial effects of the present invention are as follows:

本发明方法利用TCG模型对应用程序信息进行建模，综合考虑了SPM的合理利用与片外内存数据对象的合理布局，优化了SPM和Cache之间的数据分配，减少了程序消耗在数据存储访问时间和降低数据存储访问能耗，实现了具体应用程序在具体内存配置上的性能最优。 The method of the present invention utilizes the TCG model to model the application program information, comprehensively considers the reasonable utilization of the SPM and the reasonable layout of the off-chip memory data objects, optimizes the data allocation between the SPM and the Cache, and reduces program consumption in data storage access time and reduce data storage access energy consumption, and realize the optimal performance of specific applications on specific memory configurations.

附图说明 Description of drawings

图1为本发明方法的流程图； Fig. 1 is the flowchart of the inventive method;

图2为本发明方法提出的TCG模型结构图； Fig. 2 is the TCG model structural diagram that the inventive method proposes;

图3为本发明方法中的SPM/Cache数据分配方法流程图； Fig. 3 is the flow chart of the SPM/Cache data allocation method in the method of the present invention;

图4为本发明方法中的固定Cache数据布局方法流程图。 FIG. 4 is a flow chart of the fixed Cache data layout method in the method of the present invention.

具体实施方式 detailed description

下面结合具体实施方式和附图对本发明进行详细描述。 The present invention will be described in detail below in conjunction with specific embodiments and accompanying drawings.

如图1所示，本实施方式中首先利用编译器和仿真器工具提取具体应用程序的信息：①选择GCC-2.7.1-MIPS编译器的-O3优化选项静态编译应用程序得到MIPS汇编代码；②选择MIPS仿真器对片上内存进行配置，包括容量大小、访问延迟和组织方式（替换策略、写策略、写缺失策略和关联方式）等，开启性能统计工具，进行程序的数据存储访问性能的仿真。其次对这些信息建立TCG模型；然后利用SPM/Cache数据分配方法将TCG值大的数据对象分配到SPM；最后利用固定Cache数据布局方法将TCG值大的数据对象映射到不同的Cache组以避免冲突。 As shown in Figure 1, at first utilize compiler and emulator tool to extract the information of concrete application program in the present embodiment: 1. select the -O3 optimization option static compilation application program of GCC-2.7.1-MIPS compiler to obtain MIPS assembly code; ②Select the MIPS emulator to configure the on-chip memory, including capacity, access delay, and organization (replacement strategy, write strategy, write-miss strategy, and association method), etc., and open the performance statistics tool to simulate the data storage access performance of the program . Secondly, establish a TCG model for these information; then use the SPM/Cache data allocation method to allocate data objects with large TCG values to SPM; finally use the fixed Cache data layout method to map data objects with large TCG values to different Cache groups to avoid conflicts .

所述的具体应用程序的信息，包括数据对象的大小、生命周期、访问次数、时间局部性和空间局部性；所述的时间局部性是由时间关系图TRG（TemporalRelationshipGraph）来表示；空间局部性是由最大连续访问次数来表示。 The specific application information includes the size, life cycle, access times, temporal locality and spatial locality of the data object; the temporal locality is represented by the temporal relationship graph TRG (TemporalRelationshipGraph); spatial locality It is represented by the maximum number of consecutive visits.

如图2所示，所述的TCG模型，其内容包括步骤1提取的数据对象的大小、生命周期、访问次数、时间局部性和空间局部性因素，其模型公式如下： As shown in Figure 2, the described TCG model includes the size, life cycle, number of visits, temporal locality and spatial locality factors of the data object extracted in step 1, and its model formula is as follows:

其中，时间关系图TRG（TemporalRelationshipGraph）和TRG值可参看N.Gloy,T.Blockwell,M.D.Zorn论文题目Procedureplacementusingtemporalorderinginformation后参照文本。 Among them, the temporal relationship graph TRG (TemporalRelationshipGraph) and TRG values can be found in the reference text after the title of N.Gloy, T.Blockwell, M.D.Zorn paper title Procedure placement using temporal ordering information.

如图3所示，本实施方式中的SPM/Cache数据分配的目的是将最容易发生冲突的数据对象分配到SPM中，包括如下步骤： As shown in Figure 3, the purpose of the SPM/Cache data allocation in this embodiment is to allocate the most conflict-prone data objects to the SPM, including the following steps:

Step1、将全部数据对象按照TCG值降序排列，并初始化，然后分配到片外内存，作为待分配数据对象； Step1. Arrange all data objects in descending order of TCG values, initialize them, and then allocate them to off-chip memory as data objects to be allocated;

Step2、在所有待分配数据对象中，依降序顺序选择第一个满足容量小于或等于便笺式存贮器剩余容量的数据对象，将该数据对象分配到片上便笺式存贮器SPM； Step2, among all data objects to be distributed, select the first data object satisfying capacity less than or equal to the remaining capacity of the pad memory in descending order, and distribute the data object to the on-chip scratch pad memory SPM;

Step3、重复步骤Step2，直到所有待分配数据对象容量均大于便笺式存贮器剩余容量，则结束。 Step 3. Repeat Step 2 until the capacity of all data objects to be allocated is greater than the remaining capacity of the scratch pad, then end.

如图4所示，本实施方式中的固定Cache数据布局具有两个目标：①减少Cache缺失的次数；②减少片外内存空间（即数据布局完后减少片外内存中的洞），包括如下步骤： As shown in Figure 4, the fixed Cache data layout in this embodiment has two goals: ① reduce the number of Cache misses; ② reduce the off-chip memory space (that is, reduce the holes in the off-chip memory after the data layout is completed), including the following step:

Step4、计算剩余的i个待分配数据对象中每个数据对象需要的高速缓存组数j，计算公式如下： Step4. Calculate the number j of cache groups required for each data object in the remaining i data objects to be allocated. The calculation formula is as follows:

组数j=数据对象大小/高速缓存组大小； Number of groups j = data object size / cache group size;

Step5、将高速缓存当前组号setNO分配给数据对象，并将缓存当前组号setNO加一，同时数据对象所需组数j减一； Step5. Assign the current group number setNO of the cache to the data object, and add one to the current group number setNO of the cache, and decrease the number of groups j required by the data object by one;

Step6、重复步骤4-5，直到数据对象所需组数j为零； Step6, repeat steps 4-5 until the required number of groups j of the data object is zero;

Step7、重复步骤4-4、4-5和4-6，直到剩余的i个待分配数据对象全部分配完成。 Step7. Repeat steps 4-4, 4-5 and 4-6 until the remaining i data objects to be allocated are all allocated.

Claims

1. off-chip distribution method in an embedded memory data slice, it is characterised in that comprise the steps:

Step 1. utilizes compiler and simulator tool to extract the information of concrete application program；

These information are set up TCG model by step 2.；

Step 3. proposes data distributing method and data object big for TCG value is assigned to SPM；

Step 4. proposes data layout's method and data object big for TCG value is mapped to different Cache groups to avoid conflict；

The information of the concrete application program described in step 1, including the size of data object, life cycle, access times, temporal locality and spatial locality；Described temporal locality is to be represented by time chart TRG；Spatial locality is to be represented by maximum connected reference number of times；

TCG model described in step 2, its content includes the size of data object, life cycle, access times, temporal locality and the spatial locality factor that step 1 is extracted, and its model formation is as follows:

TCG=(access times * life cycle * TRG value)/(maximum connected reference number of times * object size)；

Data distributing method described in step 3, specifically includes following steps:

Total data object according to TCG value descending, and is initialized by 3-1., is then dispensed for off-chip internal memory, as data to be distributed object；

3-2., in all data to be distributed objects, selects first according to descending order and meets the capacity data object less than or equal to scratch-pad memory residual capacity, be assigned on sheet by this data object scratch-pad memory；

3-3. repeats step 3-2, until all data to be distributed object capacities are all higher than scratch-pad memory residual capacity, then terminates；

Data layout's method described in step 4 comprises the steps of:

4-1. calculates the cache set number that in residue data to be distributed object, data object needs, and computing formula is as follows:

Group number=data object size/cache set size；

Current for high-speed cache group number is distributed to data object by 4-2., and adds one by current for buffer memory group number, and data object required group number subtracts one simultaneously；

4-3. repeats step 4-2, until data object required group number is zero；

4-4. repeats step 4-1,4-2 and 4-3, until residue data to be distributed object is all assigned.