CN110376503A

CN110376503A - A kind of AI accelerates chip performance test method and its device

Info

Publication number: CN110376503A
Application number: CN201910565843.9A
Authority: CN
Inventors: 陈坚; 汪玉; 林峰; 葛广君; 梁爽
Original assignee: Fuzhou Institute Of Data Technology Co Ltd
Current assignee: Fuzhou Institute Of Data Technology Co Ltd
Priority date: 2019-06-27
Filing date: 2019-06-27
Publication date: 2019-10-25
Anticipated expiration: 2039-06-27
Also published as: CN110376503B

Abstract

The present invention discloses a kind of AI and accelerates chip performance test method and its device, by the beginning of each instruction of each module in sample record chip and the end time forms data record；Data record is organized into list again, and then corresponding calculation processing is carried out to list, to obtain the instruction operation duration and relevant parameter of chip modules；The specified module of acquisition or the parallel instruction of specified time, and text printing or figure row display can be searched, from list additionally for parallel parsing.The present invention not only provides the performance evaluation for calculating part, also provides communication part and analyzes with the degree of parallelism for calculating part.The present invention can be according to the parallel instruction within the scope of setting condition search condition, analysis instruction degree of parallelism, to provide more preferable support for chip performance optimization.

Description

A kind of AI accelerates chip performance test method and its device

Technical field

The present invention relates to chip testing fields more particularly to a kind of AI to accelerate chip performance test method and its device.

Background technique

AI accelerates chip interior to generally comprise: instruction scheduling, convolution algorithm, Chi Hua, activation primitive calculates, data are loaded into and carry These modules out.Since current mainstream AI algorithm parameter amount is huge, data need in piece storage chip external memory between into Row is moved repeatedly.Thus the computation bandwidth of system and the matching degree of data bandwidth be influence AI accelerate chip performance it is main because Element.In addition, the diversity of AI network model also causes the operational efficiency of modules to have higher uncertainty.Therefore AI chip Performance depend not only on the independent performances of these instruction modules, additionally depend on data and set out and other instruction modules Dispatching efficiency, it is necessary to propose a kind of test method, the degree of parallelism each instruction module performance and instruction module is carried out Analysis, facilitates subsequent progress chip performance improvement.

Since chip purposes is different, the emphasis of existing performance test methods concern is also different, some test methods pay close attention to core Performance of the piece under different utilization rates facilitates and carries out chip type selecting；Some test methods pay close attention to the fortune of some internal module Row performance facilitates subsequent improve；The degree of parallelism of some concern chips multiple modules operation, the development phase to chip most Whole performance is predicted.It is mainly the following typical method at this stage:

1) chip utilization rate is controlled, chip performance test result under different utilization rates is obtained, takes geometric mean.Such as application No. is The utilization rate of CPU is controlled shown in 201310161217.6 " cpu performance appraisal procedure and device " patent by control instruction；It is right Central processor CPU carries out benchmark test, obtains the performance test results of the CPU under each utilization rate, wherein each performance is surveyed Test result indicates a kind of performance of the CPU under load；The geometric mean of multiple the performance test results is calculated to obtain CPU lastness It can assessment result.The patent can only estimate performance by test instruction segment, can not be inside precise measurement when the operation of each module Between and internal disparate modules degree of parallelism.The patent is unable to test out the module for causing performance bottleneck.

2) disparate modules of chip interior design bypass circuit, and the performance of module to be measured can be tested after bypass.Such as application Number for 201910103596.0 " the unit performance test method and System on Chip/SoC of artificial intelligence module " patent shown in, for row Multiple AI processing units of two-dimensional array are arranged into, each processing unit includes enabled input terminal, for receiving enable signal, and Suspend or start the operation of processing unit according to enable signal；Can have and processing unit to be tested in multiple processing units The processing unit of identical dimensional 1 and/or dimension 2 is configured to bypass condition, carries out to realize to the processing unit to be tested Performance test；By assigning processing unit with bypass functionality, AI module can be tested more conveniently.The patent passes through Bypass is tested for module to be measured, can not estimate chip overall performance；It can not be observed between disparate modules in real scene Degree of parallelism.

3) by drawing the runing time of each calculation block of chip interior, the overall performance of chip is estimated.Such as application number Shown in " a kind of method and corresponding computer system for predicting GPU performance " patent for 201510387995.6, to One group of test application program is run in the GPU chip of assessment；Capture one group of scalar performance counter and vector performance counter；Base In the scalar performance counter and vector performance counter captured for the configuration creation of different chips for assessing and predicting GPU The model of performance；And it predicts the performance scores of GPU chip and identifies the bottleneck in GPU assembly line.The patent is for GPU's Computing module degree of parallelism establishes performance model, but a chip includes communication part and calculating part, and communication part is sometimes to whole Body performance is affected.

Summary of the invention

The purpose of the present invention is to provide a kind of AI to accelerate chip performance test method and its device, not only provides calculation part The performance evaluation of part also provides communication part and calculates the degree of parallelism analysis of part.The present invention can be searched for according to imposing a condition Parallel instruction in condition and range, analysis instruction degree of parallelism, to provide more preferable support for chip performance optimization.

The technical solution adopted by the present invention is that:

A kind of AI acceleration chip performance test method comprising following steps:

Step 1, global test is opened, test instruction is distributed to the modules that AI accelerates chip；

Step 2, at the beginning of sampling obtains every instruction that modules are run respectively and the end time forms data record It is uploaded to external performance analyser；

Step 3, performance analyser is arranged data record for list by module；

Step 4, list is carried out using scripting language calculating the operation duration for obtaining every instruction；

Step 5, the operation duration for adding up all instructions of each module respectively obtains each module operation total duration, for counting The instruction for most accounting for the operating time counts instruction operation total duration by module；

Step 6, list is carried out using scripting language calculating the interval for obtaining adjacent instructions in modules；

Step 7, from the parallel instruction operation result in the specified range that list lookup need to be analyzed；

Step 8, the operation result for the parallel instruction that output display is found.

Further, it includes that specific computing module is responsible for the logical of data carrying that AI, which accelerates the modules of chip, in step 1 Believe module.

Further, AI accelerates chip to obtain fiducial time by a timer in step 2, and is remembered using fiducial time Record instruction at the beginning of and the end time.

Further, the data record in step 2 is written to volatile memory, and memory capacity reach waterline or Performance analyser is uploaded to after instruction end of run.

Further, the data record in step 2 by be located at AI accelerate chip on communication interface directly being uploaded to property It can analyzer.

Further, the scripting language in step 4 or 6 is python language.

Further, list lookup includes Three models in step 7, specific as follows:

Mode 1: according to setting time, the parallel instruction in time range is searched；

Mode 2: according to specified module instruction sequence number, the parallel instruction in time range is searched

Mode 3: the longest instruction index of command interval in some module, the parallel instruction within the scope of hunting time are searched for；

Further, the specific steps of mode 3 are as follows:

Step 7-1, by sorting to the command interval in specified module, the maximum instruction call number in acquisition instruction interval；

Step 7-2, with maximum instruction call number, q item instructs the corresponding starting time as Time min forward；And p backward Item instructs corresponding starting time as Time max, and q, p value are set by the user；

Step 7-3 searches parallel instruction using Time min and Time max as time range.

The invention also discloses a kind of AI to accelerate chip performance test device comprising Data within the chip record generates electricity Road and performance analyser, Data within the chip record generative circuit include test control circuit, timer and instruction time record Summarize and telecommunication circuit, test control circuit are separately connected the modules in timer and chip, test control circuit is used Starting or terminate performance test in control chip, timer is for generation time benchmark and is supplied to the modules in chip, Each AI accelerates each module of chip to be equipped with time sampling circuit, and time sampling circuit is used for using in the respective module of acquisition Every instruction when operation starts between and the end time；Instruction time record summarizes and telecommunication circuit is for summarizing the instruction of generation Runing time is simultaneously uploaded to performance analyser.

Further, described instruction time record summarizes and telecommunication circuit includes competition judgement and record holding circuit, interior Portion's RAM memory and communication interface,

Internal RAM memory is written in record by the principle that competition judgement and record holding circuit are rotated according to justice, and AI accelerates core When piece number of modules is X, then all records are written in X clock cycle；

Competition judges and records that holding circuit is a set of, the cycle of operation number of competition judgement and every instruction of record holding circuit Accelerate chip module number greater than AI；Or competitions judgement and record holding circuit using more set rotation, and compete judgement and Accelerate chip module number, every suit competition judgement and record in AI when recording the cycle of operation number of every instruction of holding circuit Holding circuit wheel cycle turnover is less than the instruction cycle of operation；

Internal RAM memory provides storage state information for storing record data；

Communication interface and performance analyser communicate to connect, and reach waterline or instruction end of run in the capacity of internal RAM memory Record is uploaded to by performance analyser by communication interface afterwards；

Performance analyser is PC machine, plate, smart phone or cloud server.

The invention adopts the above technical scheme, at the beginning of each instruction by each module in sample record chip Data record is formed with the end time；Data record is organized into list again, and then corresponding calculation processing is carried out to list, with Obtain the instruction operation duration and relevant parameter of chip modules；Can additionally be searched from list the specified module of acquisition or The parallel instruction of specified time, and text printing or figure row display, for parallel parsing.The present invention, which not only provides, calculates part Performance evaluation also provides communication part and analyzes with the degree of parallelism for calculating part.The present invention can be according to setting condition search condition Parallel instruction in range, analysis instruction degree of parallelism, to provide more preferable support for chip performance optimization.

Detailed description of the invention

The present invention is described in further details below in conjunction with the drawings and specific embodiments；

Fig. 1 is test philosophy schematic diagram of the invention；

Fig. 2 summarizes for instruction time record of the invention and the schematic illustration of telecommunication circuit

Fig. 3 is performance test data product process schematic diagram of the invention；

Fig. 4 is the analysis process schematic diagram of performance analyser of the present invention；

Fig. 5 is the list schematic diagram that performance analyser of the present invention arranges；

Fig. 6 is the schematic illustration of mode 1 in list lookup of the invention；

Fig. 7 is the schematic illustration of mode 2 in list lookup of the invention；

Fig. 8 is the schematic illustration of mode 3 in list lookup of the invention；

Fig. 9 is the result schematic diagram of instruction operation total duration of the invention；

Figure 10 is the lookup result schematic diagram of the parallel instruction within the scope of setting time of the invention；

Figure 11 is the lookup result schematic diagram of the parallel instruction in the corresponding time range of instruction of assigned indexes of the invention.

Specific embodiment

As shown in one of Fig. 1-11, the invention discloses a kind of AI to accelerate chip performance test device comprising in chip Portion's data record generative circuit and performance analyser, it includes test control circuit, timing that Data within the chip, which records generative circuit, Device and instruction time record summarize and telecommunication circuit,

Test control circuit is separately connected the modules in timer and chip, and test control circuit is opened for controlling chip Begin or terminate performance test；Timer is for generation time benchmark and is supplied to the modules in chip；Each AI accelerates core Each module of piece is equipped with time sampling circuit, and time sampling circuit is used for using the fortune for obtaining every instruction in respective module Row starting and end time is also possible to be responsible in addition, the module that AI accelerates chip to include can be specific computing module The communication module that data are carried, the present invention not only can also test communication module with measuring and calculation module；Instruction time record Summarize and telecommunication circuit is used to summarize the instruction runing time data of sampling and is uploaded to performance analyser.

Specifically, as shown in Fig. 2, instruction time record summarizes and telecommunication circuit includes that a competition judgement and record are kept Circuit will record write-in inside low capacity RAM memory according to the principle of justice rotation when a plurality of record arrives simultaneously.Mould When block number is X, X clock cycle can be written all records, therefore the circuit has a applicable elements, the fortune of every instruction Row periodicity is greater than number of modules.If it is does not satisfy the condition, just needing to design the such circuit of more sets, the module of every suit rotation Less than the instruction cycle of operation.

Internal low capacity RAM provides storage state information for storing record data.The each of write-in records packet Which contain, instruction type, the instruction type sequence (item instruction), the instruction beginning and ending time.

When memory capacity reaches certain waterline (guaranteeing that RAM is not overflowed) or all instructions end of run, starting communication Record is uploaded to performance analyser by interface, and communication interface can be the band external tapping of chip, such as SPI interface, gigabit interface Deng.Notice that the transmission bandwidth of communication interface must assure that internal low capacity RAM is not spilt over.

Specifically, as an implementation, the performance analyser is PC machine, plate or smart phone, have such as Lower function:

A) data analysis is done for the data record uploaded.B) show or print the runing time of each instruction.C) statistics is every A kind of total run time of type instruction.D) it according to the parallel operating instruction within the scope of setting condition search condition, and shows.

Further, the invention also discloses a kind of AI acceleration chip performance test method, specific step is as follows:

Step 1 of the invention to 2 testing partial performances data product processes, it is specific as shown in figure 3,

Step 1, instruction operation enabling signal is received, starts global timing's device, time reference is provided, open global test, will survey Examination instruction is distributed to the modules that AI accelerates chip；Further, it includes specific that AI, which accelerates the modules of chip, in step 1 Computing module is responsible for the communication module of data carrying.

Step 2, at the beginning of sampling obtains every instruction that modules are run respectively and the end time forms data Record is uploaded to external performance analyser；

Specifically, a record is saved as with the end time at the beginning of every every instruction；

Further, as an implementation, the data record in step 2 is written to volatile memory, and in memory Capacity is uploaded to performance analyser after reaching waterline or instruction end of run.Further, as another embodiment, step Data record in 2 accelerates the communication interface on chip to be directly uploaded to performance analyser by being located at AI, to reduce internal deposit The use of reservoir

Step 3 of the invention to 8 parts are related to the analysis process of performance analyser, specifically as shown in figure 4, wherein start (n) is Refer to the initial time of nth instruction in module；End (n) refers to the end time that nth instructs in module.

Step 3, performance analyser is arranged data record for list by module, the instruction time of each module is recorded whole Listings format as shown in Figure 5 is managed into, subsequent analysis is facilitated to handle, wherein Start indicates the starting time of instruction, End table Show the end time of instruction.

, above-mentioned list is calculated, calculate the operation duration of every instruction.Formula are as follows:

Module_x_Inst_cycle(n) = End(n) - Start(n) 1≤n≤index_max

Wherein, start (n): the initial time that nth instructs in module x；End (n): in module x at the end of nth instruction Between；Module_x_Inst_cycle (n): the operation duration of the nth instruction of xth module；Index_max: instruction is maximum to compile Number.

Step 5, the operation duration for adding up all instructions of each module of Module_x_Inst_cycle (n) respectively obtains Each module runs total duration, for counting the instruction for most accounting for the operating time by module statistics instruction operation total duration；

Step 6, list is carried out using scripting language calculating the interval for obtaining adjacent instructions in modules；Scripting language is Python language, specific formula are as follows:

Module_x_gap_cycle(n) = Start(n) - End(n-1) 2≤n≤index_max

Wherein, Module_x_gap_cycle (n): the command interval of nth instruction and (n-1) article instruction in module x.

Step 7, from the parallel instruction operation result in the specified range that list lookup need to be analyzed；Further, in step 7 List lookup includes Three models, specific as follows:

Mode 1: according to setting time, the parallel instruction in time range is searched；Specific principle is as shown in fig. 6, wherein Time Min and Time max is display time range set by user.

Mode 2: according to specified module instruction sequence number, the parallel instruction in time range is searched；Specific principle is such as Shown in Fig. 7, Time min and Time max therein are display time ranges set by user.

Further, as shown in figure 8, the specific steps of mode 3 are as follows:

Step 8, the operation result for the parallel instruction that output display is found.It is shown according to above-mentioned lookup result figure row or literary The parallel operation result of word print command.

The test case of chip is accelerated to carry out effect explanation to the present invention with AI below.

1) as shown in figure 9, test AI accelerates the display result of the instruction operation total duration of chip, wherein Load refers to data Load instruction operation total duration, Save refer to that calculated result saves instruction operation total duration, and Conv refers to convolution instruction operation total duration, Pooling refers to that pond computations run total duration.

2) as shown in Figure 10, further test AI accelerates the parallel instruction within the scope of the setting time of chip, shows result It is as follows:

Time from 1000 to 15000000 inst seq is:

Load inst seq: [2 : 7141]

save inst seq: [0 : 1749]

conv inst seq: [0 : 311]

misc inst seq: [0 : 271]。

3) as shown in figure 11, the parallel instruction further in the corresponding time range of the instruction of test assigned indexes (specifies mould The instruction of block 1 index 1000 ~ 1200), display result is as follows:

Time from 2841401 to 3155305 inst seq is:

Load inst seq: [1000 : 1200]

save inst seq: [536 : 639]

conv inst seq: [74 : 81]

misc inst seq: [110 : 130]。

Claims

1. a kind of AI accelerates chip performance test method, it is characterised in that: itself the following steps are included:

Step 3, performance analyser is arranged data record for list by module；

2. a kind of AI according to claim 1 accelerates chip performance test method, it is characterised in that: AI accelerates in step 1 The modules of chip include the communication module that specific computing module is responsible for data carrying.

3. a kind of AI according to claim 1 accelerates chip performance test method, it is characterised in that: AI accelerates in step 2 Chip obtains fiducial time by a timer, and using fiducial time come at the beginning of recording instruction and the end time.

4. a kind of AI according to claim 1 accelerates chip performance test method, it is characterised in that: the data in step 2 Record is written to volatile memory, and is uploaded to performance evaluation after memory capacity reaches waterline or instruction end of run Device.

5. a kind of AI according to claim 1 accelerates chip performance test method, it is characterised in that: the data in step 2 Record accelerates the communication interface on chip to be directly uploaded to performance analyser by being located at AI.

6. a kind of AI according to claim 1 accelerates chip performance test method, it is characterised in that: the foot in step 4 or 6 This language is python language.

7. a kind of AI according to claim 1 accelerates chip performance test method, it is characterised in that: list is looked into step 7 It looks for including Three models, specific as follows:

Mode 3: the longest instruction index of command interval in some module, the parallel instruction within the scope of hunting time are searched for.

8. a kind of AI according to claim 7 accelerates chip performance test method, it is characterised in that: the specific step of mode 3 Suddenly are as follows:

9. a kind of AI accelerates chip performance test device, it is characterised in that: it include Data within the chip record generative circuit and Performance analyser, Data within the chip record generative circuit include that test control circuit, timer and instruction time record summarize And telecommunication circuit, test control circuit are separately connected the modules in timer and chip, test control circuit is for controlling Coremaking piece starts or terminates performance test, and timer is for generation time benchmark and is supplied to the modules in chip, each AI accelerates each module of chip to be equipped with time sampling circuit, and time sampling circuit is used for using every in the respective module of acquisition Instruction when operation starts between and the end time；Instruction time record summarizes and telecommunication circuit is for summarizing the instruction operation of sampling Time data are simultaneously uploaded to performance analyser.

10. a kind of AI according to claim 9 accelerates chip performance test device, it is characterised in that: the described instruction time Record summarizes and telecommunication circuit includes that competition judges and record holding circuit, internal RAM memory and communication interface,

Performance analyser is PC machine, plate, smart phone or cloud server.