[go: up one dir, main page]

CN118331542B - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN118331542B
CN118331542B CN202410418712.9A CN202410418712A CN118331542B CN 118331542 B CN118331542 B CN 118331542B CN 202410418712 A CN202410418712 A CN 202410418712A CN 118331542 B CN118331542 B CN 118331542B
Authority
CN
China
Prior art keywords
data
input data
thread
input
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410418712.9A
Other languages
Chinese (zh)
Other versions
CN118331542A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202410418712.9A priority Critical patent/CN118331542B/en
Publication of CN118331542A publication Critical patent/CN118331542A/en
Application granted granted Critical
Publication of CN118331542B publication Critical patent/CN118331542B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/60Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7821Tightly coupled to memory, e.g. computational memory, smart memory, processor in memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Image Processing (AREA)

Abstract

The disclosure relates to the technical field of artificial intelligence, and in particular relates to a data processing method and device. For each first data block in the output tensor, calculating a target logical address of target input data corresponding to the positioning data in the input tensor according to the logical coordinates of the positioning data in the first data block. For loading of each input data, the input data is loaded after being converted into a physical memory address according to a target logical address, and the input data is stored into the shared memory according to the sequence indicated by the corresponding logical coordinates, so that a second data block comprising a plurality of input data is stored in a continuous space of the shared memory. For calculation of each output data, corresponding four input data are obtained in parallel from the second data block according to the corresponding offset information and the target size to serve as interpolation reference points, and bilinear interpolation calculation is carried out by combining interpolation weights which are calculated in advance and stored to obtain the corresponding output data. The whole time consumption is short, and the method is suitable for application scenes of various memory architectures supporting tensor layout.

Description

Data processing method and device
Technical Field
The disclosure relates to the technical field of artificial intelligence, and in particular relates to a data processing method and device.
Background
Upsampling (Upsampling) is widely used in tensor operation processing, which refers to sampling and interpolating elements in two or more dimensions in an input tensor (tensor) to obtain an output tensor that is amplified in two or more dimensions, such as bilinear Upsampling (Bilinear Upsampling). In this process, two key steps are involved. The first is a coordinate mapping step, namely, according to the logic coordinates of the data to be generated on the output tensor and the difference of sampling modes, the quantity of the source data on the input tensor to be selected and the mapping coordinates of each source data point are different. And secondly, sampling calculation, namely, processing calculation processes of the selected source data set are different according to different sampling modes. In a bilinear upsampling process, one piece of output data requires the generation of 4 adjacent pieces of input data. In the related art, the calculation process of the coordinate mapping operation from the output to the input is complex, and the loading is usually slow according to the obtained address of the misaligned input data, which can make the running efficiency of the generated assembly code kernel lower, and further cause the performance of a part of computer vision model which uses a large amount of bilinear upsampling operators to be far lower than expected. How to improve the efficiency and speed of the whole bilinear upsampling process is a technical problem to be solved at present.
Disclosure of Invention
In view of this, the present disclosure proposes a data processing method and apparatus.
According to an aspect of the present disclosure, there is provided a data processing method applied to a processor of a memory architecture supporting tensor layout, the method including:
For each first data block in the output tensor, calculating a target logical address of target input data corresponding to positioning data in the input tensor according to the logical coordinates of the positioning data in the first data block, wherein the positioning data is one of a plurality of output data included in the first data block;
for loading of each input data, storing the input data acquired according to the target logical address into a shared memory according to the sequence indicated by the corresponding logical coordinates, so as to store second data blocks comprising a plurality of input data in a continuous space of the shared memory, wherein the sizes of the second data blocks and the first data blocks are target sizes;
And for calculation of each output data, according to the corresponding offset information and the target size, acquiring four input data corresponding to the output data from the second data block, and carrying out bilinear interpolation calculation by taking the four input data as interpolation reference points and combining pre-calculated and stored interpolation weights to obtain the corresponding output data.
In one possible implementation, the method further includes:
outputting and storing a first data block formed according to the corresponding logical coordinate sequence according to the output data, or
And calculating each output data, and outputting and storing the output data according to the corresponding logic coordinate sequence after obtaining the output data.
In a possible implementation manner, a plurality of threads in an execution unit in the processor are utilized to load input data, the number of output data in the first data block is the same as the number of threads in the execution unit, and each input data in the second data block is from the input tensor;
The method for storing the input data acquired according to the target logical address into the shared memory according to the sequence indicated by the corresponding logical coordinates comprises the following steps:
Controlling a first thread of the plurality of threads to acquire input data from a target physical address corresponding to the target logical address and store the input data into a shared memory according to the sequence indicated by the corresponding logical coordinates, or
Controlling each thread which is different from the first thread in the plurality of threads, determining a logic address of input data to be acquired according to the thread offset of the thread and the target logic address, further acquiring the input data from a physical address determined according to the logic address, and storing the input data into a shared memory according to a sequence indicated by corresponding logic coordinates;
the offset information corresponding to the input data to be loaded by each thread is the thread offset of the thread.
In a possible implementation manner, the plurality of threads in the execution unit in the processor are utilized to load input data, the number of output data in the first data block is the same as the number of threads in the execution unit, and only input data serving as interpolation reference points in the plurality of input data in the second data block is needed to be from the input tensor;
The method for storing the input data acquired according to the target logical address into the shared memory according to the sequence indicated by the corresponding logical coordinates comprises the following steps:
Controlling each thread to acquire input data from a target physical address corresponding to the target logical address under the condition that the input data of an interpolation reference point is required to be acquired according to the thread offset of the thread, or determining the logical address of the input data to be acquired according to the thread offset of the thread and the target logical address, further acquiring the input data from the physical address determined according to the logical address, and storing the input data into a shared memory according to the sequence indicated by the corresponding logical coordinates;
Controlling each thread to acquire input data from a target physical address corresponding to the target logical address under the condition that the input data of an interpolation reference point is not required to be acquired according to the thread offset of the thread, and storing the input data into a shared memory according to a storage position corresponding to the thread offset;
the offset information corresponding to the input data to be loaded by each thread is the thread offset of the thread.
In a possible implementation manner, the plurality of threads in the execution unit in the processor are utilized to load input data, the number of output data in the first data block is the same as the number of threads in the execution unit, and only input data serving as interpolation reference points in the plurality of input data in the second data block is needed to be from the input tensor;
The method for storing the input data acquired according to the target logical address into the shared memory according to the sequence indicated by the corresponding logical coordinates comprises the following steps:
controlling each thread to acquire input data from a physical address corresponding to the target logical address under the condition that the input data of an interpolation reference point is required to be acquired according to the thread offset of the thread, or determining the logical address of the input data to be acquired according to the thread offset of the thread and the target logical address, further acquiring the input data from the physical address determined according to the logical address, and storing the input data into a shared memory according to the sequence indicated by the corresponding logical coordinates;
controlling each thread to determine a preset value obtained from a preset physical address as input data or directly determine the preset value as input data under the condition that the input data of an interpolation reference point is not required to be obtained according to the thread offset of the thread, and storing the input data into a shared memory according to a storage position corresponding to the thread offset;
the offset information corresponding to the input data to be loaded by each thread is the thread offset of the thread.
In one possible implementation, the computation of the output data is performed using a plurality of threads in an execution unit within the processor,
According to the corresponding offset information and the target size, four input data corresponding to the output data are obtained from the second data block, including:
Controlling each thread to acquire input data of the upper left coordinate position in four input data corresponding to the output data from the second data block according to the thread offset of the thread;
Controlling each thread to acquire the rest three input data in the four input data corresponding to the output data from the second data block according to the thread offset and the target size of the thread;
The offset information corresponding to the output data to be calculated by each thread is the thread offset of the thread.
In a possible implementation manner, in a case where the input tensor includes multiple layers, the output tensor where the first data block is located is a layer 0 input tensor, and the number of the first data blocks in each layer of input tensor is a target number, where the method further includes:
under the condition that the input tensor comprises a plurality of layers, a target physical address corresponding to the target logical address is also determined;
For loading each input data, under the condition that the input data required to be acquired does not belong to the 0 th layer input tensor, storing the input data acquired from the physical address determined according to the target physical address, the first offset corresponding to the layer number of the input tensor where the input data required to be acquired is located and the offset information corresponding to the input data into a shared memory according to the sequence indicated by the corresponding logic coordinates, or storing the input data acquired from the physical address determined according to the target logical address, the first offset corresponding to the layer number of the input tensor where the input data required to be acquired is located and the offset information corresponding to the input data into the shared memory according to the sequence indicated by the corresponding logic coordinates.
In one possible implementation, the positioning data is output data at an upper left position on a boundary among the plurality of output data of the first data block, and/or
And if the corresponding relation exists between the difference of the storage positions of the input data in the shared memory and the difference of the indexes between the threads, the thread offset is the index of the thread.
In one possible implementation, the method further includes:
And storing the determined interpolation weight into a preset register or the shared memory.
According to another aspect of the present disclosure, there is provided a data processing apparatus for use in a processor of a memory architecture supporting tensor placement, the apparatus comprising a control unit and at least one execution unit, each of the execution units comprising a plurality of threads;
The control unit calculates a target logical address of target input data corresponding to positioning data in an input tensor according to the logical coordinates of the positioning data in the first data blocks for each first data block in the output tensor, wherein the positioning data is one of a plurality of output data included in the first data block;
Each thread stores the input data required to be obtained by the thread according to the target logical address into a shared memory according to the sequence indicated by the corresponding logical coordinates for the input data required to be loaded by the thread, so as to store second data blocks comprising a plurality of input data in the continuous space of the shared memory, wherein the sizes of the second data blocks and the first data blocks are both target sizes;
And each thread acquires four input data corresponding to the output data from the second data block according to the corresponding offset information and the target size aiming at the output data required to be calculated by the thread, and carries out bilinear interpolation calculation by taking the four input data as interpolation reference points and combining the pre-calculated and stored interpolation weights to obtain the corresponding output data.
In one possible implementation, the method further includes:
the control unit outputs and stores the first data block formed according to the corresponding logic coordinate sequence according to the output data, or
And after the threads obtain the output data to be calculated, outputting and storing the output data according to the corresponding logic coordinate sequence.
In a possible implementation manner, the number of output data in the first data block is the same as the number of threads in the execution unit, and each input data in the second data block is from the input tensor;
Each thread stores, for input data to be loaded by the thread, the input data to be acquired by the thread acquired according to the target logical address into a shared memory according to an order indicated by corresponding logical coordinates, including:
A first thread in the plurality of threads acquires input data from a target physical address corresponding to the target logical address, and stores the input data into a shared memory according to an order indicated by corresponding logical coordinates;
Each thread which is different from the first thread in the plurality of threads determines a logic address of input data to be acquired according to the thread offset of the thread and the target logic address, acquires the input data from a physical address determined according to the logic address, and stores the input data into a shared memory according to a sequence indicated by corresponding logic coordinates;
the offset information corresponding to the input data to be loaded by each thread is the thread offset of the thread.
In a possible implementation manner, the number of output data in the first data block is the same as the number of threads in the execution unit, and only input data serving as interpolation reference points in the plurality of input data in the second data block is needed to be from the input tensor;
Each thread stores, for input data to be loaded by the thread, the input data to be acquired by the thread acquired according to the target logical address into a shared memory according to an order indicated by corresponding logical coordinates, including:
Under the condition that the thread deviation of each thread determines that the input data of the interpolation reference point needs to be acquired, acquiring the input data from a target physical address corresponding to the target logical address, or determining a logical address of the input data to be acquired according to the thread deviation of the thread and the target logical address, further acquiring the input data from the physical address determined according to the logical address, and storing the input data into a shared memory according to the sequence indicated by the corresponding logical coordinates;
Under the condition that the input data of the interpolation reference point is not required to be acquired according to the thread offset of the thread, acquiring the input data from a target physical address corresponding to the target logical address, and storing the input data to a shared memory according to a storage position corresponding to the thread offset;
the offset information corresponding to the input data to be loaded by each thread is the thread offset of the thread.
In a possible implementation manner, the number of output data in the first data block is the same as the number of threads in the execution unit, and only input data serving as interpolation reference points in the plurality of input data in the second data block is needed to be from the input tensor;
Each thread stores, for input data to be loaded by the thread, the input data to be acquired by the thread acquired according to the target logical address into a shared memory according to an order indicated by corresponding logical coordinates, including:
under the condition that the thread deviation of each thread determines that the input data of the interpolation reference point needs to be acquired, acquiring the input data from a physical address corresponding to the target logical address, or determining a logical address of the input data to be acquired according to the thread deviation of the thread and the target logical address, further acquiring the input data from the physical address determined according to the logical address, and storing the input data into a shared memory according to the sequence indicated by the corresponding logical coordinates;
Under the condition that the input data of the interpolation reference point is not required to be acquired according to the thread offset determination of the thread, determining a preset value acquired from a preset physical address as the input data or directly determining the preset value as the input data, and storing the input data into a shared memory according to a storage position corresponding to the thread offset;
the offset information corresponding to the input data to be loaded by each thread is the thread offset of the thread.
In one possible implementation manner, for each thread, according to the corresponding offset information and the target size, four input data corresponding to the output data are obtained from the second data block, where the output data are calculated for the thread, and the four input data include:
for output data required to be calculated by the thread, acquiring input data of an upper left coordinate position in four input data corresponding to the output data from the second data block according to the thread offset of the thread;
For output data required to be calculated by the thread, acquiring the rest three input data in four input data corresponding to the output data from the second data block according to the thread offset and the target size of the thread;
The offset information corresponding to the output data to be calculated by each thread is the thread offset of the thread.
In a possible implementation manner, in a case that the input tensor includes a plurality of layers, the output tensor where the first data block is located is a layer 0 input tensor, the number of the first data blocks in each layer input tensor is a target number,
The control unit further determines a target physical address corresponding to the target logical address when determining that the input tensor comprises a plurality of layers;
And each thread stores the input data acquired from the physical address determined according to the target physical address, the first offset corresponding to the layer number of the input tensor where the input data is located and the offset information corresponding to the input data into a shared memory according to the sequence indicated by the corresponding logic coordinates, or stores the input data acquired from the physical address determined according to the target logic address, the first offset corresponding to the layer number of the input tensor where the input data is located and the offset information corresponding to the input data into the shared memory according to the sequence indicated by the corresponding logic coordinates when determining that the input data which is required to be acquired does not belong to the input tensor of the layer 0.
In one possible implementation, the positioning data is output data at an upper left position on a boundary among the plurality of output data of the first data block, and/or
And if the corresponding relation exists between the difference of the storage positions of the input data in the shared memory and the difference of the indexes between the threads, the thread offset is the index of the thread.
In one possible implementation manner, the control unit is further configured to store the determined interpolation weight to a preset register or store the determined interpolation weight to the shared memory.
According to another aspect of the disclosure, there is provided an electronic device comprising a processor, a memory for storing processor-executable instructions, wherein the processor is configured to implement the above-described method when executing the instructions stored in the memory.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the above-described method.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.
According to the data processing method and device provided by the embodiment of the disclosure, for each first data block in the output tensor, a target logical address of target input data corresponding to positioning data in the input tensor is calculated according to the logical coordinates of the positioning data in the first data block, and the positioning data is one of a plurality of output data included in the first data block. For loading of each input data, the input data acquired according to the target logical address is stored in the shared memory according to the sequence indicated by the corresponding logical coordinates, so that second data blocks comprising a plurality of input data are stored in the continuous space of the shared memory, and the size of the second data blocks and the size of the first data blocks are both the target sizes. For calculation of each output data, according to the corresponding offset information and the target size, four input data corresponding to the output data are obtained from the second data block, and the four input data are used as interpolation reference points to perform bilinear interpolation calculation by combining pre-calculated and stored interpolation weights, so that the corresponding output data are obtained. The loading time of input data is reduced, the time occupied by address conversion is reduced, and the method is suitable for various application scenes of memory architectures supporting tensor layout.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 shows a schematic diagram of an upsampling process in the related art.
Fig. 2 shows a flow chart of a data processing method according to an embodiment of the present disclosure.
Fig. 3 shows a block diagram of a data processing apparatus according to an embodiment of the present disclosure.
Fig. 4 is a schematic diagram illustrating a correspondence relationship between output data and interpolation reference points in a one-dimensional tensor scene in a data processing method according to an embodiment of the disclosure.
FIG. 5 is a diagram of a second data block loaded by a load mode in a data processing method according to an embodiment of the present disclosure.
Fig. 6 is a schematic diagram of a second data block obtained by loading in the second loading mode in the data processing method according to an embodiment of the present disclosure.
Fig. 7 is a schematic diagram of a second data block obtained by loading in a third loading mode in a data processing method according to an embodiment of the present disclosure.
Fig. 8 is a schematic diagram illustrating a correspondence relationship between output data and interpolation reference points in a multi-dimensional tensor scene in a data processing method according to an embodiment of the disclosure.
Fig. 9 is a block diagram illustrating an apparatus 1900 for data processing according to an example embodiment.
Detailed Description
Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.
An Operator is a calculation unit in the artificial intelligence neural network model for performing a corresponding mathematical calculation on the tensor transmitted in the network model. The kinds of operators are also varied for different operational requirements. The up-sampling operator can sample and interpolate (such as nearest neighbor interpolation, bilinear interpolation, bicubic interpolation and other interpolation modes) aiming at the input tensor based on a set sampling mode (or sampling mode) to obtain an amplified output vector.
For the upsampling operator based on bilinear interpolation, since one output data needs to be generated by interpolation calculation based on 4 input data, the whole upsampling process is shown in fig. 1, specifically, a logical coordinate (Ox, oy) of the output data at a certain point in the output tensor is obtained, then the logical coordinates of the corresponding 4 interpolation reference points are calculated according to the logical coordinates, and assuming that the four interpolation reference points are s00, s01, s10 and s11 respectively, the logical coordinates of the four interpolation reference points s00, s01, s10 and s11 can be expressed as (x, y), (x+1, y), (x, y+1) and (x+1, y+1). And then, calculating the memory addresses (namely physical addresses) of the corresponding 4 input data according to the logical coordinates of the interpolation reference points, and loading the input data in the memory addresses based on the memory addresses as the numerical value of the interpolation reference points. After the values of the 4 interpolation reference points are obtained, interpolation coefficients (i.e. interpolation weights) are calculated according to the logical coordinates where the output data are located and the amplification factors, and then the values of the corresponding output data are calculated according to the values of the interpolation reference points and the interpolation coefficients. And then sampling and interpolation calculation of the next output data are carried out until the calculation of all output times is completed. The detailed calculation process is as follows:
x_ratio=[(Ox+0.5)/scale_fator_x-0.5]-floor((Ox+0.5)/scale_factor_x-0.5)
y_ratio=[(Oy+0.5)/scale_fator_y-0.5]-floor((Oy+0.5)/scale_factor_y-0.5)
result=(1-y_ratio)*(1-x_ratio)*s00+y_ratio*(1-x_ratio)*s10+x_ratio*(1-y_ratio)*s01+x_ratio*y_ratio*s11
Where scale_factor_x and scale_factor_y represent the current up-sampling operation magnification in the x and y directions. x_ratio and y_ratio represent interpolation coefficients when the point output data calculates interpolation. result represents the interpolation calculation result to be stored to the point (also the output data of the point) with the logical coordinates (Ox, oy) on the output tensor. s00, s01, s10 and s11 represent the values of the four interpolation reference points, respectively.
For GPGPU (General-Purpose Graphics Processing Unit) chip software stack, it refers to a software architecture for performing General-purpose computing tasks on a Graphics Processor (GPU), and the hardware implementation generally used in the related art is a memory architecture with tensor layout, where the most efficient access way of data is often "aligned" layout. That is, if the memory layout is N-byte aligned, it can be considered that the memory layout is divided into a plurality of blocks (blocks) with N-byte sizes, and then a number of data memory commands with N or integer multiple of N are consecutive, and the starting point is also the starting point of the block, so that the hardware can achieve faster memory access relative to memory accesses with other sizes and starting coordinates. Hereinafter referred to as "per-block memory accesses". Accordingly, the hardware in the related art generally accesses a string of memory addresses more directly in buffer (cache) form, that is, linearly, for data. In contrast, the efficiency of per-block memory access of data is generally far better than that of per-buffer memory access.
In the related art, the Bilinear upsampling operator (bilinear upsampling operator) implementation adopts a direct calculation process as in fig. 1 to perform sampling and interpolation calculation. Because the algorithms are not identical, the address mapping to the input data is often not one-block, and even under different logical coordinates, the same algorithm will have different mapping rules or parameters, resulting in a non-continuous result of this address mapping. Thus, the "loading interpolation reference point according to address" process implemented by bilinear upsampling operator in the related art cannot directly "load by block", but only can select the input data in the input tensor required by the point-to-point loading calculation. Here, a large number of "loading according to buffer" is often adopted, address conversion is directly performed according to the logical coordinates of the output points to obtain corresponding physical addresses, and then the required point-to-point loading process is executed based on the physical addresses.
Since 4 parts of input data are needed as interpolation reference points in the bilinear interpolation up-sampling process, and the bilinear interpolation algorithm needs to consider multiple reforming sequence modes at the boundary because of the corner alignment (alignment-corner) rule, the performance loss (WARP DIVERGENCE) of the execution unit in the computing process is excessive, and the execution efficiency is affected. Furthermore, the computation of the coordinate mapping operation output to the input is complex and loading based on the acquired address of the misaligned input data is typically slow. This results in the idea that the bilinear upsampling operator contains a large number of "buffer-loaded" and a large number of "logical address to physical address" translation computations. These all result in a lower running efficiency of the generated assembly code kernel, and further result in a far lower performance than expected for the computer vision class model in which the operator is used in large part.
In this case, in the related art, a shared memory (shared memory) is used to reorder the input data, so as to ensure that the access to both the input data and the output data is "per block" as much as possible. However, due to the complexity and breadth of the algorithm, the method is limited by the support form of a specific hardware platform to the shared memory, and no effective general optimization method is available to realize the process, and the method can be specifically optimized only for specific scenes including the input and output of operators, algorithm modes and characteristics of the hardware platform. In the linear interpolation up-sampling process, the method is generally loaded into a shared memory in a buffer mode, but because interpolation reference points required by adjacent output data are overlapped, a large number of repeated loading and repeated address calculation conditions exist in the method, and the method has lower performance because the buffer mode is loaded and the buffer address calculation is slower under a memory architecture with tensor layout. In the related art, the whole address calculation result is calculated in advance by a host (host side) and put in a preloaded tensor, and the device (device side) calculation process is converted into reading the tensor, but the problem of the method is that the pre-calculation process takes longer and the host side compiles slowly. Meanwhile, the method needs time-consuming rearrangement data to be imported to the device end, occupies additional video memory of the device, and is not a feasible solution in a business scene.
Therefore, how to reduce the number of "loading in buffer" and the number of "buffer address calculation" as much as possible in the linear interpolation up-sampling process is a technical problem to be solved in the present day.
To solve the above technical problem, an embodiment of the present disclosure provides a data processing method and apparatus, for each first data block in an output tensor, according to a logical coordinate of positioning data in the first data block, a target logical address of target input data corresponding to the positioning data in an input tensor is calculated, where the positioning data is one of a plurality of output data included in the first data block. For loading of each input data, the input data acquired according to the target logical address is stored in the shared memory according to the sequence indicated by the corresponding logical coordinates, so that second data blocks comprising a plurality of input data are stored in the continuous space of the shared memory, and the size of the second data blocks and the size of the first data blocks are both the target sizes. For calculation of each output data, according to the corresponding offset information and the target size, four input data corresponding to the output data are obtained from the second data block, and the four input data are used as interpolation reference points to perform bilinear interpolation calculation by combining pre-calculated and stored interpolation weights, so that the corresponding output data are obtained. The loading time of input data is reduced, the time occupied by address conversion is reduced, and the method is suitable for various application scenes of memory architectures supporting tensor layout.
The embodiment of the disclosure provides a data processing method, which is applied to a processor of a memory architecture supporting tensor layout, as shown in fig. 2, and includes steps S101-S103. The method may be performed by a data processing apparatus shown in fig. 3, which data processing apparatus comprises a control unit 11 and at least one execution unit (warp) 12, each execution unit 12 comprising a plurality of threads (threads), as shown in fig. 3. The number of execution units 12 in the device may be one or more, only one is schematically depicted in fig. 2 for simplicity. The device can be applied to a processor of a memory architecture supporting tensor layout. In this embodiment, the whole data processing process can be implemented based on a single instruction multithreading (Single Instruction, multiple Threads, SIMT) manner, and the bilinear upsampling is completed. Thus, the data processing method is applicable to all SIMT architectures with tensor memory layout and efficient per-block memory access capability. Only the amount of loading data (i.e., the target size described below) and the multiplexing process of interpolation weights and memory address calculations need to be adjusted according to the size of the memory block and the mapping relationship of the memory data block and the logical data block in a specific hardware scenario.
In step S101, for each first data block in the output tensor, a target logical address of target input data corresponding to the positioning data in the input tensor is calculated according to the logical coordinates of the positioning data in the first data block, where the positioning data is one of a plurality of output data included in the first data block. In some embodiments, to simplify the address calculation process of subsequent input data loading, the positioning data is output data at an upper left position on a boundary among input data required for bilinear interpolation calculation of a plurality of output data of the first data block.
In step S102, for loading of each input data, the input data acquired according to the target logical address is stored in the shared memory according to the order indicated by the corresponding logical coordinates, so as to store, in the continuous space of the shared memory, a second data block including a plurality of input data, where the size of the second data block and the size of the first data block are both the target sizes. Therefore, 4 times of loading according to Buffer of 4 interpolation reference points of output data in the related technology are reduced to 1 time, and time consumption is greatly reduced.
In step S103, for each output data calculation, according to the corresponding offset information and the target size, four input data corresponding to the output data are obtained from the second data block, and the four input data are used as interpolation reference points to perform bilinear interpolation calculation by combining the interpolation weights calculated and stored in advance, so as to obtain the corresponding output data. In this way, each input data is stored to a separate location in the shared memory, ensuring that the shared memory is aligned to avoid possible bank conflicts (bank conflict).
In this embodiment, step S101 may be performed by the control unit 11 shown in fig. 3, and the loading of each input data in step S102 and the calculation of each output data in step S103 may be performed by the corresponding thread in the execution unit 12.
In some embodiments, the interpolation weights may be determined by the control unit 11 in advance before the step S103 is performed, and stored in a preset register or stored in the shared memory.
In some embodiments the method may further comprise storing a first data block output formed in a corresponding logical coordinate order from each of said output data, which step may be performed by the control unit 11. Or in the calculation of each output data in step S103, after the output data is obtained, the output data is sequentially output and stored according to the corresponding logical coordinates. Thus, each output data in the first data block can be stored in blocks, and the overall efficiency and speed are improved.
For a more intuitive and clear illustration of the working principle and procedure of the data processing method and apparatus, the method and apparatus provided by the embodiments of the present disclosure are schematically illustrated below with reference to fig. 4-8 by the example of a 2-fold magnification bilinear upsampling procedure using an apparatus including an execution unit 12 provided with 32 threads.
As shown in fig. 4, in the process of performing bilinear upsampling on the input tensor to obtain an output tensor 2 times larger according to the method of the embodiment of the present disclosure, in the case that the number of output data in the first data block is the same as the number of threads in the execution unit 12, each thread may perform loading of one designated input data and calculation of one output data, so that overall efficiency and speed may be improved, and each input data is stored in an independent location of the shared memory, so as to ensure that the shared memory is aligned with the number of threads, so as to avoid possible memory bank conflicts. At this time, the target size may be set according to the number of threads, for example, the number of data in the target size may be set to be the same as the number of threads, and the length and width of the target size may be set according to the needs of the interpolation reference points, so as to ensure that the interpolation reference points (i.e. the input data) for calculating each output data in the first data block are all loaded into the second data block. The output tensor may be divided into a plurality of first data blocks, such as the first data blocks of the plurality 4*8, according to the target size. The 4 input data of the input tensor, which are used as interpolation reference points, of the output data of any one of the first data blocks can be calculated, and then the second data blocks corresponding to the first data blocks can be obtained from the input tensor according to the calculation result and stored in the shared memory according to the 4*8 arrangement mode. For example, 4 input data as interpolation reference points of output data 1 in the first data block are respectively input data 1, 2, 9, 10 in the second data block, 4 input data as interpolation reference points of output data 2 in the first data block are respectively input data 1, 2, 9, 10 in the second data block, 4 input data as interpolation reference points of output data 3 in the first data block are respectively input data 2, 3, 10, 11 in the second data block, and so on. Then a reasonable calculation is performed, as shown in fig. 4, we can determine that in the case that the output data of the first data block are amplified by 2 times and the output data of the first data block are arranged according to 4*8, interpolation reference points corresponding to all the output data in the first data block have rules as shown in the following table 1, that is, in the case that the output data of the first data block are amplified by 2 times and the output data of the first data block are arranged according to 4*8, the reference interpolation points required for calculating each output data in the second data block are only input data in the upper left 3*5 range in the second data block.
Table 1 interpolation reference point example
All upper left interpolation points 1、2、3、4、9、10、11、12
All upper right interpolation points 2、3、4、5、10、11、12、13
All lower left interpolation points 9、10、11、12、17、18、19、20
All lower right interpolation points 10、11、12、13、18、19、20、21
It can be seen that, under the condition of designating the magnification and the number of threads, the second data block which is equal to the first data block is obtained, so that the interpolation reference points for calculating the output data in the first data block can be ensured to be loaded. In the process of loading input data by using multiple threads in the execution unit in the processor, in the case that the number of output data in the first data block is the same as the number of threads in the execution unit, the following different implementation manners exist for loading each input data in step S102.
Loading mode one:
Each input data in the second data block may be from the input tensor if set. For example, input data 1-32 in FIG. 4 are all data from an input tensor. Then the loading step for each input data of step 102 may include:
Controlling a first thread of the plurality of threads to acquire input data from a target physical address corresponding to the target logical address and store the input data into a shared memory according to the sequence indicated by the corresponding logical coordinates, or
Controlling each thread which is different from the first thread in the plurality of threads, determining a logic address of input data to be acquired according to the thread offset of the thread and the target logic address, further acquiring the input data from a physical address determined according to the logic address, and storing the input data into a shared memory according to a sequence indicated by corresponding logic coordinates;
the offset information corresponding to the input data to be loaded by each thread is the thread offset of the thread.
For example, for the example shown in fig. 4 and 5, as soon as the input data is loaded in the loading manner, the thread 1 loads the input data 1 as the first thread, and the process of the thread 1 is to obtain the input data 1 from the target physical address corresponding to the target logical address (e.g., (0, 0)), and store the input data 1 into the shared memory according to the order indicated by the corresponding logical coordinates. Thread 2, according to the thread offset of the thread 'like 2' and the target logical address (0, 0), determining the logical address (1, 0) of the input data to be acquired, further acquiring the input data 2 from the physical address determined according to the logical address (1, 0), storing the input data 2 into the shared memory according to the sequence indicated by the corresponding logical coordinates, and so on, the threads 3-32 also finish loading of the corresponding input data. Finally, a second data block shown in fig. 5 is obtained.
And a loading mode II:
If only the input data from the plurality of input data in the second data block is needed as the interpolation reference point, the input data is from the input tensor. For example, in the example shown in fig. 4, it is only necessary to ensure that the input data in the upper left 3*5 area in the second data block is actually from the input tensor, and since the input data in the remaining area is not used in step S103, the input data in the remaining area may be loaded according to the settings, for example, all the input data specified in the input tensor may be the target input data for improving the efficiency and the speed. Then the loading step for each input data of step 102 may include:
Controlling each thread to acquire input data from a target physical address corresponding to the target logical address under the condition that the input data of an interpolation reference point is required to be acquired according to the thread offset of the thread, or determining the logical address of the input data to be acquired according to the thread offset of the thread and the target logical address, further acquiring the input data from the physical address determined according to the logical address, and storing the input data into a shared memory according to the sequence indicated by the corresponding logical coordinates;
Controlling each thread to acquire input data from a target physical address corresponding to the target logical address under the condition that the input data of an interpolation reference point is not required to be acquired according to the thread offset of the thread, and storing the input data into a shared memory according to a storage position corresponding to the thread offset;
the offset information corresponding to the input data to be loaded by each thread is the thread offset of the thread.
For example, for the example shown in fig. 4 and 6, when the input data is loaded according to the second loading mode, the thread 1 loads the input data 1 as the first thread, and the process is to obtain the input data 1 from the target physical address corresponding to the target logical address (e.g., (0, 0)), and store the input data 1 into the shared memory according to the order indicated by the corresponding logical coordinates. The thread 2 determines the logic address (1, 0) of the input data to be acquired according to the thread offset 'such as 2' and the target logic address (0, 0) of the thread 2, further acquires the input data 2 from the physical address determined according to the logic address (1, 0), stores the input data 2 into the shared memory according to the sequence indicated by the corresponding logic coordinates, and so on, the threads 3-5, 9-13 and 17-18 finish loading of the corresponding input data. The other threads can acquire the input data 1 from the target physical address corresponding to the target logical address (e.g., (0, 0)) like the thread 1, and store the input data 1 into the shared memory according to the sequence indicated by the logical coordinates corresponding to the thread. Finally, a second data block shown in fig. 6 is obtained.
And a loading mode III:
Also in case only input data from the input tensor is needed as interpolation reference point among the plurality of input data in the second data block. For example, in the example shown in fig. 4, it is only necessary to ensure that the input data in the upper left 3*5 area in the second data block is actually from the input tensor, and since the input data in the remaining area is not used in step S103, the input data may be loaded according to the setting, for example, a certain preset value is loaded as the input data and stored in the shared memory. The preset value may be directly stored in a shared memory or a designated register for each thread to acquire and store, or a specific value of the preset value, for example, "0", may be preset, and each thread may directly store the preset value. Then the loading step for each input data of step 102 may include:
controlling each thread to acquire input data from a physical address corresponding to the target logical address under the condition that the input data of an interpolation reference point is required to be acquired according to the thread offset of the thread, or determining the logical address of the input data to be acquired according to the thread offset of the thread and the target logical address, further acquiring the input data from the physical address determined according to the logical address, and storing the input data into a shared memory according to the sequence indicated by the corresponding logical coordinates;
controlling each thread to determine a preset value obtained from a preset physical address as input data or directly determine the preset value as input data under the condition that the input data of an interpolation reference point is not required to be obtained according to the thread offset of the thread, and storing the input data into a shared memory according to a storage position corresponding to the thread offset;
the offset information corresponding to the input data to be loaded by each thread is the thread offset of the thread.
For example, for the example shown in fig. 4 and fig. 7, when loading is performed according to the third loading mode, the thread 1 is used as the first thread to load the input data 1, and the process is to obtain the input data 1 from the target physical address corresponding to the target logical address (e.g., (0, 0)), and store the input data 1 into the shared memory according to the order indicated by the corresponding logical coordinates. The thread 2 determines the logic address (1, 0) of the input data to be acquired according to the thread offset 'such as 2' and the target logic address (0, 0) of the thread 2, further acquires the input data 2 from the physical address determined according to the logic address (1, 0), stores the input data 2 into the shared memory according to the sequence indicated by the corresponding logic coordinates, and so on, the threads 3-5, 9-13 and 17-18 finish loading of the corresponding input data. And the other threads can store the preset value into the shared memory by taking the preset value as input data according to the sequence indicated by the corresponding logical coordinates of the threads as per thread 1. Finally, a second data block shown in fig. 7 is obtained.
In some embodiments, if there is a correspondence between a difference in storage locations of input data in the shared memory and a difference in index between the threads, the thread offset is the index of the thread. The difference between the storage positions of the input data in the shared memory may be the same as the difference between the indexes of the threads, and the indexes of the threads may be the serial numbers of the threads, so that in the implementation process of the loading modes one, two and three, address calculation may be directly performed based on the indexes of the threads.
Continuing with the example of "32 threads for 2 times magnification bilinear upsampling" shown in fig. 4, for example, to increase efficiency and speed, each thread may be designated to perform calculation of one output data, where the input data load and output data calculation performed by each thread may be designated to be one data with the same logical coordinates, for example, thread 1 performs the load of input data 1 and the calculation of output data 1 in fig. 4. In case the number of output data in the first data block is the same as the number of threads in the execution unit 12, step S103 may comprise:
Controlling each thread to acquire input data of the upper left coordinate position in four input data corresponding to the output data from the second data block according to the thread offset of the thread;
acquiring the rest three input data of the four input data corresponding to the output data from the second data block according to the thread offset of the thread and the target size;
The offset information corresponding to the output data to be calculated by each thread is the thread offset of the thread.
For example, as shown in fig. 4, the thread 1 calculates the output data 1, calculates the address of the input data in the shared memory at the upper left coordinate position to be obtained according to the thread index of the thread 1 and the base address of the second data block in the shared memory, further obtains the input data 1 at the upper left coordinate position, calculates the addresses of the other three input data in the shared memory to be obtained according to the offset between the target size and the interpolation reference point of the other difference reference point and the upper left coordinate position, and the base address of the second data block in the shared memory, further obtains the input data 2, 9, 10, which are the remaining three input data, and then calculates the output data 1. Similarly, other threads synchronously acquire the input data of the difference reference point and calculate the subsequent corresponding output data.
In this way, the above steps S101 to S103 for each first data block are continuously performed in a loop, and the calculation may be stopped until each output data in the output tensor is calculated.
In the above, the process of performing bilinear upsampling on a layer of input tensor to obtain a layer of output tensor is described, where in the case that the input tensor includes multiple layers, the output tensor where the first data block is located is the input tensor of the 0 th layer, and the number of the first data blocks in each layer of input tensor is the target number. The number of layers of tensors may refer to the dimension of the tensor, as shown in fig. 8, for a bilinear interpolation up-sampling process on a plane composed of a plurality of dimensions, points on all coordinates (called equivalent enlarged positions) on an axis perpendicular to the plane have the same interpolation weight, so that tensors of layers other than layer 0 may be multiplexed with the logical coordinate calculation process of layer 0. In the hardware memory with tensor layout, the mode of adding the offset to the target physical address can be directly multiplexed, so that the calculation process of converting the logical coordinates of different blocks into the physical address is accelerated. Then:
In step S101, in the case where it is determined that the input tensor includes a plurality of layers, a target physical address corresponding to the target logical address is also determined.
In step S102, if the physical addresses of the input tensors of different layers can be calculated by the offset, for loading of each input data, in the case that it is determined that the input data to be acquired belongs to the input tensor of layer 0, the input data acquired according to the target logical address is stored in the shared memory according to the order indicated by the corresponding logical coordinates. And under the condition that the input data to be acquired does not belong to the 0 th layer input tensor, storing the input data acquired from the physical address determined according to the target physical address, the first offset corresponding to the layer number of the input tensor where the input data to be acquired is located and the offset information corresponding to the input data into a shared memory according to the sequence indicated by the corresponding logic coordinates. That is, based on the target physical address addr_mem_o1 (Ox 0, oy 0), the physical address of each equivalent amplifying position is obtained according to the data amount corresponding to the target size of addr_mem_o1 (Ox 0, oy 0) +the number of layers of the single-layer total first data block number, so that loading of each input data of the non-0 layers is realized.
If the physical addresses of the input tensors of different layers can be calculated through offset, for loading of each input data, under the condition that the input data required to be acquired is determined to belong to the input tensor of the 0 th layer, the input data acquired according to the target logical address is stored into the shared memory according to the sequence indicated by the corresponding logical coordinates. And under the condition that the input data to be acquired does not belong to the 0 th layer input tensor, storing the input data acquired from a physical address determined according to the target logical address, the first offset corresponding to the layer number of the input tensor where the input data to be acquired is located and the offset information corresponding to the input data into a shared memory according to the sequence indicated by the corresponding logical coordinates.
In this way, the time-consuming calculation of the coordinate mapping to the physical address in the calculation process of each output data in the bilinear interpolation up-sampling can also be reused on all the equivalent amplifying positions (namely, the corresponding positions on different tensor layers). The multiplexing rate of data maximization is achieved, and the workload of integral physical address calculation is reduced.
Aiming at the data processing method provided by the embodiment of the disclosure, the target sizes of the first data block and the second data block can be adjusted according to different application scenes such as an actually executed hardware architecture and interpolation weights, and the adaptive setting of the mapping relation calculation mode of the address is performed, so that the adaptive setting of bilinear upsampling under different application scenes can be realized.
In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.
The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.
The embodiment of the disclosure also provides electronic equipment, which comprises a processor and a memory for storing instructions executable by the processor, wherein the processor is configured to realize the method when the instructions stored by the memory are executed.
Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.
Fig. 9 is a block diagram illustrating an apparatus 1900 for data processing according to an example embodiment. For example, the apparatus 1900 may be provided as a server or electronic device. Referring to fig. 9, the apparatus 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that are executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.
The apparatus 1900 may further include a power component 1926 configured to perform power management of the apparatus 1900, a wired or wireless network interface 1950 configured to connect the apparatus 1900 to a network, and an input/output (I/O) interface 1958. The device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of apparatus 1900 to perform the above-described methods.
The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical encoding device, punch cards or intra-groove protrusion structures such as those having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (13)

1.一种数据处理方法,其特征在于,应用于支持张量布局的内存架构的处理器中,所述方法包括:1. A data processing method, characterized in that it is applied to a processor of a memory architecture supporting tensor layout, the method comprising: 针对输出张量中的各第一数据块,根据所述第一数据块中的定位数据的逻辑坐标,计算出输入张量中与所述定位数据对应的目标输入数据的目标逻辑地址,所述定位数据为所述第一数据块包括的多个输出数据中的一个;For each first data block in the output tensor, according to the logical coordinates of the positioning data in the first data block, calculate a target logical address of the target input data corresponding to the positioning data in the input tensor, where the positioning data is one of the plurality of output data included in the first data block; 针对各输入数据的加载,将根据所述目标逻辑地址获取到的输入数据,按照对应的逻辑坐标指示的顺序存储至共享内存,以在所述共享内存的连续空间中存储包括多个输入数据的第二数据块,所述第二数据块的尺寸与所述第一数据块的尺寸均为目标尺寸;For each input data loaded, the input data acquired according to the target logical address is stored in the shared memory in the order indicated by the corresponding logical coordinates, so as to store a second data block including the plurality of input data in a continuous space of the shared memory, wherein the size of the second data block and the size of the first data block are both the target size; 针对各输出数据的计算,根据对应的偏移信息和所述目标尺寸,从所述第二数据块中获取到对应该输出数据的四个输入数据,并将所述四个输入数据作为插值参考点结合预先计算并存储的插值权重进行双线性插值计算得到对应的输出数据;For calculation of each output data, four input data corresponding to the output data are obtained from the second data block according to the corresponding offset information and the target size, and the four input data are used as interpolation reference points in combination with pre-calculated and stored interpolation weights to perform bilinear interpolation calculation to obtain the corresponding output data; 其中,利用所述处理器内执行单元中的多个线程分别进行各所述输入数据的加载,所述第一数据块中的输出数据的数量与执行单元中线程的数量相同。The input data are loaded respectively using a plurality of threads in the execution unit in the processor, and the number of output data in the first data block is the same as the number of threads in the execution unit. 2.根据权利要求1所述的方法,其特征在于,所述方法还包括:2. The method according to claim 1, characterized in that the method further comprises: 将根据各所述输出数据按照对应的逻辑坐标顺序形成的第一数据块输出存储;或者Output and store a first data block formed according to the respective output data in the order of corresponding logical coordinates; or 针对各输出数据的计算,在得到输出数据之后,根据对应的逻辑坐标顺序输出存储。For the calculation of each output data, after the output data is obtained, it is output and stored in order according to the corresponding logical coordinates. 3.根据权利要求1所述的方法,其特征在于,所述第二数据块中的各输入数据均来自所述输入张量;3. The method according to claim 1, characterized in that each input data in the second data block comes from the input tensor; 其中,将根据所述目标逻辑地址获取到的输入数据,按照对应的逻辑坐标指示的顺序存储至共享内存,包括:The input data acquired according to the target logical address is stored in the shared memory in the order indicated by the corresponding logical coordinates, including: 控制所述多个线程中的第一个线程从所述目标逻辑地址对应的目标物理地址中获取到输入数据,并按照对应的逻辑坐标指示的顺序将该输入数据存储至共享内存;或者Controlling a first thread among the multiple threads to obtain input data from a target physical address corresponding to the target logical address, and storing the input data to a shared memory in an order indicated by corresponding logical coordinates; or 控制所述多个线程中与所述第一个线程不同的各线程,根据该线程的线程偏移和所述目标逻辑地址确定出所要获取的输入数据的逻辑地址,进而从根据该逻辑地址确定出的物理地址中获取到输入数据,并按照对应的逻辑坐标指示的顺序将该输入数据存储至共享内存;Controlling each thread among the multiple threads different from the first thread, determining a logical address of input data to be obtained according to a thread offset of the thread and the target logical address, and then obtaining the input data from a physical address determined according to the logical address, and storing the input data to a shared memory in an order indicated by corresponding logical coordinates; 其中,各线程所要加载的输入数据对应的偏移信息为该线程的线程偏移。The offset information corresponding to the input data to be loaded by each thread is the thread offset of the thread. 4.根据权利要求1所述的方法,其特征在于,所述第二数据块中的多个输入数据中仅需要作为插值参考点的输入数据来自所述输入张量;4. The method according to claim 1, characterized in that among the plurality of input data in the second data block, only the input data required as the interpolation reference point comes from the input tensor; 其中,将根据所述目标逻辑地址获取到的输入数据,按照对应的逻辑坐标指示的顺序存储至共享内存,包括:The input data acquired according to the target logical address is stored in the shared memory in the order indicated by the corresponding logical coordinates, including: 控制各所述线程在根据该线程的线程偏移确定需要获取插值参考点的输入数据的情况下,从所述目标逻辑地址对应的目标物理地址中获取到输入数据,或者根据该线程的线程偏移和所述目标逻辑地址确定出所要获取的输入数据的逻辑地址、进而从根据该逻辑地址确定出的物理地址中获取到输入数据,并按照对应的逻辑坐标指示的顺序将该输入数据存储至共享内存;Controlling each of the threads to obtain the input data of the interpolation reference point from the target physical address corresponding to the target logical address when it is determined according to the thread offset of the thread that the input data of the interpolation reference point needs to be obtained, or determining the logical address of the input data to be obtained according to the thread offset of the thread and the target logical address, and then obtaining the input data from the physical address determined according to the logical address, and storing the input data to the shared memory in the order indicated by the corresponding logical coordinates; 控制各所述线程在根据该线程的线程偏移确定无需获取插值参考点的输入数据的情况下,从根据所述目标逻辑地址对应的目标物理地址中获取到输入数据,并按照该线程偏移对应的存储位置将该输入数据存储至共享内存;Control each of the threads to obtain input data from a target physical address corresponding to the target logical address when it is determined according to the thread offset of the thread that there is no need to obtain input data of an interpolation reference point, and store the input data in a shared memory according to a storage location corresponding to the thread offset; 其中,各线程所要加载的输入数据对应的偏移信息为该线程的线程偏移。The offset information corresponding to the input data to be loaded by each thread is the thread offset of the thread. 5.根据权利要求1所述的方法,其特征在于,所述第二数据块中的多个输入数据中仅需要作为插值参考点的输入数据来自所述输入张量;5. The method according to claim 1, characterized in that among the plurality of input data in the second data block, only the input data required as the interpolation reference point comes from the input tensor; 其中,将根据所述目标逻辑地址获取到的输入数据,按照对应的逻辑坐标指示的顺序存储至共享内存,包括:The input data acquired according to the target logical address is stored in the shared memory in the order indicated by the corresponding logical coordinates, including: 控制各所述线程在根据该线程的线程偏移确定需要获取插值参考点的输入数据的情况下,从根据所述目标逻辑地址对应的物理地址中获取到输入数据,或者根据该线程的线程偏移和所述目标逻辑地址确定出所要获取的输入数据的逻辑地址、进而从根据该逻辑地址确定出的物理地址中获取到输入数据,并按照对应的逻辑坐标指示的顺序将该输入数据存储至共享内存;Controlling each of the threads to obtain the input data of the interpolation reference point from the physical address corresponding to the target logical address when it is determined according to the thread offset of the thread that the input data of the interpolation reference point needs to be obtained, or to determine the logical address of the input data to be obtained according to the thread offset of the thread and the target logical address, and then obtain the input data from the physical address determined according to the logical address, and store the input data to the shared memory in the order indicated by the corresponding logical coordinates; 控制各所述线程在根据该线程的线程偏移确定无需获取插值参考点的输入数据的情况下,将从预设物理地址中获取到的预设值确定为输入数据、或直接将预设值确定为输入数据,并按照该线程偏移对应的存储位置将该输入数据存储至共享内存;Control each of the threads to determine a preset value obtained from a preset physical address as input data, or directly determine the preset value as input data, when it is determined according to the thread offset of the thread that there is no need to obtain input data of an interpolation reference point, and store the input data in a shared memory according to a storage location corresponding to the thread offset; 其中,各线程所要加载的输入数据对应的偏移信息为该线程的线程偏移。The offset information corresponding to the input data to be loaded by each thread is the thread offset of the thread. 6.根据权利要求3-5任意一项所述的方法,其特征在于,利用所述处理器内执行单元中的多个线程进行输出数据的计算,6. The method according to any one of claims 3 to 5, characterized in that the output data is calculated using multiple threads in the execution unit in the processor, 其中,根据对应的偏移信息和所述目标尺寸,从所述第二数据块中获取到对应该输出数据的四个输入数据,包括:Wherein, according to the corresponding offset information and the target size, four input data corresponding to the output data are obtained from the second data block, including: 控制各所述线程根据该线程的线程偏移从所述第二数据块中获取到对应该输出数据的四个输入数据中左上坐标位置的输入数据;Control each of the threads to obtain input data at an upper left coordinate position among four input data corresponding to the output data from the second data block according to the thread offset of the thread; 控制各所述线程根据该线程的线程偏移和所述目标尺寸,从所述第二数据块中获取到对应该输出数据的四个输入数据中的其余三个输入数据;Control each of the threads to obtain the remaining three input data of the four input data corresponding to the output data from the second data block according to the thread offset of the thread and the target size; 其中,各线程所要计算的输出数据对应的偏移信息为该线程的线程偏移。The offset information corresponding to the output data to be calculated by each thread is the thread offset of the thread. 7.根据权利要求1所述的方法,其特征在于,在所述输入张量包括多个层的情况下,所述第一数据块所在输出张量为第0层输入张量,各层输入张量中第一数据块的数量均为目标数量,所述方法还包括:7. The method according to claim 1, characterized in that, when the input tensor includes multiple layers, the output tensor where the first data block is located is the 0th layer input tensor, and the number of first data blocks in the input tensors of each layer is the target number, and the method further comprises: 在确定所述输入张量包括多个层的情况下,还确定出所述目标逻辑地址对应的目标物理地址;In the case where it is determined that the input tensor includes a plurality of layers, a target physical address corresponding to the target logical address is also determined; 针对各输入数据的加载,在确定所需获取的输入数据不属于第0层输入张量的情况下,将从根据所述目标物理地址、所需获取的输入数据所在输入张量的层数所对应的第一偏移和该输入数据对应的偏移信息确定出的物理地址中获取到的该输入数据,按照对应的逻辑坐标指示的顺序存储至共享内存;或者,将从根据所述目标逻辑地址、所需获取的输入数据所在输入张量的层数所对应的第一偏移和该输入数据对应的偏移信息确定出的物理地址中获取到的该输入数据,按照对应的逻辑坐标指示的顺序存储至共享内存。For the loading of each input data, when it is determined that the input data to be obtained does not belong to the input tensor of the 0th layer, the input data obtained from the physical address determined according to the target physical address, the first offset corresponding to the number of layers of the input tensor where the input data to be obtained is located, and the offset information corresponding to the input data, shall be stored in the shared memory in the order indicated by the corresponding logical coordinates; or, the input data obtained from the physical address determined according to the target logical address, the first offset corresponding to the number of layers of the input tensor where the input data to be obtained is located, and the offset information corresponding to the input data shall be stored in the shared memory in the order indicated by the corresponding logical coordinates. 8.根据权利要求6所述的方法,其特征在于,所述定位数据为所述第一数据块的多个输出数据中处于边界上左上位置的输出数据,并且/或者8. The method according to claim 6, characterized in that the positioning data is output data at an upper left position on a boundary among a plurality of output data of the first data block, and/or 若所述共享内存中输入数据的存储位置之差与所述线程之间的索引之差存在对应关系,则所述线程偏移为线程的索引。If there is a corresponding relationship between the difference in storage positions of the input data in the shared memory and the difference in indexes between the threads, the thread offset is the index of the thread. 9.根据权利要求1所述的方法,其特征在于,所述方法还包括:9. The method according to claim 1, characterized in that the method further comprises: 将确定出的插值权重存储至预设寄存器或存储至所述共享内存中。The determined interpolation weights are stored in a preset register or in the shared memory. 10.一种数据处理装置,其特征在于,应用于支持张量布局的内存架构的处理器中,所述装置包括控制单元和至少一个执行单元,各所述执行单元包括多个线程;10. A data processing device, characterized in that it is applied to a processor of a memory architecture supporting tensor layout, the device comprises a control unit and at least one execution unit, each of the execution units comprises a plurality of threads; 所述控制单元,针对输出张量中的各第一数据块,根据所述第一数据块中的定位数据的逻辑坐标,计算出输入张量中与所述定位数据对应的目标输入数据的目标逻辑地址,所述定位数据为所述第一数据块包括的多个输出数据中的一个;The control unit calculates, for each first data block in the output tensor, a target logical address of target input data corresponding to the positioning data in the input tensor according to the logical coordinates of the positioning data in the first data block, wherein the positioning data is one of the plurality of output data included in the first data block; 各所述线程,针对该线程需要加载的输入数据,将根据所述目标逻辑地址获取到的该线程所需获取的输入数据,按照对应的逻辑坐标指示的顺序存储至共享内存,以在所述共享内存的连续空间中存储包括多个输入数据的第二数据块,所述第二数据块的尺寸与所述第一数据块的尺寸均为目标尺寸;Each of the threads, for input data that needs to be loaded by the thread, stores the input data that needs to be acquired by the thread according to the target logical address in the shared memory in the order indicated by the corresponding logical coordinates, so as to store a second data block including multiple input data in a continuous space of the shared memory, wherein the size of the second data block and the size of the first data block are both the target size; 各所述线程,针对该线程所需计算的输出数据,根据对应的偏移信息和所述目标尺寸,从所述第二数据块中获取到对应该输出数据的四个输入数据,并将所述四个输入数据作为插值参考点结合预先计算并存储的插值权重进行双线性插值计算得到对应的输出数据;Each of the threads, for output data to be calculated by the thread, obtains four input data corresponding to the output data from the second data block according to the corresponding offset information and the target size, and uses the four input data as interpolation reference points in combination with pre-calculated and stored interpolation weights to perform bilinear interpolation calculation to obtain corresponding output data; 其中,所述第一数据块中的输出数据的数量与执行单元中线程的数量相同。The number of output data in the first data block is the same as the number of threads in the execution unit. 11.一种电子设备,其特征在于,包括:11. An electronic device, comprising: 处理器;processor; 用于存储处理器可执行指令的存储器;a memory for storing processor-executable instructions; 其中,所述处理器被配置为在执行所述存储器存储的指令时,实现权利要求1至9中任意一项所述的方法。Wherein, the processor is configured to implement the method described in any one of claims 1 to 9 when executing the instructions stored in the memory. 12.一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1至9中任意一项所述的方法。12. A non-volatile computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions implement the method of any one of claims 1 to 9 when executed by a processor. 13.一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,其特征在于,当所述计算机可读代码在电子设备的处理器中运行时,所述电子设备中的处理器执行权利要求1至9中任意一项所述的方法。13. A computer program product, comprising a computer-readable code, or a non-volatile computer-readable storage medium carrying a computer-readable code, characterized in that when the computer-readable code runs in a processor of an electronic device, the processor in the electronic device executes the method described in any one of claims 1 to 9.
CN202410418712.9A 2024-04-08 2024-04-08 Data processing method and device Active CN118331542B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410418712.9A CN118331542B (en) 2024-04-08 2024-04-08 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410418712.9A CN118331542B (en) 2024-04-08 2024-04-08 Data processing method and device

Publications (2)

Publication Number Publication Date
CN118331542A CN118331542A (en) 2024-07-12
CN118331542B true CN118331542B (en) 2025-05-13

Family

ID=91767467

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410418712.9A Active CN118331542B (en) 2024-04-08 2024-04-08 Data processing method and device

Country Status (1)

Country Link
CN (1) CN118331542B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116385714A (en) * 2023-04-10 2023-07-04 寒武纪(西安)集成电路有限公司 Image processing method and related products
CN117575888A (en) * 2023-10-19 2024-02-20 北京地平线信息技术有限公司 Methods, devices, media and equipment for accelerating calculations on feature data

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12142016B2 (en) * 2021-06-17 2024-11-12 Nvidia Corporation Fused processing of a continuous mathematical operator
US11704067B2 (en) * 2021-08-02 2023-07-18 Nvidia Corporation Performing multiple point table lookups in a single cycle in a system on chip
CN113836049B (en) * 2021-09-17 2023-08-08 海飞科(南京)信息技术有限公司 Memory access method and electronic device
KR102733032B1 (en) * 2021-12-27 2024-11-20 서울대학교산학협력단 Apparatus and method for address generation of multi-dimensional tensor
CN116360858B (en) * 2023-05-26 2023-08-29 摩尔线程智能科技(北京)有限责任公司 Data processing method, graphics processor, electronic device and storage medium
CN117056096A (en) * 2023-08-04 2023-11-14 苏州浪潮智能科技有限公司 Shared memory management method, device, electronic equipment and storage medium
CN117075918B (en) * 2023-10-13 2024-01-09 之江实验室 Model deployment method and device, storage medium and electronic equipment
CN117271136B (en) * 2023-10-20 2025-03-18 上海壁仞科技股份有限公司 Data processing method, device, equipment and storage medium
CN117539798B (en) * 2023-11-10 2025-04-18 上海壁仞科技股份有限公司 Data processing method, computing device and computer readable storage medium
CN117478816B (en) * 2023-12-22 2024-03-15 北京壁仞科技开发有限公司 Bilinear interpolation upsampling method, system, device and medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116385714A (en) * 2023-04-10 2023-07-04 寒武纪(西安)集成电路有限公司 Image processing method and related products
CN117575888A (en) * 2023-10-19 2024-02-20 北京地平线信息技术有限公司 Methods, devices, media and equipment for accelerating calculations on feature data

Also Published As

Publication number Publication date
CN118331542A (en) 2024-07-12

Similar Documents

Publication Publication Date Title
CN110046702B (en) Neural Network Computing Accelerator and Method of Execution
KR102492477B1 (en) Matrix multiplier
US10140251B2 (en) Processor and method for executing matrix multiplication operation on processor
CN117785480B (en) Processor, reduction calculation method and electronic device
WO2019127731A1 (en) Convolutional neural network hardware acceleration device, convolutional calculation method and storage medium
US10268451B2 (en) Method and processing apparatus for performing arithmetic operation
US10642622B2 (en) Arithmetic processing device and control method of the arithmetic processing device
US9910714B2 (en) Scriptable dynamic load balancing in computer systems
US20170206089A1 (en) Information processing apparatus and computational method
US20220171828A1 (en) Selection of pauli strings for variational quantum eigensolver
US10078596B2 (en) Data processing method, computer readable medium and data processing device
JP2017078934A (en) Calculation method of convolution neural network, calculation program, and information processor
JP7012766B2 (en) How to reduce parameter table storage space, appliances, devices, and computer readable storage media
US11580193B2 (en) Computation device, computation method, and program
GB2619919A (en) Hardware implementation of an attention-based neural network
CN118331542B (en) Data processing method and device
CN116089786B (en) Method, device and medium for triangular matrix inversion
CN111552478A (en) Apparatus, method and storage medium for generating CUDA programs
CN110060195B (en) Data processing method and device
US11657324B2 (en) Method, electronic device, and computer program product for processing data
JP2020123125A (en) Computation processing device, computation processing method, and program
US11221851B2 (en) Method executed by computing device, apparatus, device and computer-readable storage medium
EP4202774A1 (en) Runtime predictors for neural network computation reduction
CN118520920B (en) Method and device for optimizing valued class operator on NPU
US20220405204A1 (en) Computer-readable recording medium storing data placement program, processor, and data placement method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant