RU2830044C1

RU2830044C1 - Vector computing device

Info

Publication number: RU2830044C1
Application number: RU2024125062A
Authority: RU
Inventors: Алексей Михайлович Попов; Кирилл Алексеевич Королев; Елена Сергеевна Молочко; Сергей Рудольфович Шевцов; Данила Игоревич Хайдуков; Андрей Александрович Шишпанов; Артем Сергеевич Сенин
Original assignee: Акционерное Общество "Софит"
Filing date: 2024-08-27
Publication date: 2024-11-11

Abstract

FIELD: microprocessors.

SUBSTANCE: invention relates to the field of microprocessors, and in particular to a vector computing device. Device is placed on a chip and comprises: scalar devices, each of which consists of at least two scalar modules forming at least one scalar line, at least one first demultiplexer, complex arithmetic operations unit, at least one second demultiplexer; multi-port shared memory; unit of horizontal operations.

EFFECT: high speed of vector processor operation.

15 cl, 4 dwg

Description

ОБЛАСТЬ ТЕХНИКИAREA OF TECHNOLOGY

[0001] Заявленное техническое решение в общем относится к области микропроцессоров, а в частности к векторному вычислительному устройству.[0001] The claimed technical solution generally relates to the field of microprocessors, and in particular to a vector computing device.

УРОВЕНЬ ТЕХНИКИLEVEL OF TECHNOLOGY

[0002] В настоящий момент в современном мире выполняется большое количество операций с данными. С развитием и внедрением нейронных сетей во все сферы жизнедеятельности, появилась необходимость в устройствах, способных эффективно работать с нейронными сетями. Так, процессоры общего назначения, ввиду своей архитектуры, не приспособлены для выполнения процессов машинного обучения, таких как обучение нейронных сетей и т.д., с высокой скоростью. В связи с этим, в последние годы, особую важность получило направление развития векторных процессоров. [0002] Currently, a large number of data operations are performed in the modern world. With the development and implementation of neural networks in all spheres of life, there is a need for devices that can effectively work with neural networks. Thus, general-purpose processors, due to their architecture, are not suitable for performing machine learning processes, such as training neural networks, etc., at high speed. In this regard, in recent years, the direction of development of vector processors has become especially important.

[0003] Векторный процессор - это тип процессора, который может обрабатывать несколько элементов данных одновременно, что обеспечивает более высокую производительность, чем скалярные процессоры для ряда операций. Такой процессор способен выполнять операции над вектором элементов данных параллельно. Векторные процессоры особенно полезны для таких задач, как вычисление нейронных сетей, обработка изображений и т.д., т.е. для вычислений где необходимо обрабатывать параллельно большие объемы данных.[0003] A vector processor is a type of processor that can process multiple data elements simultaneously, providing higher performance than scalar processors for a number of operations. Such a processor is capable of performing operations on a vector of data elements in parallel. Vector processors are particularly useful for tasks such as neural network computing, image processing, etc., i.e., for computations where large amounts of data must be processed in parallel.

[0004] Так, из уровня техники известно векторное вычислительное устройство (см. Интернет: https://cloud.google.com/tpu/docs/system-architecture-tpu-vm). Указанный векторный процессор, разработанный компанией Google™, предназначен для повышения пропускной способности данных (повышение производительности вычислений) для процессов машинного обучения.[0004] Thus, a vector computing device is known from the prior art (see Internet: https://cloud.google.com/tpu/docs/system-architecture-tpu-vm). The said vector processor, developed by Google™, is intended to increase data throughput (increase computing performance) for machine learning processes.

[0005] Элементы векторного процессора расположены на кристалле интегральной схемы. Для ускорения вычислений, вычислительные блоки сегментированы на несколько дорожек, каждая из которых предназначена для осуществления вычислительной операции с элементом вектора, что обеспечивает распараллеливание операций. Внутри каждой дорожки расположена векторная память, которая может включать в себя несколько банков памяти, каждый из которых имеет несколько ячеек адреса памяти. Более подробно, каждая дорожка (линия) включает в себя многомерный регистр данных/файлов, сконфигурированный для хранения множества векторных элементов, и арифметико-логический блок (ALU), сконфигурированный для выполнения арифметических операций над векторными элементами, доступными из регистра данных и хранящимися в нем. Все элементы процессора конфигурируются посредством VLIW инструкций. [0005] The vector processor elements are located on the integrated circuit crystal. To speed up the calculations, the computing units are segmented into several tracks, each of which is intended to perform a computing operation with a vector element, which ensures parallelization of operations. Within each track, there is a vector memory, which may include several memory banks, each of which has several memory address cells. In more detail, each track (line) includes a multi-dimensional data/file register configured to store a plurality of vector elements, and an arithmetic logic unit (ALU) configured to perform arithmetic operations on vector elements accessible from the data register and stored in it. All processor elements are configured using VLIW instructions.

[0006] Недостатками указанного решения являются низкая производительность ввиду архитектурных особенностей, связанных с организацией дорожек, а также способ конфигурирования элементов при выполнении арифметических операций. Кроме того, за счет предлагаемого распараллеливания операций на множество линий, некоторые вычислительные операции, требующие получения промежуточных результатов, исполняются в несколько тактов (проходов) процессора, что, соответственно увеличивает время исполнения операции, и, как следствие, снижает производительность. Также, представленное решение не подразумевает возможность конфигурирования вычислительных блоков посредством потока команд (Dataflow). Еще одним недостатком является низкая пропускная способность векторной памяти, из-за необходимости соединения линий с вычислительными блоками между собой посредством шин данных. [0006] The disadvantages of this solution are low performance due to architectural features related to the organization of tracks, as well as the method of configuring elements when performing arithmetic operations. In addition, due to the proposed parallelization of operations on many lines, some computing operations that require obtaining intermediate results are executed in several processor cycles (passes), which, accordingly, increases the execution time of the operation, and, as a consequence, reduces performance. Also, the presented solution does not imply the possibility of configuring computing units by means of a command flow (Dataflow). Another disadvantage is the low bandwidth of vector memory, due to the need to connect lines with computing units to each other by means of data buses.

[0007] Соответственно, целью настоящего технического решения является создание векторного вычислительного устройства, обладающего высокой производительностью. Кроме того, данное решение должно обеспечить возможность потоковой конфигурации вычислительных элементов процессора, в том числе и векторной памяти, а также обеспечить уменьшение площади кристалла и улучшение энергоэффективности, за счет особенностей заявленной архитектуры процессора.[0007] Accordingly, the objective of the present technical solution is to create a vector computing device with high performance. In addition, this solution should provide the possibility of a stream configuration of the computing elements of the processor, including vector memory, as well as provide a reduction in the crystal area and an improvement in energy efficiency, due to the features of the declared processor architecture.

РАСКРЫТИЕ ИЗОБРЕТЕНИЯDISCLOSURE OF INVENTION

[0008] В заявленном техническом решении предлагается новый подход к архитектуре векторного процессора, обеспечивающего высокую производительность.[0008] The claimed technical solution proposes a new approach to the architecture of a vector processor that provides high performance.

[0009] Решается техническая проблема повышения эффективности и скорости работы векторного процессора за счет увеличения количества операций за один такт процессора.[0009] The technical problem of increasing the efficiency and speed of a vector processor by increasing the number of operations per processor cycle is solved.

[0010] Техническим результатом, достигающимся при решении данной проблемы, является повышение скорости работы векторного процессора.[0010] The technical result achieved by solving this problem is an increase in the operating speed of the vector processor.

[0011] Дополнительным техническим результатом, проявляющимся при решении вышеуказанной проблемы, является уменьшение площади кристалла векторного процессора.[0011] An additional technical result that appears when solving the above-mentioned problem is a reduction in the area of the vector processor crystal.

[0012] Указанные технические результаты достигаются благодаря осуществлению векторного процессора, размещенного на кристалле, содержащего:[0012] The specified technical results are achieved by implementing a vector processor located on a crystal, containing:

• скалярные устройства, каждое из которых выполнено с возможностью получения элемента вектора из устройства умножения матриц и состоит из:• scalar devices, each of which is configured to obtain a vector element from a matrix multiplication device and consists of:

по меньшей мере двух скалярных модулей, образующих по меньшей мере одну скалярную линию, выполненных с возможностью осуществления арифметических операций над элементом вектора; at least two scalar modules forming at least one scalar line, configured to perform arithmetic operations on a vector element;

по меньшей мере одного первого демультиплексора, выполненного с возможностью перенаправления данных от первого скалярного модуля в линии, в многопортовую общую память или второй скалярный модуль в линии; at least one first demultiplexer configured to redirect data from the first scalar module in the line to the multiport shared memory or the second scalar module in the line;

блока сложных арифметических операций, соединенного по меньшей мере с одним скалярным модулем в линии, и выполненного с возможностью исполнения математических функций над элементом вектора; a block of complex arithmetic operations, connected to at least one scalar module in a line, and configured to perform mathematical functions on a vector element;

по меньшей мере одного второго демультиплексора, выполненного с возможностью перенаправления данных из многопортовой памяти в скалярный модуль или устройство умножения матриц; at least one second demultiplexer configured to redirect data from the multiport memory to a scalar module or matrix multiplier;

• многопортовая общая память, соединенная по меньшей мере с внешней памятью, вторым демультиплексором каждого скалярного устройства и устройством умножения матриц, выполненная с возможностью буферизации и хранения результатов промежуточных вычислений над элементами вектора каждого скалярного устройства;• a multi-port shared memory connected to at least an external memory, a second demultiplexer of each scalar device and a matrix multiplication device, configured to buffer and store the results of intermediate calculations over the vector elements of each scalar device;

• блок горизонтальных операций, связанный со скалярными устройствами, выполненный с возможностью:• a horizontal operations block associated with scalar devices, designed with the ability to:

исполнения арифметических операций с объединенными элементами вектора; performing arithmetic operations with combined vector elements;

трансляции результата исполнения арифметической операции с объединенными элементами вектора в скалярные устройства. translation of the result of executing an arithmetic operation with the combined elements of a vector into scalar devices.

[0013] В одном из частных вариантов реализации скалярные устройства располагаются на кристалле в виде плитки.[0013] In one particular implementation, scalar devices are arranged on a crystal in the form of a tile.

[0014] В другом частном варианте реализации каждое скалярное устройство содержит по меньшей мере четыре скалярных модуля.[0014] In another particular embodiment, each scalar device comprises at least four scalar modules.

[0015] В другом частном варианте реализации по меньшей мере четыре скалярных модуля образуют по меньшей мере две скалярные линии.[0015] In another particular embodiment, at least four scalar modules form at least two scalar lines.

[0016] В другом частном варианте реализации скалярная линия представляет собой последовательно соединенные по меньшей мере два скалярных модуля. [0016] In another particular embodiment, the scalar line is at least two scalar modules connected in series.

[0017] В другом частном варианте реализации каждый из по меньшей мере двух скалярных модулей, представляет собой конфигурируемый потоковый процессор содержащий: блок суммирования, блок произведения, модуль вычисления функций и локальную память.[0017] In another particular embodiment, each of the at least two scalar modules is a configurable stream processor comprising: an adder unit, a product unit, a function calculation module, and local memory.

[0018] В другом частном варианте реализации коммутация данных, осуществляемая каждым первым демультиплексором в каждом скалярном устройстве, от первого скалярного модуля в линии, зависит от исполняемой операции процессора. [0018] In another particular embodiment, the data switching performed by each first demultiplexer in each scalar device, from the first scalar module in the line, depends on the processor operation being executed.

[0019] В другом частном варианте реализации элементы процессора соединены между собой посредством AXI4-Stream интерфейса.[0019] In another particular embodiment, the processor elements are interconnected via an AXI4-Stream interface.

[0020] В другом частном варианте реализации математические функции над элементом вектора представляют собой, по меньшей мере:[0020] In another particular embodiment, the mathematical functions on the vector element are at least:

• экспонента;• exhibitor;

• натуральный логарифм;• natural logarithm;

• обратное число;• reciprocal number;

• обратный квадратный корень;• inverse square root;

• математические функции с использованием LUT таблиц.• mathematical functions using LUT tables.

[0021] В другом частном варианте реализации арифметические операции, исполняемые блоком горизонтальных операций, представляют собой, по меньшей мере:[0021] In another particular embodiment, the arithmetic operations performed by the horizontal operations block are at least:

• сложение элементов вектора, полученных от скалярных устройств;• addition of vector elements obtained from scalar devices;

• нахождение максимального элемента среди полученных скалярных элементов;• finding the maximum element among the obtained scalar elements;

• нахождение минимального элемента среди полученных скалярных элементов;• finding the minimum element among the obtained scalar elements;

• нахождение обратного значения скалярных элементов;• finding the inverse value of scalar elements;

• нахождение обратного квадратного корня скалярных элементов.• finding the inverse square root of scalar elements.

[0022] В другом частном варианте реализации исполняемая операция, осуществляемая процессором, задается по меньшей мере одним потоком конфигураций. [0022] In another particular embodiment, an executable operation performed by a processor is specified by at least one configuration stream.

[0023] В другом частном варианте реализации каждая конфигурация определяет параметры и тип обработки вектора.[0023] In another particular embodiment, each configuration defines the parameters and type of vector processing.

[0024] В другом частном варианте реализации одна конфигурация распараллеливается на каждое из скалярных устройств.[0024] In another particular embodiment, one configuration is parallelized across each of the scalar devices.

[0025] В другом частном варианте реализации элементы процессора соединены между собой шиной (interconnect).[0025] In another particular embodiment, the processor elements are interconnected by a bus.

[0026] В другом частном варианте реализации шина выполнена с возможностью обеспечения связи между устройствами векторного процессора и распространения конфигураций и данных.[0026] In another particular embodiment, the bus is configured to provide communication between devices of the vector processor and distribution of configurations and data.

[0027] В другом частном варианте реализации многопортовая общая память содержит:[0027] In another particular embodiment, the multiported shared memory comprises:

• порт записи для каждого скалярного модуля;• write port for each scalar module;

• порт чтения для каждого скалярного модуля;• reading port for each scalar module;

• порты для связи с внешней памятью;• ports for communication with external memory;

• порты для связи с устройством перестановок.• ports for communication with the permutation device.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF DRAWINGS

[0028] Признаки заявленного технического решения и подробное описание приведено ниже в виде прилагаемых чертежей.[0028] The features of the claimed technical solution and a detailed description are given below in the form of the attached drawings.

[0029] Фиг. 1 иллюстрирует пример структурной схемы векторного процессора с одной скалярной линией.[0029] Fig. 1 illustrates an example of a block diagram of a vector processor with one scalar line.

[0030] Фиг. 2 иллюстрирует пример структурной схемы векторного процессора с двумя скалярной линией.[0030] Fig. 2 illustrates an example of a block diagram of a vector processor with two scalar lines.

[0031] Фиг. 3 иллюстрирует структурную схему одной скалярной линии. [0031] Fig. 3 illustrates a structural diagram of one scalar line.

[0032] Фиг. 4 иллюстрирует пример размещения элементов на кристалле.[0032] Fig. 4 illustrates an example of the arrangement of elements on a crystal.

ОСУЩЕСТВЛЕНИЕ ИЗОБРЕТЕНИЯIMPLEMENTATION OF THE INVENTION

[0033] Ниже будут описаны понятия и термины, необходимые для понимания данного технического решения.[0033] Below, the concepts and terms necessary for understanding this technical solution will be described.

[0034] Векторный процессор - это процессор, в котором операндами некоторых команд могут выступать упорядоченные массивы данных - векторы. Отличается от скалярных процессоров, которые могут работать только с одним операндом в единицу времени.[0034] A vector processor is a processor in which the operands of some commands can be ordered arrays of data - vectors. It differs from scalar processors, which can only work with one operand per unit of time.

[0035] Заявленное техническое решение предлагает новый подход в создании векторного процессора, обладающего высокой производительностью. Кроме того, заявленное решение обеспечивает возможность потоковой конфигурации вычислительных элементов процессора, что дополнительно повышает скорость работы процессора, ввиду исключения необходимости отправки набора инструкций на каждую исполняемую однотипную команду. Так, за счет предлагаемой архитектуры заявленного решения обеспечивается возможность исполнения потоковых операций над данными, без необходимости переконфигурирования элементов процессора после каждой операции, что дополнительно повышает быстродействие процессора.[0035] The claimed technical solution offers a new approach to creating a vector processor with high performance. In addition, the claimed solution provides the ability to stream configuration of the processor's computing elements, which additionally increases the processor's operating speed, due to the elimination of the need to send a set of instructions for each executable command of the same type. Thus, due to the proposed architecture of the claimed solution, the ability to perform stream operations on data is ensured, without the need to reconfigure the processor elements after each operation, which additionally increases the processor's operating speed.

[0036] Термин «инструкции», используемый в этой заявке, может относиться, в общем, к программным инструкциям или программным командам, которые написаны на заданном языке программирования для осуществления конкретной функции, такой как, например, конфигурирование вычислительных элементов векторного процессора на исполнение операции с вектором и т.п.Инструкции могут быть осуществлены множеством способов, включающих в себя, например, потоковую конфигурацию машинных команд, объектно-ориентированные методы и т.д. Инструкции, осуществляющие процессы, описанные в этом решении, могут передаваться по проводным каналам от устройств управления. [0036] The term "instructions" as used in this application may refer generally to software instructions or software commands that are written in a given programming language to perform a specific function, such as, for example, configuring the computing elements of a vector processor to perform an operation on a vector, etc. The instructions may be implemented in a variety of ways, including, for example, streaming machine instruction configuration, object-oriented methods, etc. The instructions that perform the processes described in this solution may be transmitted over wired channels from control devices.

[0037] На Фиг. 1 представлена блок схема векторного процессора 100. Указанный процессор 100 включает в себя множество скалярных устройств 110, каждое из которых состоит из скалярных модулей 210-1 и 210-2, первого демультиплексора 220, блока сложных арифметических операций 230, второго демультиплексора 240, многопортовой общей памяти 120, блока горизонтальных операций 130.[0037] Fig. 1 shows a block diagram of a vector processor 100. Said processor 100 includes a plurality of scalar devices 110, each of which consists of scalar modules 210-1 and 210-2, a first demultiplexer 220, a complex arithmetic operations unit 230, a second demultiplexer 240, a multiport common memory 120, and a horizontal operations unit 130.

[0038] Архитектура заявленного векторного процессора, преимущественно, направлена на решение массивно-параллельных вычислений, включая задачи вычисления нейронных сетей и искусственного интеллекта, обработку видеопотоков и т.д. Однако, стоит отметить, что заявленный векторный процессор не должен ограничиваться задачами машинного обучения или вычислениями на основе нейронных сетей, а может также применяться для вычислений, связанных с различными областями технологий, для которых требуются такие процессоры, например, обработка изображений и т.д. [0038] The architecture of the claimed vector processor is primarily aimed at solving massively parallel computing, including the tasks of computing neural networks and artificial intelligence, processing video streams, etc. However, it is worth noting that the claimed vector processor should not be limited to machine learning tasks or computing based on neural networks, but can also be used for computing related to various areas of technology that require such processors, for example, image processing, etc.

[0039] Как указывалось, выше, заявленный векторный процессор 100 (Vector Engine (VE)), предназначен для выполнения ряда математических операций над векторами, например, длиной 128 слов формата FP24, с возможностью буферизации входных и промежуточных значений. Основное применение процессора 100 - исполнение вычислений сверточных слоев нейронных сетей после умножения матриц фичей и весов в устройстве умножения матриц (Matrix Engine (ME)). Векторный процессор, преимущественно, является частью ускорителя нейросетей, и предназначен для завершения вычисления слоев нейросети. Векторный процессор 100 управляется посредством потоков конфигураций, получаемых из устройства управления (внешнее по отношению к процессору). Также, векторный процессор 100 имеет соединение с внешней памятью для получения/отправки данных в/из нее. В еще одном частном варианте осуществления векторный процессор 100 выполнен с возможностью взаимодействия с устройством перестановок (является внешним по отношению к процессору 100).[0039] As indicated above, the claimed vector processor 100 (Vector Engine (VE)) is intended to perform a number of mathematical operations on vectors, for example, 128 words long in FP24 format, with the ability to buffer input and intermediate values. The main application of the processor 100 is the execution of calculations of convolutional layers of neural networks after multiplying matrices of features and weights in the matrix multiplication device (Matrix Engine (ME)). The vector processor is mainly part of a neural network accelerator, and is intended to complete the calculation of the neural network layers. The vector processor 100 is controlled by means of configuration flows received from a control device (external with respect to the processor). Also, the vector processor 100 has a connection to an external memory for receiving/sending data to/from it. In another particular embodiment, the vector processor 100 is configured to interact with a permutation device (external with respect to the processor 100).

[0040] Элементы векторного процессора 100 расположены на кристалле интегральной схемы (ИС). В одном частном варианте осуществления кристалл ИС может являться частью другого кристалла ИС, например, ускорителя нейросетей, который также содержит другие компоненты, связанные с процессором 100 (устройство умножения матриц ME, устройство управления CU, устройство перестановок и т.д.).[0040] The elements of the vector processor 100 are located on an integrated circuit (IC) crystal. In one particular embodiment, the IC crystal may be part of another IC crystal, for example, a neural network accelerator, which also contains other components associated with the processor 100 (a matrix multiplication unit ME, a control unit CU, a permutation unit, etc.).

[0041] Скалярные устройства (Scalar Unit, (SU)) 110 представляют собой скалярные вычислительные блоки, каждый из которых выполнен с возможностью обработки одного 24-битного слова (скаляр) из общего количества слов вектора. Так, количество устройств 110 определяет длину вектора, которую способен обработать процессор 100 за один проход. В одном частном варианте осуществления процессор 100 разбит на 128 скалярных устройств 110 каждое из которых обрабатывает одно 24-битное слово (скаляр) из 128 слов вектора. Соответственно, в других частных вариантах осуществления процессор 100 может содержать большее или меньшее количество устройств 110, из которые являются частью процессора 100.[0041] Scalar units (SU) 110 are scalar computing units, each of which is configured to process one 24-bit word (scalar) from the total number of words of the vector. Thus, the number of units 110 determines the length of the vector that the processor 100 is capable of processing in one pass. In one particular embodiment, the processor 100 is divided into 128 scalar units 110, each of which processes one 24-bit word (scalar) from the 128 words of the vector. Accordingly, in other particular embodiments, the processor 100 may contain a greater or lesser number of units 110, of which are part of the processor 100.

[0042] Устройства 110 выложены на кристалле в виде плитки («тайлов»). Пример расположения устройств 110 в виде плитки представлен на Фиг. 4. Слева показано одно устройство 110, содержащее четыре устройства 210, справа - 128 устройств 110. Каждое из устройств 110 является независимым. В одном частном варианте осуществления устройства 110 могут быть объединены в кластера, например, по 16, 8 устройств 110 и т.д. Устройства 110 связаны с общей шиной данных (interconnect). Устройства 110 являются независимыми и работают параллельно. Т.е. в соответствии с конфигурацией, подаваемой на шину от устройства управления, указанная конфигурация распараллеливается на каждое устройство 110, например, на 128 устройств 110, и каждое устройство 110 выполняет операцию независимо с каждым элементов вектора (скаляром). Также, по указанной шине осуществляется взаимодействие с внешними элементами процессора 100. [0042] The devices 110 are laid out on the crystal in the form of tiles. An example of the arrangement of devices 110 in the form of tiles is shown in Fig. 4. On the left, one device 110 is shown, containing four devices 210, on the right - 128 devices 110. Each of the devices 110 is independent. In one particular embodiment, the devices 110 can be combined into clusters, for example, 16, 8 devices 110, etc. The devices 110 are connected to a common data bus (interconnect). The devices 110 are independent and operate in parallel. That is, in accordance with the configuration supplied to the bus from the control device, the said configuration is parallelized to each device 110, for example, to 128 devices 110, and each device 110 performs an operation independently with each element of the vector (scalar). Also, interaction with external elements of processor 100 is carried out via the specified bus.

[0043] В свою очередь, для выполнения арифметических операций над элементом вектора, каждое устройство 110 состоит из по меньшей мере двух скалярных модулей 210 (Vector Subunit (VS)), образующих по меньшей мере одну скалярную линию (210-1-210-2), выполненных с возможностью осуществления арифметических операций над элементом вектора; по меньшей мере одного первого демультиплексора 220, выполненного с возможностью перенаправления данных от первого скалярного модуля 210-1 (VS) в линии, в многопортовую общую память 120 или второй скалярный модуль в линии 210-2; блока сложных арифметических операций 230 (Special Math Operations (SMO)), соединенного по меньшей мере с одним скалярным модулем в линии, например, модулем 210-2, и выполненного с возможностью исполнения математических функций над элементом вектора; по меньшей мере одного второго демультиплексора, выполненного с возможностью перенаправления данных из многопортовой памяти 120 в скалярный модуль, например, модуль 210-1 или устройство умножения матриц (ME).[0043] In turn, for performing arithmetic operations on an element of a vector, each device 110 consists of at least two scalar modules 210 (Vector Subunit (VS)) forming at least one scalar line (210-1-210-2), configured to perform arithmetic operations on an element of a vector; at least one first demultiplexer 220, configured to redirect data from the first scalar module 210-1 (VS) in the line to the multiport common memory 120 or the second scalar module in the line 210-2; a complex arithmetic operations unit 230 (Special Math Operations (SMO)), connected to at least one scalar module in the line, for example, module 210-2, and configured to perform mathematical functions on an element of a vector; at least one second demultiplexer configured to redirect data from the multiport memory 120 to a scalar module, such as module 210-1 or a matrix multiplier (ME).

[0044] Стоит отметить, что под линией понимается группа последовательно соединенных модулей 210 в составе каждого устройства 110. Так, в одном частном варианте осуществления, количество линий в составе каждого устройства 110 зависит от количества блоков матричного умножения, содержащихся в устройстве более высокого уровня (например, ускорителя нейросетей). Так, в еще одном частном варианте осуществления каждое устройство 110 может содержать две линии, образованные четырьмя модулями 210, как это показано на Фиг. 2: Линия 0: (МЕ0→) модуль 210-1→ модуль 210-2→ память 120. Линия 1: (МЕ1→) модуль 210-3→ модуль 210-4→ память 120. В еще одном частном варианте осуществления линия может содержать только один модуль 210. Такой вариант осуществления возможен при исполнении математического аппарата, не требующего сложных вычислений.[0044] It is worth noting that a line is understood to mean a group of sequentially connected modules 210 in each device 110. Thus, in one particular embodiment, the number of lines in each device 110 depends on the number of matrix multiplication units contained in a higher-level device (for example, a neural network accelerator). Thus, in another particular embodiment, each device 110 can contain two lines formed by four modules 210, as shown in Fig. 2: Line 0: (ME0 →) module 210-1 → module 210-2 → memory 120. Line 1: (ME1 →) module 210-3 → module 210-4 → memory 120. In yet another particular embodiment, a line can contain only one module 210. Such an embodiment is possible when executing a mathematical apparatus that does not require complex calculations.

[0045] Устройства 210 в разных линиях работают независимо и связаны друг с другом только через память 120. Т.е. линии между собой являются идентичными и модули 210 имеющие одинаковую позицию в линии обладают одинаковыми функциональными возможностями. Так, например, вторые в линии модули 210 (210-2 и 210-4) подключены к блоку 230, и, посредством его, к блоку горизонтальных операций 130. [0045] The devices 210 in different lines operate independently and are connected to each other only through the memory 120. That is, the lines are identical to each other and the modules 210 having the same position in the line have the same functional capabilities. Thus, for example, the second modules 210 in the line (210-2 and 210-4) are connected to the block 230, and, through it, to the horizontal operations block 130.

[0046] Возвращаясь к Фиг. 1, скалярные модули 210 обеспечивают исполнение арифметических операций над данными из устройства матричного умножения (ME) или памяти 120. Оба модуля 210-1 и 210-2 в линии идентичны, каждый состоит из следующих элементов, показанных на Фиг. 3: блок суммирования 310, блок произведения 320, локальной памяти 330, а также блока интерфейса для связи с блоком сложных математических операций 230. На Фиг. 3 показан пример одной линии, состоящей из двух скалярных модулей 210, например, таких как модули 210-1 и 210-2. Модули 210 имеют два входа (первый из устройства матричных умножений (ME), второй из памяти векторного процессора) и два выхода (в память векторного процессора). Выход первого модуля 210 в линии подается на демультиплексор, такой как демультиплексор 220 управляемый инструкцией этого модуля. Так, если в инструкции указана операция OUTPUT_VM (запись в память векторного процессора) - данные из модуля 210-1 перенаправляются на запись в память 120, если операция OUTPUT_VS - данные подаются на вход следующего модуля 210 в линии, например, модуля 210-2.[0046] Returning to Fig. 1, scalar modules 210 provide for the execution of arithmetic operations on data from the matrix multiplication unit (ME) or memory 120. Both modules 210-1 and 210-2 in the line are identical, each consisting of the following elements shown in Fig. 3: an adder 310, a product unit 320, a local memory 330, and an interface unit for communication with the complex mathematical operations unit 230. Fig. 3 shows an example of one line consisting of two scalar modules 210, such as, for example, modules 210-1 and 210-2. Modules 210 have two inputs (the first from the matrix multiplication unit (ME), the second from the vector processor memory) and two outputs (to the vector processor memory). The output of the first module 210 in the line is fed to a demultiplexer, such as demultiplexer 220 controlled by the instruction of this module. Thus, if the instruction specifies the operation OUTPUT_VM (writing to the memory of the vector processor) - data from module 210-1 are redirected for writing to memory 120, if the operation is OUTPUT_VS - data are fed to the input of the next module 210 in the line, for example, module 210-2.

[0047] Такой принцип организации архитектуры процессора 100 обеспечивает сокращение количества инструкций, требуемых для исполнения операции в процессоре 100 и обеспечивает возможность потокового исполнения операций. Так, такой принцип также позволяет достичь повышения производительности процессора 100 за счет сокращения количества обращений к процессору 100 для его конфигурирования. Т.е. последовательный способ соединения модулей 210 в линии позволяет сократить количество операций, требуемых процессору для выполнения вычисления, требующего промежуточный результат. Так, в качестве примера рассмотрим операцию (А+В)*С.При параллельной обработки такой операции потребуется конфигурировать модуль на исполнение первой операции, а именно выполнение операции (А+В), сохранение этого результата в память, чтение результата из памяти для умножения на С.Соответственно, в заявленном решении, за счет наличия последовательной линии арифметических блоков (модули 210), модуль 210-1 выполнит операцию А+В и передаст результат в модуль 210-2 для умножения. Т.е. такая архитектура позволяет выполнять операции последовательно, находясь в вычислительном конвейере. При этом, поддержка потоковой конфигурации процессора (DataFlow), дополнительно повышает скорость работы процессора 100 ввиду исключения необходимости многократной конфигурации вычислительных элементов на выполнение однотипных операций. Так, при потоковой конфигурации исполняемая операция задается один раз и выполняется для всего входящего пакета данных. Данный пример приведен для более точного отражения сути заявленного технического решения и не должен ограничивать сложность исполняемых операций. Так, посредством модулей 210-1 и 210-2 могут быть исполнены более сложные операции, например, exp[([(A+B)*C]+D)*E], где, соответственно, [(А+В)*С] будет выполняться первым модулем в линии, а остальная часть во втором.[0047] Such a principle of organizing the architecture of the processor 100 ensures a reduction in the number of instructions required to perform an operation in the processor 100 and ensures the possibility of streaming execution of operations. Thus, such a principle also makes it possible to achieve an increase in the performance of the processor 100 by reducing the number of calls to the processor 100 for its configuration. That is, the serial method of connecting modules 210 in a line makes it possible to reduce the number of operations required by the processor to perform a calculation that requires an intermediate result. Thus, as an example, let us consider the operation (A + B) * C. When processing such an operation in parallel, it will be necessary to configure the module to perform the first operation, namely, performing the operation (A + B), storing this result in memory, reading the result from memory for multiplication by C. Accordingly, in the claimed solution, due to the presence of a serial line of arithmetic units (modules 210), module 210-1 will perform the operation A + B and transfer the result to module 210-2 for multiplication. That is, such architecture allows to execute operations sequentially, being in the computing pipeline. At the same time, support of the streaming configuration of the processor (DataFlow) additionally increases the speed of the processor 100 due to the elimination of the need for multiple configuration of computing elements to execute the same type of operations. Thus, with a streaming configuration, the executable operation is specified once and is executed for the entire incoming data packet. This example is given for a more accurate reflection of the essence of the declared technical solution and should not limit the complexity of the executable operations. Thus, by means of modules 210-1 and 210-2, more complex operations can be executed, for example, exp[([(A+B)*C]+D)*E], where, accordingly, [(A+B)*C] will be executed by the first module in the line, and the rest in the second.

[0048] Соответственно, аналогичная ситуация будет и для модулей 210-3 и 210-4, показанных на Фиг. 2, образующих вторую линию. Как указывалось, выше, линии между собой идентичны. Выход последнего из модулей 210 в линии всегда подается на запись в память 120(в VE_MEM). У каждого модуля 210 также есть доступ на чтение памяти 120.. Стоит отметить, что наличие второй и последующих линий (третья, четвертая и т.д.) обеспечивает повышение параллельных вычислений, однако увеличивает площадь и энергопотребление процессора 100.[0048] Accordingly, a similar situation will be for modules 210-3 and 210-4, shown in Fig. 2, forming the second line. As indicated above, the lines are identical to each other. The output of the last of the modules 210 in the line is always fed for writing to memory 120 (in VE_MEM). Each module 210 also has access to read memory 120. It is worth noting that the presence of the second and subsequent lines (third, fourth, etc.) provides an increase in parallel computing, but increases the area and power consumption of the processor 100.

[0049] Каждое из арифметических устройств, составляющих модуль 210, может представлять собой конфигурируемый потоковый процессор и поддерживает свой уникальный набор команд. Для блока суммирования 310 это в первую очередь сложение и вычитание, для блока произведения 320 произведение. Кроме того, как указывалось выше, для исполнения функций над операндами последний модуль в линии соединен с блоком 230 через блок интерфейса связи с блоком выполнения сложных математических операций. Более подробно, блок суммирования 310 выполнен с возможностью исполнения простых арифметических операций, таких как суммирование, нахождения минимального, максимального числа, операция аккумуляции, операции сравнения (больше/меньше, равенство) и т.д. Блок 320 выполнен с возможностью выполнения таких операций как умножение, возведение в степень и т.д.[0049] Each of the arithmetic units that make up module 210 may be a configurable stream processor and supports its own unique set of commands. For summation unit 310, this is primarily addition and subtraction, for product unit 320, product. In addition, as indicated above, to perform functions on operands, the last module in the line is connected to block 230 via a communication interface unit with a block for performing complex mathematical operations. In more detail, summation unit 310 is configured to perform simple arithmetic operations, such as summation, finding a minimum, maximum number, accumulation operation, comparison operations (greater/less, equality), etc. Block 320 is configured to perform such operations as multiplication, exponentiation, etc.

[0050] Соответственно, как видно из схемы, второй модуль 210 в линии соединен, посредством блока интерфейса (SMO_IO) с блоком сложных математических операций 230 (Special Math Operations, SMO), который является общим для линий. Указанный блок 230 предназначен для исполнения редких и сложных арифметических операций, в соответствии с задаваемой конфигурацией 230.[0050] Accordingly, as can be seen from the diagram, the second module 210 in the line is connected, via an interface block (SMO_IO), to a block of complex mathematical operations 230 (Special Math Operations, SMO), which is common for the lines. The said block 230 is intended for the execution of rare and complex arithmetic operations, in accordance with the specified configuration 230.

[0051] Указанный блок 230 позволяет исполнять редкие и сложные арифметические операции. Поскольку блок 230 не имеет своих инструкций, SMO_IO (блок интерфейса последнего модуля 210 в линии, отображенного на Фиг. 3) добавляет номер исполняемой операции в USER биты AXI пакета перед его отправкой в блок 230. Только последние (например, вторые) модули 210 в линии имеют подключение к блоку 230. Соответственно, первые модули 210 в линии не имеют внешних подключений и не могут быть настроены на исполнение операций. Более подробно указанный блок 230 описывается ниже. [0051] The indicated block 230 allows executing rare and complex arithmetic operations. Since the block 230 does not have its own instructions, SMO_IO (the interface block of the last module 210 in the line, shown in Fig. 3) adds the number of the executed operation to the USER bits of the AXI packet before sending it to the block 230. Only the last (for example, the second) modules 210 in the line have a connection to the block 230. Accordingly, the first modules 210 in the line do not have external connections and cannot be configured to execute operations. The indicated block 230 is described in more detail below.

[0052] Кроме того, стоит отметить, что, хотя и в данном решении описываются примеры для одной и двух линий, состоящих из двух и четырех модулей 210, количество линий может быть большим или меньшим. Количество модулей 210 в каждой линии также может отличаться и зависит от степени сложности исполняемой операции. Так, в одном частном варианте осуществления в каждой линии может быть расположено три и более модулей 210. Соответственно при таком исполнении линии, первый и последний модуль будут иметь соединения, идентичные первому и второму модулю в вышеописанной линии, а остальные модули в линии будут являться промежуточными и обеспечивать требуемую вычислительную мощность.[0052] In addition, it is worth noting that, although this solution describes examples for one and two lines consisting of two and four modules 210, the number of lines may be greater or less. The number of modules 210 in each line may also differ and depends on the degree of complexity of the operation being performed. Thus, in one particular embodiment, three or more modules 210 may be located in each line. Accordingly, with such a design of the line, the first and last module will have connections identical to the first and second module in the above-described line, and the remaining modules in the line will be intermediate and provide the required computing power.

[0053] Продолжая раскрытие модулей 210, каждое арифметическое устройство может быть одновременно настроено на исполнение только одной из своих операций. Текущая операция каждого устройства задается инструкцией для модуля 210. Арифметические устройства могут быть соединены в цепь, состав и порядок которой задается динамически с помощью инструкции модуля 210. Цепь задает только источник первого операнда для каждого арифметического устройства. Однако, блок сложения 310 и умножения 320 для своих бинарных операций также используют второй операнд, который может браться либо из VS_MEM (памяти 330) или читаться из памяти 120 (VE_MEM).. Кроме того, наличие локальной памяти 330 позволяет сохранять промежуточный результат вычисления непосредственно внутри модуля 210 если результат вычисления не превышает размерлокальной памяти 330. Такой подход также позволяет использовать результат для последующих вычислений без обращения к общей памяти 120.[0053] Continuing to disclose modules 210, each arithmetic unit can be configured to perform only one of its operations at a time. The current operation of each unit is specified by an instruction for module 210. Arithmetic units can be connected in a chain, the composition and order of which is specified dynamically by an instruction for module 210. The chain specifies only the source of the first operand for each arithmetic unit. However, the addition unit 310 and multiplication unit 320 also use the second operand for their binary operations, which can be taken either from VS_MEM (memory 330) or read from memory 120 (VE_MEM). In addition, the presence of local memory 330 makes it possible to store an intermediate calculation result directly within module 210 if the calculation result does not exceed the size of local memory 330. This approach also makes it possible to use the result for subsequent calculations without accessing the common memory 120.

[0054] Возвращаясь к блоку 230 (Special Math Operations (SMO)) обеспечивает исполнение сложных редких арифметических операций, которые не исполняются в арифметических устройствах модулей 210, а именно: экспонента ЕХР; натуральный логарифм LN; обратное число 1/×(RCP); обратный квадратный корень 1/√×(RSQRT); математические функции с использованием LUT таблиц, ReLU и другие произвольные активации; горизонтальные операции через устройство XLU.[0054] Returning to block 230 (Special Math Operations (SMO)) provides execution of complex rare arithmetic operations that are not executed in the arithmetic units of modules 210, namely: exponent EXP; natural logarithm LN; reciprocal number 1/×(RCP); reciprocal square root 1/√×(RSQRT); mathematical functions using LUT tables, ReLU and other arbitrary activations; horizontal operations through the XLU unit.

[0055] Один блок 230 обслуживает обе линии из модулей 210, принимая данные от последних модулей 210 в линиях, например, модулей 210-2 и 210-4. Блок 230 имеет отдельную 24-битную входную AXI шину для каждого из модулей 210. Данные для обработки с помощью блока 230 формируются и отправляются посредством устройства SMO_IO внутри модулей 210. Путь данных в блок 230 зависит от того, какая из операций должна быть исполнена. Блок 230 не имеет собственных инструкций, поэтому исполняемая операция задается ее номером в USER битах входящего AXI пакета. Номер операции блока 230 добавляется в AXI пакет устройством SMO_IO на основе текущей инструкции модуля 210. Блок 230 соединен с блоком горизонтальных операций 130. Стоит отметить, что в одном частном варианте осуществления, блок 230 выполнен с возможностью, например, посредством планировщика задач, поочередной отправки операций, требуемых исполнения в блоке 130, в соответствии с их приоритетностью. Т.е. блок 230 поочередно отправляет некоторые операции, которые независимо поступают от модулей 210, в блоке 130. [0055] One block 230 services both lines of modules 210, receiving data from the last modules 210 in the lines, such as modules 210-2 and 210-4. Block 230 has a separate 24-bit input AXI bus for each of the modules 210. Data to be processed by block 230 is generated and sent via an SMO_IO device within the modules 210. The path of the data to block 230 depends on which of the operations is to be performed. Block 230 does not have its own instructions, so the operation to be executed is specified by its number in the USER bits of the incoming AXI packet. The operation number of block 230 is added to the AXI packet by the SMO_IO device based on the current instruction of module 210. Block 230 is connected to the horizontal operations block 130. It is worth noting that in one particular embodiment, block 230 is configured to, for example, by means of a task scheduler, alternately send the operations required for execution in block 130 in accordance with their priority. That is, block 230 alternately sends some operations that independently arrive from modules 210 in block 130.

[0056] Блок горизонтальных операций 130 (Cross-Lane Unit (XLU)), связан со скалярными устройствами 110, через блоки 230 каждого скалярного устройства, и выполнен с возможностью: исполнения арифметических операций с объединенными элементами вектора; трансляции результата исполнения арифметической операции с объединенными элементами вектора в скалярные устройства.[0056] The horizontal operations unit 130 (Cross-Lane Unit (XLU)) is connected to the scalar devices 110, via the blocks 230 of each scalar device, and is configured to: perform arithmetic operations with the combined elements of the vector; translate the result of performing the arithmetic operation with the combined elements of the vector into the scalar devices.

[0057] Все горизонтальные операции, исполняемые блоком 130, принимают целый вектор, например, из 128 24-битных слов. Каждый 24-битных скаляр отправляется из блока 230 одного из, например, 128 скалярных модулей 210. Блок 130 исполняет операцию только после поступления всех 128 скаляров. Данные подаются в блок 130 от независимых устройств 110, например, от 128 устройств 110, посредством шины данных (interconnect). Т.е. все устройства 110 независимо отправляют результаты своих вычислений в блок 130 через шину данных.[0057] All horizontal operations performed by block 130 receive an entire vector of, for example, 128 24-bit words. Each 24-bit scalar is sent from block 230 to one of, for example, 128 scalar modules 210. Block 130 executes the operation only after all 128 scalars have arrived. Data is fed to block 130 from independent devices 110, for example, from 128 devices 110, via a data bus (interconnect). That is, all devices 110 independently send the results of their calculations to block 130 via the data bus.

[0058] Блок 130 поддерживает такие операции, как: ADD_TREE; ADD_TREE+RCP; ADD_TREE+RSQRT; MAX_TREE; MAX_TREE+RCP; MAX_TREE+RSQRT; MIN_TREE; MIN_TREE+RCP; MIN_TREE+RSQRT. Где: ADD_TREE - сложить 128 чисел, MAX_TREE - найти максимальное из 128 чисел, MIN_TREE - найти минимальное из 128 чисел, RCP=1/×, RSQRT=1/√×.[0058] Block 130 supports the following operations: ADD_TREE; ADD_TREE+RCP; ADD_TREE+RSQRT; MAX_TREE; MAX_TREE+RCP; MAX_TREE+RSQRT; MIN_TREE; MIN_TREE+RCP; MIN_TREE+RSQRT. Where: ADD_TREE - add 128 numbers, MAX_TREE - find the maximum of 128 numbers, MIN_TREE - find the minimum of 128 numbers, RCP=1/×, RSQRT=1/√×.

[0059] Результатом операций блока 130 является один 24-битный скаляр. Копия итогового 24-битного числа отправляется в блок 230 каждого из 128 устройств 110. [0059] The result of the operations of block 130 is a single 24-bit scalar. A copy of the resulting 24-bit number is sent to block 230 of each of the 128 devices 110.

[0060] Многопортовая общая память 120 (VMEM) представляет собой основную память для буферизации промежуточных результатов вычисления нейронной сети. Стоит отметить, что под термином общая память подразумевается память, к которой у всех элементов имеется доступ. Физически же, память 120 разбита на множество «кусков» в соответствии с количеством устройств 110. В памяти могут храниться данные в разных форматах используемые для конфигурации различных узлов, например, целочисленные индексы перестановок для устройства перестановок и таблицы для вычисления активационных функций. В памяти 120 сохраняются результаты работы модулей 210, откуда они поступают либо снова в модули 210, либо в матричные умножители в качестве входных данных (фичи), либо в устройство перестановок, либо во внешнюю память, например, память типа НВМ. Так, память 120 соединена по меньшей мере с внешней памятью (НВМ), вторым демультиплексором каждого скалярного устройства и устройством умножения матриц и выполнена с возможностью буферизации и хранения результатов промежуточных вычислений над элементами вектора каждого скалярного устройства.[0060] The multiport shared memory 120 (VMEM) is the main memory for buffering the intermediate results of the neural network calculation. It is worth noting that the term shared memory means the memory to which all elements have access. Physically, the memory 120 is divided into a plurality of "pieces" in accordance with the number of devices 110. The memory can store data in different formats used for configuring various nodes, for example, integer permutation indices for the permutation device and tables for calculating activation functions. The memory 120 stores the results of the modules 210, from where they are fed either again to the modules 210, or to the matrix multipliers as input data (features), or to the permutation device, or to external memory, for example, memory of the HBM type. Thus, the memory 120 is connected at least to the external memory (EM), the second demultiplexer of each scalar device and the matrix multiplication device and is designed with the possibility of buffering and storing the results of intermediate calculations over the elements of the vector of each scalar device.

[0061] В одном частном варианте осуществления память 120 имеет следующие порты: порт записи для каждого скалярного модуля 210; порт чтения для каждого скалярного модуля 210; порты для связи с внешней памятью (НВМ); порты для связи с устройством перестановок (Shuffler).[0061] In one particular embodiment, the memory 120 has the following ports: a write port for each scalar module 210; a read port for each scalar module 210; ports for communicating with an external memory (EM); ports for communicating with a shuffler.

[0062] В еще одном частном варианте осуществления память 120 может представлять собой разделяемую память с 6 портами для чтения и 6 портами для записи. Всего, таким образом, память 120 может иметь 12 портов: четыре порта для записи из каждого модуля 210; четыре порта для чтения в каждый модуль 210, причем порты для чтения в первые модули 210 в линии делятся через вторые демультиплексоры 240 с подачей данных в устройства ME в той же линии; два порта (чтение и запись) для устройств контроллеров памяти (DMA); два порта (чтение и запись) для устройства перестановок shuffler. Соответственно, для организации памяти 120 при другом количестве линий (одна, три и т.д.), количество портов для записи из каждого модуля 210 и для чтения в каждый модуль 210 может быть изменено (уменьшено/увеличено) в соответствии с количеством модулей.[0062] In another particular embodiment, the memory 120 may be a shared memory with 6 ports for reading and 6 ports for writing. In total, therefore, the memory 120 may have 12 ports: four ports for writing from each module 210; four ports for reading into each module 210, wherein the ports for reading into the first modules 210 in a line are divided through the second demultiplexers 240 with the supply of data to the ME devices in the same line; two ports (reading and writing) for the memory controller devices (DMA); two ports (reading and writing) for the shuffler permutation device. Accordingly, for organizing the memory 120 with a different number of lines (one, three, etc.), the number of ports for writing from each module 210 and for reading into each module 210 can be changed (reduced/increased) in accordance with the number of modules.

[0063] Все порты памяти 120 являются конфигурируемыми. Конфигурация портов, кроме портов для shuffler, задает последовательность адресов доступа; адреса доступа на портах для shuffler генерируются самим модулем shuffler.[0063] All memory ports 120 are configurable. The configuration of ports, except for the shuffler ports, specifies the sequence of access addresses; the access addresses on the shuffler ports are generated by the shuffler module itself.

[0064] Все элементы внутри процессора 100 (VE) используют AXI4-Stream интерфейс для обмена данными друг с другом и с внешними устройствами, а также для приема инструкций из устройства управления (Control Unit, (CU)). AXI позволяет автоматически управлять передачей данных в сколь угодно длинной цепи устройств без искажения и потерь. Например, если устройство в цепи не может принять новые данные, оно уведомляет об этом отправляющее устройство, сбрасывая сигнал ready. В свою очередь отправляющее устройство не может отправить данные без сигнала ready от получателя, а значит, не может само принять новые данные на место отправленных и останавливает подачу сигнала ready своему отправителю. Так, передача останавливается по всей цепи вплоть до источника данных (например, порта чтения SHMEM) до тех пор, пока не разрешится причина остановки самого первого устройства.[0064] All elements within the processor 100 (VE) use the AXI4-Stream interface to exchange data with each other and with external devices, as well as to receive instructions from the Control Unit (CU). AXI allows for automatic control of data transmission in an arbitrarily long chain of devices without distortion and loss. For example, if a device in the chain cannot accept new data, it notifies the sending device by resetting the ready signal. In turn, the sending device cannot send data without a ready signal from the receiver, and therefore cannot accept new data in place of the sent data and stops sending the ready signal to its sender. Thus, transmission is stopped along the entire chain up to the source of the data (for example, the SHMEM read port) until the reason for stopping the very first device is resolved.

[0065] Длина входной последовательности скаляров не указывается и может быть произвольно длинной. Окончание последовательности обозначается сигналом last=1: сигнал приходит в начало цепи устройств вместе с последним пакетом данных последовательности и передается через всю цепь.[0065] The length of the input scalar sequence is not specified and may be arbitrarily long. The end of the sequence is indicated by the signal last=1: the signal arrives at the beginning of the chain of devices together with the last data packet of the sequence and is transmitted through the entire chain.

[0066] Процессор 100 выполнен с возможностью обмена данными с устройствами, работающими с векторными данными: матричным умножителем (ME), устройством перестановок (Shuffler). Так, в рамках работы устройства, такого как ускоритель нейросетей, векторный процессор 100, который может являться частью указанного ускорителя, получает вектор, состоящий из 128 скаляров в формате FP24. Как указывалось выше, вектор может поступать, например, от устройства матричных умножений. Каждый скаляр передается в свое устройство 110. Индекс устройства 110 равен индексу скаляра в векторе. Передача осуществляется по независимой 24-битной AXI шине (шина interconnect). При получении всех 128 скаляров на принимающей стороне формируется вектор (индекс скаляра в векторе равен индексу передавшего устройства 110). [0066] The processor 100 is configured to exchange data with devices that work with vector data: a matrix multiplier (ME), a shuffler. Thus, within the framework of the operation of a device such as a neural network accelerator, the vector processor 100, which may be part of said accelerator, receives a vector consisting of 128 scalars in FP24 format. As indicated above, the vector may come, for example, from a matrix multiplication device. Each scalar is transmitted to its device 110. The index of the device 110 is equal to the index of the scalar in the vector. The transmission is performed via an independent 24-bit AXI bus (interconnect bus). Upon receipt of all 128 scalars, a vector is formed on the receiving side (the index of the scalar in the vector is equal to the index of the transmitting device 110).

[0067] Память 120 обменивается данными с внешней памятью (НВМ - High Bandwidth Memory). Для этого выделена пара устройств контроллеров памяти (DMA - Direct Memory Access) для записи в НВМ и пара DMA для чтения из НВМ.[0067] Memory 120 exchanges data with external memory (HBM - High Bandwidth Memory). For this purpose, a pair of memory controller devices (DMA - Direct Memory Access) are allocated for writing to the HBM and a pair of DMA for reading from the HBM.

[0068] Работа векторного процессора 100 управляется набором потоков конфигураций (инструкций) каждого элемента в его составе. Каждая конфигурация определяет параметры обработки некоторого количества векторов на одном элементе, например, устройстве 110. В процессе работы конфигурации сменяются по мере завершения их обработки. Работа каждого устройства 110 в общем случае является независимой, синхронизация осуществляется посредством зависимости по данным. Внутренние связи между устройствами реализуются на основе протокола AXI4-Stream.[0068] The operation of the vector processor 100 is controlled by a set of configuration (instruction) streams of each element in its composition. Each configuration defines the parameters for processing a certain number of vectors on one element, for example, device 110. During operation, the configurations are replaced as their processing is completed. The operation of each device 110 is generally independent, synchronization is carried out by means of data dependence. Internal communications between devices are implemented based on the AXI4-Stream protocol.

[0069] Для конфигурации устройств, входящих в состав процессора 100, устройство управления (CU) может иметь несколько очередей инструкций (две и более). Так, каждая очередь инструкций может передаваться по собственной шине. Рассмотрим вариант с четырьмя очередями инструкций (Threads). Соответственно устройство управления будет соединено с процессором 100 посредством 4 шин для передачи инструкций в векторный процессор 100. Каждая шина имеет ширину 64 бита. Все инструкции разбиты на слова по 32 бита. Например, инструкция модуля 210 состоит из 20 слов (640 бит). Передача слов одной инструкции происходит последовательно по одной из 4 шин. Конфигурация устройства 110 происходит после получения последнего слова текущей инструкции. [0069] For the configuration of the devices included in the processor 100, the control unit (CU) may have several instruction queues (two or more). Thus, each instruction queue may be transmitted via its own bus. Let us consider a variant with four instruction queues (Threads). Accordingly, the control unit will be connected to the processor 100 via 4 buses for transmitting instructions to the vector processor 100. Each bus has a width of 64 bits. All instructions are divided into words of 32 bits. For example, the instruction of module 210 consists of 20 words (640 bits). The transmission of words of one instruction occurs sequentially via one of the 4 buses. The configuration of device 110 occurs after receiving the last word of the current instruction.

[0070] Так, рассмотрим в качестве примера процессор 100 с двумя линиями. Инструкции разных элементов процессора 100 будут распределены по 4 очередям CU (устройство управления) следующим образом: Thread_0 (очередь_0): модули 210-1 и 210-2, порты памяти 120 для чтения в модуль 210-1/210-2 и записи из модуля 210-2, демультиплексора 240 (в инструкции порта памяти 120), демультиплексора 220 (в инструкции модуля 210-1).[0070] Thus, let us consider as an example a processor 100 with two lines. The instructions of different elements of the processor 100 will be distributed among 4 queues of the CU (control unit) as follows: Thread_0 (queue_0): modules 210-1 and 210-2, memory ports 120 for reading to module 210-1/210-2 and writing from module 210-2, demultiplexer 240 (in memory port 120 instruction), demultiplexer 220 (in module 210-1 instruction).

[0071] Thread_1 (очередь_1): модули 210-3 и 210-4, порты памяти 120 для чтения в модуль 210-3/210-4 и записи из модуля 210-4, второго демультиплексора 240 (в инструкции порта памяти 120), второго демультиплексора 220 (в инструкции модуля 210-4).[0071] Thread_1 (queue_1): modules 210-3 and 210-4, memory ports 120 for reading to module 210-3/210-4 and writing from module 210-4, second demultiplexer 240 (in memory port instruction 120), second demultiplexer 220 (in module 210-4 instruction).

[0072] Thread_2 (очередь_2): порты памяти 120 для чтения/записи в/из внешней памяти (НВМ).[0072] Thread_2 (queue_2): Memory ports 120 for reading/writing to/from external memory (HBM).

[0073] Thread_3 (очередь_3): порты памяти 120 для чтения/записи в устройство перестановок (Shuffler).[0073] Thread_3 (queue_3): Memory ports 120 for reading/writing to the shuffler.

[0074] Все инструкции записываются в свои очереди в программном порядке (в порядке расположения инструкций в НВМ).[0074] All instructions are written to their queues in program order (in the order in which the instructions are located in the HBM).

[0075] 128 одинаковых устройств 110 (скалярные устройства) конфигурируются одной инструкцией, раздублированной 128 раз. Например, 128 модулей 210-1 принимают одну и ту же инструкцию. Тоже и для инструкций портов памяти 120, например, первый порт записи памяти 120 в каждом устройстве 110 конфигурируется одной и той же инструкцией. Инструкция распространяется во все устройства 110 параллельно по дереву устройств Broadcast, однако из-за сильной рассеяности устройств 110 по кристаллу ядра одна инструкция доходит до разных устройств 110 за разное число тактов. После исполнения инструкции из каждого устройства 110 передается сигнал "done": сначала "done" собирается от каждого из 128 устройств 110 и только потом один общий "done" отправляется в CU. [0075] 128 identical devices 110 (scalar devices) are configured by one instruction, duplicated 128 times. For example, 128 modules 210-1 accept the same instruction. The same is true for the instructions of the memory ports 120, for example, the first write port of the memory 120 in each device 110 is configured by the same instruction. The instruction is distributed to all devices 110 in parallel along the Broadcast device tree, but due to the strong dispersion of the devices 110 across the core crystal, one instruction reaches different devices 110 in a different number of clock cycles. After the instruction is executed, a "done" signal is transmitted from each device 110: first, "done" is collected from each of the 128 devices 110 and only then one common "done" is sent to the CU.

[0076] Инструкция модуля 210-1 также конфигурирует демультиплексор 220, чтобы выбрать, куда передаются данные из модуля 210-1: на вход модуля 210-2 или на запись в память 120. Аналогично, инструкция модуля 210-3 конфигурирует второй демультиплексор 220. Демультиплексоры 240 (каждый для своей линии) конфигурируются инструкциями портов чтения памяти 120. Каждое из двух состояний демультиплексоров 240 соответствует отдельной инструкции порта памяти 120: одна для чтения в модули 210, другая для чтения в ME.[0076] The instruction of module 210-1 also configures demultiplexer 220 to select where data from module 210-1 is transferred: to the input of module 210-2 or to be written to memory 120. Similarly, instruction of module 210-3 configures a second demultiplexer 220. Demultiplexers 240 (each for its own line) are configured by instructions of memory read ports 120. Each of the two states of demultiplexers 240 corresponds to a separate instruction of memory port 120: one for reading into modules 210, the other for reading into ME.

[0077] В еще одном частном варианте осуществления также предусмотрена поддержка зацикливания операций модулей 210 посредством специального поля конфигурации. Блоки 230 (SMO) и 130 (XLU) не имеют собственных конфигураций.[0077] In another particular embodiment, support for looping of operations of modules 210 is also provided by means of a special configuration field. Blocks 230 (SMO) and 130 (XLU) do not have their own configurations.

[0078] Рассмотрим один из сценариев работы процессора 100. На вход процессора 100 поступает вектор, например, длиной 128 чисел. Указанный вектор поступает от устройства матричного умножения (ME), которое может быть реализовано, например, в виде систолического массива. Далее, все 128 чисел по шине данных (interconnect) распределяются по 128 устройствам 110 в соответствии с их индексом в векторе. В каждом устройстве 110, в зависимости от конфигурации, выполняется арифметическая операция над элементом вектора. Стоит отметить, что в зависимости от сложности операции, а, следовательно, и конфигурации, в линии устройства 110 может быть задействован или не задействован блок 230 а также второй модуль в линии 210-2. В соответствии с исполняемой конфигурацией результат вычисления из модуля 210-1, посредством демультиплексора 220, либо передается в память 120, либо передается в следующий модуль 210 в линии/ блок 230/ блок 130 и т.д. Кроме того, в одном частном варианте осуществления, при необходимости изначальной работы с модулем 230, конфигурация может содержать инструкции, транслирующие число вектора из ME через модуль 210-1 в модуль 210-2 для последующего исполнения операции в блоке 230 и/или в блоке 130. После исполнения арифметических операций каждым из устройств 110, результаты передаются в память 120. Как указывалось выше, память 120 выступает синхронизирующим устройством для 128 асинхронных устройств 110, и хранит результаты исполнения, например, в виде вектора. Далее, при необходимости дальнейших операций с вычисленным вектором, посредством демультиплексора 240, указанный результат может поступать либо в матричный умножитель (в форме вектора), либо в каждое из устройств 110 (в форме скаляров в соответствии с индексом в векторе).[0078] Let us consider one of the scenarios of the processor 100 operation. A vector, for example, 128 numbers long, is received at the input of the processor 100. The specified vector is received from the matrix multiplication (ME) device, which can be implemented, for example, in the form of a systolic array. Then, all 128 numbers are distributed over the data bus (interconnect) to 128 devices 110 in accordance with their index in the vector. In each device 110, depending on the configuration, an arithmetic operation is performed on an element of the vector. It is worth noting that, depending on the complexity of the operation, and, consequently, the configuration, block 230 and the second module in line 210-2 may or may not be involved in the line of device 110. In accordance with the executable configuration, the result of the calculation from the module 210-1, via the demultiplexer 220, is either transferred to the memory 120, or transferred to the next module 210 in the line/block 230/block 130, etc. In addition, in one particular embodiment, if it is necessary to initially work with the module 230, the configuration can contain instructions that translate the vector number from the ME through the module 210-1 to the module 210-2 for subsequent execution of the operation in the block 230 and/or in the block 130. After the execution of the arithmetic operations by each of the devices 110, the results are transferred to the memory 120. As indicated above, the memory 120 acts as a synchronizing device for 128 asynchronous devices 110, and stores the results of the execution, for example, in the form of a vector. Further, if further operations with the calculated vector are required, via demultiplexer 240, the specified result can be fed either to the matrix multiplier (in the form of a vector) or to each of the devices 110 (in the form of scalars in accordance with the index in the vector).

[0079] Таким образом, в представленных материалах заявки были описаны различные варианты исполнения векторного процессора, обеспечивающего высокую скорость работы за счет организации вычислительных элементов в виде скалярных устройств 110 с линиями, которые в свою очередь содержат последовательные арифметические блоки (модули 210), что, соответственно, позволяет исполнять операции с элементами вектора, где требуются действия с промежуточным результатом, т.е. обеспечивают выполнение операций по конвейерному типу. Кроме того, заявленное решение обеспечивает возможность потоковой конфигурации вычислительных элементов процессора, в том числе и векторной памяти, что дополнительно повышает быстродействие процессора из-за исключения необходимости отправки команд на каждую однотипную операцию всем элементам процессора. [0079] Thus, in the submitted application materials, various embodiments of a vector processor were described, ensuring high operating speed due to the organization of computing elements in the form of scalar devices 110 with lines, which in turn contain sequential arithmetic blocks (modules 210), which, accordingly, allows performing operations with vector elements, where actions with an intermediate result are required, i.e. ensure the execution of operations according to the pipeline type. In addition, the claimed solution provides the possibility of a stream configuration of the computing elements of the processor, including vector memory, which additionally increases the processor speed due to the elimination of the need to send commands for each similar operation to all elements of the processor.

[0080] Конкретный выбор элементов процессора 100 для реализации различных программно-аппаратных и/или архитектурных решений может варьироваться с сохранением обеспечиваемого требуемого функционала.[0080] The specific selection of elements of the processor 100 for implementing various software, hardware and/or architectural solutions may vary while maintaining the required functionality provided.

[0081] Представленные материалы заявки раскрывают предпочтительные примеры реализации технического решения и не должны трактоваться как ограничивающие иные, частные примеры его воплощения, не выходящие за пределы испрашиваемой правовой охраны, которые являются очевидными для специалистов соответствующей области техники. Таким образом, объем настоящего технического решения ограничен только объемом прилагаемой формулы.[0081] The submitted application materials disclose preferred examples of the implementation of the technical solution and should not be interpreted as limiting other, particular examples of its implementation that do not go beyond the scope of the requested legal protection, which are obvious to specialists in the relevant field of technology. Thus, the scope of the present technical solution is limited only by the scope of the attached formula.

Claims

1. An on-chip vector processor containing:

• scalar devices, each of which is configured to obtain a vector element from a matrix multiplication device and consists of:

at least two scalar modules forming at least one scalar line, configured to perform arithmetic operations on a vector element;

at least one first demultiplexer configured to redirect data from the first scalar module in the line to the multiport shared memory or the second scalar module in the line;

a block of complex arithmetic operations connected to at least one scalar module in a line and configured to perform mathematical functions on a vector element;

at least one second demultiplexer configured to redirect data from the multiport memory to a scalar module or matrix multiplier;

• a multi-port shared memory connected to at least an external memory, a second demultiplexer of each scalar device and a matrix multiplication device, configured to buffer and store the results of intermediate calculations over the vector elements of each scalar device;

• a horizontal operations block associated with scalar devices, designed with the ability to:

performing arithmetic operations with combined vector elements;

translation of the result of executing an arithmetic operation with the combined elements of a vector into scalar devices.

2. A vector processor according to item 1, characterized in that the scalar devices are located on the crystal in the form of a tile.

3. The vector processor according to claim 1, characterized in that each scalar device contains at least four scalar modules.

4. The vector processor according to claim 3, characterized in that at least four scalar modules form at least two scalar lines.

5. The vector processor according to claim 1, characterized in that the scalar line is at least two scalar modules connected in series.

6. The vector processor according to claim 1, characterized in that each of at least two scalar modules is a configurable stream processor containing: an adder unit, a product unit, a function calculation module, and local memory.

7. The vector processor according to claim 1, characterized in that the data switching performed by each first demultiplexer in each scalar device from the first scalar module in the line depends on the executed processor operation.

8. A vector processor according to claim 1, characterized in that the mathematical functions over the vector element are at least:

• exhibitor;

• natural logarithm;

• reciprocal number;

• inverse square root;

• mathematical functions using LUT tables.

9. The vector processor according to claim 1, characterized in that the arithmetic operations performed by the horizontal operations block are at least:

• addition of vector elements obtained from scalar devices;

• finding the maximum element among the obtained scalar elements;

• finding the minimum element among the obtained scalar elements;

• finding the inverse value of scalar elements;

• finding the inverse square root of scalar elements.

10. The vector processor according to claim 1, characterized in that the executable operation performed by the processor is specified by at least one configuration stream.

11. The vector processor according to item 10, characterized in that each configuration determines the parameters and type of vector processing.

12. The vector processor according to item 10, characterized in that one configuration is parallelized to each of the scalar devices.

13. A vector processor according to claim 1, characterized in that the processor elements are connected to each other by a bus.

14. The vector processor according to claim 13, characterized in that the bus is designed with the ability to provide communication between devices of the vector processor and the distribution of configurations and data.

15. The vector processor according to claim 1, characterized in that the multiport shared memory contains:

• write port for each scalar module;

• reading port for each scalar module;

• ports for communication with external memory;

• ports for communication with the permutation device.