[go: up one dir, main page]

WO2018184570A1 - 运算装置和方法 - Google Patents

运算装置和方法 Download PDF

Info

Publication number
WO2018184570A1
WO2018184570A1 PCT/CN2018/081929 CN2018081929W WO2018184570A1 WO 2018184570 A1 WO2018184570 A1 WO 2018184570A1 CN 2018081929 W CN2018081929 W CN 2018081929W WO 2018184570 A1 WO2018184570 A1 WO 2018184570A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
power
neuron
unit
neural network
Prior art date
Application number
PCT/CN2018/081929
Other languages
English (en)
French (fr)
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201710222232.5A external-priority patent/CN108694181B/zh
Priority claimed from CN201710227493.6A external-priority patent/CN108694441B/zh
Priority claimed from CN201710256444.5A external-priority patent/CN108733625B/zh
Priority claimed from CN201710266052.7A external-priority patent/CN108734280A/zh
Priority claimed from CN201710312415.6A external-priority patent/CN108805271B/zh
Priority to CN201880001242.9A priority Critical patent/CN109219821B/zh
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Priority to CN201811423295.8A priority patent/CN109409515B/zh
Priority to EP19199526.5A priority patent/EP3633526A1/en
Priority to EP18780474.5A priority patent/EP3579150B1/en
Priority to EP19199528.1A priority patent/EP3624018B1/en
Priority to EP19199521.6A priority patent/EP3620992B1/en
Priority to EP19199524.0A priority patent/EP3627437B1/en
Priority to EP24168317.6A priority patent/EP4372620A3/en
Publication of WO2018184570A1 publication Critical patent/WO2018184570A1/zh
Priority to US16/283,711 priority patent/US10896369B2/en
Priority to US16/520,082 priority patent/US11010338B2/en
Priority to US16/520,041 priority patent/US11551067B2/en
Priority to US16/520,615 priority patent/US10671913B2/en
Priority to US16/520,654 priority patent/US11049002B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4046Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30076Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
    • G06F9/30083Power or thermal control instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30025Format conversion instructions, e.g. Floating-Point to Integer, decimal conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to the field of artificial intelligence technology, and more particularly to an arithmetic device and method.
  • Multi-layer neural networks are widely used in tasks such as classification and recognition. In recent years, due to their high recognition rate and high parallelism, they have received extensive attention from academia and industry.
  • neural networks with better performance are usually very large, which means that these neural networks require a large amount of computing resources and storage resources.
  • the overhead of a large amount of computing and storage resources will reduce the computing speed of the neural network, and at the same time, the transmission bandwidth of the hardware and the requirements of the computing device are greatly improved.
  • the present disclosure provides an arithmetic apparatus and method to at least partially solve the above-mentioned technical problems.
  • a neural network computing device including:
  • the power conversion module is coupled to the operation module for converting input neuron data and/or output neuron data of the neural network operation into power neuron data.
  • the power conversion module includes:
  • a first power conversion unit configured to convert the neuron data output by the operation module into power neuron data
  • a second power conversion unit configured to convert the neuron data input to the operation module into power neuron data.
  • the operation module further includes: a third power conversion unit configured to convert the power neuron data into non-power neuron data.
  • the neural network computing device further includes:
  • a storage module for storing data and operation instructions
  • control module configured to control interaction between the data and the operation instruction, receive the data and the operation instruction sent by the storage module, and decode the operation instruction into the operation micro instruction;
  • the operation module includes an operation unit, configured to receive data and operation micro-instructions sent by the control module, and perform neural network operations on the weight data and the neuron data received according to the operation micro-instruction.
  • control module includes: an operation instruction buffer unit, a decoding unit, an input neuron buffer unit, a weight buffer unit, and a data control unit;
  • An operation instruction buffer unit connected to the data control unit, for receiving an operation instruction sent by the data control unit;
  • a decoding unit coupled to the operation instruction buffer unit, for reading an operation instruction from the operation instruction buffer unit, and decoding the operation instruction into an operation micro instruction
  • a neuron buffer unit coupled to the data control unit, for acquiring corresponding power neuron data from the data control unit
  • a weight buffer unit connected to the data control unit, for acquiring corresponding weight data from the data control unit;
  • a data control unit coupled to the storage module, configured to implement interaction between the storage module and the data and operation instructions between the operation instruction cache unit, the weight buffer unit, and the input neuron cache unit; among them,
  • the operation unit is respectively connected to the decoding unit, the input neuron buffer unit and the weight buffer unit, and receives the operation micro instruction, the power neuron data and the weight data, and receives the operation unit according to the operation micro instruction.
  • the power neuron data and the weight data perform corresponding neural network operations.
  • the neural network computing device further includes: an output module, comprising: an output neuron buffer unit, configured to receive neuron data output by the operation module;
  • the power conversion module includes:
  • a first power conversion unit coupled to the output neuron buffer unit, configured to convert the neuron data output by the output neuron buffer unit into power neuron data
  • a second power conversion unit connected to the storage module, configured to convert the neuron data input to the storage module into power neuron data
  • the operation module further includes: a third power conversion unit connected to the operation unit for converting the power neuron data into non-power neuron data.
  • the first power conversion unit is further connected to the data control unit, and the neuron data output by the operation module is converted into power neuron data and sent to the data control unit. Take the input data as the next layer of neural network operation.
  • the power neuron data includes a sign bit and a power bit, the sign bit being used to represent a symbol of the power neuron data, the power bit being used to represent the power Power bit data of neuron data; the sign bit includes one or more bit data, the power bit includes m bit data, and m is a positive integer greater than one.
  • the neural network computing device further includes a storage module prestored with a coding table including power sub-bit data and an exponent value, the coding table being used for powering neuron data Each power bit data acquires an exponent value corresponding to the power bit data.
  • the coding table further includes one or more zero-powered bit data, and the power neuron data corresponding to the zero-powered bit data is zero.
  • the largest power bit data corresponds to a power neuron data of 0 or a minimum power bit data corresponding to a power neuron data of zero.
  • the correspondence relationship of the coding table is that the highest bit of the power bit data represents a zero bit, and the other m-1 bits of the power bit data correspond to an index value.
  • the correspondence relationship of the coding table is a positive correlation relationship
  • the storage module prestores an integer value x and a positive integer value y
  • the smallest power bit data corresponds to an exponent value of x; wherein x represents an offset
  • the value, y represents the step size.
  • the smallest power bit data corresponds to an exponent value of x
  • the largest power bit data corresponds to a power sub-neuron data of 0, and other power-bit data other than the minimum and maximum power sub-bit data.
  • the corresponding index value is (power sub-data + x) * y.
  • the correspondence relationship of the coding table is a negative correlation
  • the storage module prestores an integer value x and a positive integer value y
  • the maximum power bit data corresponds to an exponent value of x; wherein x represents an offset
  • the value, y represents the step size.
  • the largest power bit data corresponds to an exponent value of x
  • the smallest power bit data corresponds to a power sub-neuron data of 0, and other power-bit data other than the minimum and maximum power bit data.
  • the corresponding index value is (power sub-data -x) * y.
  • y 1, the value of x is equal to 2 m-1 .
  • converting the neuron data to power neuron data includes:
  • d in is the input data of the power conversion unit
  • d out is the output data of the power conversion unit
  • s in is the sign of the input data
  • s out is the sign of the output data
  • d in+ is the positive part of the input data.
  • d in+ d in ⁇ s in
  • d out+ d out ⁇ s out , Indicates that the data x is taken out; or,
  • d in is the input data of the power conversion unit
  • d out is the output data of the power conversion unit
  • s in is the sign of the input data
  • s out is the sign of the output data
  • d in+ is the positive part of the input data.
  • d in+ d in ⁇ s in
  • d out+ d out ⁇ s out , Indicates that the data x is taken as a whole operation; or,
  • d in is the input data of the power conversion unit
  • d out is the output data of the power conversion unit
  • s in is the sign of the input data
  • s out is the sign of the output data
  • d in+ is the positive part of the input data
  • d in+ d in ⁇ s in
  • d out+ is the positive part of the output data
  • d out+ d out ⁇ s out ;
  • Indicates that the data x is rounded off.
  • a neural network operation method including:
  • the step of converting the input neuron data of the neural network operation into power neuron data prior to performing the neural network operation comprises:
  • the operation instruction, the power neuron data, and the weight data are received and stored.
  • the method further includes:
  • the operation instruction is read and decoded into each operation micro instruction.
  • a neural network operation is performed on the weight data and the power neuron data according to the operation microinstruction.
  • the step of converting the output neuron data of the neural network operation to the power neuron data after performing the neural network operation comprises:
  • the non-power neuron data in the neuron data obtained after the neural network operation is converted into power neuron data.
  • the non-power neuron data in the neuron data obtained after the neural network operation is converted into power neuron data and sent to the data control unit as an input to the next layer of the neural network operation.
  • the power neurons, the repeated neural network operation steps and the non-powered neuron data are converted into power neuron data steps until the last layer of the neural network ends.
  • an integer value x and a positive integer value y are prestored in the storage module; wherein x represents an offset value and y represents a step size; by changing an integer value x and a positive integer value y prestored by the storage module, Adjust the range of power neuron data that can be represented by the neural network computing device.
  • a method of using the neural network computing device wherein an integer value x and a positive integer value y prestored by a storage module are changed to adjust a power that can be represented by a neural network computing device Secondary neuron data range.
  • a neural network computing device including:
  • a power conversion module is coupled to the computing module for converting input data and/or output data of the neural network operation into power data.
  • the input data includes input neuron data, input weight data
  • the output data includes output neuron data, output weight data
  • the power data includes power neuron data, power Weight data.
  • the power conversion module includes:
  • a first power conversion unit for converting output data of the operation module into power data
  • a second power conversion unit configured to convert input data of the operation module into power data.
  • the operation module further includes: a third power conversion unit configured to convert the power data into non-power data.
  • the neural network computing device further includes:
  • a storage module for storing data and operation instructions
  • control module configured to control interaction between the data and the operation instruction, receive the data and the operation instruction sent by the storage module, and decode the operation instruction into the operation micro instruction;
  • the operation module includes an operation unit, configured to receive data and operation micro-instructions sent by the control module, and perform neural network operations on the weight data and the neuron data received according to the operation micro-instruction.
  • control module includes: an operation instruction buffer unit, a decoding unit, an input neuron buffer unit, a weight buffer unit, and a data control unit;
  • An operation instruction buffer unit connected to the data control unit, for receiving an operation instruction sent by the data control unit;
  • a decoding unit coupled to the operation instruction buffer unit, for reading an operation instruction from the operation instruction buffer unit, and decoding the operation instruction into an operation micro instruction
  • a neuron buffer unit coupled to the data control unit, for acquiring corresponding power neuron data from the data control unit
  • a weight buffer unit connected to the data control unit, for acquiring corresponding power weight data from the data control unit;
  • a data control unit coupled to the storage module, configured to implement interaction between the storage module and the data and operation instructions between the operation instruction cache unit, the weight buffer unit, and the input neuron cache unit; among them,
  • the operation unit is respectively connected to the decoding unit, the input neuron buffer unit and the weight buffer unit, and receives the operation microinstruction, the power sub-neuron data, the power sub-weight data, and the operation unit according to the operation micro-instruction
  • the received power neuron data and the power weight data perform corresponding neural network operations.
  • the neural network computing device further includes: an output module, comprising: an output neuron buffer unit, configured to receive neuron data output by the operation module;
  • the power conversion module includes:
  • a first power conversion unit coupled to the output neuron buffer unit and the operation unit, configured to convert the neuron data output by the output neuron buffer unit into power neuron data and output the operation unit
  • the weight data is converted into power weight data
  • a second power conversion unit is connected to the storage module, and configured to convert the neuron data and the weight data input to the storage module into power neuron data and power weight data respectively;
  • the operation module further includes: a third power conversion unit connected to the operation unit, configured to convert the power neuron data and the power weight data into non-power neuron data and non-power weights respectively. data.
  • the first power conversion unit is further connected to the data control unit, and the neuron data and the weight data output by the operation module are respectively converted into power neuron data and power sub-weights.
  • the value data is sent to the data control unit as input data for the next layer of neural network operations.
  • the power neuron data includes a sign bit and a power bit, the sign bit being used to represent a symbol of the power neuron data, the power bit being used to represent the power Power bit data of neuron data;
  • the sign bit includes one or more bit data, the power bit bit includes m bit bit data, and m is a positive integer greater than one;
  • the power weight data indicates that the value of the weight data is represented by its power exponent value, wherein the power weight data includes a sign bit and a power bit, and the sign bit uses one or more bit representations.
  • the sign of the value data, the power bit uses the m-bit bit to represent the power bit data of the weight data, and m is a positive integer greater than one.
  • the neural network computing device further includes a storage module prestored with a coding table including power sub-bit data and an exponent value, the coding table being used to pass power sub-neuron data, Each power bit data of the power weight data obtains an exponent value corresponding to the power bit data.
  • the coding table further includes one or more zero-powered bit data, and the power-level neuron data and the power-weighted weight data corresponding to the zero-powered bit data are zero.
  • the largest power bit data corresponds to power neuron data
  • the power weight data is 0, or the least power bit data corresponds to power neuron data, and the power weight data is 0.
  • the correspondence relationship of the coding table is that the highest bit of the power bit data represents a zero bit, and the other m-1 bits of the power bit data correspond to an index value.
  • the correspondence relationship of the coding table is a positive correlation relationship
  • the storage module prestores an integer value x and a positive integer value y
  • the smallest power bit data corresponds to an exponent value of x; wherein x represents an offset
  • the value, y represents the step size.
  • the smallest power bit data corresponds to an exponent value of x
  • the largest power bit data corresponds to power sub-neuron data
  • power sub-weight data is 0, and minimum and maximum power sub-bit data are The other power-bit data corresponding to the exponent value is (power sub-data + x) * y.
  • the correspondence relationship of the coding table is a negative correlation
  • the storage module prestores an integer value x and a positive integer value y
  • the maximum power bit data corresponds to an exponent value of x; wherein x represents an offset
  • the value, y represents the step size.
  • the largest power bit data corresponds to an exponent value of x
  • the smallest power bit data corresponds to power sub-neuron data
  • the power sub-weight data is 0, and the minimum and maximum power sub-bit data
  • the other power-bit data corresponds to an exponent value of (power-bit data -x)*y.
  • y 1, the value of x is equal to 2 m-1 .
  • converting the neuron data and the weight data into power neuron data and power weight data respectively includes:
  • d in is the input data of the power conversion unit
  • d out is the output data of the power conversion unit
  • s in is the sign of the input data
  • s out is the sign of the output data
  • d in+ is the positive part of the input data.
  • d in+ d in ⁇ s in
  • d out+ d out ⁇ s out , Indicates that the data x is taken out; or,
  • d in is the input data of the power conversion unit
  • d out is the output data of the power conversion unit
  • s in is the sign of the input data
  • s out is the sign of the output data
  • d in+ is the positive part of the input data.
  • d in+ d in ⁇ s in
  • d out+ d out ⁇ s out , Indicates that the data x is taken as a whole operation; or,
  • d in is the input data of the power conversion unit
  • d out is the output data of the power conversion unit
  • s in is the sign of the input data
  • s out is the sign of the output data
  • d in+ is the positive part of the input data
  • d in+ d in ⁇ s in
  • d out+ is the positive part of the output data
  • d out+ d out ⁇ s out
  • [x] indicates the rounding operation on the data x.
  • a neural network operation method including:
  • the input data of the neural network operation is converted into power data before the neural network operation is performed; and/or the output data of the neural network operation is converted into power data after performing the neural network operation.
  • the input data includes input neuron data, input weight data
  • the output data includes output neuron data, output weight data
  • the power data includes power neuron data, power Weight data.
  • the step of converting the input data of the neural network operation to power data prior to performing the neural network operation comprises:
  • the operation instruction and the power data are received and stored.
  • the method further includes:
  • the operation instruction is read and decoded into each operation micro instruction.
  • a neural network operation is performed on the power weight data and the power neuron data according to the operation microinstruction.
  • the step of converting the output data of the neural network operation to power data after performing the neural network operation comprises:
  • the non-power data in the data obtained by the neural network operation is converted into power data.
  • the non-power data in the data obtained after the neural network operation is converted into power data and sent to the data control unit to perform the neural network operation as the input data of the next layer of the neural network operation.
  • the steps and non-power data are converted to power data steps until the last layer of the neural network ends.
  • an integer value x and a positive integer value y are prestored in the storage module; wherein x represents an offset value and y represents a step size; by changing an integer value x and a positive integer value y prestored by the storage module, Adjust the power data range that the neural network computing device can represent.
  • a method of using the neural network computing device wherein an integer value x and a positive integer value y prestored by a storage module are changed to adjust a power that can be represented by a neural network computing device Secondary data range.
  • an arithmetic device comprising:
  • An operation control module configured to determine the block information
  • the operation module is configured to perform block, transpose, and merge operations on the operation matrix according to the block information to obtain a transposed matrix of the operation matrix.
  • the computing device further includes:
  • An address storage module configured to store address information of the operation matrix
  • a data storage module configured to store the operation matrix, and store the transposed matrix after the operation
  • the operation control module is configured to extract address information of the operation matrix from the address storage module, and obtain block information according to address information of the operation matrix; and the operation module is configured to perform the operation from the operation
  • the control module obtains the address information and the block information of the operation matrix, extracts an operation matrix from the data storage module according to the address information of the operation matrix, and blocks, transposes, and transposes the operation matrix according to the block information. Combining operations to obtain a transposed matrix of the operational matrix, and feeding back the transposed matrix of the operational matrix to the data storage module.
  • the operation module includes a matrix block unit, a matrix operation unit, and a matrix merge unit, wherein:
  • a matrix block unit configured to acquire address information and block information of the operation matrix from the operation control module, and extract an operation matrix from the data storage module according to the address information of the operation matrix, according to the block information
  • the operation matrix is divided into blocks to obtain n block matrices
  • a matrix operation unit configured to acquire the n block matrices, and perform transposition operations on the n block matrices to obtain a transposed matrix of the n block matrices
  • a matrix merging unit configured to acquire and merge a transposed matrix of the n block matrices, obtain a transposed matrix of the operation matrix, and feed back a transposed matrix of the operation matrix to the data storage module, where , n is a natural number.
  • the operation module further includes a buffer unit for buffering the n block matrices for acquisition by the matrix operation unit.
  • the arithmetic control module includes an instruction processing unit, an instruction cache unit, and a matrix determination unit, wherein:
  • An instruction cache unit configured to store a matrix operation instruction to be executed
  • An instruction processing unit configured to obtain a matrix operation instruction from the instruction cache unit, decode the matrix operation instruction, and obtain address information of the operation matrix from the address storage module according to the decoded matrix operation instruction ;
  • the matrix determining unit is configured to analyze address information of the operation matrix to obtain the block information.
  • the operation control module further includes a dependency processing unit, configured to determine whether the address information of the decoded matrix operation instruction and the operation matrix conflicts with the previous operation, and if there is a conflict, temporarily And storing the decoded matrix operation instruction and the address information of the operation matrix; if there is no conflict, transmitting the decoded matrix operation instruction and the address information of the operation matrix to the matrix determination unit.
  • a dependency processing unit configured to determine whether the address information of the decoded matrix operation instruction and the operation matrix conflicts with the previous operation, and if there is a conflict, temporarily And storing the decoded matrix operation instruction and the address information of the operation matrix; if there is no conflict, transmitting the decoded matrix operation instruction and the address information of the operation matrix to the matrix determination unit.
  • the operation control module further includes an instruction queue memory for buffering the address information of the conflicting decoded matrix operation instruction and the operation matrix, and when the conflict is eliminated, the cached location is The decoded matrix operation instruction and the address information of the operation matrix are transmitted to the matrix determination unit.
  • the instruction processing unit includes an instruction fetch unit and a coding unit, wherein:
  • An instruction fetching unit configured to acquire a matrix operation instruction from the instruction cache unit, and transmit the matrix operation instruction to the decoding unit;
  • a decoding unit configured to decode the matrix operation instruction, extract address information of the operation matrix from the address storage module according to the decoded matrix operation instruction, and decode the decoded matrix operation instruction
  • the address information of the extracted operation matrix is transmitted to the dependency processing unit.
  • the apparatus further includes an input and output module, configured to input the operation matrix data to the data storage module, and further configured to acquire an operation transposed matrix from the data storage module, and output the The transposed matrix after the operation.
  • the address storage module includes a scalar register file or a general-purpose memory unit; the data storage module includes a scratchpad memory or a general-purpose memory unit; address information of the operation matrix is a starting address information of a matrix and Matrix size information.
  • an arithmetic method comprising the steps of:
  • the operation control module determines the block information
  • the operation module performs block, transpose, and merge operations on the operation matrix according to the block information to obtain a transposed matrix of the operation matrix.
  • the step of the operation control module determining the block information includes:
  • the operation control module extracts address information of the operation matrix from the address storage module
  • the operation control module obtains the block information according to the address information of the operation matrix.
  • the step of extracting the address information of the operation matrix from the address storage module by the operation control module includes:
  • the fetching unit extracts the operation instruction and sends the operation instruction to the decoding unit;
  • the decoding unit decodes the operation instruction, acquires the address information of the operation matrix from the address storage module according to the decoded operation instruction, and sends the decoded operation instruction and the address information of the operation matrix to the dependency processing unit;
  • the dependency processing unit analyzes whether the decoded operation instruction and the previous instruction that has not been executed have a dependency on the data; if there is a dependency relationship, the decoded operation instruction and the address information of the corresponding operation matrix are required. Waiting in the instruction queue memory until it no longer has a dependency on the data with the previous unexecuted instruction
  • the step of the operation module performing block, transpose, and merge operations on the operation matrix according to the block information to obtain the transposed matrix of the operation matrix includes:
  • the operation module extracts an operation matrix from the data storage module according to the address information of the operation matrix; and divides the operation matrix into n block matrices according to the block information;
  • the operation module performs a transposition operation on the n block matrices to obtain a transposed matrix of the n block matrices;
  • the operation module merges the transposed matrix of the n block matrices, obtains a transposed matrix of the operation matrix, and feeds back to the data storage module;
  • n is a natural number.
  • the computing module combines the transposed matrices of the n block matrices to obtain a transposed matrix of the operational matrix, and the step of feeding back the transposed matrix to the data storage module includes:
  • the matrix merging unit receives the transposed matrix of each block matrix. After the number of transposed matrices of the received block matrix reaches the total number of blocks, matrix merging operations are performed on all the blocks to obtain transposition of the operation matrix. a matrix; and feeding back the transposed matrix to a specified address of the data storage module;
  • the input/output module directly accesses the data storage module, and reads the transposed matrix of the operation matrix obtained from the operation from the data storage module.
  • a data screening apparatus comprising:
  • a storage unit for storing data
  • a register unit for storing a data address in the storage unit
  • the data screening module acquires a data address in the register unit, acquires corresponding data in the storage unit according to the data address, and performs a filtering operation on the obtained data to obtain a data screening result.
  • the data screening module includes a data screening unit that performs a screening operation on the acquired data.
  • the data screening module further includes: an I/O unit, an input data buffer unit, and an output data buffer unit;
  • the I/O unit carries the data stored by the storage unit to the input data buffer unit;
  • the input data buffer unit stores data carried by the I/O unit
  • the data screening unit takes the data transmitted from the input data buffer unit as input data, performs a filtering operation on the input data, and transmits the output data to the output data buffer unit;
  • the output data buffer unit stores output data.
  • the input data includes data to be screened and location information data, the output data comprising filtered data, or filtered data and related information.
  • the data to be filtered is a vector or an array
  • the location information data is a binary code, a vector or an array
  • the related information includes a vector length, an array size, and a occupied space.
  • the data screening unit scans each component of the location information data. If the component is 0, the data component to be filtered corresponding to the component is deleted. If the component is 1, the data to be filtered corresponding to the component is retained. If the component is 1, the data component to be filtered corresponding to the component is deleted. If the component is 0, the data component to be filtered corresponding to the component is retained, and the filtered data is obtained and output after the scanning.
  • the data screening module further includes a structural deformation unit that modifies the storage structure of the input data and/or the output data.
  • a data screening method is provided, and data filtering is performed by using the data screening device, including:
  • Step A The data screening module acquires a data address in the register unit
  • Step B obtaining corresponding data in the storage unit according to the data address
  • Step C Perform a screening operation on the obtained data to obtain a data screening result.
  • the step A includes: the data screening unit acquires an address of the data to be filtered and an address of the location information data from the register unit;
  • the step B includes:
  • Sub-step B1 the I/O unit transfers the data to be filtered and the location information data in the storage unit to the input data buffer unit;
  • Sub-step B2 the input data buffer unit transfers the data to be filtered and the location information data to the data filtering unit;
  • the step C includes: the data screening unit performs a filtering operation on the to-be-filtered data according to the location information data, and delivers the output data to the output data buffer unit.
  • the sub-step B1 and the sub-step B2 further include:
  • Sub-step B3 The input data buffer unit transmits the data to be filtered to the structural deformation unit, and the structural deformation unit performs deformation of the storage structure, and returns the deformed data to be filtered back to the input data buffer unit, and then performs sub-step B2.
  • a neural network processor includes: a memory, a scratch pad memory, and a heterogeneous kernel
  • the memory is configured to store data and instructions of a neural network operation
  • the scratch pad memory is connected to the memory through a memory bus
  • the heterogeneous core is connected to the high-speed temporary storage memory through a high-speed temporary storage memory bus, reads data and instructions of the neural network operation through the high-speed temporary storage memory, completes the neural network operation, and returns the operation result to the high-speed temporary
  • the memory is stored, and the scratch pad memory is controlled to write the operation result back to the memory.
  • the heterogeneous kernel comprises:
  • a plurality of operational cores having at least two different types of operational cores for performing neural network operations or neural network layer operations;
  • One or more logic control cores for determining, based on data of the neural network operation, performing neural network operations or neural network layer operations by the dedicated core and/or the generic kernel.
  • the plurality of operational cores includes x general purpose cores and y dedicated cores; wherein the dedicated cores are dedicated to performing specified neural network/neural network layer operations for performing arbitrary nerves Network/neural network layer operations.
  • the generic core is a cpu and the dedicated core is npu.
  • the scratch pad memory comprises a shared scratch pad memory and/or a non-shared scratch pad memory; wherein the shared scratch pad memory is connected to the heterogeneous core through a scratch pad memory bus At least two cores are correspondingly connected; the non-shared scratch pad memory is correspondingly connected to a core of the heterogeneous kernel through a cache memory bus.
  • the logic control core is connected to the scratchpad memory through a scratchpad memory bus, reads data of a neural network operation through a scratchpad memory, and uses a neural network in data calculated by the neural network.
  • the type and parameters of the model determine that the dedicated kernel and/or the generic kernel act as the target kernel to perform neural network operations and/or neural network layer operations.
  • the logic control core sends a signal directly to the target core via the control bus or to the target core via the scratchpad memory to control the target core to perform neural network operations and / or neural network layer operations.
  • a neural network operation method wherein performing neural network operations using the neural network processor includes:
  • a logic control core in a heterogeneous kernel reads data and instructions of a neural network operation from memory through a scratch pad memory;
  • the logic control kernel in the heterogeneous kernel determines whether the neural network operation or the neural network layer operation is performed by the dedicated kernel and/or the general kernel according to the type and parameters of the neural network model in the data of the neural network operation.
  • the logic control kernel in the heterogeneous kernel determines that the neural network layer operations performed by the dedicated kernel and/or the generic kernel are based on the type and parameters of the neural network model in the data of the neural network operation, including:
  • the logic control kernel in the heterogeneous kernel determines whether there is a qualified dedicated kernel according to the type and parameters of the neural network model in the data of the neural network operation;
  • the dedicated kernel m meets the condition, the dedicated kernel m is used as the target kernel, and the logic control kernel in the heterogeneous kernel sends a signal to the target kernel, and sends the data corresponding to the neural network operation and the address corresponding to the instruction to the target kernel;
  • the target kernel acquires data and instructions of the neural network operation from the memory through the shared or non-shared scratch pad memory according to the address, performs a neural network operation, and outputs the operation result to the memory through the shared or non-shared scratch pad memory, and the operation is completed;
  • the logical control kernel in the heterogeneous kernel sends a signal to the general core, and sends the data corresponding to the neural network operation and the address corresponding to the instruction to the general-purpose kernel;
  • the general-purpose kernel acquires data and instructions of the neural network operation from the memory through the shared or non-shared scratch pad memory according to the address, performs neural network operations, and outputs the operation result to the memory through the shared or non-shared scratch pad memory, and the operation is completed.
  • the qualifying dedicated kernel refers to a dedicated kernel that supports specified neural network operations and that can perform the specified neural network operations.
  • the logic control kernel in the heterogeneous kernel determines that the neural network operations performed by the dedicated kernel and/or the generic kernel are based on the type and parameters of the neural network model in the data of the neural network operation, including:
  • the logic control kernel in the heterogeneous kernel parses the types and parameters of the neural network model in the data, determines whether there is a qualified dedicated kernel for each neural network layer, and assigns a corresponding general kernel to each neural network layer. Or a dedicated kernel to obtain a kernel sequence corresponding to the neural network layer;
  • the logic control kernel in the heterogeneous kernel sends the data corresponding to the neural network layer operation and the address corresponding to the instruction to the dedicated kernel or the general kernel corresponding to the neural network layer, and sends the kernel sequence to the dedicated kernel or the general core corresponding to the neural network layer.
  • the dedicated kernel and the general kernel corresponding to the neural network layer read the data and instructions of the neural network layer operation from the address, perform the neural network layer operation, and transfer the operation result to the designated address of the shared and/or non-shared scratch pad memory;
  • the logic control core controls the shared and/or non-shared scratch pad memory to write back the operation results of the neural network layer back to the memory, and the operation is completed.
  • the qualifying dedicated kernel refers to a dedicated kernel that supports a specified neural network layer operation and that can complete the scale of the specified neural network layer operation.
  • the neural network operation comprises a pulse neural network operation;
  • the neural network layer operation comprises a convolution operation of a neural network layer, a fully connected layer, a splicing operation, a bit addition/multiplication operation, a Relu operation, a pool Operations and/or Batch Norm operations.
  • the power data representation method is used to store the neuron data, and the storage space required for storing the network data is reduced. At the same time, the data representation method simplifies the multiplication operation of the neuron and the weight data, and reduces the design of the arithmetic unit. Requirements, speed up the operation of the neural network.
  • power conversion can be performed first, and then input into the neural network computing device, thereby further reducing the overhead of the neural network storage resources and computing resources, and improving the neural network.
  • the speed of the operation can be performed first, and then input into the neural network computing device, thereby further reducing the overhead of the neural network storage resources and computing resources, and improving the neural network. The speed of the operation.
  • the data screening apparatus and method of the present disclosure temporarily stores data and instructions participating in the screening operation on a dedicated cache, and can perform data filtering operations on data of different storage structures and different sizes more efficiently.
  • FIG. 1A is a schematic structural diagram of a neural network computing device according to an embodiment of the present disclosure.
  • FIG. 1B is a schematic structural diagram of a neural network computing device according to another embodiment of the present disclosure.
  • FIG. 1C is a schematic diagram of functions of an arithmetic unit according to an embodiment of the present disclosure.
  • FIG. 1D is another schematic diagram of the operation unit of the embodiment of the present disclosure.
  • FIG. 1E is a functional diagram of a main processing circuit according to an embodiment of the present disclosure.
  • FIG. 1F is another schematic structural diagram of a neural network computing device according to an embodiment of the present disclosure.
  • FIG. 1G is another schematic structural diagram of a neural network computing device according to an embodiment of the present disclosure.
  • FIG. 1H is a flowchart of a neural network operation method according to an embodiment of the present disclosure.
  • FIG. 1I is a schematic diagram of a coding table in accordance with an embodiment of the present disclosure.
  • FIG. 1J is another schematic diagram of a coding table in accordance with an embodiment of the present disclosure.
  • FIG. 1K is another schematic diagram of a coding table in accordance with an embodiment of the present disclosure.
  • FIG. 1L is another schematic diagram of a coding table in accordance with an embodiment of the present disclosure.
  • FIG. 1M is a schematic diagram of a representation method of power data according to an embodiment of the present disclosure.
  • 1N is a schematic diagram of multiplication operations of weights and power neurons in accordance with an embodiment of the present disclosure.
  • FIG. 10 is a schematic diagram of multiplication operations of weights and power neurons in accordance with an embodiment of the present disclosure.
  • FIG. 2A is a schematic structural diagram of a neural network computing device according to an embodiment of the present disclosure.
  • 2B is a flow chart of a neural network operation method in accordance with an embodiment of the present disclosure.
  • 2C is a schematic diagram showing a method of representing power data according to an embodiment of the present disclosure.
  • 2D is a schematic diagram of a multiplication operation of a neuron and a power weight according to an embodiment of the present disclosure.
  • 2E is a schematic diagram of a multiplication operation of a neuron and a power weight according to an embodiment of the present disclosure.
  • 2F is a flow chart of a neural network operation method in accordance with an embodiment of the present disclosure.
  • 2G is a schematic diagram of a method for representing power data according to an embodiment of the present disclosure.
  • 2H is a schematic diagram of a multiplication operation of a power neuron and a power weight according to an embodiment of the present disclosure.
  • FIG. 3A is a schematic structural diagram of an arithmetic device proposed by the present disclosure.
  • FIG. 3B is a schematic diagram of information flow of an arithmetic device proposed by the present disclosure.
  • FIG. 3C is a schematic structural diagram of an arithmetic module in the computing device proposed by the present disclosure.
  • FIG. 3D is a schematic diagram of matrix operations performed by the arithmetic module proposed by the present disclosure.
  • FIG. 3E is a schematic structural diagram of an arithmetic control module in an arithmetic device according to the present disclosure.
  • FIG. 3F is a detailed structural diagram of an arithmetic device according to an embodiment of the present disclosure.
  • FIG. 3G is a flowchart of an operation method according to another embodiment of the present disclosure.
  • FIG. 4A is a schematic diagram of the overall structure of a data screening apparatus according to an embodiment of the present disclosure.
  • FIG. 4B is a schematic diagram of functions of a data screening unit according to an embodiment of the present disclosure.
  • 4C is a schematic diagram of a specific structure of a data screening device according to an embodiment of the present disclosure.
  • FIG. 4D is another schematic structural diagram of a data screening apparatus according to an embodiment of the present disclosure.
  • 4E is a flow chart of a data screening method of an embodiment of the present disclosure.
  • FIG. 5A schematically illustrates a heterogeneous multi-core neural network processor in accordance with an embodiment of the present disclosure.
  • FIG. 5B schematically illustrates a heterogeneous multi-core neural network processor in accordance with another embodiment of the present disclosure.
  • FIG. 5C is a flowchart of a neural network operation method according to another embodiment of the present disclosure.
  • FIG. 5D is a flowchart of a neural network operation method according to another embodiment of the present disclosure.
  • FIG. 5E schematically illustrates a heterogeneous multi-core neural network processor in accordance with another embodiment of the present disclosure.
  • FIG. 5F schematically illustrates a heterogeneous multi-core neural network processor in accordance with another embodiment of the present disclosure.
  • the computing device includes: an arithmetic module 1-1 for performing a neural network operation; and a power conversion module 1-2 connected to the computing module for The input neuron data and/or the output neuron data of the neural network operation are converted into power neuron data.
  • the computing device includes:
  • the storage module 1-4 is configured to store data and an operation instruction
  • the control module 1-3 is connected to the storage module and configured to control the interaction between the data and the operation instruction, receive the data and the operation instruction sent by the storage module, and decode the operation instruction into the operation micro instruction;
  • the operation module 1-1 is connected to the control module, receives data and operation micro-instructions sent by the control module, and performs neural network operations on the weight data and the neuron data received according to the operation micro-instruction;
  • the power conversion module 1-2 is coupled to the operation module for converting input neuron data and/or output neuron data of the neural network operation into power neuron data.
  • the storage module can be integrated inside the computing device or can be external to the computing device as an off-chip memory.
  • the storage module includes: a storage unit 1-41 for storing data and operation instructions.
  • the control module includes:
  • An operation instruction buffer unit 1-32 connected to the data control unit, for receiving an operation instruction sent by the data control unit;
  • the decoding unit 1-33 is connected to the operation instruction buffer unit, and is configured to read an operation instruction from the operation instruction buffer unit and decode the operation instruction into each operation micro instruction;
  • the input neuron buffer unit 1-34 is connected to the data control unit and configured to receive the neuron data sent by the data control unit;
  • the weight buffer unit 1-35 is connected to the data control unit and configured to receive the weight data sent from the data control unit;
  • the data control unit 1-31 is connected to the storage module, and is configured to implement interaction between the storage module and the data and operation instructions between the operation instruction cache unit, the weight buffer unit, and the input neuron cache unit.
  • the operation module includes: an operation unit 1-11, respectively connected to the decoding unit, the input neuron buffer unit, and the weight buffer unit, and receives each operation micro instruction, neuron data, and weight data, for each The arithmetic microinstruction performs a corresponding operation on the neuron data and the weight data received.
  • the arithmetic unit includes, but is not limited to: a first one or more multipliers of the first portion; one or more adders of the second portion (more specifically, an adder of the second portion) It is also possible to form an addition tree); an activation function unit of the third part; and/or a vector processing unit of the fourth part. More specifically, the vector processing unit can process vector operations and/or pool operations.
  • the first part multiplies the input data 1 (in1) and the input data 2 (in2) to obtain the multiplied output (out).
  • the input data in1 is added step by step through the addition tree to obtain output data (out), where in1 is a vector of length N, and N is greater than 1,
  • the input data (in) can be obtained by the operation (f) to obtain the output data (out).
  • the operations of the above several parts can freely select one or more parts to be combined in different orders, thereby realizing the operation of various functions.
  • the corresponding computing unit constitutes a secondary, tertiary, or four-level flow-level architecture.
  • the arithmetic unit may include a main processing circuit and a plurality of slave processing circuits.
  • the main processing circuit is configured to allocate one input data into a plurality of data blocks, and send at least one of the plurality of data blocks and at least one of the plurality of operation instructions to the slave processing Circuit
  • the plurality of slave processing circuits are configured to perform an operation on the received data block according to the operation instruction to obtain an intermediate result, and transmit the operation result to the main processing circuit;
  • the main processing circuit is configured to process a plurality of intermediate results sent from the processing circuit to obtain a result of the operation instruction, and send the result of the operation instruction to the data control unit.
  • the operation unit may include a branch processing circuit
  • the main processing circuit is connected to the branch processing circuit, and the branch processing circuit is connected to the plurality of slave processing circuits;
  • a branch processing circuit for performing forwarding of data or instructions between the main processing circuit and the slave processing circuit.
  • the arithmetic unit may include a main processing circuit and a plurality of slave processing circuits.
  • a plurality of slave processing circuits are arranged in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and the main processing circuit is connected to k slave processing circuits of the plurality of slave processing circuits, the k
  • the basic circuits are: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column.
  • K slave processing circuits for forwarding data and instructions between the main processing circuit and the plurality of slave processing circuits.
  • the main processing circuit may further include: one or any combination of a conversion processing circuit, an activation processing circuit, and an addition processing circuit;
  • a conversion processing circuit for performing an exchange between the first data structure and the second data structure (for example, conversion of continuous data and discrete data) by the data block or the intermediate result received by the main processing circuit; or receiving the main processing circuit
  • the data block or intermediate result performs an exchange between the first data type and the second data type (eg, conversion of a fixed point type and a floating point type);
  • An addition processing circuit for performing an addition operation or an accumulation operation.
  • the slave processing circuit includes:
  • a multiplication processing circuit configured to perform a product operation on the received data block to obtain a product result
  • a forwarding processing circuit (optional) for forwarding the received data block or product result.
  • accumulating processing circuit wherein the accumulating processing circuit is configured to perform an accumulating operation on the product result to obtain the intermediate result.
  • the operation instruction may be an operation instruction of a matrix multiplied by a matrix, an accumulation instruction, an activation instruction, and the like.
  • the output module 1-5 includes: an output neuron buffer unit 1-51 connected to the operation unit for receiving neuron data output by the operation unit;
  • the power conversion module includes:
  • a first power conversion unit 1-21 coupled to the output neuron buffer unit for converting neuron data output by the output neuron buffer unit into power neuron data;
  • the second power conversion unit 1-22 is connected to the storage module and configured to convert the neuron data input to the storage module into power neuron data. For the power neuron data in the neural network input data, it is directly stored in the storage module.
  • the first and second power conversion units may also be disposed between the I/O module and the computing module to input the neural network operation. Neuron data and/or output neuron data is converted to power neuron data.
  • the computing device may include: a third power conversion unit 1-23, configured to convert the power neuron data into non-power neuron data.
  • the non-powered neuron data is converted into power sub-neuron data by the second power conversion unit, and then input into the operation unit to perform an operation.
  • the third power conversion unit can be selectively set.
  • the third power conversion unit may be disposed outside the operation module (as shown in FIG. 1F) or may be disposed inside the operation module (FIG. 1G).
  • the non-power neuron data output after the operation can be converted into power neuron data by using the first power conversion unit, and then fed back to the data control unit to participate in subsequent operations to speed up the operation, thereby forming Close the loop.
  • the data output by the arithmetic module can also be directly sent to the output neuron buffer unit, and sent by the output neuron buffer unit to the data control unit without passing through the power conversion unit.
  • the storage module can receive data and operation instructions from the external address space, and the data includes neural network weight data, neural network input data, and the like.
  • the first power conversion method :
  • d in is the input data of the power conversion unit
  • d out is the output data of the power conversion unit
  • s in is the sign of the input data
  • s out is the sign of the output data
  • d in+ is the positive part of the input data.
  • d in+ d in ⁇ s in
  • d out+ d out ⁇ s out , Indicates that the data x is taken.
  • the second power conversion method is the second power conversion method
  • d in is the input data of the power conversion unit
  • d out is the output data of the power conversion unit
  • s in is the sign of the input data
  • s out is the sign of the output data
  • d in+ is the positive part of the input data.
  • d in+ d in ⁇ s in
  • d out+ d out ⁇ s out , Indicates that the data x is taken.
  • the third power conversion method is the third power conversion method.
  • d in is the input data of the power conversion unit
  • d out is the output data of the power conversion unit
  • s in is the sign of the input data
  • s out is the sign of the output data
  • d in+ is the positive part of the input data
  • d in+ d in ⁇ s in
  • d out+ is the positive part of the output data
  • d out+ d out ⁇ s out
  • [x] indicates the rounding operation on the data x.
  • the power conversion method of the present disclosure may also be rounding to odd numbers, rounding to even numbers, rounding to zero, and random rounding. Among them, it is preferable to round off, round to zero, and random round to reduce the loss of precision.
  • an embodiment of the present disclosure further provides a neural network operation method, including: performing a neural network operation; and converting the input neuron data of the neural network operation into a power before performing the neural network operation. Secondary neuron data; and/or after performing a neural network operation, converting the output neuron data of the neural network operation into power neuron data.
  • converting the input neuron data of the neural network operation into the power neuron data comprises: converting the non-power neuron data in the input data into the power neuron data; And receiving and storing the operation instruction, the power neuron data, and the weight data.
  • the method further includes: reading the operation instruction, and decoding the operation into each operation Microinstructions.
  • the neural network operation is performed on the weight data and the power neuron data according to the operation micro-instruction.
  • the step of converting the output neuron data of the neural network operation into the power neuron data comprises: outputting the neuron data obtained after the neural network operation; and obtaining the neural network operation The non-power neuron data in the neuron data is converted into power neuron data.
  • the non-power neuron data in the neuron data obtained by the neural network operation is converted into power neuron data and sent to the data control unit, as an input power of the next layer of the neural network operation.
  • the neural network of the embodiment of the present disclosure is a multi-layer neural network.
  • the operation may be performed according to the operation method shown in FIG. 1H, where the first layer of the neural network is input.
  • the neuron data can be read in from the external address through the storage module. If the data read by the external address is already power, it is directly transferred to the storage module. Otherwise, the power is converted to power data by the power conversion unit, and then the neural network of each layer.
  • the input power neuron data may be provided by output power neuron data of one or more layers of neural networks preceding the layer.
  • the single layer neural network operation method of this embodiment includes:
  • step S1-1 an operation instruction, weight data, and neuron data are acquired.
  • the step S1-1 includes the following sub-steps:
  • the data control unit receives the operation instruction, the power neuron data, and the weight data sent by the storage module;
  • the operation instruction buffer unit, the input neuron buffer unit, and the weight buffer unit respectively receive the operation instruction, the power neuron data, and the weight data sent by the data control unit, and distribute the data to the decoding unit or the operation unit.
  • the power neuron data indicates that the value of the neuron data is represented by its power exponential value.
  • the power neuron data includes a sign bit and a power bit, and the sign bit represents a power with one or more bits.
  • the sign of the neuron data, the power bit represents the power bit data of the power neuron data with m bits, and m is a positive integer greater than one.
  • the storage unit of the storage module prestores a coding table, and provides an exponential value corresponding to each power bit data of the power neuron data.
  • the coding table sets one or more power-bit data (ie, zero-powered bit data) to specify that the corresponding power neuron data is zero.
  • the coding table may have a flexible storage manner, and may be stored in a form of a table or a mapping by a function relationship.
  • the correspondence of the coding tables can be arbitrary.
  • the correspondence of the coding tables may be out of order.
  • a part of the coding table whose m is 5, when the power bit data is 00000, the corresponding index value is 0.
  • the power sub-bit data is 00001, the corresponding index value is 3.
  • the power sub-bit data is 00010, the corresponding index value is 4.
  • the power sub-bit data is 00011, the corresponding index value is 1.
  • the power sub-bit data is 00100, the corresponding power neuron data is 0.
  • the correspondence relationship of the coding table may also be positively correlated.
  • the storage module prestores an integer value x and a positive integer value y, and the smallest power bit data corresponds to an index value of x, and any other one or more power bits corresponding to the power
  • the neuron data is zero.
  • x represents the offset value and y represents the step size.
  • the smallest power bit data corresponds to an exponent value of x, and the largest power bit data corresponds to a power sub-neuron data of 0, other powers other than the minimum and maximum power sub-bit data.
  • the bit data corresponds to an index value of (power sub-data + x) * y.
  • the representation range of the powers becomes configurable and can be applied to different application scenarios requiring different ranges of values. Therefore, the neural network computing device has a wider application range, is more flexible to use, and can be adjusted according to user requirements.
  • y is 1, and the value of x is equal to -2 m-1 .
  • the exponential range of the numerical values represented by the power neuron data is -2 m-1 to 2 m-1 -1.
  • a part of the coding table in which m is 5, x is 0, and y is 1.
  • the power bit data is 00000
  • the corresponding index value is 0.
  • the power sub-bit data is 00010
  • the corresponding index value is 2.
  • the power sub-bit data is 00011
  • the corresponding index value is 3.
  • the power sub-bit data is 11111
  • the corresponding power neuron data is 0.
  • FIG. 1K another m is 5, x is 0, and y is part of the coding table.
  • the power bit data is 00000
  • the corresponding index value is 0.
  • the power sub-bit data is 00001
  • the corresponding index value is 2.
  • the power sub-bit data is 00010
  • the corresponding index value is 4.
  • the power sub-bit data is 00011
  • the corresponding index value is 6.
  • the power sub-bit data is 11111
  • the corresponding power neuron data is 0.
  • the correspondence relationship of the coding table may be negative correlation
  • the storage module prestores an integer value x and a positive integer value y
  • the maximum power level data corresponds to an exponent value of x
  • any other one or more power sub-bit data corresponds to a power
  • the neuron data is zero.
  • x represents the offset value and y represents the step size.
  • the maximum power bit data corresponds to an exponent value of x
  • the smallest power bit data corresponds to a power sub-neuron data of 0, and the powers other than the minimum and maximum power sub-bit data
  • the bit data corresponds to an exponent value of (power sub-data -x)*y.
  • the representation range of the powers becomes configurable and can be applied to different application scenarios requiring different ranges of values. Therefore, the neural network computing device has a wider application range, is more flexible to use, and can be adjusted according to user requirements.
  • y is 1, and the value of x is equal to 2 m-1 .
  • the exponential range of the numerical values represented by the power neuron data is -2 m-1 -1 to 2 m-1 .
  • the corresponding number value is 0.
  • the corresponding index value is 1.
  • the power sub-bit data is 11101
  • the corresponding index value is 2.
  • the power sub-bit data is 11100
  • the corresponding index value is 3.
  • the power sub-bit data is 00000
  • the corresponding power neuron data is 0.
  • the correspondence relationship of the coding table may be that the highest bit of the power bit data represents a zero bit, and the other m-1 bits of the power bit data correspond to an index value.
  • the corresponding power neuron data is 0; when the highest bit of the power sub-bit data is 1, the corresponding power neuron data is not 0.
  • the reverse is also true, that is, when the highest bit of the power bit data is 1, the corresponding power neuron data is 0; when the highest bit of the power bit data is 0, the corresponding power neuron data is not 0.
  • the power bit of the power neuron data is divided by one bit to indicate whether the power neuron data is zero.
  • the sign bit is 1 bit
  • the power bit data bit is 7 bits, that is, m is 7.
  • the coding table is power sub-bit data is 11111111
  • the corresponding power neuron data is 0, and the power sub-bit data is other values
  • the power sub-element data corresponds to the corresponding two-complement code.
  • the power neuron data sign bit is 0 and the power sub-bit is 0001001
  • it means that the specific value is 2 9 , ie 512
  • the power neuron data sign bit is 1, and the power sub-bit is 1111101, then it represents the specific value. It is -2 -3 , which is -0.125.
  • power-only data retains only the power of the data, greatly reducing the storage space required to store the data.
  • the storage space required to store the neuron data can be reduced.
  • the power data is 8-bit data, and it should be recognized that the data length is not fixed. In different occasions, different data lengths are used according to the data range of the data neurons.
  • Step S1-2 performing neural network operations on the weight data and the neuron data according to the operation microinstruction.
  • the step S2 includes the following sub-steps:
  • the decoding unit reads the operation instruction from the operation instruction buffer unit, and decodes the operation instruction into each operation micro instruction;
  • the operation unit respectively receives the operation micro-instruction, the power-order neuron data, and the weight data sent by the decoding unit, the input neuron buffer unit, and the weight buffer unit, and performs weight data according to the operation micro-instruction.
  • Power neuron data is used for neural network operations.
  • the power sub-neuron and the weight multiplication operation are specifically that the power neuron data sign bit and the weight data symbol bit are XORed; if the correspondence relationship of the coding table is out of order, the search code table is found to find the power
  • the exponential value corresponding to the power sub-bit of the neuron data the corresponding value of the coding table is positively correlated, and the minimum value of the index value of the coding table is recorded and added to find the exponential value corresponding to the power sub-power of the power neuron data, the coding table
  • the corresponding relationship is negative correlation, the maximum value of the coding table is recorded and subtracted to find the exponential value corresponding to the power sub-power of the power neuron; the exponential value and the weight data power sub-bit are added, the weight data
  • the valid bits remain unchanged.
  • the weight data is 16-bit floating point data
  • the sign bit is 0, the power bit is 10101, and the valid bit is 0110100000, and the actual value represented by it is 1.40625*2 6 .
  • the power neuron data sign bit is 1 bit
  • the power bit data bit is 5 bits, that is, m is 5.
  • the coding table is power sub-data is 11111
  • the corresponding power neuron data is 0, and the power sub-bit data is other values
  • the power sub-bit data corresponds to the corresponding two-complement code.
  • the power neuron is 000110, and its actual value is 64, which is 2 6 .
  • the power of the weight and the power of the power of the power are 11011, and the actual value of the result is 1.40625*2 12 , which is the product of the product of the neuron and the weight.
  • the multiplication operation becomes an addition operation, and the amount of calculation required for the calculation is reduced.
  • the weight data is 32-bit floating point data
  • the sign bit is 1
  • the power bit is 10000011
  • the effective bit is 10010010000000000000000
  • the actual value represented by it is -1.5703125*2 4
  • the power neuron data sign bit is 1 bit
  • the power bit data bit is 5 bits, that is, m is 5.
  • the power neuron is 111100
  • its actual value is -2 -4 .
  • the power of the weight plus the power of the power neuron is 01111111, then the actual value of the result is 1.5703125*2 0 , which is the product of the product of the neuron and the weight.
  • step S1-3 the first power conversion unit converts the neural network data after the neural network operation into power neuron data.
  • the step S1-3 includes the following sub-steps:
  • the output neuron buffer unit receives the neuron data obtained by the neural network operation sent by the computing unit;
  • the first power conversion unit receives the neuron data sent by the output neuron buffer unit, and converts the non-power neuron data into power sub-neuron data.
  • the first power conversion method :
  • d in is the input data of the power conversion unit
  • d out is the output data of the power conversion unit
  • s in is the sign of the input data
  • s out is the sign of the output data
  • d in+ is the positive part of the input data.
  • d in+ d in ⁇ s in
  • d out+ d out ⁇ s out , Indicates that the data x is taken.
  • d in is the input data of the power conversion unit
  • d out is the output data of the power conversion unit
  • s in is the sign of the input data
  • s out is the sign of the output data
  • d in+ is the positive part of the input data.
  • d in+ d in ⁇ s in
  • d out+ d out ⁇ s out , Indicates that the data x is taken.
  • the third power conversion method is the third power conversion method.
  • d in is the input data of the power conversion unit
  • d out is the output data of the power conversion unit
  • s in is the sign of the input data
  • s out is the sign of the output data
  • d in+ is the positive part of the input data
  • d in+ d in ⁇ s in
  • d out+ is the positive part of the output data
  • d out+ d out ⁇ s out ;
  • Indicates that the data x is rounded off.
  • the power neuron data obtained by the power conversion unit can be used as the input power neuron of the next layer of the neural network operation, and then steps 1 to 3 are repeated until the last layer operation of the neural network ends.
  • the power neuron data range that can be represented by the neural network operation device can be adjusted.
  • the present disclosure further provides a method for using the neural network computing device to adjust a power value represented by a neural network computing device by changing an integer value x and a positive integer value y prestored by the storage module.
  • the range of neuronal data is not limited to one or more of the neural network computing devices.
  • the power conversion module of the computing device is coupled to the computing module for converting input data and/or output data of a neural network operation. Is power data.
  • the input data includes input neuron data, input weight data
  • the output data includes output neuron data and output weight data
  • the power data includes power neuron data and power weight data.
  • the power conversion module herein can perform power conversion on the weight data in addition to the power conversion of the neuron data, and the weight in the operation result. After the value data is converted into power weight data, it can be directly sent to the data control unit to participate in subsequent operations.
  • the remaining modules, unit components, functional uses, and connection relationships of the computing device are similar to the previous embodiments.
  • the neural network computing device of this embodiment includes a storage module 2-4, a control module 2-3, an arithmetic module 2-1, an output module 2-5, and a power conversion module 2-2.
  • the storage module includes: a storage unit 2-41 for storing data and instructions;
  • the control module includes:
  • a data control unit 2-31 connected to the storage unit for data and instruction interaction between the storage unit and each cache unit;
  • An operation instruction buffer unit 2-32 connected to the data control unit for receiving an instruction sent by the data control unit;
  • the decoding unit 2-33 is connected to the instruction cache unit for reading an instruction from the instruction cache unit and decoding the instruction into each operation instruction;
  • the input neuron buffer unit 2-34 is connected to the data control unit for receiving the neuron data sent by the data control unit;
  • the weight buffer unit 2-35 is connected to the data control unit for receiving weight data transmitted from the data control unit.
  • the operation module includes: an operation unit 2-11, is connected to the control module, receives data and an operation instruction sent by the control module, and performs a neural network operation on the neuron data and the weight data received by the operation instruction according to the operation instruction;
  • the output module includes: an output neuron buffer unit 2-51 connected to the operation unit for receiving neuron data output by the operation unit; and transmitting the same to the data control unit. This can be used as input data for the next layer of neural network operations;
  • the power conversion module can include:
  • a first power conversion unit 2-21 connected to the output neuron buffer unit and the operation unit, configured to convert the neuron data output by the output neuron buffer unit into power neuron data and perform an operation
  • the weighted data output by the unit is converted to power-weighted data
  • the second power conversion unit 2-22 is connected to the storage module, and is configured to convert the neuron data and the weight data input to the storage module into power neuron data and power weight data respectively;
  • the computing device further includes: a third power conversion unit 2-23, connected to the operation unit, configured to convert the power neuron data and the power weight data into non-power neurons respectively Data, non-power weight data.
  • a third power conversion unit 2-23 connected to the operation unit, configured to convert the power neuron data and the power weight data into non-power neurons respectively Data, non-power weight data.
  • the power conversion module includes the first power conversion unit, the second power conversion unit, and the third power conversion unit as an example, but in fact, the power conversion module Any of the first power conversion unit, the second power conversion unit, and the third power conversion unit may be included, as in the embodiment illustrated in FIGS. 1B, 1F, and 1G.
  • the non-powered neuron data and the weight data are converted into power neuron data and power weight data by the second power conversion unit, and then input into the operation unit to perform an operation.
  • the setting may be performed.
  • the third power conversion unit converts the power neuron data and the power weight data into non-power neuron data and non-power weight data, and the third power conversion unit may be disposed outside the operation module. It can be set inside the operation module, and the non-power neuron data output after the operation can be converted into power neuron data by using the first power conversion unit, and then fed back to the data control unit to participate in subsequent operations to speed up the operation. Thereby, a closed loop can be formed.
  • the neural network is a multi-layer neural network, and each layer of the neural network may be operated according to the operation method shown in FIG. 2B, wherein the first layer of the neural network inputs the power weight data through the storage unit. Read from the external address. If the weight data read by the external address is already power-valued, it is directly transferred to the storage unit. Otherwise, it is first converted to power-weighted data by the power conversion unit.
  • the single layer neural network operation method of this embodiment includes:
  • step S2-1 the instruction, the neuron data, and the power weight data are acquired.
  • the step S2-1 includes the following sub-steps:
  • the data control unit receives the instruction, the neuron data, and the power weight data sent by the storage unit;
  • the instruction cache unit, the input neuron buffer unit, and the weight buffer unit respectively receive the instruction, the neuron data, and the power weight data sent by the data control unit, and distribute the data to the decoding unit or the operation unit.
  • the power weight data indicates that the value of the weight data is represented by its power exponential value.
  • the power weight data includes a sign bit and a power bit, and the sign bit represents the weight with one or more bits.
  • the sign of the data, the power bit represents the power bit data of the weight data with m bits, and m is a positive integer greater than 1.
  • the storage unit prestores a coding table, and provides an exponential value corresponding to each power bit data of the power weight data.
  • the coding table sets one or more power-bit data (ie, zero-powered bit data) to specify that the corresponding power weight data is zero. That is to say, when the power bit data of the power weight data is the zero power bit data in the code table, it indicates that the power weight data is 0.
  • the correspondence between the coding tables is similar to that in the foregoing embodiment, and details are not described herein again.
  • the sign bit is 1 bit and the power bit data bit is 7 bits, i.e., m is 7.
  • the power sub-bit data is 11111111
  • the corresponding power weight data is 0, and the power sub-bit data is other values
  • the power weight data corresponds to the corresponding two-complement code.
  • the power weight data symbol bit is 0 and the power bit is 0001001
  • it indicates that the specific value is 2 9 , that is, 512
  • the power weight data symbol bit is 1, and the power bit is 1111101, which indicates the specific value. It is -2 -3 , which is -0.125.
  • power-only data retains only the power of the data, greatly reducing the storage space required to store the data.
  • the storage space required to store the weight data can be reduced.
  • the power data is 8-bit data, and it should be recognized that the data length is not fixed. In different occasions, different data lengths are used according to the data range of the data weight.
  • Step S2-2 performing neural network operations on the neuron data and the power weight data according to the operation instruction.
  • the step S2 includes the following sub-steps:
  • the decoding unit reads the instruction from the instruction cache unit, and decodes the instruction into each operation instruction;
  • the operation unit respectively receives the operation instruction, the power weight data, and the neuron data sent by the decoding unit, the input neuron buffer unit, and the weight buffer unit, and the neuron data and the power according to the operation instruction
  • the weight data represented is subjected to a neural network operation.
  • the multiplication operation of the neuron and the power weight is specifically: the neuron data sign bit is XORed with the power weight data bit position; if the correspondence relationship of the coding table is out of order, the search code table is found to find the power The exponential value corresponding to the power sub-power sub-bit, the corresponding value of the coding table is positive correlation, and the index numerical value minimum value of the coding table is recorded and added to find the exponential value corresponding to the power sub-power data sub-bit, the coding table The correspondence relationship is negatively correlated, and the maximum value of the coding table is recorded and subtracted to find the exponential value corresponding to the power sub-power of the power-of-power value; the exponential value and the neuron data power sub-bit are added, the neuron data The valid bits remain unchanged.
  • the neuron data is 16-bit floating point data
  • the sign bit is 0, the power bit is 10101, and the valid bit is 0110100000.
  • the actual value represented by it is 1.40625*2 6 .
  • the power weight data symbol bit is 1 bit, and the power bit data bit is 5 bits, that is, m is 5.
  • the coding table is power sub-bit data is 11111, the corresponding power weight data is 0, and the power sub-bit data is other values, and the power sub-bit data corresponds to the corresponding two-complement code.
  • the power weight is 000110, and the actual value it represents is 64, which is 2 6 .
  • the power of the power weight and the power of the neuron result is 11011, and the actual value of the result is 1.40625*2 12 , which is the product of the product of the neuron and the power weight.
  • the multiplication operation becomes an addition operation, and the amount of calculation required for the calculation is reduced.
  • the neuron data is 32-bit floating point data
  • the sign bit is 1
  • the power bit is 10000011
  • the effective bit is 10010010000000000000000
  • the actual value represented is -1.5703125*2 4
  • the power weight data symbol bit is 1 bit
  • the power bit data bit is 5 bits, that is, m is 5.
  • the coding table is power sub-bit data is 11111
  • the corresponding power weight data is 0, and the power sub-bit data is other values
  • the power sub-bit data corresponds to the corresponding two-complement code.
  • the power neuron is 111100, and its actual value is -2 -4 .
  • the power of the neuron plus the power of the power of the power is 0111111111, then the actual value of the result is 1.5703125*2 0 , which is the product of the product of the neuron and the power weight.
  • the method further includes the step S2-3, and outputting the neuron data after the neural network operation as the input data of the next layer neural network operation.
  • the step S2-3 may include the following sub-steps:
  • the output neuron buffer unit receives the neuron data obtained by the neural network operation sent by the computing unit.
  • step S2-32 transmitting the neuron data received by the output neuron buffer unit to the data control unit, and the neuron data obtained by outputting the neuron buffer unit can be used as the input neuron of the next layer of the neural network operation, and then repeating step S2- 1 to step S2-3 until the end of the last layer of the neural network operation.
  • the power neuron data obtained by the power conversion unit can be used as the input power neuron of the next layer of the neural network operation, and then the steps S2-1 to S2-3 are repeated until the last layer operation of the neural network ends.
  • the power neuron data range that can be represented by the neural network operation device can be adjusted.
  • the neural network is a multi-layer neural network, and each layer of the neural network can be operated according to the operation method shown in FIG. 2F, wherein the first layer of the neural network inputs the power weight data through the storage unit.
  • the first layer of the neural network inputs the power weight data through the storage unit.
  • Read from the external address if the data read by the external address is already power-valued data, it is directly transferred to the storage unit; otherwise, it is first converted to power-weighted data by the power conversion unit; and the first layer of the neural network is input.
  • the secondary neuron data can be read in from the external address through the storage unit. If the data read by the external address is already power data, it is directly transferred to the storage unit.
  • the single layer neural network operation method in this embodiment includes:
  • Step S2-4 acquiring instructions, power neuron data, and power weight data.
  • the step S2-4 includes the following sub-steps:
  • S2-41 inputting instruction, neuron data and weight data into the storage unit; wherein the power neuron data and the power weight data are directly input into the storage unit, and the non-power neuron data and the non-power weight The data is converted into power neuron data and power weight data through the first power conversion unit, and then input into the storage unit;
  • the data control unit receives the instruction sent by the storage unit, the power neuron data, and the power weight data;
  • S2-43 the instruction cache unit, the input neuron buffer unit, and the weight buffer unit respectively receive the instruction, the power neuron data, and the power weight data sent by the data control unit, and distribute the data to the decoding unit or the operation unit.
  • the power neuron data and the power weight data represent that the values of the neuron data and the weight data are represented by their power exponential values.
  • the power neuron data and the power weight data include symbol bits and The power bit, the sign bit represents the symbol of the neuron data and the weight data by one or more bits, and the power bit represents the power bit data of the neuron data and the weight data by m bits, and m is greater than A positive integer of 1.
  • the storage unit of the storage unit prestores a coding table, and provides an exponential value corresponding to each power bit data of the power neuron data and the power weight data.
  • the coding table sets one or more power-bit data (ie, zero-powered bit data) to specify that the corresponding power neuron data and the power weight data are zero. That is to say, when the power bit data of the power neuron data and the power weight data is the zero-powered bit data in the coding table, it indicates that the power neuron data and the power weight data are 0.
  • the sign bit is 1 bit
  • the power bit data bit is 7 bits, that is, m is 7.
  • the coding table is power sub-data 11111111
  • the corresponding power neuron data and the power weight data are 0, and the power sub-data is other values, and the power neuron data and the power weight data correspond to the corresponding binary.
  • Complement code When the power neuron data and the power weight data data symbol bit is 0, and the power sub-bit is 0001001, it indicates that the specific value is 2 9 , that is, 512; the power neuron data and the power weight data symbol bit are 1
  • the power level is 1111101, which means that the specific value is -2 -3 , which is -0.125. Compared to floating-point data, power-only data retains only the power of the data, greatly reducing the storage space required to store the data.
  • the storage space required to store the neuron data and the weight data can be reduced.
  • the power data is 8-bit data, and it should be recognized that the data length is not fixed. In different occasions, different data are used according to the data range of the neuron data and the weight data. length.
  • step S2-5 the neural network operation is performed on the power neuron data and the power weight data according to the operation instruction.
  • the step S5 includes the following sub-steps:
  • the decoding unit reads the instruction from the instruction cache unit and decodes the instruction into each operation instruction;
  • the operation unit respectively receives the operation instruction, the power neuron data, and the power weight data sent by the decoding unit, the input neuron buffer unit, and the weight buffer unit, and the power neurons are according to the operation instruction.
  • Data and power weight data are used for neural network operations.
  • the power sub-neuron and the power-weight weight multiplication operation are specifically: the power-order neuron data sign bit and the power-weight weight data sign bit are XORed; the correspondence relationship of the coding table is the out-of-order search code table Find the exponential value corresponding to the power sub-neuron data and the power sub-power data power sub-bit.
  • the correspondence relationship of the coding table is positive correlation, record the index numerical value minimum value of the coding table and add the power to find the power neuron data.
  • the correspondence value of the coding table is negative correlation
  • the maximum value of the coding table is recorded and subtracted to find the power neuron secretary and the power weight data power sub-bit
  • the corresponding index value; the exponential value corresponding to the power neuron data and the exponential value corresponding to the power weight data are added.
  • the power neuron data and the power sub-weight data symbol bit are 1 bit, and the power sub-bit data bit is 4 bits, that is, m is 4.
  • the coding table is power sub-bit data is 1111, the corresponding power weight data is 0, and the power sub-bit data is other values, and the power sub-bit data corresponds to the corresponding two-complement code.
  • the power neuron data is 00010, and the actual value it represents is 2 2 .
  • the power weight is 00110, and the actual value it represents is 64, which is 2 6 .
  • the product of the power neuron data and the power weight data is 01000, which represents an actual value of 2 8 .
  • the method of this embodiment may further include, in step S2-6, outputting the neural network data after the neural network operation as the input data of the next layer neural network operation.
  • the step S2-6 includes the following sub-steps:
  • the output neuron buffer unit receives the neuron data obtained by the neural network operation sent by the calculating unit.
  • step S2-62 transmitting the neuron data received by the output neuron buffer unit to the data control unit, and the neuron data obtained by outputting the neuron buffer unit can be used as the input neuron of the next layer of the neural network operation, and then repeating step S4 to Step S6 until the end of the last layer of the neural network operation.
  • the bandwidth required for transmitting it to the data control unit is greatly reduced compared to the bandwidth required for the floating point data, thereby further reducing the neural network storage resources and computing resources.
  • the overhead increases the speed of the neural network.
  • All of the elements of the disclosed embodiments may be hardware structures including, but not limited to, physical devices including, but not limited to, transistors, memristors, DNA computers.
  • An embodiment of the present disclosure provides an arithmetic device, including:
  • An operation control module 3-2 configured to determine the block information
  • the operation module 3-3 is configured to perform block, transpose, and merge operations on the operation matrix according to the block information to obtain a transposed matrix of the operation matrix.
  • the block information may include at least one of block size information, block mode information, and block merge information.
  • the block size information indicates size information of each block matrix obtained after the operation matrix is divided into blocks.
  • the split mode information indicates the manner in which the operation matrix is partitioned.
  • the separate merge information indicates a manner in which each block matrix is subjected to a transposition operation and then re-merged to obtain a transposed matrix of the operation matrix.
  • the arithmetic operation device of the present disclosure can block the operation matrix, the transposed matrix of the plurality of block matrices is respectively obtained by transposing the plurality of block matrices, and finally the transposed matrices of the plurality of block matrices are merged.
  • the transposed matrix of the operation matrix is obtained, so that the transposition operation of the arbitrary size matrix can be completed within a constant time complexity using a single instruction.
  • the present disclosure also makes the use of the matrix transposition operation simpler and more efficient while reducing the operation time complexity.
  • the computing device further includes:
  • An address storage module 3-1 configured to store address information of the operation matrix
  • the data storage module 3-4 is configured to store original matrix data, the original matrix data includes the operation matrix, and store the calculated transposed matrix;
  • the operation control module is configured to extract address information of the operation matrix from the address storage module, and obtain block information according to address information of the operation matrix; the operation module is configured to obtain address information of the operation matrix from the operation control module and Blocking information, extracting the operation matrix from the data storage module according to the address information of the operation matrix, and performing block, transpose and merge operations on the operation matrix according to the block information, obtaining the transposed matrix of the operation matrix, and turning the operation matrix The matrix is fed back to the data storage module.
  • the foregoing operation module includes a matrix blocking unit, a matrix operation unit, and a matrix merging unit, wherein:
  • Matrix block unit 3-31 used to obtain the address information and the block information of the operation matrix from the operation control module, and extract the operation matrix from the data storage module according to the address information of the operation matrix, and block the operation matrix according to the block information.
  • the operation yields n block matrices;
  • the matrix operation unit 3-32 is configured to acquire n block matrices, and perform transposition operations on the n block matrices respectively to obtain a transposed matrix of n block matrices;
  • the matrix merging unit 3-33 is configured to acquire and merge the transposed matrices of the n block matrices to obtain a transposed matrix of the operation matrix, where n is a natural number.
  • the matrix blocking unit of the operation module extracts the operation matrix X from the data storage module, and operates the matrix X according to the block information.
  • the matrix operation unit acquires the four block matrixes from the matrix block unit, and The four block matrices are respectively subjected to a transposition operation, and the transposed matrices X 1 T , X 2 T , X 3 T , and X 4 T of the four block matrices are obtained and output to the matrix merging unit;
  • the matrix merging unit is matrix operation unit acquires transposed matrix of the four block matrices and combined to obtain a transposed matrix operation matrix X T, may further be transposed matrix X T output to the data storage module.
  • the foregoing operation module further includes a buffer unit 3-34 for buffering n block matrices for acquisition by the matrix operation unit.
  • the matrix merging unit may further include a memory for temporarily storing the transposed matrix of the obtained block matrix. After the matrix operation unit completes all the operations of the block matrix, the matrix merging unit may The transposed matrix of all the block matrices is obtained, and then the transposed matrices of the n block matrices are combined to obtain the transposed matrix, and the output result is written back to the data storage module.
  • matrix blocking unit matrix operation unit and matrix merging unit can be implemented in the form of hardware or in the form of software program modules.
  • the matrix blocking unit and the matrix combining unit may include one or more control elements, and the matrix operation unit may include one or more control elements, computing elements.
  • the above operation control module includes an instruction processing unit 3-22, an instruction buffer unit 3-21, and a matrix determination unit 3-23, wherein:
  • An instruction cache unit configured to store a matrix operation instruction to be executed
  • An instruction processing unit is configured to obtain a matrix operation instruction from the instruction cache unit, decode the matrix operation instruction, and extract address information of the operation matrix from the address storage module according to the decoded matrix operation instruction;
  • the matrix determining unit is configured to determine, according to the address information of the operation matrix, whether the block needs to be performed, and obtain the block information according to the judgment result.
  • the operation control module further includes a dependency processing unit 3-24, configured to determine whether the address information of the decoded matrix operation instruction and the operation matrix conflicts with the previous operation, and if there is a conflict And temporarily storing the decoded matrix operation instruction and the address information of the operation matrix; if there is no conflict, transmitting the decoded matrix operation instruction and the address information of the operation matrix to the matrix determination unit.
  • a dependency processing unit 3-24 configured to determine whether the address information of the decoded matrix operation instruction and the operation matrix conflicts with the previous operation, and if there is a conflict And temporarily storing the decoded matrix operation instruction and the address information of the operation matrix; if there is no conflict, transmitting the decoded matrix operation instruction and the address information of the operation matrix to the matrix determination unit.
  • the operation control module further includes an instruction queue memory 3-25 for buffering address information of the decoded matrix operation instruction and the operation matrix in conflict, and when the conflict is eliminated, The buffered decoded matrix operation instruction and the address information of the operation matrix are transmitted to the matrix determination unit.
  • the front and rear instructions may access the same block of storage space.
  • the instruction In order to ensure the correctness of the instruction execution result, if the current instruction is detected to have a dependency relationship with the data of the previous instruction, the instruction must be Waiting in the instruction queue memory until the dependency is eliminated.
  • the instruction processing unit includes an instruction fetch unit 3-221 and a decoding unit 3-222, where:
  • the fetching unit is configured to obtain a matrix operation instruction from the instruction cache unit, and transmit the matrix operation instruction to the decoding unit;
  • a decoding unit configured to decode the matrix operation instruction, extract the address information of the operation matrix from the address storage module according to the decoded matrix operation instruction, and decode the decoded matrix operation instruction and the extracted operation matrix
  • the address information is transmitted to the dependency processing unit.
  • the computing device further includes an input and output module, configured to input the operation matrix to the data storage module, and further, obtain the calculated transposed matrix from the data storage module, and output the calculated operation matrix. Transpose matrix.
  • the address information of the operation matrix is the start address information and the matrix size information of the matrix.
  • the address information of the operational matrix is the storage address of the matrix in the data storage module.
  • the address storage module is a scalar register file or a general-purpose memory unit; the data storage module is a scratch pad memory or a general-purpose memory unit.
  • the address storage module may be a scalar register file that provides the scalar registers required for the operation.
  • the scalar registers not only store matrix addresses, but also store scalar data.
  • the scalar data in the scalar register can be used to record the number of matrix blocks when a block operation is performed on transposing a large scale matrix.
  • the data storage module may be a scratch pad memory capable of supporting matrix data of different sizes.
  • the matrix determining unit is configured to determine the size of the matrix. If the specified maximum size M is exceeded, the matrix is subjected to a blocking operation, and the matrix determining unit analyzes the obtained block information according to the determination result.
  • an instruction cache unit is configured to store matrix operation instructions to be executed.
  • the instruction is also cached in the instruction cache unit during execution.
  • the instruction cache unit can be a reordering cache.
  • the matrix operation instruction is a matrix transposition operation instruction including an operation code and an operation field, wherein the operation code is used to indicate a function of the matrix transposition operation instruction, and the matrix operation control module recognizes the operation
  • the code confirms the matrix transposition operation
  • the operation field is used to indicate the data information of the matrix transposition operation instruction, wherein the data information may be an immediate number or a register number, for example, when a matrix is to be acquired, according to the register number, the corresponding The matrix start address and matrix size are obtained in the register, and the matrix stored in the corresponding address is obtained in the data storage module according to the matrix start address and the matrix size.
  • the present disclosure uses a new arithmetic structure to implement a transposition operation on a matrix simply and efficiently, which reduces the time complexity of this operation.
  • the present disclosure also discloses an operation method, including the following steps:
  • Step 1 The operation control module extracts address information of the operation matrix from the address storage module.
  • Step 2 The operation control module obtains the block information according to the address information of the operation matrix, and transmits the address information and the block information of the operation matrix to the operation module;
  • Step 3 The operation module extracts an operation matrix from the data storage module according to the address information of the operation matrix; and divides the operation matrix into n block matrices according to the block information;
  • Step 4 The arithmetic module performs transposition operations on the n block matrices respectively to obtain a transposed matrix of n block matrices;
  • Step 5 The operation module combines the transposed matrices of the n block matrices to obtain a transposed matrix of the operation matrix and feeds back to the data storage module;
  • n is a natural number.
  • the embodiment provides an arithmetic device, including an address storage module, an operation control module, an operation module, a data storage module, and an input and output module 3-5, wherein
  • the operation control module includes an instruction cache unit, an instruction processing unit, a dependency processing unit, an instruction queue memory, and a matrix determination unit, wherein the instruction processing unit further includes an instruction fetch unit and a decoding unit;
  • the operation module includes a matrix block unit, a matrix buffer unit, a matrix operation unit, and a matrix merging unit;
  • the address storage module is a scalar register file
  • the data storage module is a high speed temporary storage memory; and the input/output module is an IO direct memory access module.
  • An fetch unit which is responsible for fetching the next operation instruction to be executed from the instruction cache unit, and transmitting the operation instruction to the decoding unit;
  • a decoding unit the unit is responsible for decoding the operation instruction, and sending the decoded operation instruction to the scalar register file, obtaining the address information of the operation matrix fed back by the scalar register file, and obtaining the decoded operation instruction and the obtained
  • the address information of the operation matrix is transmitted to the dependency processing unit;
  • a dependency processing unit that handles the storage dependencies that an arithmetic instruction may have with the previous instruction.
  • the matrix operation instruction accesses the scratch pad memory, and the front and back instructions may access the same block of memory.
  • the operation instruction In order to ensure the correctness of the execution result of the instruction, if the current operation instruction is detected to have a dependency on the data of the previous operation instruction, the operation instruction must be buffered into the instruction queue memory until the dependency is eliminated; if the current operation instruction is before The operation instruction unit does not have a dependency relationship, and the dependency processing unit directly transfers the address information of the operation matrix and the decoded operation instruction to the matrix determination unit.
  • the instruction queue memory considers that there may be a dependency on the corresponding/specified scalar register corresponding to different operation instructions, and is used for buffering the decoded operation instruction of the conflict and the address information of the corresponding operation matrix, when the dependency is satisfied. And transmitting the decoded operation instruction and the address information of the corresponding operation matrix to the matrix determination unit;
  • the matrix determining unit is configured to determine the size of the matrix according to the address information of the operation matrix. If the specified maximum size M is exceeded, the matrix is subjected to a block operation, and the matrix determining unit analyzes the obtained block information according to the judgment result, and the operation matrix is obtained. The address information and the obtained block information are transmitted to the matrix block unit.
  • a matrix block unit which is responsible for extracting an operation matrix to be transposed from the cache according to the address information of the operation matrix, and segmenting the operation matrix according to the block information to obtain n block matrices .
  • a matrix buffer unit the unit is configured to buffer the n block matrices after being divided into blocks, and sequentially transmit to the matrix operation unit for transposition operation;
  • the matrix operation unit is responsible for sequentially extracting the block matrix from the matrix buffer unit for transposition operation, and transmitting the transposed block matrix to the matrix merging unit;
  • the matrix merging unit is responsible for receiving and temporarily buffering the transposed block matrix. After all the block matrices have been transposed, the transposed matrices of the n block matrices are combined to obtain the transposed matrix of the operation matrix. matrix.
  • the scalar register file provides a scalar register required by the device during the operation to provide address information of the operation matrix for the operation;
  • the scratchpad is a temporary storage device dedicated to matrix data and can support matrix data of different sizes.
  • IO memory access module which is used to directly access the scratchpad memory and is responsible for reading data or writing data from the scratchpad memory.
  • the embodiment provides an operation method for performing a transposition operation of a large-scale matrix, which specifically includes the following steps:
  • Step 1 The operation control module extracts the address information of the operation matrix from the address storage module, and specifically includes the following steps:
  • Step 1-1 the fetching unit extracts the operation instruction, and sends the operation instruction to the decoding unit;
  • Step 1-2 The decoding unit decodes the operation instruction, obtains the address information of the operation matrix from the address storage module according to the decoded operation instruction, and sends the decoded operation instruction and the address information of the operation matrix to the dependency. Relationship processing unit;
  • Step 1-3 The dependency processing unit analyzes whether the decoded operation instruction and the previous instruction that has not been executed end have a dependency relationship on the data; specifically, the dependency processing unit may read according to the operation instruction.
  • the address of the register is fetched to determine whether there is a condition to be written in the register. If there is, there is a dependency relationship, and the operation instruction can be executed after the data is written back.
  • the decoded operation instruction of the strip and the address information of the corresponding operation matrix need to wait in the instruction queue memory until the instruction that is not executed before is no longer dependent on the data;
  • Step 2 The operation control module obtains the block information according to the address information of the operation matrix
  • the instruction queue memory transmits the decoded operation instruction and the address information of the corresponding operation matrix to the matrix judgment unit to determine whether the matrix needs to be divided, and the matrix judgment unit obtains the score according to the judgment result.
  • Block information and transmitting the block information and the address information of the operation matrix to the matrix block unit;
  • Step 3 The operation module extracts an operation matrix from the data storage module according to the address information of the operation matrix; and divides the operation matrix into n block matrices according to the block information;
  • the matrix block unit extracts the required operation matrix from the data storage module according to the address information of the input operation matrix, and then divides the operation matrix into n block matrix according to the incoming block information, and after completing the block Each block matrix is sequentially passed to the matrix cache unit;
  • Step 4 The arithmetic module respectively performs transposition operations on the n block matrices to obtain a transposed matrix of n block matrices;
  • the matrix operation unit sequentially extracts the block matrix from the matrix buffer unit, and performs a transposition operation on each extracted block matrix, and then passes the obtained transposed matrix of each block matrix to the matrix merging unit.
  • Step 5 The operation module combines the transposed matrices of the n block matrices to obtain a transposed matrix of the operation matrix, and feeds the transposed matrix to the data storage module, specifically including the following steps:
  • Step 5-1 The matrix merging unit receives the transposed matrix of each block matrix. After the number of transposed matrices of the received block matrix reaches the total number of blocks, matrix merging operations are performed on all the blocks. The transposed matrix of the operation matrix; and feeding back the transposed matrix to a specified address of the data storage module;
  • Step 5-2 The input/output module directly accesses the data storage module, and reads the transposed matrix of the operation matrix obtained by the operation from the data storage module.
  • the vector mentioned in the present disclosure may be a 0-dimensional vector, a 1-dimensional vector, a 2-dimensional vector or a multi-dimensional vector.
  • the 0-dimensional vector can also be called a scalar
  • the 2-dimensional can also be called a matrix.
  • One embodiment of the present disclosure proposes a data screening apparatus, see FIG. 4A, comprising: a storage unit 4-3 for storing data and instructions, wherein the data includes data to be screened and location information data.
  • the register unit 4-2 is configured to store the data address in the storage unit.
  • the data filtering module 4-1 includes a data filtering unit 4-11.
  • the data filtering module acquires a data address in the register unit according to the instruction, obtains corresponding data in the storage unit according to the data address, and performs a filtering operation according to the acquired data. Get the data screening results.
  • the function of the data screening unit is shown in FIG. 4B.
  • the input data is the data to be filtered and the location information data.
  • the output data may only include the filtered data, and may also include related information of the filtered data, for example, the length of the vector. Array size, space occupied, etc.
  • the data screening apparatus of this embodiment specifically includes:
  • the storage unit 4-3 is configured to store data to be filtered, location information data, and instructions.
  • the register unit 4-2 is configured to store the data address in the storage unit.
  • the data screening module 4-1 includes:
  • the instruction cache unit 4-12 is configured to store instructions.
  • the control unit 4-13 reads the instruction from the instruction cache unit and decodes it into a specific operation microinstruction.
  • the I/O unit 4-16 transfers the instructions in the storage unit to the instruction cache unit, carries the data in the storage unit to the input data buffer unit and the output buffer unit, and can also carry the output data in the output buffer unit. Go to the storage unit.
  • the input data buffer unit 4-14 stores data carried by the I/O unit, including data to be filtered and location information data.
  • the data filtering unit 4-11 is configured to receive the microinstruction sent by the control unit, and obtain a data address from the register unit, and use the data to be filtered and the location information data sent from the input data buffer unit as input data to filter the input data. After the operation is completed, the filtered data is transmitted to the output data buffer unit.
  • the output data buffer unit 4-15 stores the output data, and the output data may only include the filtered data, and may also include related information of the filtered data, such as vector length, array size, occupied space, and the like.
  • the data screening apparatus of this embodiment is applicable to a plurality of screening objects.
  • the data to be filtered may be a vector or a high-dimensional array
  • the location information data may be a binary code, a vector or a high-dimensional array, etc., each component of which is 0 or 1.
  • the components of the data to be filtered and the components of the location information data may be one-to-one correspondence. It should be understood by those skilled in the art that each component of the location information data is 1 or 0, which is only an exemplary representation of the location information, and the representation of the location information is not limited to this representation.
  • the data filtering unit when 0 or 1 is used for each component in the location information data, performs a filtering operation on the input data, where the data filtering unit scans each component of the location information data, and if the component is 0, deletes The data component to be filtered corresponding to the component, if the component is 1, retains the data component to be filtered corresponding to the component; or if the component of the location information data is 1, the data component to be filtered corresponding to the component is deleted, and if the component is 0, the data component to be filtered corresponding to the component is retained.
  • the scan is completed, the screening is completed, and the filtered data is obtained and output.
  • the data filtering unit may further configure a screening operation corresponding to the representation manner.
  • the data to be filtered is set as a vector (1 0 101 34 243), and the component to be filtered is less than 100, and the input position information data is also a vector, that is, a vector (1 1 0 1 0).
  • the filtered data can still maintain the vector structure and can simultaneously output the vector length of the filtered data.
  • the location information vector may be externally input or internally generated.
  • the apparatus of the present disclosure may further include a location information generating module, where the location information generating module may be configured to generate a location information vector, where the location information generating module is connected to the data filtering unit.
  • the location information generating module may generate a location information vector by using a vector operation, where the vector operation may be a vector comparison operation, that is, by comparing the components of the vector to be filtered with the preset values one by one.
  • the location information generating module may further select another vector operation to generate a location information vector according to a preset condition. . In this example, if the component of the location information data is 1, the corresponding data component to be filtered is reserved, and if the component is 0, the corresponding data component to be filtered is deleted.
  • the data filtering module may further include a structural deformation unit 4-17, which may deform the input data of the input data buffer unit and the storage structure of the output data of the output data buffer unit, such as High-dimensional arrays are expanded into vectors, vectors are changed to high-dimensional arrays, and so on.
  • the method of expanding the high-dimensional data may be row-first or column-first, and other expansion modes may be selected according to specific situations.
  • the data to be filtered is a four-dimensional array Need to filter the even value, then the input position information array is The filtered data is a vector structure and does not output relevant information. In this example, if the component of the location information data is 1, the corresponding data component to be filtered is reserved, and if the component is 0, the data component to be filtered is deleted.
  • the data screening unit reads the data of the input data buffer unit, scans the (1,1)th component of the position information array, and finds that the value is 0, and deletes the value of the (1,1)th component of the array to be filtered. ;
  • the structural deformation unit changes the retained value into a vector, that is, the filtered data is a vector (4 22) and is stored by the output data buffer unit.
  • the data data screening module may further include: a computing unit 4-18. Therefore, the device of the present disclosure can simultaneously perform data screening and processing, and a data screening and processing device can be obtained.
  • the specific structure of the computing unit is the same as the foregoing embodiment, and details are not described herein again.
  • the present disclosure provides a method for data screening using the data screening device, including:
  • the data screening module obtains a data address in the register unit
  • the obtained data is subjected to a screening operation to obtain a data screening result.
  • the step of the data screening module acquiring the data address in the register unit comprises: the data screening unit obtaining an address of the data to be filtered and an address of the location information data from the register unit.
  • the step of obtaining corresponding data in the storage unit according to the data address comprises the following sub-steps:
  • the I/O unit transfers the data to be filtered and the location information data in the storage unit to the input data buffer unit;
  • the input data buffer unit passes the data to be filtered and the location information data to the data filtering unit.
  • the I/O unit passes the data to be filtered and the location information data in the storage unit to the input data buffer unit, and the input data buffer unit passes the data to be filtered and the location information data to the data screening.
  • the sub-steps of the unit further include: determining whether the storage structure is deformed.
  • the input data buffer unit transmits the data to be filtered to the structural deformation unit, and the structural deformation unit performs the storage structure deformation, and returns the deformed data to be filtered back to the input data buffer unit, and then executes the input data cache.
  • the unit passes the data to be filtered and the location information data to the sub-step of the data filtering unit; if not, directly performs the sub-step of the input data buffer unit to transfer the data to be filtered and the location information data to the data filtering unit.
  • the step of performing a filtering operation on the acquired data, and obtaining the data screening result includes: the data filtering unit performs a filtering operation on the screening data according to the location information data, and delivers the output data to the output data buffer unit.
  • the steps of the data screening method are specifically as follows:
  • Step S4-1 the control unit reads a data filtering instruction from the instruction buffer unit, and decodes the data filtering instruction into a specific operation micro instruction, and transmits the data to the data filtering unit;
  • Step S4-2 the data screening unit acquires an address of the data to be filtered and the location information data from the register unit;
  • Step S4-3 the control unit reads an I/O instruction from the instruction cache unit, decodes it into a specific operation micro instruction, and transmits it to the I/O unit;
  • Step S4-4 the I/O unit transfers the data to be filtered and the location information data in the storage unit to the input data buffer unit;
  • step S4-5 is performed, and if not, step S4-6 is directly executed.
  • Step S4-5 the input data buffer unit transmits the data to the structural deformation unit, and performs corresponding storage structure deformation, and then transfers the deformed data back to the input data buffer unit, and then proceeds to step S4-6;
  • Step S4-6 the input data buffer unit transfers the data to the data filtering unit, and the data filtering unit performs a filtering operation on the screening data according to the location information data;
  • Step S4-7 the output data is delivered to the output data buffer unit, wherein the output data may only include the filtered data, or may also include related information of the filtered data, such as vector length, array size, occupied space, and the like.
  • An embodiment of the present disclosure provides a neural network processor, including: a memory, a scratch pad memory, and a heterogeneous kernel; wherein the memory is used to store data and instructions of a neural network operation; a memory connected to the memory through a memory bus; the heterogeneous core is connected to the high-speed temporary storage memory through a high-speed temporary storage memory bus, and reads data and instructions of the neural network operation through the high-speed temporary storage memory to complete the neural network The operation is performed, and the operation result is sent back to the scratch pad memory, and the cache memory is controlled to write the operation result back to the memory.
  • the heterogeneous kernel refers to a kernel including at least two different types of cores, that is, two different structures.
  • the heterogeneous kernel includes: a plurality of operational cores having at least two different types of operational cores for performing neural network operations or neural network layer operations; and one or more logic control cores, For data based on neural network operations, it is decided to perform neural network operations or neural network layer operations by the dedicated kernel and/or the generic kernel.
  • the plurality of operational cores includes m general-purpose kernels and n dedicated cores; wherein the dedicated cores are dedicated to performing specified neural network/neural network layer operations for performing arbitrary neural networks/nerves Network layer operation.
  • the general-purpose kernel may be a CPU
  • the dedicated core may be an npu.
  • the scratch pad memory comprises a shared scratch pad memory and/or a non-shared scratch pad memory; wherein the shared scratch pad memory is connected to the heterogeneous core through a scratch pad memory bus At least two cores are correspondingly connected; the non-shared scratch pad memory is correspondingly connected to a core of the heterogeneous kernel through a cache memory bus.
  • the scratchpad memory may include only one or more shared scratchpad memories, each of which shares a cache memory with multiple cores in a heterogeneous kernel (logical control core, dedicated core, or general purpose core) connection.
  • the scratch pad memory may also include only one or more non-shared scratch pad memories, each of which is connected to one of the heterogeneous cores (a logical control core, a dedicated core, or a general purpose core).
  • the scratchpad memory may also include one or more shared scratchpad memories and one or more non-shared scratchpad memories, wherein each shared scratchpad memory and multiple cores in a heterogeneous kernel (Logical Control Kernel, Dedicated Kernel, or Generic Core) connections, each non-shared scratch pad memory connected to one of the heterogeneous cores (logical control core, dedicated core, or general purpose core).
  • a heterogeneous kernel Logical Control Kernel, Dedicated Kernel, or Generic Core
  • the logic control core is connected to the scratchpad memory through a scratchpad memory bus, reads data of a neural network operation through a scratchpad memory, and uses a neural network in data calculated by the neural network.
  • the type and parameters of the model determine that the dedicated kernel and/or the generic kernel act as the target kernel to perform neural network operations and/or neural network layer operations.
  • a path may be added between the cores, and the logic control core may directly send signals to the target core through the control bus, or may send signals to the target core via the cache memory to control
  • the target kernel performs neural network operations and/or neural network layer operations.
  • One embodiment of the present disclosure proposes a heterogeneous multi-core neural network processor, see FIG. 5A, including: a memory 11, a non-shared scratch pad memory 12, and a heterogeneous kernel 13.
  • the memory 11 is configured to store data and instructions of the neural network operation, and the data includes offsets, weights, input data, output data, and types and parameters of the neural network model.
  • the output data may not be stored in the memory;
  • the instructions include various instructions corresponding to the neural network operation, such as a CONFIG instruction, a COMPUTE instruction, an IO instruction, a NOP instruction, a JUMP instruction, a MOVE instruction, and the like.
  • the data and instructions stored in the memory 11 can be transferred to the heterogeneous kernel 13 through the non-shared scratch pad memory 12.
  • the non-shared scratch pad memory 12 includes a plurality of scratch pad memories 121. Each cache memory 121 is connected to the memory 11 through a memory bus, and is connected to the heterogeneous core 13 through a cache memory bus to implement a heterogeneous kernel. Data exchange between the non-shared scratch pad memory 12 and the non-shared scratch pad memory 12 and the memory 11. When the neural network operation data or instructions required by the heterogeneous kernel 13 are not stored in the non-shared scratch pad memory 12, the non-shared scratch pad memory 12 first reads the required data or instructions from the memory 11 through the memory bus. And then it is fed into the heterogeneous kernel 13 through the scratchpad memory bus.
  • the heterogeneous kernel 13 includes a logic control core 131, a general purpose core 132, and a plurality of dedicated cores 133.
  • the logic control core 131, the general purpose core 132, and each of the dedicated cores 133 pass through the cache memory bus and a scratchpad memory. 121 corresponds to the connection.
  • the heterogeneous kernel 13 is configured to read the instructions and data of the neural network operation from the non-shared scratch pad memory 12, complete the neural network operation, and send the operation result back to the non-high speed shared cache 12 to control the non-shared scratch pad memory. 12 Write the result of the operation back to the memory 11.
  • the logic control core 131 reads the neural network operation data and instructions from the non-shared scratch pad memory 12, and determines whether there is support for the neural network operation and can complete the operation scale of the neural network according to the type and parameters of the neural network model in the data.
  • the dedicated kernel 133 if present, is handed off by the corresponding dedicated kernel 133 to complete the neural network operation, and if not, the general core 132 performs the neural network operation.
  • the kernel number in the table can be in this network.
  • the processor is initialized and scanned once, so that it can support dynamically configurable heterogeneous kernels (that is, the type and number of dedicated processors in the heterogeneous kernel can be changed at any time, and the kernel information table will be scanned and updated after the change); You can also not support the dynamic configuration of heterogeneous kernels. In this case, you only need to fix the kernel number in the table. You do not need to scan multiple updates. Alternatively, if the number of each type of dedicated kernel is always continuous.
  • a decoder can be set in the logic control kernel to determine the type of the network layer according to the instruction, and can determine whether the instruction of the general core or the instruction of the dedicated kernel, parameters, data addresses Etc. can also be parsed from the instruction; optionally, the data can also include a data header containing the number and size of each network layer, as well as the address of the corresponding calculation data and instructions, and a special parser (Software or hardware can) to parse this information; optionally, store the parsed information in a specified area.
  • a decoder can be set in the logic control kernel to determine the type of the network layer according to the instruction, and can determine whether the instruction of the general core or the instruction of the dedicated kernel, parameters, data addresses Etc. can also be parsed from the instruction; optionally, the data can also include a data header containing the number and size of each network layer, as well as the address of the corresponding calculation data and instructions, and a special parser (Software or hardware can) to parse this information; optionally, store
  • a content addressable memory can be set in the logic control kernel, and the content can be implemented as configurable, which requires a logic control kernel. Provide some instructions to configure/write this CAM.
  • the content in the CAM has the network layer number, the maximum size that each dimension can support, and the address of the dedicated kernel information table supporting the layer and the general kernel information table address.
  • Each dedicated core 133 can independently perform a neural network operation, such as a specified neural network operation such as a pulse neural network (SNN) operation, and write the operation result back to its corresponding connected scratch pad memory 121 to control the high speed.
  • the temporary storage memory 121 writes the operation result back to the memory 11.
  • the general-purpose kernel 132 can independently perform neural network operations that are beyond the operational scale supported by the dedicated core or that are not supported by all the dedicated kernels 133, and write the operation result back to its corresponding connected scratchpad memory 121 to control the high speed.
  • the temporary storage memory 121 writes the operation result back to the memory 11.
  • One embodiment of the present disclosure proposes a heterogeneous multi-core neural network processor, see FIG. 5B, including: memory 21, shared scratchpad memory 22, and heterogeneous kernel 23.
  • the memory 21 is configured to store data and instructions of the neural network operation, and the data includes offsets, weights, input data, output data, and types and parameters of the neural network model, and the instructions include various instructions corresponding to the neural network operations. Data and instructions stored in the memory are transferred to the heterogeneous kernel 23 through the shared scratchpad memory 22.
  • the shared scratchpad memory 22 is connected to the memory 21 through a memory bus, and is connected to the heterogeneous core 23 through a shared scratchpad memory bus to realize sharing of the scratchpad memory between the heterogeneous core 23 and the shared scratchpad memory 22. Data exchange between 22 and memory 21.
  • the shared scratchpad memory 22 first reads the required data or instructions from the memory 21 through the memory bus, and then It is fed into the heterogeneous kernel 23 through the scratchpad memory bus.
  • the heterogeneous kernel 23 includes a logic control core 231, a plurality of general cores 232, and a plurality of dedicated cores 233.
  • the logic control core 231, the general purpose core 232, and the dedicated core 233 are all connected to the shared scratchpad memory 22 by the cache memory bus. connection.
  • the heterogeneous kernel 23 is configured to read the neural network operation data and instructions from the shared scratchpad memory 22, complete the neural network operation, and send the operation result back to the high speed shared cache 22, and control the shared cache memory 22 to calculate the operation result. Write back to memory 21.
  • the core transmitting the data can share the data.
  • the scratchpad memory bus is first transferred to the shared scratchpad memory 22 and then transferred to the core receiving the data without going through the memory 21.
  • the neural network model generally includes multiple neural network layers, and each neural network layer uses the operation result of the previous neural network layer to perform corresponding operations, and the operation result is output to the next neural network layer, and finally The result of a neural network layer is the result of the entire neural network operation.
  • both the general-purpose kernel 232 and the dedicated kernel 233 can perform a neural network layer operation, and the logical control kernel 231, the general-purpose kernel 232, and the dedicated kernel 233 are used to complete the neural network operation.
  • the neural network layer is simply referred to as a layer.
  • Each dedicated core 233 can perform a layer operation independently, such as a convolution operation of a neural network layer, a full connection layer, a splicing operation, a bit addition/multiplication operation, a Relu operation, a pool operation, a Batch Norm operation, and the like, and
  • the size of the neural network operation layer should not be too large, that is, it cannot exceed the scale of the neural network operation layer supported by the corresponding dedicated kernel, that is, the dedicated kernel operation limits the number of neurons and synapses in the layer.
  • the general-purpose kernel 232 is used to execute a layer operation that is beyond the operational scale supported by the dedicated core 233 or that is not supported by all dedicated cores, and writes the operation result back to the shared scratchpad memory 22 to control the shared scratchpad memory 22 The result of the operation is written back to the memory 21.
  • the logic control core 231 sends a start operation signal to the dedicated core or the general-purpose core that performs the next-level operation to notify the execution of the next-level operation.
  • the kernel or generic kernel starts the operation.
  • the dedicated core 233 and the general-purpose kernel 232 start the operation signal when receiving the start operation signal sent by the dedicated core or the general-purpose core that performs the upper layer operation, and the layer operation is currently not performed, if the layer operation is currently being performed, Then, the current layer operation is completed, and the operation result is written back to the shared scratch pad memory 22 to start the operation.
  • the logic control kernel 231 reads the neural network operation data from the shared scratchpad memory 22, and analyzes each layer of the neural network model for the type and parameters of the neural network model therein, and determines whether there is support for each layer.
  • the dedicated kernel 233 that performs the operation of the layer and can complete the operation scale of the layer, if present, transfers the operation of the layer to the corresponding dedicated kernel 233, and if not, the operation of the layer is performed by the general kernel 232. .
  • the logic control core 231 also sets the corresponding addresses of the data and instructions required for the layer operations of the general core 232 and the dedicated core 233.
  • the general core 232 and the dedicated core 233 read the data and instructions of the corresponding addresses, and perform layer operations.
  • the logic control core 231 For the dedicated core 233 and the general-purpose kernel 232 that perform the first layer operation, the logic control core 231 sends a start operation signal to the dedicated core 233 or the general-purpose kernel 232 at the start of the operation, and after the end of the neural network operation, executes the last layer.
  • the dedicated core 233 or the general-purpose kernel 232 of the operation transmits a start operation signal to the logic control core 231, and after receiving the start operation signal, the logic control core 231 controls the shared scratch pad memory 22 to write the operation result back to the memory 21.
  • An embodiment of the present disclosure provides a method for performing neural network operation by using the heterogeneous multi-core neural network processor according to the first embodiment. Referring to FIG. 5C, the steps are as follows:
  • Step S5-11 The logic control core 131 in the heterogeneous kernel 13 reads the data and instructions of the neural network operation from the memory 11 through the non-shared scratch pad memory 12;
  • Step S5-12 The logic control kernel 131 in the heterogeneous kernel 13 determines whether there is a dedicated kernel that meets the condition according to the type and parameters of the neural network model in the data, and the condition that the dedicated kernel supports the neural network operation and can Completing the scale of the neural network operation (the scale limitation can be inherent to the dedicated kernel. At this time, the kernel design manufacturer can be queried; or it can be artificially specified (for example, through experiments, the general-purpose kernel is more effective than a certain scale), when configuring the CAM. You can set the size limit). If the dedicated kernel m meets the condition, the dedicated kernel m is used as the target kernel, and step S5-13 is performed. Otherwise, step S5-15 is performed, where m is a dedicated kernel number, 1 ⁇ m ⁇ M, and M is a dedicated kernel number.
  • Step S5-13 The logic control core 131 in the heterogeneous kernel 13 sends a signal to the target core, activates the target kernel, and transmits the data corresponding to the neural network operation to be performed and the address corresponding to the instruction to the target kernel.
  • Step S5-14 The target kernel acquires data and instructions of the neural network operation from the memory 11 through the non-shared scratch pad memory 12 according to the acquired address, performs a neural network operation, and then outputs the operation result through the non-shared scratch pad memory 12 In the memory 11, the operation is completed.
  • steps S5-15 to S5-16 are performed next.
  • Step S5-15 The logic control core 131 in the heterogeneous kernel 13 sends a signal to the general core 132, activates the general core 132, and transmits the data corresponding to the neural network operation to be performed and the address corresponding to the instruction to the general core 132.
  • Step S5-16 The general-purpose kernel 132 acquires the data and instructions of the neural network operation from the memory 11 through the non-shared scratch pad memory 12 according to the acquired address, performs a neural network operation, and then passes the operation result through the non-shared scratch pad memory 12 Output to memory 11, the operation is completed.
  • An embodiment of the present disclosure provides a method for performing neural network operation by using the heterogeneous multi-core neural network processor according to the second embodiment. Referring to FIG. 5D, the steps are as follows:
  • Step S5-21 The logic control kernel 231 in the heterogeneous kernel 23 reads the data and instructions of the neural network operation from the memory 21 through the shared scratchpad memory 22.
  • Step S5-22 The logic control kernel 231 in the heterogeneous kernel 23 parses the types and parameters of the neural network model in the data. For the first to the first layers of the neural network model, it is determined whether there is a dedicated kernel that meets the conditions. I is the number of layers of the neural network model. The condition is that the dedicated kernel supports the layer operation and can complete the operation scale of the layer, and allocates a corresponding general-purpose kernel or a dedicated kernel for each layer operation.
  • the dedicated kernel m For the i-th layer operation of the neural network model, 1 ⁇ i ⁇ I. If the dedicated kernel m meets the condition, the dedicated kernel m is selected to perform the i-th layer operation of the neural network model, m is a dedicated kernel number, 1 ⁇ m ⁇ M, M is The number of dedicated cores. If no dedicated kernel meets the conditions, then select the general kernel M+n to perform the i-th layer operation of the neural network model.
  • M+n is the number of the general-purpose kernel, 1 ⁇ n ⁇ N, where N is the number of common cores, and the dedicated kernel 233 and The general-purpose kernel 232 is uniformly numbered (that is, the dedicated core and the general-purpose kernel are numbered together, for example, x dedicated cores, and y general-purpose cores can be numbered from 1 to x+y, and each dedicated core and general-purpose kernel correspond to 1 to A number in x+y; of course, the dedicated kernel and the general-purpose kernel can be separately numbered separately, for example, x dedicated cores, y general-purpose kernels, then the dedicated kernel can be numbered from 1 to x, and the general-purpose kernel can start from 1 Number, programmed to y, each dedicated core, common kernel corresponds to a number), at this time the dedicated kernel may be the same as the general kernel number, but only the logical number is the same, can be addressed according to its physical address, and finally get with the neural network model
  • the kernel sequence has a total of one element, and each element is a dedicated kernel or a general-purpose kernel, which sequentially corresponds to the first to the first layer operations of the neural network model.
  • Step S5-23 The logic control kernel 231 in the heterogeneous kernel 23 transmits the data corresponding to the layer operation to be performed and the address corresponding to the instruction to the dedicated core or the general-purpose kernel that performs the layer operation, and to the dedicated core that executes the layer operation. Or the general-purpose kernel sends the number of the next dedicated core or general-purpose kernel in the kernel sequence, where the number of the logical control kernel is sent to the dedicated core or general-purpose kernel that performs the last-level operation.
  • Step S5-24 The logic control kernel 231 in the heterogeneous kernel 23 sends a start operation signal to the first core in the kernel sequence. After receiving the start operation signal, the first dedicated core 233 or the general-purpose core 232 continues to complete the operation if there is currently an unfinished operation, and then continues to read data and instructions from the address corresponding to the data and the instruction to perform the current layer operation. .
  • Step S5-25 After completing the current layer operation, the first dedicated core 233 or the general-purpose kernel 232 transfers the operation result to the designated address of the shared scratch pad memory 22, and simultaneously sends a start operation signal to the second core in the kernel sequence. .
  • Step S5-26 By analogy, after each core in the kernel sequence receives the start operation signal, if there is currently an unfinished operation, the operation is continued, and then the data and the instruction are continuously read from the address corresponding to the data and the instruction. The corresponding layer operation is performed, the operation result is transferred to the designated address of the shared scratch pad memory 22, and the start operation signal is transmitted to the next core in the kernel sequence. The last core of the kernel sequence sends a start operation signal to the logic control core 231.
  • Step S5-27 After receiving the start operation signal, the logic control core 231 controls the shared scratch pad memory 22 to write the operation result of each neural network layer back to the memory 21, and the calculation is completed.
  • this embodiment is a further extension of the above first embodiment.
  • the scratchpad memory 121 is dedicated to each core.
  • the dedicated core 1 can only access the scratchpad memory 3, but cannot access other scratchpad memories.
  • the other cores are also the same. 12 has a non-shared nature.
  • kernel j wants to use the result of kernel i(i ⁇ j) (this result is initially stored in the scratchpad memory corresponding to kernel i)
  • kernel i must first write this result from the scratchpad memory to the memory. 11, then the kernel j also reads from the memory 11 its accessible scratchpad memory, after which the kernel j can use this result.
  • N ⁇ N data exchange network 34 can be implemented with a crossbar, so that each core (331 or 332 or 333) can access all the scratchpad memory (321 ). At this point 32 has a shared nature.
  • the method for performing neural network operation using the apparatus of this embodiment is as follows:
  • Step S5-31 The logic control kernel 331 in the heterogeneous kernel 33 reads the data and instructions of the neural network operation from the memory 31 through the scratchpad memory 32;
  • Step S5-32 The logic control kernel 331 in the heterogeneous kernel 33 determines whether there is a dedicated kernel that meets the condition according to the type and parameters of the neural network model in the data, and the condition that the dedicated kernel supports the neural network operation and can Complete the scale of the neural network operation. If the dedicated kernel m meets the condition, the dedicated kernel m is used as the target kernel, and steps S5-33 are performed, otherwise steps S5-35 are performed, where m is a dedicated kernel number.
  • Step S5-33 The logic control kernel 331 in the heterogeneous kernel 33 sends a signal to the target core to activate the target kernel, and simultaneously transmits the data corresponding to the neural network operation to be performed and the address corresponding to the instruction to the target core.
  • Step S5-34 The target kernel acquires the data and instructions of the neural network operation according to the acquired address (from the scratch pad memory 32), performs a neural network operation, and then stores the operation result in the scratch pad memory 32, and the operation is completed.
  • Step S5-35 The logic control kernel 331 in the heterogeneous kernel 33 sends a signal to the general-purpose kernel 332, activates the general-purpose kernel 332, and transmits the data corresponding to the neural network operation to be performed and the address corresponding to the instruction to the general-purpose kernel 332.
  • Step S5-36 The general-purpose kernel 332 acquires data and instructions of the neural network operation according to the acquired address (from the scratch pad memory 32), performs a neural network operation, and then stores the operation result in the scratch pad memory 32, and the operation is completed. .
  • FIG. 5F The difference with respect to the embodiment described in FIG. 5E is the manner of connection between the memory 41 and the scratch pad memory 42.
  • the bus connection is used, and when the memory 31 is written in the plurality of cache memories 321, the queues are not efficient (see FIG. 5E).
  • the structure here is abstracted as a 1-input-N-output data exchange network that can be implemented using a variety of topologies, such as star structures (memory 41 and N scratchpad memories 421 have dedicated paths). Connection), tree structure (memory 41 is at the root position, cache memory 421 is at the leaf position), and the like.
  • the number of logical control cores, the number of dedicated cores, the number of common cores, the number of shared or non-shared scratch pad memories, and the number of memories are not limited, and may be determined according to the operation of the neural network. Request appropriate adjustments.
  • the present disclosure also provides a chip that includes the above-described computing device.
  • the present disclosure also provides a chip package structure including the above chip.
  • the present disclosure also provides a board that includes the chip package structure described above.
  • the present disclosure also provides an electronic device that includes the above-described card.
  • Electronic equipment including data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headphones , mobile storage, wearables, vehicles, household appliances, and/or medical equipment.
  • the vehicle includes an airplane, a ship, and/or a vehicle;
  • the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood;
  • the medical device includes a nuclear magnetic resonance instrument, B-ultrasound and / or electrocardiograph.
  • the disclosed apparatus may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical or otherwise.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software program module.
  • the integrated unit if implemented in the form of a software program module and sold or used as a standalone product, may be stored in a computer readable memory.
  • a computer readable memory A number of instructions are included to cause a computer device (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing memory includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like, which can store program codes.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • control module of the present disclosure is not limited to the specific composition of the embodiment, and the control module that can realize the interaction between the data and the operation instruction between the storage module and the operation unit, which is well known to those skilled in the art, can be used to implement the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Neurology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Algebra (AREA)
  • Human Computer Interaction (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Feedback Control In General (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Memory System (AREA)
  • Power Sources (AREA)
  • Tests Of Electronic Circuits (AREA)

Abstract

一种运算方法及装置,运算装置包括:运算模块(1-1),用于执行神经网络运算;以及幂次转换模块(1-2),与所述运算模块(1-1)连接,用于将神经网络运算的输入神经元数据和/或输出神经元数据转换为幂次神经元数据。此运算方法及装置,减小了存储资源和计算资源的开销,提高了运算速度。

Description

运算装置和方法 技术领域
本公开涉及人工智能技术领域,更具体地涉及一种运算装置和方法。
背景技术
多层神经网络被广泛应用于分类识别等任务,近年来,由于其较高的识别率和较高的可并行性,受到学术界和工业界的广泛关注。
目前一些性能较好的神经网络通常都非常庞大,这也意味着这些神经网络需要大量的计算资源和存储资源。大量的计算和存储资源的开销会降低神经网络的运算速度,同时,对硬件的传输带宽以及运算器的要求也大大提高了。
发明内容
(一)要解决的技术问题
本公开提供了一种运算装置和方法,以至少部分解决以上所提出的技术问题。
(二)技术方案
根据本公开的一个方面,提供了一种神经网络运算装置,包括:
运算模块,用于执行神经网络运算;以及
幂次转换模块,与所述运算模块连接,用于将神经网络运算的输入神经元数据和/或输出神经元数据转换为幂次神经元数据。
在一些实施例中,所述幂次转换模块包括:
第一幂次转换单元,用于将所述运算模块输出的神经元数据转换为幂次神经元数据;以及
第二幂次转换单元,用于将输入所述运算模块的神经元数据转换为幂次神经元数据。
在一些实施例中,所述运算模块还包括:第三幂次转换单元,用于将幂次神经元数据转换为非幂次神经元数据。
在一些实施例中,所述的神经网络运算装置,还包括:
存储模块,用于存储数据和运算指令;
控制模块,用于控制数据和运算指令的交互,其接收该存储模块发送的数据和运算指令,并将运算指令译码成运算微指令;其中,
所述运算模块包括运算单元,用于接收所述控制模块发送的数据和运算微指令,并根据运算微指令对其接收的权值数据和神经元数据执行神经网络运算。
在一些实施例中,所述控制模块包括:运算指令缓存单元、译码单元、输入神经元缓存单元、权值缓存单元以及数据控制单元;其中
运算指令缓存单元,与所述数据控制单元连接,用于接收该数据控制单元发送的运算指令;
译码单元,与所述运算指令缓存单元连接,用于从运算指令缓存单元中读取运算指令,并将其译码成运算微指令;
输入神经元缓存单元,与所述数据控制单元连接,用于从该数据控制单元获取相应的幂次神经元数据;
权值缓存单元,与所述数据控制单元连接,用于从数据控制单元获取相应的权值数据;
数据控制单元,与所述存储模块连接,用于实现所述存储模块分别与所述运算指令缓存单元、所述权值缓存单元以及所述输入神经元缓存单元之间的数据和运算指令交互;其中,
所述运算单元分别与所述译码单元、输入神经元缓存单元及权值缓存单元连接,接收运算微指令、幂次神经元数据及权值数据,根据所述运算微指令对运算单元接收的幂次神经元数据和权值数据执行相应的神经网络运算。
在一些实施例中,所述的神经网络运算装置,还包括:输出模块,其包括输出神经元缓存单元,用于接收所述运算模块输出的神经元数据;
其中,所述幂次转换模块包括:
第一幂次转换单元,与所述输出神经元缓存单元连接,用于将所述输出神经元缓存单元输出的神经元数据转换为幂次神经元数据;
第二幂次转换单元,与所述存储模块连接,用于将输入所述存储模块的神经元数据转换为幂次神经元数据;
所述运算模块还包括:第三幂次转换单元,与所述运算单元连接,用于将幂次神经元数据转换为非幂次神经元数据。
在一些实施例中,所述第一幂次转换单元还与所述数据控制单元连接,用于所述运算模块输出的神经元数据转换为幂次神经元数据并发送至所述数据控制单元,以作为下一层神经网络运算的输入数据。
在一些实施例中,所述幂次神经元数据包括符号位和幂次位,所述符号位用于表示所述幂次神经元数据的符号,所述幂次位用于表示所述幂次神经元数据的幂次位数据;所述符号位包括一位或多位比特位数据,所述幂次位包括m位比特位数据,m为大于1的正整数。
在一些实施例中,所述神经网络运算装置还包括存储模块,该存储模块预存有编码表,该编码表包括幂次位数据以及指数数值,所述编码表用于通过幂次神经元数据的每个幂次位数据获取与幂次位数据对应的指数数值。
在一些实施例中,所述编码表还包括一个或多个置零幂次位数据,所述置零幂次位数据对应的幂次神经元数据为0。
在一些实施例中,最大的幂次位数据对应幂次神经元数据为0或最小的幂次位数据对应幂次神经元数据为0。
在一些实施例中,所述编码表的对应关系为幂次位数据最高位代表置零位,幂次位数据的其他m-1位对应指数数值。
在一些实施例中,所述编码表的对应关系为正相关关系,存储模块预存一个整数值x和一个正整数值y,最小的幂次位数据对应指数数值为x;其中,x表示偏置值,y表示步长。
在一些实施例中,最小的幂次位数据对应指数数值为x,最大的幂次位数据对应幂次神经元数据为0,最小和最大的幂次位数据之外的其他的幂次位数据对应指数数值为(幂次位数据+x)*y。
在一些实施例中,y=1,x的数值=-2 m-1
在一些实施例中,所述编码表的对应关系为负相关关系,存储模块预存一个整数值x和一个正整数值y,最大的幂次位数据对应指数数值为x;其中,x表示偏置值,y表示步长。
在一些实施例中,最大的幂次位数据对应指数数值为x,最小的幂次位数据对应幂次神经元数据为0,最小和最大的幂次位数据之外的其他的幂次位数据对应指数数值为(幂次位数据-x)*y。
在一些实施例中,y=1,x的数值等于2 m-1
在一些实施例中,神经元数据转换成幂次神经元数据包括:
s out=s in
Figure PCTCN2018081929-appb-000001
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据,s in为输入数据的符号,s out为输出数据的符号,d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out
Figure PCTCN2018081929-appb-000002
表示对数据x做取下整操作;或,
s out=s in
Figure PCTCN2018081929-appb-000003
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据,s in为输入数据的符号,s out为输出数据的符号,d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out
Figure PCTCN2018081929-appb-000004
表示对数据x做取上整操作;或,
s out=s in
d out+=[log 2(d in+)]
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据;s in为输入数据的符号,s out为输出数据的符号;d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out
Figure PCTCN2018081929-appb-000005
表示对数据x做四舍五入操作。
根据本公开的另一个方面,提供了一种神经网络运算方法,包括:
执行神经网络运算;以及
在执行神经网络运算之前,将神经网络运算的输入神经元数据转换为幂次神经元数据;和/或在执行神经网络运算之后,将神经网络运算的输出神经元数据转换为幂次神经元数据。
在一些实施例中,在执行神经网络运算之前,将神经网络运算的输入神经元数据转换为幂次神经元数据的步骤包括:
将输入数据中的非幂次神经元数据转换为幂次神经元数据;以及
接收并存储运算指令、所述幂次神经元数据及权值数据。
在一些实施例中,在接收并存储运算指令、所述幂次神经元数据及权值数据的步骤和执行神经网络运算的步骤之间,还包括:
读取运算指令,并将其译码成各运算微指令。
在一些实施例中,在执行神经网络运算的步骤中,根据运算微指令对权值数据和幂次神经元数据进行神经网络运算。
在一些实施例中,在执行神经网络运算之后,将神经网络运算的输出神经元数据转换为幂次神经元数据的步骤包括:
输出神经网络运算后得到的神经元数据;以及
将神经网络运算后得到的神经元数据中的非幂次神经元数据转换为幂次神经元数据。
在一些实施例中,将神经网络运算后得到的神经元数据中的非幂次神经元数据转换为幂次神经元数据并发送至所述数据控制单元,以作为神经网络运算下一层的输入幂次神经元,重复神经网络运算步骤和非幂次神经元数据转换成幂次神经元数据步骤,直到神经网络最后一层运算结束。
在一些实施例中,在存储模块中预存一个整数值x和一个正整数值y;其中,x表示偏置值,y表示步长;通过改变存储模块预存的整数值x和正整数值y,以调整神经网络运算装置所能表示的幂次神经元数据范围。
根据本公开的又一个方面,提供了一种使用所述的神经网络运算装置的方法,其中,通过改变存储模块预存的整数值x和正整数值y,以调整神经网络运算装置所能表示的幂次神经元数据范围。
根据本公开的一个方面,提供了一种神经网络运算装置,包括:
运算模块,用于执行神经网络运算;以及
幂次转换模块,与所述运算模块连接,用于将神经网络运算的输入数据和/或输出数据转换为幂次数据。
在一些实施例中,所述输入数据包括输入神经元数据、输入权值数据,所述输出数据包括输出神经元数据、输出权值数据,所述幂次数据包括幂次神经元数据、幂次权值数据。
在一些实施例中,所述幂次转换模块包括:
第一幂次转换单元,用于将所述运算模块的输出数据转换为幂次数据;以及
第二幂次转换单元,用于将所述运算模块的输入数据转换为幂次数据。
在一些实施例中,所述运算模块还包括:第三幂次转换单元,用于将幂次数据转换为非幂次数据。
在一些实施例中,所述的神经网络运算装置,还包括:
存储模块,用于存储数据和运算指令;
控制模块,用于控制数据和运算指令的交互,其接收该存储模块发送的数据和运算指令,并将运算指令译码成运算微指令;其中,
所述运算模块包括运算单元,用于接收所述控制模块发送的数据和运算微指令,并根据运算微指令对其接收的权值数据和神经元数据执行神经网络运算。
在一些实施例中,所述控制模块包括:运算指令缓存单元、译码单元、输入神经元缓存单元、权值缓存单元以及数据控制单元;其中
运算指令缓存单元,与所述数据控制单元连接,用于接收该数据控制单元发送的运算指令;
译码单元,与所述运算指令缓存单元连接,用于从运算指令缓存单元中读取运算指令,并将其译码成运算微指令;
输入神经元缓存单元,与所述数据控制单元连接,用于从该数据控制单元获取相应的幂次神经元数据;
权值缓存单元,与所述数据控制单元连接,用于从数据控制单元获取相应的幂次权值数据;
数据控制单元,与所述存储模块连接,用于实现所述存储模块分别与所述运算指令缓存单元、所述权值缓存单元以及所述输入神经元缓存单元之间的数据和运算指令交互;其中,
所述运算单元分别与所述译码单元、输入神经元缓存单元及权值缓存单元连接,接收运算微指令、幂次神经元数据、幂次权值数据,根据所述运算微指令对运算单元接收的幂次神经元数据、幂次权值数据执行相应的神经网络运算。
在一些实施例中,所述的神经网络运算装置,还包括:输出模块,其包括输出神经元缓存单元,用于接收所述运算模块输出的神经元数据;
其中,所述幂次转换模块包括:
第一幂次转换单元,与所述输出神经元缓存单元及所述运算单元连接,用于将所述输出神经元缓存单元输出的神经元数据转换为幂次神经元数据以及将运算单元输出的权值数据转换为幂次权值数据;
第二幂次转换单元,与所述存储模块连接,用于将输入所述存储模块的神经元数据、权值数据分别转换为幂次神经元数据、幂次权值数据;
所述运算模块还包括:第三幂次转换单元,与所述运算单元连接,用于将幂次神经元数据、幂次权值数据分别转换为非幂次神经元数据、非幂次权值数据。
在一些实施例中,所述第一幂次转换单元还与所述数据控制单元连接,用于所述运算模块输出的神经元数据、权值数据分别转换为幂次神经元数据、幂次权值数据并发送至所述数据控制单元,以作为下一层神经网络运算的输入数据。
在一些实施例中,所述幂次神经元数据包括符号位和幂次位,所述符号位用于表示所述幂次神经元数据的符号,所述幂次位用于表示所述幂次神经元数据的幂次位数据;所述符号位包括一位或多位比特位数据,所述幂次位包括m位比特位数据,m为大于1的正整数;
所述幂次权值数据表示权值数据的数值采用其幂指数值形式表示,其中,所述幂次权值数据包括符号位和幂次位,符号位采用一位或多位比特位表示权值数据的符号,幂次位采用m位比特位表示权值数据的幂次位数据,m为大于1的正整数。
在一些实施例中,所述神经网络运算装置还包括存储模块,该存储模块预存有编码表,该编码表包括幂次位数据以及指数数值,所述编码表用于通过幂次神经元数据、幂次权值数据的每个幂次位数据获取与幂次位数据对应的指数数值。
在一些实施例中,所述编码表还包括一个或多个置零幂次位数据,所述置零幂次位数据对应的幂次神经元数据、幂次权值数据为0。
在一些实施例中,最大的幂次位数据对应幂次神经元数据、幂次权值数据为0,或最小的幂次位数据对应幂次神经元数据、幂次权值数据为0。
在一些实施例中,所述编码表的对应关系为幂次位数据最高位代表置零位,幂次位数据的其他m-1位对应指数数值。
在一些实施例中,所述编码表的对应关系为正相关关系,存储模块预存一个整数值x和一个正整数值y,最小的幂次位数据对应指数数值为x;其中,x表示偏置值,y表示步长。
在一些实施例中,最小的幂次位数据对应指数数值为x,最大的幂次位数据对应幂次神经元数据、幂次权值数据为0,最小和最大的幂次位数据之外的其他的幂次位数据对应指数数值为(幂次位数据+x)*y。
在一些实施例中,y=1,x的数值=-2 m-1
在一些实施例中,所述编码表的对应关系为负相关关系,存储模块预存一个整数值x和一个正整数值y,最大的幂次位数据对应指数数值为x;其中,x表示偏置值,y表示步长。
在一些实施例中,最大的幂次位数据对应指数数值为x,最小的幂次位数据对应幂次神经元数据、幂次权值数据为0,最小和最大的幂次位数据之外的其他的幂次位数据对应指数数值为(幂次位数据-x)*y。
在一些实施例中,y=1,x的数值等于2 m-1
在一些实施例中,将神经元数据、权值数据分别转换成幂次神经元数据、幂次权值数据包括:
s out=s in
Figure PCTCN2018081929-appb-000006
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据,s in为输入数据的符号,s out为输出数据的符号,d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out
Figure PCTCN2018081929-appb-000007
表示对数据x做取下整操作;或,
s out=s in
Figure PCTCN2018081929-appb-000008
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据,s in为输入数据的符号,s out为输出数据的符号,d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out
Figure PCTCN2018081929-appb-000009
表示对数据x做取上整操作;或,
s out=s in
d out+=[log 2(d in+)]
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据;s in为输入数据的符号,s out为输出数据的符号;d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out;[x]表示对数据x做四舍五入操作。
根据本公开的另一个方面,提供了一种神经网络运算方法,包括:
执行神经网络运算;以及
在执行神经网络运算之前,将神经网络运算的输入数据转换为幂次数据;和/或在执行神经网络运算之后,将神经网络运算的输出数据转换为幂次数据。
在一些实施例中,所述输入数据包括输入神经元数据、输入权值数据,所述输出数据包括输出神经元数据、输出权值数据,所述幂次数据包括幂次神经元数据、幂次权值数据。
在一些实施例中,在执行神经网络运算之前,将神经网络运算的输入数据转换为幂次数据的步骤包括:
将输入数据中的非幂次数据转换为幂次数据;以及
接收并存储运算指令、所述幂次数据。
在一些实施例中,在接收并存储运算指令、所述幂次数据的步骤和执行神经网络运算的步骤之间,还包括:
读取运算指令,并将其译码成各运算微指令。
在一些实施例中,在执行神经网络运算的步骤中,根据运算微指令对幂次权值数据和幂次神经元数据进行神经网络运算。
在一些实施例中,在执行神经网络运算之后,将神经网络运算的输出数据转换为幂次数据的步骤包括:
输出神经网络运算后得到的数据;以及
将神经网络运算后得到的数据中的非幂次数据转换为幂次数据。
在一些实施例中,将神经网络运算后得到的数据中的非幂次数据转换为幂次数据并发送至所述数据控制单元,以作为神经网络运算下一层的输入数据,重复神经网络运算步骤和非幂次数据转换成幂次数据步骤,直到神经网络最后一层运算结束。
在一些实施例中,在存储模块中预存一个整数值x和一个正整数值y;其中,x表示偏置值,y表示步长;通过改变存储模块预存的整数值x和正整数值y,以调整神经网络运算装置所能表示的幂次数据范围。
根据本公开的又一个方面,提供了一种使用所述的神经网络运算装置的方法,其中,通过改变存储模块预存的整数值x和正整数值y,以调整神经网络运算装置所能表示的幂次数据范围。
根据本公开的一个方面,提供了一种运算装置,包括:
运算控制模块,用于确定分块信息;
运算模块,用于根据所述分块信息对运算矩阵进行分块、转置及合并运算,以得到所述运算矩阵的转置矩阵。
在一些实施例中,所述的运算装置,还包括:
地址存储模块,用于存储所述运算矩阵的地址信息;以及
数据存储模块,用于存储所述运算矩阵,并存储运算后的转置矩阵;
其中,所述运算控制模块用于从所述地址存储模块提取所述运算矩阵的地址信息,并根据所述运算矩阵的地址信息分析得到分块信息;所述运算模块,用于从所述运算控制模块获取运算矩阵的地址信息及分块信息,根据所述运算矩阵的地址信息从所述数据存储模块提取运算矩阵,并根据所述分块信息对所述运算矩阵进行分块、转置及合并运算,得到所述运算矩阵的转置矩阵,并将所述运算矩阵的转置矩阵反馈至所述数据存储模块。
在一些实施例中,所述运算模块包括矩阵分块单元、矩阵运算单元和矩阵合并单元,其中:
矩阵分块单元:用于从所述运算控制模块获取运算矩阵的地址信息及分块信息,并根据所述运算矩阵的地址信息从所述数据存储模块提取运算矩阵,根据所述分块信息对所述运算矩阵进行分块,得到n个分块矩阵;
矩阵运算单元,用于获取所述n个分块矩阵,并对所述n个分块矩阵进行转置运算,得到所述n个分块矩阵的转置矩阵;
矩阵合并单元,用于获取并合并所述n个分块矩阵的转置矩阵,得到所述运算矩阵的转置矩阵,并将所述运算矩阵的转置矩阵反馈至所述数据存储模块,其中,n为自然数。
在一些实施例中,所述运算模块还包括缓存单元,用于缓存所述n个分块矩阵,以供所述矩阵运算单元获取。
在一些实施例中,所述运算控制模块包括指令处理单元、指令缓存单元和矩阵判断单元,其中:
指令缓存单元,用于存储待执行的矩阵运算指令;
指令处理单元,用于从指令缓存单元中获取矩阵运算指令,对所述矩阵运算指令进行译码,并根据所述译码后的矩阵运算指令从所述地址存储模块中获取运算矩阵的地址信息;
矩阵判断单元,用于对所述运算矩阵的地址信息进行分析,得到所述分块信息。
在一些实施例中,所述运算控制模块还包括依赖关系处理单元,用于判断所述译码后的矩阵运算指令和运算矩阵的地址信息是否与上一运算存在冲突,若存在冲突,则暂存所述译码后的矩阵运算指令和运算矩阵的地址信息;若不存在冲突,则发射所述译码后的矩阵运算指令和运算矩阵的地址信息至所述矩阵判断单元。
在一些实施例中,所述运算控制模块还包括指令队列存储器,用于缓存所述存在冲突的译码后的矩阵运算指令和运算矩阵的地址信息,当所述冲突消除后,将缓存的所述译码后的矩阵运算指令和运算矩阵的地址信息发射至所述矩阵判断单元。
在一些实施例中,所述指令处理单元包括取指单元和译码单元,其中:
取指单元,用于从所述指令缓存单元中获取矩阵运算指令,并将此矩阵运算指令传输至所述译码单元;
译码单元,用于对所述矩阵运算指令进行译码,根据该译码后的矩阵运算指令从所述地址存储模块中提取运算矩阵的地址信息,并将所述译码后的矩阵运算指令和提取的运算矩阵的地址信息传输至所述依赖关系处理单元。
在一些实施例中,所述装置还包括输入输出模块,用于向所述数据存储模块输入所述运算矩阵数据,还用于从所述数据存储模块获取运算后的转置矩阵,并输出所述运算后的转置矩阵。
在一些实施例中,所述地址存储模块包括标量寄存器堆或通用内存单元;所述数据存储模块包括高速暂存存储器或通用内存单元;所述运算矩阵的地址信息为矩阵的起始地址信息和矩阵大小信息。
根据本公开的另一个方面,提供了一种运算方法,包括以下步骤:
运算控制模块确定分块信息;
运算模块根据所述分块信息对运算矩阵进行分块、转置及合并运算,以得到所述运算矩阵的转置矩阵。
在一些实施例中,所述运算控制模块确定分块信息的步骤包括:
运算控制模块从地址存储模块提取运算矩阵的地址信息;以及
运算控制模块根据所述运算矩阵的地址信息得到分块信息。
在一些实施例中,所述运算控制模块从地址存储模块提取运算矩阵的地址信息的步骤包括:
取指单元提取运算指令,并将运算指令送至译码单元;
译码单元对运算指令进行译码,根据译码后的运算指令从地址存储模块获取运算矩阵的地址信息,并将译码后的运算指令和运算矩阵的地址信息送往依赖关系处理单元;
依赖关系处理单元分析该译码后的运算指令与前面的尚未执行结束的指令在数据上是否存在依赖关系;若存在依赖关系,该条译码后的运算指令与相应的运算矩阵的地址信息需要在指令队列存储器中等待至其与前面的未执行结束的指令在数据上不再存在依赖关系为止
在一些实施例中,所述运算模块根据所述分块信息对运算矩阵进行分块、转置及合并运算,以得到所述运算矩阵的转置矩阵的步骤包括:
运算模块根据所述运算矩阵的地址信息从数据存储模块提取运算矩阵;并根据所述分块信息将所述运算矩阵分成n个分块矩阵;
运算模块对所述n个分块矩阵分别进行转置运算,得到所述n个分块矩阵的转置矩阵;以及
运算模块合并所述n个分块矩阵的转置矩阵,得到所述运算矩阵的转置矩阵并反馈至所述数据存储模块;
其中,n为自然数。
在一些实施例中,所述运算模块合并n个分块矩阵的转置矩阵,得到运算矩阵的转置矩阵,并将该转置矩阵反馈给数据存储模块的步骤包括:
矩阵合并单元接收每个分块矩阵的转置矩阵,当接收到的分块矩阵的转置矩阵数量达到总的分块数后,对所有的分块进行矩阵合并操作,得到运算矩阵的转置矩阵;并将该转置矩阵反馈至数据存储模块的指定地址;
输入输出模块直接访问数据存储模块,从数据存储模块中读取运算得到的运算矩阵的转置矩阵。
根据本公开的一个方面,提供了一种数据筛选装置,包括:
存储单元,用于存储数据;
寄存器单元,用于存放存储单元中的数据地址;
数据筛选模块,在寄存器单元中获取数据地址,根据数据地址在存储单元中获取相应的数据,并对获取的数据进行筛选操作,得到数据筛选结果。
在一些实施例中,所述数据筛选模块包括数据筛选单元,对获取的数据进行筛选操作。
在一些实施例中,所述数据筛选模块还包括:I/O单元、输入数据缓存单元和输出数据缓存单元;
所述I/O单元,将存储单元存储的数据搬运到输入数据缓存单元;
所述输入数据缓存单元,存储I/O单元搬运的数据;
所述数据筛选单元将输入数据缓存单元传来的数据作为输入数据,并对输入数据进行筛选操作,将输出数据传给输出数据缓存单元;
所述输出数据缓存单元存储输出数据。
在一些实施例中,所述输入数据包括待筛选数据和位置信息数据,所述输出数据包含筛选后数据,或者筛选后数据及其相关信息。
在一些实施例中,所述待筛选数据为向量或数组,所述位置信息数据为二进制码、向量或数组;所述相关信息包括向量长度、数组大小、所占空间。
在一些实施例中,所述数据筛选单元扫描位置信息数据的每个分量,若分量为0,则删除该分量对应的待筛选数据分量,若分量为1,则保留该分量对应的待筛选数据分量;或者,若分量为1,则删除该分量对应的待筛选数据分量,若分量为0,则保留该分量对应的待筛选数据分量,扫描完毕后得到筛选后数据并输出。
在一些实施例中,所述数据筛选模块还包括结构变形单元,对输入数据和/或输出数据的存储结构进行变形。
根据本公开的另一个方面,提供了一种数据筛选方法,利用所述的数据筛选装置进行数据筛选,包括:
步骤A:数据筛选模块在寄存器单元中获取数据地址;
步骤B:根据数据地址在存储单元中获取相应的数据;以及
步骤C:对获取的数据进行筛选操作,得到数据筛选结果。
在一些实施例中,所述步骤A包括:数据筛选单元从寄存器单元中获取待筛选数据的地址和位置信息数据的地址;
所述步骤B包括:
子步骤B1:I/O单元将存储单元中的待筛选数据和位置信息数据传递给输入数据缓存单元;
子步骤B2:输入数据缓存单元将待筛选数据和位置信息数据传递给数据筛选单元;
所述步骤C包括:数据筛选单元根据位置信息数据,对待筛选数据进行筛选操作,将输出数据传递给输出数据缓存单元。
在一些实施例中,所述子步骤B1和子步骤B2之间还包括:
判断是否进行存储结构变形,如果是,则执行步骤子步骤B3,如果否,则直接执行子步骤B2;
子步骤B3:输入数据缓存单元将待筛选数据传给结构变形单元,结构变形单元进行存储结构变形,将变形后的待筛选数据传回输入数据缓存单元,然后执行子步骤B2。
根据本公开的一个方面,提供了一种神经网络处理器,包括:存储器、高速暂存存储器和异构内核;其中,
所述存储器,用于存储神经网络运算的数据和指令;
所述高速暂存存储器,通过存储器总线与所述存储器连接;
所述异构内核,通过高速暂存存储器总线与所述高速暂存存储器连接,通过高速暂存存储器读取神经网络运算的数据和指令,完成神经网络运算,并将运算结果送回到高速暂存存储器,控制高速暂存存储器将运算结果写回到存储器。
在一些实施例中,所述异构内核包括:
多个运算内核,其具有至少两种不同类型的运算内核,用于执行神经网络运算或神经网络层运算;以及
一个或多个逻辑控制内核,用于根据神经网络运算的数据,决定由所述专用内核和/或所述通用内核执行神经网络运算或神经网络层运算。
在一些实施例中,所述多个运算内核包括x个通用内核和y个专用内核;其中,所述专用内核专用于执行指定神经网络/神经网络层运算,所述通用内核用于执行任意神经网络/神经网络层运算。
在一些实施例中,所述通用内核为cpu,所述专用内核为npu。
在一些实施例中,所述高速暂存存储器包括共享高速暂存存储器和/或非共享高速暂存存储器;其中,所述一共享高速暂存存储器通过高速暂存存储器总线与所述异构内核中的至少两个内核对应连接;所述一非共享高速暂存存储器通过高速暂存存储器总线与所述异构内核中的一内核对应连接。
在一些实施例中,所述逻辑控制内核通过高速暂存存储器总线与所述高速暂存存储器连接,通过高速暂存存储器读取神经网络运算的数据,并根据神经网络运算的数据中的神经网络模型的类型和参数,决定由专用内核和/或通用内核作为目标内核来执行神经网络运算和/或神经网络层运算。
在一些实施例中,所述逻辑控制内核通过控制总线直接向所述向目标内核发送信号或者经所述高速暂存存储器存储器向所述向目标内核发送信号,从而控制目标内核执行神经网络运算和/或神经网络层运算。
根据本公开的另一个方面,提供了一种神经网络运算方法,其中,利用所述的神经网络处理器进行神经网络运算,包括:
异构内核中的逻辑控制内核通过高速暂存存储器从存储器读取神经网络运算的数据和指令;以及
异构内核中的逻辑控制内核根据神经网络运算的数据中的神经网络模型的类型和参数,决定由专用内核和/或通用内核执行神经网络运算或神经网络层运算。
在一些实施例中,异构内核中的逻辑控制内核根据神经网络运算的数据中的神经网络模型的类型和参数,决定由专用内核和/或通用内核执行神经网络层运算包括:
异构内核中的逻辑控制内核根据神经网络运算的数据中的神经网络模型的类型和参数,判断是否存在符合条件的专用内核;
如果专用内核m符合条件,则专用内核m作为目标内核,异构内核中的逻辑控制内核向目标内核发送信号,将神经网络运算的数据和指令对应的地址发送到目标内核;
目标内核根据地址通过共享或非共享高速暂存存储器从存储器获取神经网络运算的数据和指令,进行神经网络运算,将运算结果通过所述共享或非共享高速暂存存储器输出到存储器,运算完成;
如果不存在符合条件的专用内核,则异构内核中的逻辑控制内核向通用内核发送信号,将神经网络运算的数据和指令对应的地址发送到通用内核;
通用内核根据地址通过共享或非共享高速暂存存储器从存储器获取神经网络运算的数据和指令,进行神经网络运算,将运算结果通过共享或非共享高速暂存存储器输出到存储器,运算完成。
在一些实施例中,所述符合条件的专用内核是指支持指定神经网络运算且能完成指定神经网络运算规模的专用内核。
在一些实施例中,异构内核中的逻辑控制内核根据神经网络运算的数据中的神经网络模型的类型和参数,决定由专用内核和/或通用内核执行神经网络运算包括:
异构内核中的逻辑控制内核对数据中的神经网络模型的类型和参数进行解析,对每一神经网络层分别判断是否存在符合条件的专用内核,并为每一神经网络层分配对应的通用内核或专用内核,得到与神经网络层对应的内核序列;
异构内核中的逻辑控制内核将神经网络层运算的数据和指令对应的地址发送到神经网络层对应的专用内核或通用内核,并向神经网络层对应的专用内核或通用内核发送内核序列中下一个专用内核或通用内核的编号;
神经网络层对应的专用内核和通用内核从地址读取神经网络层运算的数据和指令,进行神经网络层运算,将运算结果传送到共享和/或非共享高速暂存存储器的指定地址;
逻辑控制内核控制共享和/或非共享高速暂存存储器将神经网络层的运算结果写回到存储器中,运算完成。
在一些实施例中,所述符合条件的专用内核是指支持指定神经网络层运算且能完成该指定神经网络层运算规模的专用内核。
在一些实施例中,所述神经网络运算包括脉冲神经网络运算;所述神经网络层运算包括神经网络层的卷积运算、全连接层、拼接运算、对位加/乘运算、Relu运算、池化运算和/或Batch Norm运算。
(三)有益效果
从上述技术方案可以看出,本公开运算装置和方法至少具有以下有益效果其中之一:
(1)利用幂次数据表示方法存储神经元数据,减小存储网络数据所需的存储空间,同时,该数据表示方法简化了神经元与权值数据的乘法操作,降低了对运算器的设计要求,加快的神经网络的运算速度。
(2)将运算后得到的神经元数据转换为幂次神经元数据,减小了神经网络存储资 源和计算资源的开销,有利于提高神经网络的运算速度。
(3)对于非幂次神经元数据在输入到神经网络运算装置前可先经过幂次转换,再输入神经网络运算装置,进一步减小了神经网络存储资源和计算资源的开销,提高了神经网络的运算速度。
(4)本公开的数据筛选装置和方法将参与筛选操作的数据和指令暂存在专用缓存上,可以更加高效地对不同存储结构、不同大小的数据进行数据筛选操作。
(5)采用异构内核进行神经网络运算,可以根据实际神经网络的类型和规模选用不同的内核进行运算,充分利用硬件实际的运算能力,降低成本,减少功耗开销。
(6)不同的内核进行不同层的运算,不同层之间并行运算,可以充分利用神经网络的并行性,提高了神经网络运算的效率。
附图说明
附图是用来提供对本公开的进一步理解,并且构成说明书的一部分,与下面的具体实施方式一起用于解释本公开,但并不构成对本公开的限制。在附图中:
图1A为依据本公开一实施例的神经网络运算装置的结构示意图。
图1B为依据本公开另一实施例的神经网络运算装置的结构示意图。
图1C为本公开实施例的运算单元功能示意图。
图1D为本公开实施例的运算单元另一功能示意图。
图1E为本公开实施例的主处理电路的功能示意图。
图1F为依据本公开实施例的神经网络运算装置的另一结构示意图。
图1G为依据本公开实施例的神经网络运算装置的又一结构示意图。
图1H为依据本公开实施例的神经网络运算方法流程图。
图1I为依据本公开实施例的编码表的示意图。
图1J为依据本公开实施例的编码表的另一示意图。
图1K为依据本公开实施例的编码表的另一示意图。
图1L为依据本公开实施例的编码表的另一示意图。
图1M为依据本公开实施例的幂次数据的表示方法示意图。
图1N为依据本公开实施例的权值与幂次神经元的乘法操作示意图。
图1O为依据本公开实施例的权值与幂次神经元的乘法操作示意图。
图2A为依据本公开实施例的神经网络运算装置的结构示意图。
图2B为依据本公开实施例的神经网络运算方法流程图。
图2C为依据本公开实施例的幂次数据的表示方法示意图。
图2D为依据本公开实施例的神经元与幂次权值的乘法操作示意图。
图2E为依据本公开实施例的神经元与幂次权值的乘法操作示意图。
图2F为依据本公开实施例的神经网络运算方法流程图。
图2G为依据本公开实施例的幂次数据的表示方法示意图。
图2H为依据本公开实施例的幂次神经元与幂次权值的乘法操作示意图。
图3A是本公开提出的运算装置的结构示意图。
图3B是本公开提出的运算装置的信息流示意图。
图3C是本公开提出的运算装置中运算模块的结构示意图。
图3D是本公开提出的运算模块进行矩阵运算示意图。
图3E是本公开提出的运算装置中运算控制模块的结构示意图。
图3F是本公开一实施例提出的运算装置的详细结构示意图。
图3G是本公开另一实施例提出的运算方法的流程图。
图4A为本公开实施例的数据筛选装置的整体结构示意图。
图4B为本公开实施例的数据筛选单元功能示意图。
图4C是本公开实施例的数据筛选装置的具体结构示意图。
图4D是本公开实施例的数据筛选装置的另一具体结构示意图。
图4E是本公开实施例的数据筛选方法的流程图。
图5A示意性示出了本公开一实施例的异构多核神经网络处理器。
图5B示意性示出了本公开另一实施例的异构多核神经网络处理器。
图5C为本公开另一实施例的神经网络运算方法流程图。
图5D为本公开另一实施例的神经网络运算方法流程图。
图5E示意性示出了本公开另一实施例的异构多核神经网络处理器。
图5F示意性示出了本公开另一实施例的异构多核神经网络处理器。
具体实施方式
为使本公开的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本公开进一步详细说明。
需要说明的是,在附图或说明书描述中,相似或相同的部分都使用相同的图号。附图中未绘示或描述的实现方式,为所属技术领域中普通技术人员所知的形式。另外,虽然本文可提供包含特定值的参数的示范,但应了解,参数无需确切等于相应的值,而是可在可接受的误差容限或设计约束内近似于相应的值。实施例中提到的方向用语,例如“上”、“下”、“前”、“后”、“左”、“右”等,仅是参考附图的方向。因此,使用的方向用语是用来说明并非用来限制本公开的保护范围。
本公开的一实施例中,如图1A所示,运算装置,包括:运算模块1-1,用于执行神经网络运算;以及幂次转换模块1-2,与所述运算模块连接,用于将神经网络运算的输入神经元数据和/或输出神经元数据转换为幂次神经元数据。
在另一实施例中,如图1B所示,运算装置,包括:
存储模块1-4,用于存储数据和运算指令;
控制模块1-3,与所述存储模块连接,用于控制数据和运算指令的交互,其接收该存储模块发送的数据和运算指令,并将运算指令译码成运算微指令;
运算模块1-1,与所述控制模块连接,接收该控制模块发送的数据和运算微指令,并根据运算微指令对其接收的权值数据和神经元数据执行神经网络运算;以及
幂次转换模块1-2,与所述运算模块连接,用于将神经网络运算的输入神经元数据和/或输出神经元数据转换为幂次神经元数据。
本领域人员可以理解,存储模块可以集成在运算装置内部,也可以作为片外存储器,设置在运算装置外部。
具体而言,请继续参照图1B所示,所述存储模块包括:存储单元1-41,用于存储数据和运算指令。
所述控制模块包括:
运算指令缓存单元1-32,与所述数据控制单元连接,用于接收数据控制单元发送的运算指令;
译码单元1-33,与所述运算指令缓存单元连接,用于从运算指令缓存单元中读取运算指令,并将其译码成各运算微指令;
输入神经元缓存单元1-34,与所述数据控制单元连接,用于接收数据控制单元发送的神经元数据;
权值缓存单元1-35,与所述数据控制单元连接,用于接收从数据控制单元发送的权值数据;
数据控制单元1-31,与存储模块连接,用于实现存储模块分别与运算指令缓存单元、权值缓存单元以及输入神经元缓存单元之间的数据和运算指令交互。
所述运算模块包括:运算单元1-11,分别与所述译码单元、输入神经元缓存单元及权值缓存单元连接,接收各运算微指令、神经元数据及权值数据,用于根据各运算微指令对其接收的神经元数据和权值数据执行相应的运算。
在一可选的实施例中,运算单元包括但不仅限于:第一部分的第一个或多个乘法器;第二部分的一个或者多个加法器(更具体的,第二个部分的加法器也可以组成加法树);第三部分的激活函数单元;和/或第四部分的向量处理单元。更具体的,向量处理单元可以处理向量运算和/或池化运算。第一部分将输入数据1(in1)和输入数据2(in2)相乘得到相乘之后的输出(out),过程为:out=in1*in2;第二部分将输入数据in1通过加法器相加得到输出数据(out)。更具体的,第二部分为加法树时,将输入数据in1通过加法树逐级相加得到输出数据(out),其中in1是一个长度为N的向量,N大于1,过程为:out=in1[1]+in1[2]+...+in1[N],和/或将输入数据(in1)通过加法数累加之后和输入数据(in2)相加得到输出数据(out),过程为:out=in1[1]+in1[2]+...+in1[N]+in2,或者将输入数据(in1)和输入数据(in2)相加得到输出数据(out),过程为:out=in1+in2;第三部分将输入数据(in)通过激活函数(active)运算得到激活输出数据(out),过程为:out=active(in),激活函数active可以是sigmoid、tanh、relu、softmax等,除了做激活操作,第三部分可以实现其他的非线性函数,可将将输入数据(in)通过运算(f)得到输出数据(out),过程为:out=f(in)。向量处理单元将输入数据(in)通过池化运算得到池化操作之后的输出数据(out),过程为out=pool(in),其中pool为池化操作,池化操作包括但不限于:平均值池化,最大值池化,中值池化,输入数据in是和输出out相关的一个池化核中的数据。
所述运算单元执行运算包括第一部分是将所述输入数据1和输入数据2相乘,得到相乘之后的数据;和/或第二部分执行加法运算(更具体的,为加法树运算,用于将输入数据1通过加法树逐级相加),或者将所述输入数据1通过和输入数据2相加得到输出数据;和/或第三部分执行激活函数运算,对输入数据通过激活函数(active)运算得到输出数据;和/或第四部分执行池化运算,out=pool(in),其中pool为池化操作,池化操作包括但不限于:平均值池化,最大值池化,中值池化,输入数据in是和输出out 相关的一个池化核中的数据。以上几个部分的运算可以自由选择一个多个部分进行不同顺序的组合,从而实现各种不同功能的运算。计算单元相应的即组成了二级,三级,或者四级流水级架构。
在另一可选的实施例中,所述运算单元可以包括一个主处理电路以及多个从处理电路。
所述主处理电路,用于将将一个输入数据分配成多个数据块,将所述多个数据块中的至少一个数据块以及多个运算指令中的至少一个运算指令发送给所述从处理电路;
所述多个从处理电路,用于依据该运算指令对接收到的数据块执行运算得到中间结果,并将运算结果传输给所述主处理电路;
所述主处理电路,用于将多个从处理电路发送的中间结果进行处理得到该运算指令的结果,将该运算指令的结果发送给所述数据控制单元。
在一种可选实施例中,运算单元如图1C所示,可以包括分支处理电路;其中,
主处理电路与分支处理电路连接,分支处理电路与多个从处理电路连接;
分支处理电路,用于执行转发主处理电路与从处理电路之间的数据或指令。
在又一可选的实施例中,运算单元如图1D所示,可以包括一个主处理电路和多个从处理电路。可选的,多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个基础电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路。
K个从处理电路,用于在所述主处理电路以及多个从处理电路之间的数据以及指令的转发。
可选的,如图1E所示,该主处理电路还可以包括:转换处理电路、激活处理电路、加法处理电路中的一种或任意组合;
转换处理电路,用于将主处理电路接收的数据块或中间结果执行第一数据结构与第二数据结构之间的互换(例如连续数据与离散数据的转换);或将主处理电路接收的数据块或中间结果执行第一数据类型与第二数据类型之间的互换(例如定点类型与浮点类型的转换);
激活处理电路,用于执行主处理电路内数据的激活运算;
加法处理电路,用于执行加法运算或累加运算。
所述从处理电路包括:
乘法处理电路,用于对接收到的数据块执行乘积运算得到乘积结果;
转发处理电路(可选的),用于将接收到的数据块或乘积结果转发。
累加处理电路,所述累加处理电路,用于对该乘积结果执行累加运算得到该中间结果。
在再一可选的实施例中,该运算指令可以为矩阵乘以矩阵的指令、累加指令、激活指令等等运算指令。
所述输出模块1-5包括:输出神经元缓存单元1-51,与所述运算单元连接,用于接收运算单元输出的神经元数据;
所述幂次转换模块包括:
第一幂次转换单元1-21,与所述输出神经元缓存单元连接,用于将所述输出神经元缓存单元输出的神经元数据转换为幂次神经元数据;以及
第二幂次转换单元1-22,与所述存储模块连接,用于将输入所述存储模块的神经元数据转换为幂次神经元数据。而对于神经网络输入数据中的幂次神经元数据,则直接存入存储模块。
若所述神经网络运算装置利用I/O模块实现数据输入/输出,所述第一和第二幂次转换单元也可设置在I/O模块与运算模块之间,以将神经网络运算的输入神经元数据和/或输出神经元数据转换为幂次神经元数据。
可选的,所述运算装置可包括:第三幂次转换单元1-23,用于将幂次神经元数据转换为非幂次神经元数据。非幂次神经元数据经第二幂次转换单元转换为幂次神经元数据后输入至运算单元中执行运算,运算过程中,为提高精度,可选择性的设置第三幂次转换单元,用于将幂次神经元数据转换为非幂次神经元数据,第三幂次转换单元可以设在所述运算模块外部(如图1F所示)也可以设在所述运算模块内部(如图1G所示),运算之后输出的非幂次神经元数据可利用第一幂次转换单元转换成幂次神经元数据,再反馈至数据控制单元,参与后续运算,以加快运算速度,由此可形成闭合循环。
当然,运算模块输出的数据也可以直接发送至输出神经元缓存单元,由输出神经元缓存单元发送至数据控制单元而不经由幂次转换单元。
其中,存储模块可以从外部地址空间接收数据和运算指令,该数据包括神经网络权值数据、神经网络输入数据等。
另外,幂次转换操作有多种可选方式。下面列举本实施例所采用的三种幂次转换操作:
第一种幂次转换方法:
s out=s in
Figure PCTCN2018081929-appb-000010
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据,s in为输入数据的符号,s out为输出数据的符号,d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out
Figure PCTCN2018081929-appb-000011
表示对数据x做取下整操作。
第二种幂次转换方法:
s out=s in
Figure PCTCN2018081929-appb-000012
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据,s in为输入数据的符号,s out为输出数据的符号,d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out
Figure PCTCN2018081929-appb-000013
表示对数据x做取上整操作。
第三种幂次转换方法:
s out=s in
d out+=[log 2(d in+)]
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据;s in为输入数据的符号,s out为输出数据的符号;d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out;[x]表示对数据x做四舍五入操作。
需要说明的是,本公开幂次转换方式除了四舍五入取整、向上取整、向下取整之外,还可以是向奇数取整、向偶数取整、向零取整和随机取整。其中,优选的为四舍五入取整、向零取整和随机取整以减小精度损失。
另外,本公开实施例还提供了一种神经网络运算方法,所述神经网络运算方法,包括:执行神经网络运算;以及在执行神经网络运算之前,将神经网络运算的输入神经元数据转换为幂次神经元数据;和/或在执行神经网络运算之后,将神经网络运算的输出神经元数据转换为幂次神经元数据。
可选的,在执行神经网络运算之前,将神经网络运算的输入神经元数据转换为幂次神经元数据的步骤包括:将输入数据中的非幂次神经元数据转换为幂次神经元数据;以及接收并存储运算指令、所述幂次神经元数据及权值数据。
可选的,在接收并存储运算指令、所述幂次神经元数据及权值数据的步骤和执行神经网络运算的步骤之间,还包括:读取运算指令,并将其译码成各运算微指令。
可选的,在执行神经网络运算的步骤中,根据运算微指令对权值数据和幂次神经元数据进行神经网络运算。
可选的,在执行神经网络运算之后,将神经网络运算的输出神经元数据转换为幂次神经元数据的步骤包括:输出神经网络运算后得到的神经元数据;以及将神经网络运算后得到的神经元数据中的非幂次神经元数据转换为幂次神经元数据。
可选的,将神经网络运算后得到的神经元数据中的非幂次神经元数据转换为幂次神经元数据并发送至所述数据控制单元,以作为神经网络运算下一层的输入幂次神经元,重复神经网络运算步骤和非幂次神经元数据转换成幂次神经元数据步骤,直到神经网络最后一层运算结束。
具体而言,本公开实施例的神经网络为多层神经网络,在一些实施例中,对于每层神经网络可按图1H所示的运算方法进行运算,其中,神经网络第一层输入幂次神经元数据可通过存储模块从外部地址读入,若外部地址读入的数据已经为幂次数据则直接传入存储模块,否则先通过幂次转换单元转换为幂次数据,此后各层神经网络的输入幂次神经元数据可由在该层之前的一层或多层神经网络的输出幂次神经元数据提供。请参照图1H,本实施例单层神经网络运算方法,包括:
步骤S1-1,获取运算指令、权值数据及神经元数据。
其中,所述步骤S1-1包括以下子步骤:
S1-11,将运算指令、神经元数据及权值数据输入存储模块;其中,对幂次神经元数据直接输入存储模块,对非幂次神经元数据经过所述第二幂次转换单元转换后输入存储模块;
S1-12,数据控制单元接收该存储模块发送的运算指令、幂次神经元数据及权值数据;
S1-13,运算指令缓存单元、输入神经元缓存单元及权值缓存单元分别接收所述数据控制单元发送的运算指令、幂次神经元数据及权值数据并分发给译码单元或运算单元。
所述幂次神经元数据表示神经元数据的数值采用其幂指数值形式表示,具体为,幂次神经元数据包括符号位和幂次位,符号位用一位或多位比特位表示幂次神经元数据的符号,幂次位用m位比特位表示幂次神经元数据的幂次位数据,m为大于1的正整数。存储模块的存储单元预存有编码表,提供幂次神经元数据的每个幂次位数据对应的指数数值。编码表设置一个或者多个幂次位数据(即置零幂次位数据)为指定对应的幂次神经元数据为0。也就是说,当幂次神经元数据的幂次位数据是编码表里的置零幂次位数据时候,表示该幂次神经元数据为0。其中,所述编码表可以有灵活的存储方式,既可以是表格形式进行存储,还可以是通过函数关系进行的映射。
编码表的对应关系可以是任意的。
例如,编码表的对应关系可以是乱序的。如图1I所示一种m为5的编码表的部分内容,幂次位数据为00000的时候对应指数数值为0。幂次位数据为00001的时候对应指数数值为3。幂次位数据为00010的时候对应指数数值为4。幂次位数据为00011的时候对应指数数值为1。幂次位数据为00100的时候对应幂次神经元数据为0。
编码表的对应关系也可以是正相关的,存储模块预存一个整数值x和一个正整数值y,最小的幂次位数据对应指数数值为x,其他任意一个或多个幂次位数据对应幂次神经元数据为0。x表示偏置值,y表示步长。在一种实施例情况下,最小的幂次位数据对应指数数值为x,最大的幂次位数据对应幂次神经元数据为0,最小和最大的幂次位数据之外的其他的幂次位数据对应指数数值为(幂次位数据+x)*y。通过预设定不同的x和y以及通过改变x和y的数值,幂次的表示范围变得可配,可以适用于需要不同数值范围的不同的应用场景。因此,本神经网络运算装置的应用范围更加广泛,使用更加灵活可变,可根据用户需求来做调整。
在一种实施例里,y为1,x的数值等于-2 m-1。由此幂次神经元数据所表示的数值的指数范围为-2 m-1~2 m-1-1。
在一种实施例里,如图1J所示一种m为5,x为0,y为1的编码表的部分内容,幂次位数据为00000的时候对应指数数值为0。幂次位数据为00001的时候对应指数数值为1。幂次位数据为00010的时候对应指数数值为2。幂次位数据为00011的时候对应指数数值为3。幂次位数据为11111的时候对应幂次神经元数据为0。如图1K所示另一种m为5,x为0,y为2的编码表的部分内容,幂次位数据为00000的时候对应指数数值为0。幂次位数据为00001的时候对应指数数值为2。幂次位数据为00010的时候对应指数数值为4。幂次位数据为00011的时候对应指数数值为6。幂次位数据为11111的时候对应幂次神经元数据为0。
编码表的对应关系可以是负相关的,存储模块预存一个整数值x和一个正整数值y,最大的幂次位数据对应指数数值为x,其他任意一个或多个幂次位数据对应幂次神 经元数据为0。x表示偏置值,y表示步长。在一种实施例情况下,最大的幂次位数据对应指数数值为x,最小的幂次位数据对应幂次神经元数据为0,最小和最大的幂次位数据之外的其他的幂次位数据对应指数数值为(幂次位数据-x)*y。通过预设定不同的x和y以及通过改变x和y的数值,幂次的表示范围变得可配,可以适用于需要不同数值范围的不同的应用场景。因此,本神经网络运算装置的应用范围更加广泛,使用更加灵活可变,可根据用户需求来做调整。
在一种实施例里,y为1,x的数值等于2 m-1。由此幂次神经元数据所表示的数值的指数范围为-2 m-1-1~2 m-1
如图1L所示一种m为5的编码表的部分内容,幂次位数据为11111的时候对应数数值为0。幂次位数据为11110的时候对应指数数值为1。幂次位数据为11101的时候对应指数数值为2。幂次位数据为11100的时候对应指数数值为3。幂次位数据为00000的时候对应幂次神经元数据为0。
编码表的对应关系可以是幂次位数据最高位代表置零位,幂次位数据其他m-1位对应指数数值。当幂次位数据最高位为0时,对应幂次神经元数据为0;当幂次位数据最高位为1时,对应幂次神经元数据不为0。反之亦可,即当幂次位数据最高位为1时,对应幂次神经元数据为0;当幂次位数据最高位为0时,对应幂次神经元数据不为0。用另一种语言来描述,即幂次神经元数据的幂次位被分出一个比特来指示幂次神经元数据是否为0。
在一个具体实例图1M所示,符号位为1位,幂次位数据位为7位,即m为7。编码表为幂次位数据为11111111的时候对应幂次神经元数据为0,幂次位数据为其他数值的时候幂次神经元数据对应相应的二进制补码。当幂次神经元数据符号位为0,幂次位为0001001,则其表示具体数值为2 9,即512;幂次神经元数据符号位为1,幂次位为1111101,则其表示具体数值为-2 -3,即-0.125。相对于浮点数据,幂次数据只保留数据的幂次位,极大减小了存储数据所需的存储空间。
通过幂次数据表示方法,可以减小存储神经元数据所需的存储空间。在本实施例所提供示例中,幂次数据为8位数据,应当认识到,该数据长度不是固定不变的,在不同场合下,根据数据神经元的数据范围采用不同的数据长度。
步骤S1-2,根据运算微指令对权值数据和神经元数据进行神经网络运算。其中,所述步骤S2包括以下子步骤:
S1-21,译码单元从运算指令缓存单元中读取运算指令,并将其译码成各运算微指令;
S1-22,运算单元分别接收所述译码单元、输入神经元缓存单元及权值缓存单元发送的运算微指令、幂次神经元数据以及权值数据,并根据运算微指令对权值数据和幂次神经元数据进行神经网络运算。
所述幂次神经元与权值乘法操作具体为,幂次神经元数据符号位与权值数据符号位做异或操作;编码表的对应关系为乱序的情况下查找编码表找出幂次神经元数据幂次位对应的指数数值,编码表的对应关系为正相关的情况下记录编码表的指数数值最小值并做加法找出幂次神经元数据幂次位对应的指数数值,编码表的对应关系为负相 关的情况下记录编码表的最大值并做减法找出幂次神经元数据幂次位对应的指数数值;将指数数值与权值数据幂次位做加法操作,权值数据有效位保持不变。
具体实例一如图1N所示,权值数据为16位浮点数据,符号位为0,幂次位为10101,有效位为0110100000,则其表示的实际数值为1.40625*2 6。幂次神经元数据符号位为1位,幂次位数据位为5位,即m为5。编码表为幂次位数据为11111的时候对应幂次神经元数据为0,幂次位数据为其他数值的时候幂次位数据对应相应的二进制补码。幂次神经元为000110,则其表示的实际数值为64,即2 6。权值的幂次位加上幂次神经元的幂次位结果为11011,则结果的实际数值为1.40625*2 12,即为神经元与权值的乘积结果。通过该运算操作,使得乘法操作变为加法操作,减小计算所需的运算量。
具体实例二如图1O所示,权值数据为32位浮点数据,符号位为1,幂次位为10000011,有效位为10010010000000000000000,则其表示的实际数值为-1.5703125*2 4。幂次神经元数据符号位为1位,幂次位数据位为5位,即m为5。编码表为幂次位数据为11111的时候对应幂次神经元数据为0,幂次位数据为其他数值的时候幂次位数据对应相应的二进制补码。幂次神经元为111100,则其表示的实际数值为-2 -4。(权值的幂次位加上幂次神经元的幂次位结果为01111111,则结果的实际数值为1.5703125*2 0,即为神经元与权值的乘积结果。
步骤S1-3,第一幂次转换单元将神经网络运算后的神经元数据转换成幂次神经元数据。
其中,所述步骤S1-3包括以下子步骤:
S1-31,输出神经元缓存单元接收所述计算单元发送的神经网络运算后得到的神经元数据;
S1-32,第一幂次转换单元接收所述输出神经元缓存单元发送的神经元数据,并将其中的非幂次神经元数据转换为幂次神经元数据。
其中,可选的幂次转换操作有多种,根据实际的应用需求选择。本实施例中列举了三种幂次转换操作:
第一种幂次转换方法:
s out=s in
Figure PCTCN2018081929-appb-000014
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据,s in为输入数据的符号,s out为输出数据的符号,d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out
Figure PCTCN2018081929-appb-000015
表示对数据x做取下整操作。
第二种幂次转换方法:
s out=s in
Figure PCTCN2018081929-appb-000016
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据,s in为输入数据的符号,s out为输出数据的符号,d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out
Figure PCTCN2018081929-appb-000017
表示对数据x做取上整操作。
第三种幂次转换方法:
s out=s in
d out+=[log 2(d in+)]
其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据;s in为输入数据的符号,s out为输出数据的符号;d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out
Figure PCTCN2018081929-appb-000018
表示对数据x做四舍五入操作。
另外,通过幂次转换单元获得的幂次神经元数据可作为神经网络运算下一层的输入幂次神经元,再重复步骤1至步骤3直到神经网络最后一层运算结束。通过改变存储模块预存的整数值x和正整数值y,可以调整神经网络运算装置所能表示的幂次神经元数据范围。
在另一个实施例里,本公开还提供了一种使用所述神经网络运算装置的方法,通过改变存储模块预存的整数值x和正整数值y,以调整神经网络运算装置所能表示的幂次神经元数据范围。
在本公开的另一些实施例中,与前述实施例不同的是,所述运算装置的幂次转换模块,与所述运算模块连接,用于将神经网络运算的输入数据和/或输出数据转换为幂次数据。
具体的,所述输入数据包括输入神经元数据、输入权值数据,所述输出数据包括输出神经元数据、输出权值数据,所述幂次数据包括幂次神经元数据、幂次权值数据。
也就是说,在前述实施例的基础上,此处的幂次转换模块除了可以对神经元数据进行幂次转换之外,还可以对权值数据进行幂次转换,另外,运算结果中的权值数据转换为幂次权值数据之后,可直接发送至数据控制单元,参与后续运算。运算装置的其余模块、单元组成、功能用途及连接关系与前述实施例类似。
如图2A所示,本实施例神经网络运算装置中,包括存储模块2-4、控制模块2-3、运算模块2-1、输出模块2-5及幂次转换模块2-2。
所述存储模块包括:存储单元2-41,用于存储数据和指令;
所述控制模块包括:
数据控制单元2-31,与所述存储单元连接,用于存储单元和各缓存单元之间的数据和指令交互;
运算指令缓存单元2-32,与所述数据控制单元连接,用于接收数据控制单元发送的指令;
译码单元2-33,与所述指令缓存单元连接,用于从指令缓存单元中读取指令,并将其译码成各运算指令;
输入神经元缓存单元2-34,与所述数据控制单元连接,用于接收数据控制单元发 送的神经元数据;
权值缓存单元2-35,与所述数据控制单元连接,用于接收从数据控制单元发送的权值数据。
所述运算模块包括:运算单元2-11,与所述控制模块连接,接收该控制模块发送的数据和运算指令,并根据运算指令对其接收的神经元数据及权值数据执行神经网络运算;
所述输出模块包括:输出神经元缓存单元2-51,与所述运算单元连接,用于接收运算单元输出的神经元数据;并将其发送至所述数据控制单元。由此可作为下一层神经网络运算的输入数据;
所述幂次转换模块可包括:
第一幂次转换单元2-21,与所述输出神经元缓存单元及所述运算单元连接,用于将所述输出神经元缓存单元输出的神经元数据转换为幂次神经元数据以及将运算单元输出的权值数据转换为幂次权值数据;和/或
第二幂次转换单元2-22,与所述存储模块连接,用于将输入所述存储模块的神经元数据、权值数据分别转换为幂次神经元数据、幂次权值数据;
可选的,所述运算装置还包括:第三幂次转换单元2-23,与所述运算单元连接,用于将幂次神经元数据、幂次权值数据分别转换为非幂次神经元数据、非幂次权值数据。
需要说明的是,此处仅以幂次转换模块同时包括第一幂次转换单元、第二幂次转换单元和第三幂次转换单元为例进行说明,但实际上,所述幂次转换模块可以包括第一幂次转换单元、第二幂次转换单元和第三幂次转换单元的其中任一,同前述图1B、1F、1G所示的实施例。
非幂次神经元数据、权值数据经第二幂次转换单元转换为幂次神经元数据、幂次权值数据后输入至运算单元中执行运算,运算过程中,为提高精度,可通过设置第三幂次转换单元,将幂次神经元数据、幂次权值数据转换为非幂次神经元数据、非幂次权值数据,第三幂次转换单元可以设在所述运算模块外部也可以设在所述运算模块内部,运算之后输出的非幂次神经元数据可利用第一幂次转换单元转换成幂次神经元数据,再反馈至数据控制单元,参与后续运算,以加快运算速度,由此可形成闭合循环。
另外,所述权值数据幂次转换的具体操作方法与前述实施例相同,此处不再赘述。
在一些实施例中,所述神经网络为多层神经网络,对于每层神经网络可按图2B所示的运算方法进行运算,其中,神经网络第一层输入幂次权值数据可通过存储单元从外部地址读入,若外部地址读入的权值数据已经为幂次权值数据则直接传入存储单元,否则先通过幂次转换单元转换为幂次权值数据。请参照图2B,本实施例单层神经网络运算方法,包括:
步骤S2-1,获取指令、神经元数据及幂次权值数据。
其中,所述步骤S2-1包括以下子步骤:
S2-11,将指令、神经元数据及权值数据输入存储单元;其中,对幂次权值数据直接输入存储单元,对非幂次权值数据经过幂次转换单元转换后输入存储单元;
S2-12,数据控制单元接收该存储单元发送的指令、神经元数据及幂次权值数据;
S2-13,指令缓存单元、输入神经元缓存单元及权值缓存单元分别接收所述数据控制单元发送的指令、神经元数据及幂次权值数据并分发给译码单元或运算单元。
所述幂次权值数据表示权值数据的数值采用其幂指数值形式表示,具体为,幂次权值数据包括符号位和幂次位,符号位用一位或多位比特位表示权值数据的符号,幂次位用m位比特位表示权值数据的幂次位数据,m为大于1的正整数。存储单元预存有编码表,提供幂次权值数据的每个幂次位数据对应的指数数值。编码表设置一个或者多个幂次位数据(即置零幂次位数据)为指定对应的幂次权值数据为0。也就是说,当幂次权值数据的幂次位数据是编码表里的置零幂次位数据时候,表示该幂次权值数据为0。编码表的对应关系与前述实施例中类似,此处不再赘述。
在一个具体实例图2C所示,符号位为1位,幂次位数据位为7位,即m为7。编码表为幂次位数据为11111111的时候对应幂次权值数据为0,幂次位数据为其他数值的时候幂次权值数据对应相应的二进制补码。当幂次权值数据符号位为0,幂次位为0001001,则其表示具体数值为2 9,即512;幂次权值数据符号位为1,幂次位为1111101,则其表示具体数值为-2 -3,即-0.125。相对于浮点数据,幂次数据只保留数据的幂次位,极大减小了存储数据所需的存储空间。
通过幂次数据表示方法,可以减小存储权值数据所需的存储空间。在本实施例所提供示例中,幂次数据为8位数据,应当认识到,该数据长度不是固定不变的,在不同场合下,根据数据权值的数据范围采用不同的数据长度。
步骤S2-2,根据运算指令对神经元数据及幂次权值数据进行神经网络运算。其中,所述步骤S2包括以下子步骤:
S2-21,译码单元从指令缓存单元中读取指令,并将其译码成各运算指令;
S2-22,运算单元分别接收所述译码单元、输入神经元缓存单元及权值缓存单元发送的运算指令、幂次权值数据以及神经元数据,并根据运算指令对神经元数据及幂次表示的权值数据进行神经网络运算。
所述神经元与幂次权值乘法操作具体为,神经元数据符号位与幂次权值数据符号位做异或操作;编码表的对应关系为乱序的情况下查找编码表找出幂次权值数据幂次位对应的指数数值,编码表的对应关系为正相关的情况下记录编码表的指数数值最小值并做加法找出幂次权值数据幂次位对应的指数数值,编码表的对应关系为负相关的情况下记录编码表的最大值并做减法找出幂次权值数据幂次位对应的指数数值;将指数数值与神经元数据幂次位做加法操作,神经元数据有效位保持不变。
具体实例一如图2D所示,神经元数据为16位浮点数据,符号位为0,幂次位为10101,有效位为0110100000,则其表示的实际数值为1.40625*2 6。幂次权值数据符号位为1位,幂次位数据位为5位,即m为5。编码表为幂次位数据为11111的时候对应幂次权值数据为0,幂次位数据为其他数值的时候幂次位数据对应相应的二进制补码。幂次权值为000110,则其表示的实际数值为64,即2 6。幂次权值的幂次位加上神经元的幂次位结果为11011,则结果的实际数值为1.40625*2 12,即为神经元与幂次权值的乘积结果。通过该运算操作,使得乘法操作变为加法操作,减小计算所需的运算量。
具体实例二如图2E所示,神经元数据为32位浮点数据,符号位为1,幂次位为10000011,有效位为10010010000000000000000,则其表示的实际数值为-1.5703125*2 4。幂次权值数据符号位为1位,幂次位数据位为5位,即m为5。编码表为幂次位数据为11111的时候对应幂次权值数据为0,幂次位数据为其他数值的时候幂次位数据对应相应的二进制补码。幂次神经元为111100,则其表示的实际数值为-2 -4。(神经元的幂次位加上幂次权值的幂次位结果为01111111,则结果的实际数值为1.5703125*2 0,即为神经元与幂次权值的乘积结果。
可选的,还包括步骤S2-3,将神经网络运算后的神经元数据输出并作为下一层神经网络运算的输入数据。
其中,所述步骤S2-3可包括以下子步骤:
S2-31,输出神经元缓存单元接收所述计算单元发送的神经网络运算后得到的神经元数据。
S2-32,将输出神经元缓存单元接收的神经元数据传输给数据控制单元,通过输出神经元缓存单元获得的神经元数据可作为神经网络运算下一层的输入神经元,再重复步骤S2-1至步骤S2-3直到神经网络最后一层运算结束。
另外,通过幂次转换单元获得的幂次神经元数据可作为神经网络运算下一层的输入幂次神经元,再重复步骤S2-1至步骤S2-3直到神经网络最后一层运算结束。通过改变存储单元预存的整数值x和正整数值y,可以调整神经网络运算装置所能表示的幂次神经元数据范围。
在一些实施例中,所述神经网络为多层神经网络,对于每层神经网络可按图2F所示的运算方法进行运算,其中,神经网络第一层输入幂次权值数据可通过存储单元从外部地址读入,若外部地址读入的数据已经为幂次权值数据则直接传入存储单元,否则先通过幂次转换单元转换为幂次权值数据;而神经网络第一层输入幂次神经元数据可通过存储单元从外部地址读入,若外部地址读入的数据已经为幂次数据则直接传入存储单元,否则先通过幂次转换单元转换为幂次神经元数据,此后各层神经网络的输入神经元数据可由在该层之前的一层或多层神经网络的输出幂次神经元数据提供。请参照图2F,本实施例单层神经网络运算方法,包括:
步骤S2-4,获取指令、幂次神经元数据及幂次权值数据。
其中,所述步骤S2-4包括以下子步骤:
S2-41,将指令、神经元数据及权值数据输入存储单元;其中,对幂次神经元数据及幂次权值数据直接输入存储单元,对非幂次神经元数据及非幂次权值数据则经过所述第一幂次转换单元转换为幂次神经元数据及幂次权值数据后输入存储单元;
S2-42,数据控制单元接收该存储单元发送的指令、幂次神经元数据及幂次权值数据;
S2-43,指令缓存单元、输入神经元缓存单元及权值缓存单元分别接收所述数据控制单元发送的指令、幂次神经元数据及幂次权值数据并分发给译码单元或运算单元。
所述幂次神经元数据及幂次权值数据表示神经元数据及权值数据的数值采用其幂指数值形式表示,具体为,幂次神经元数据及幂次权值数据均包括符号位和幂次位,符号位用一位或多位比特位表示神经元数据及权值数据的符号,幂次位用m位比特位 表示神经元数据及权值数据的幂次位数据,m为大于1的正整数。存储单元的存储单元预存有编码表,提供幂次神经元数据及幂次权值数据的每个幂次位数据对应的指数数值。编码表设置一个或者多个幂次位数据(即置零幂次位数据)为指定对应的幂次神经元数据及幂次权值数据为0。也就是说,当幂次神经元数据及幂次权值数据的幂次位数据是编码表里的置零幂次位数据时候,表示该幂次神经元数据及幂次权值数据为0。
在一个具体实例方式中,如图2G所示,符号位为1位,幂次位数据位为7位,即m为7。编码表为幂次位数据为11111111的时候对应幂次神经元数据及幂次权值数据为0,幂次位数据为其他数值的时候幂次神经元数据及幂次权值数据对应相应的二进制补码。当幂次神经元数据及幂次权值数据符号位为0,幂次位为0001001,则其表示具体数值为2 9,即512;幂次神经元数据及幂次权值数据符号位为1,幂次位为1111101,则其表示具体数值为-2 -3,即-0.125。相对于浮点数据,幂次数据只保留数据的幂次位,极大减小了存储数据所需的存储空间。
通过幂次数据表示方法,可以减小存储神经元数据及权值数据所需的存储空间。在本实施例所提供示例中,幂次数据为8位数据,应当认识到,该数据长度不是固定不变的,在不同场合下,根据神经元数据及权值数据的数据范围采用不同的数据长度。
步骤S2-5,根据运算指令对幂次神经元数据及幂次权值数据进行神经网络运算。其中,所述步骤S5包括以下子步骤:
S2-51,译码单元从指令缓存单元中读取指令,并将其译码成各运算指令;
S2-52,运算单元分别接收所述译码单元、输入神经元缓存单元及权值缓存单元发送的运算指令、幂次神经元数据及幂次权值数据,并根据运算指令对幂次神经元数据及幂次权值数据进行神经网络运算。
所述幂次神经元与幂次权值乘法操作具体为,幂次神经元数据符号位与幂次权值数据符号位做异或操作;编码表的对应关系为乱序的情况下查找编码表找出幂次神经元数据及幂次权值数据幂次位对应的指数数值,编码表的对应关系为正相关的情况下记录编码表的指数数值最小值并做加法找出幂次神经元数据及幂次权值数据幂次位对应的指数数值,编码表的对应关系为负相关的情况下记录编码表的最大值并做减法找出幂次神经元书记及幂次权值数据幂次位对应的指数数值;将幂次神经元数据对应的指数数值与幂次权值数据对应的指数数值做加法操作。
具体实例一如图2H所示,幂次神经元数据和幂次权值数据符号位为1位,幂次位数据位为4位,即m为4。编码表为幂次位数据为1111的时候对应幂次权值数据为0,幂次位数据为其他数值的时候幂次位数据对应相应的二进制补码。幂次神经元数据为00010,则其表示的实际数值为2 2。幂次权值为00110,则其表示的实际数值为64,即2 6。幂次神经元数据和幂次权值数据的乘积为01000,其表示的实际数值为2 8
可以看到,幂次神经元数据和幂次权值的乘法运算相比于浮点数据的乘法以及浮点数据和幂次数据的乘法都更加的简单方便。
本实施例方法还可进一步包括,步骤S2-6,将神经网络运算后的神经元数据输出并作为下一层神经网络运算的输入数据。
其中,所述步骤S2-6包括以下子步骤:
S2-61,输出神经元缓存单元接收所述计算单元发送的神经网络运算后得到的神经元数据。
S2-62,将输出神经元缓存单元接收的神经元数据传输给数据控制单元,通过输出神经元缓存单元获得的神经元数据可作为神经网络运算下一层的输入神经元,再重复步骤S4至步骤S6直到神经网络最后一层运算结束。
由于神经网络运算后得到的神经元数据也为幂次数据,将其传输给数据控制单元所需带宽相比于浮点数据所需带宽大大减少,因此进一步减小了神经网络存储资源和计算资源的开销,提高了神经网络的运算速度。
另外,所述幂次转换的具体操作方法与前述实施例相同,此处不再赘述。
所公开的实施例的所有的单元都可以是硬件结构,硬件结构的物理实现包括但不局限于物理器件,物理器件包括但不局限于晶体管,忆阻器,DNA计算机。
本公开的一个实施例提供了一种运算装置,包括:
运算控制模块3-2,用于确定分块信息;以及
运算模块3-3,用于根据所述分块信息对运算矩阵进行分块、转置及合并运算,以得到所述运算矩阵的转置矩阵。
具体的,所述分块信息可以包括分块大小信息,分块方式信息,分块合并信息的至少一种。其中,分块大小信息表示将所述运算矩阵进行分块后,所获得的各个分块矩阵的大小信息。分开方式信息表示对所述运算矩阵进行分块的方式。分开合并信息表示将各个分块矩阵进行转置运算后,重新合并获得运算矩阵的转置矩阵的方式。
由于本公开运算装置可以对运算矩阵进行分块,通过对多个分块矩阵分别进行转置运算得到多个分块矩阵的转置矩阵,最终对多个分块矩阵的转置矩阵进行合并,得到运算矩阵的转置矩阵,因此可以实现使用一条单独指令在常数时间复杂度内完成任意大小矩阵的转置操作。相比较传统的矩阵转置操作实现方法,本公开在降低操作时间复杂度的同时也使矩阵转置操作的使用更为简单高效。
如图3A~3B所示,在本公开的一些实施例中,所述运算装置,还包括:
地址存储模块3-1,用于存储运算矩阵的地址信息;以及
数据存储模块3-4,用于存储原始矩阵数据,该原始矩阵数据包括所述运算矩阵,并存储运算后的转置矩阵;
其中,所述运算控制模块用于从地址存储模块提取运算矩阵的地址信息,并根据运算矩阵的地址信息分析得到分块信息;所述运算模块用于从运算控制模块获取运算矩阵的地址信息及分块信息,根据运算矩阵的地址信息从数据存储模块提取运算矩阵,并根据分块信息对运算矩阵进行分块、转置及合并运算,得到运算矩阵的转置矩阵,并将运算矩阵的转置矩阵反馈至数据存储模块。
如图3C所示,在本公开的一些实施例中,上述运算模块包括矩阵分块单元、矩阵运算单元和矩阵合并单元,其中:
矩阵分块单元3-31:用于从运算控制模块获取运算矩阵的地址信息及分块信息,并根据运算矩阵的地址信息从数据存储模块提取运算矩阵,根据分块信息对运算矩阵进行分块运算得到n个分块矩阵;
矩阵运算单元3-32,用于获取n个分块矩阵,并对n个分块矩阵分别进行转置运算,得到n个分块矩阵的转置矩阵;
矩阵合并单元3-33,用于获取并合并n个分块矩阵的转置矩阵,得到所述运算矩阵的转置矩阵,其中,n为自然数。
举例而言,如图3D所示,对于存储于数据存储模块中的一运算矩阵X,运算模块的矩阵分块单元从数据存储模块中提取所述运算矩阵X,根据分块信息对运算矩阵X进行分块运算操作,得到4个分块矩阵X 1、X 2、X 3、X 4,并输出至矩阵运算单元;矩阵运算单元从矩阵分块单元中获取这4个分块矩阵,并对这4个分块矩阵分别进行转置运算操作,得到4个分块矩阵的转置矩阵X 1 T、X 2 T、X 3 T、X 4 T,并输出至矩阵合并单元;矩阵合并单元从矩阵运算单元中获取这4个分块矩阵的转置矩阵并进行合并,得到运算矩阵的转置矩阵X T,还可进一步将转置矩阵X T输出至数据存储模块。
在本公开的一些实施例中,上述运算模块还包括缓存单元3-34,用于缓存n个分块矩阵,以供矩阵运算单元获取。
在本公开的一些实施例中,上述矩阵合并单元还可以包括存储器,用于暂时存储获取的分块矩阵的转置矩阵,当矩阵运算单元完成所有分块矩阵的运算后,矩阵合并单元即可获取到所有分块矩阵的转置矩阵,再对n个分块矩阵的转置矩阵进行合并操作,得到转置后的矩阵,并将输出结果写回到数据存储模块中。
本领域技术人员应当可以理解的是,上述矩阵分块单元、矩阵运算单元及矩阵合并单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。所述矩阵分块单元及矩阵合并单元可包括一个或多个控制元件、所述矩阵运算单元可包括一个或多个控制元件、计算元件。
如图3E所示,在本公开的一些实施例中,上述运算控制模块包括指令处理单元3-22、指令缓存单元3-21和矩阵判断单元3-23,其中:
指令缓存单元,用于存储待执行的矩阵运算指令;
指令处理单元,用于从指令缓存单元中获取矩阵运算指令,对矩阵运算指令进行译码,并根据译码后的矩阵运算指令从地址存储模块中提取运算矩阵的地址信息;
矩阵判断单元,用于根据运算矩阵的地址信息判断是否需要进行分块,并根据判断结果得到分块信息。
在本公开的一些实施例中,上述运算控制模块还包括依赖关系处理单元3-24,用于判断译码后的矩阵运算指令和运算矩阵的地址信息是否与上一运算存在冲突,若存在冲突,则暂存译码后的矩阵运算指令和运算矩阵的地址信息;若不存在冲突,则发射译码后的矩阵运算指令和运算矩阵的地址信息至矩阵判断单元。
在本公开的一些实施例中,上述运算控制模块还包括指令队列存储器3-25,用于缓存存在冲突的译码后的矩阵运算指令和运算矩阵的地址信息,当所述冲突消除后,将所缓存的译码后的矩阵运算指令和运算矩阵的地址信息发射至矩阵判断单元。
具体地,矩阵运算指令访问数据存储模块时,前后指令可能会访问同一块存储空间,为了保证指令执行结果的正确性,当前指令如果被检测到与之前的指令的数据存在依赖关系,该指令必须在指令队列存储器内等待至依赖关系被消除。
在本公开的一些实施例中,上述指令处理单元包括取指单元3-221和译码单元3-222,其中:
取指单元,用于从指令缓存单元中获取矩阵运算指令,并将此矩阵运算指令传输至译码单元;
译码单元,用于对矩阵运算指令进行译码,根据该译码后的矩阵运算指令从地址存储模块中提取运算矩阵的地址信息,并将译码后的矩阵运算指令和提取的运算矩阵的地址信息传输至所述依赖关系处理单元。
在本公开的一些实施例中,上述运算装置还包括输入输出模块,用于向数据存储模块输入所述运算矩阵,还用于从数据存储模块获取运算后的转置矩阵,并输出运算后的转置矩阵。
在本公开的一些实施例中,上述运算矩阵的地址信息为矩阵的起始地址信息和矩阵大小信息。
在本公开的一些实施例中,运算矩阵的地址信息是矩阵在数据存储模块中的存储地址。
在本公开的一些实施例中,地址存储模块为标量寄存器堆或通用内存单元;数据存储模块为高速暂存存储器或通用内存单元。
在本公开的一些实施例中,地址存储模块可以是标量寄存器堆,提供运算过程中所需的标量寄存器,标量寄存器不只存放矩阵地址,还可存放有标量数据。当对大规模矩阵进行转置时进行了分块操作后,标量寄存器中的标量数据可以用于记录矩阵块的数量。
在本公开的一些实施例中,数据存储模块可以是高速暂存存储器,能够支持不同大小的矩阵数据。
在本公开的一些实施例中,矩阵判断单元用于判断矩阵大小,如果超过规定的最大规模M,则需要对矩阵进行分块操作,矩阵判断单元根据此判断结果分析得到分块信息。
在本公开的一些实施例中,指令缓存单元,用于存储待执行的矩阵运算指令。指令在执行过程中,同时也被缓存在指令缓存单元中,当一条指令执行完之后,如果该指令同时也是指令缓存单元中未被提交指令中最早的一条指令,该指令将被提交,一旦提交,该条指令进行的操作对装置状态的改变将无法撤销。该指令缓存单元可以是重排序缓存。
在本公开的一些实施例中,矩阵运算指令为矩阵转置运算指令,包括操作码和操作域,其中,操作码用于指示该矩阵转置运算指令的功能,矩阵运算控制模块通过识别该操作码确认进行矩阵转置操作,操作域用于指示该矩阵转置运算指令的数据信息,其中,数据信息可以是立即数或寄存器号,例如,要获取一个矩阵时,根据寄存器号可以在相应的寄存器中获取矩阵起始地址和矩阵规模,再根据矩阵起始地址和矩阵规模在数据存储模块中获取相应地址存放的矩阵。
本公开使用一种新的运算结构简单高效的实现对矩阵的转置运算,降低了这一运算的时间复杂度。
本公开还公开了一种运算方法,包括以下步骤:
步骤1、运算控制模块从地址存储模块提取运算矩阵的地址信息;
步骤2、运算控制模块根据运算矩阵的地址信息得到分块信息,并将运算矩阵的地址信息和分块信息传输至运算模块;
步骤3、运算模块根据运算矩阵的地址信息从数据存储模块提取运算矩阵;并根据分块信息将运算矩阵分成n个分块矩阵;
步骤4、运算模块对n个分块矩阵分别进行转置运算,得到n个分块矩阵的转置矩阵;
步骤5、运算模块合并n个分块矩阵的转置矩阵,得到运算矩阵的转置矩阵并反馈至数据存储模块;
其中,n为自然数。
以下通过具体实施例对本公开提出的运算装置及方法进行详细描述。
在一些实施例中,如图3F所示,本实施例提出一种运算装置,包括地址存储模块、运算控制模块、运算模块、数据存储模块和输入输出模块3-5,其中
可选的,所述运算控制模块包括指令缓存单元、指令处理单元、依赖关系处理单元、指令队列存储器和矩阵判断单元,其中指令处理单元又包括取指单元和译码单元;
可选的,所述运算模块包括矩阵分块单元、矩阵缓存单元、矩阵运算单元和矩阵合并单元;
可选的,所述地址存储模块为一标量寄存器堆;
可选的,所述数据存储模块为一高速暂存存储器;输入输出模块为一IO直接内存存取模块。
以下对运算装置的各组成部分进行详细说明:
取指单元,该单元负责从指令缓存单元中取出下一条将要执行的运算指令,并将该运算指令传给译码单元;
译码单元,该单元负责对运算指令进行译码,并将译码后的运算指令发送至标量寄存器堆,得到标量寄存器堆反馈的运算矩阵的地址信息,将译码后的运算指令和得到的运算矩阵的地址信息传输给依赖关系处理单元;
依赖关系处理单元,该单元处理运算指令与前一条指令可能存在的存储依赖关系。矩阵运算指令会访问高速暂存存储器,前后指令可能会访问同一块存储空间。为了保证指令执行结果的正确性,当前运算指令如果被检测到与之前的运算指令的数据存在依赖关系,该运算指令必须缓存至指令队列存储器内等待至依赖关系被消除;如当前运算指令与之前的运算指令不存在依赖关系,则依赖关系处理单元直接将运算矩阵的地址信息和译码后的运算指令传输至矩阵判断单元。
指令队列存储器,考虑到不同运算指令所对应的/指定的标量寄存器上有可能存在依赖关系,用于缓存存在冲突的译码后的运算指令和相应的运算矩阵的地址信息,当依赖关系被满足之后发射译码后的运算指令和相应的运算矩阵的地址信息至矩阵判断单元;
矩阵判断单元,用于根据运算矩阵的地址信息判断矩阵大小,如果超过规定的最大规模M,则需要对矩阵进行分块操作,矩阵判断单元根据此判断结果分析得到分块信息,并将运算矩阵的地址信息和得到的分块信息传输至矩阵分块单元。
矩阵分块单元,该单元负责根据运算矩阵的地址信息,从高速暂存器中提取需进行转置运算的运算矩阵,并根据分块信息对该运算矩阵进行分块,得到n个分块矩阵。矩阵缓存单元,该单元用于缓存经过分块后的n个分块矩阵,依次传输至矩阵运算单元进行转置运算;
矩阵运算单元,负责依次从矩阵缓存单元中提取分块矩阵进行转置运算,并将转置后的分块矩阵传输至矩阵合并单元;
矩阵合并单元,负责接收并暂时缓存转置后的分块矩阵,待所有分块矩阵都进行完转置运算后,对n个分块矩阵的转置矩阵进行合并运算,得到运算矩阵的转置矩阵。
标量寄存器堆,提供装置在运算过程中所需的标量寄存器,为运算提供运算矩阵的地址信息;
高速暂存器,该模块是矩阵数据专用的暂存存储装置,能够支持不同大小的矩阵数据。
IO内存存取模块,该模块用于直接访问高速暂存存储器,负责从高速暂存存储器中读取数据或写入数据。
在一些实施例中,如图3G所示,本实施例提出一种运算方法,用于执行大规模矩阵的转置运算,具体包括以下步骤:
步骤1、运算控制模块从地址存储模块提取运算矩阵的地址信息,具体包括以下步骤:
步骤1-1、取指单元提取运算指令,并将运算指令送至译码单元;
步骤1-2、译码单元对运算指令进行译码,根据译码后的运算指令从地址存储模块获取运算矩阵的地址信息,并将译码后的运算指令和运算矩阵的地址信息送往依赖关系处理单元;
步骤1-3、依赖关系处理单元分析该译码后的运算指令与前面的尚未执行结束的指令在数据上是否存在依赖关系;具体而言,所述依赖关系处理单元可根据运算指令所需读取的寄存器的地址,判断该寄存器是否存在待写入的情况,如有,则存在依赖关系,待数据写回后,才可执行该运算指令。
若存在依赖关系,该条译码后的运算指令与相应的运算矩阵的地址信息需要在指令队列存储器中等待至其与前面的未执行结束的指令在数据上不再存在依赖关系为止;
步骤2、运算控制模块根据运算矩阵的地址信息得到分块信息;
具体的,依赖关系不存在后,指令队列存储器发射该条译码后的运算指令与相应的运算矩阵的地址信息至矩阵判断单元,判断矩阵是否需要进行分块,矩阵判断单元根据判断结果得到分块信息,并将分块信息和运算矩阵的地址信息传输至矩阵分块单元;
步骤3、运算模块根据运算矩阵的地址信息从数据存储模块提取运算矩阵;并根据分块信息将运算矩阵分成n个分块矩阵;
具体的,矩阵分块单元根据传入的运算矩阵的地址信息从数据存储模块中取出需要的运算矩阵,再根据传入的分块信息,将运算矩阵分成n个分块矩阵,完成分块后依次将每个分块矩阵传入到矩阵缓存单元;
步骤4、运算模块分别对n个分块矩阵进行转置运算,得到n个分块矩阵的转置矩阵;
具体的,矩阵运算单元依次从矩阵缓存单元提取分块矩阵,并对每个提取的分块矩阵进行转置操作,再将得到的每个分块矩阵的转置矩阵传入到矩阵合并单元。
步骤5、运算模块合并n个分块矩阵的转置矩阵,得到运算矩阵的转置矩阵,并将该转置矩阵反馈给数据存储模块,具体包括以下步骤:
步骤5-1、矩阵合并单元接收每个分块矩阵的转置矩阵,当接收到的分块矩阵的转置矩阵数量达到总的分块数后,对所有的分块进行矩阵合并操作,得到运算矩阵的转置矩阵;并将该转置矩阵反馈至数据存储模块的指定地址;
步骤5-2、输入输出模块直接访问数据存储模块,从数据存储模块中读取运算得到的运算矩阵的转置矩阵。
本公开提到的向量可以是0维向量,1维向量,2维向量或者多维向量。其中,0维向量也可以被称为标量,2维也可以被称为矩阵。
本公开的一个实施例提出了一种数据筛选装置,参见图4A,包括:存储单元4-3,用于存储数据和指令,其中数据包括待筛选数据和位置信息数据。
寄存器单元4-2,用于存放存储单元中的数据地址。
数据筛选模块4-1,包括数据筛选单元4-11,数据筛选模块根据指令在寄存器单元中获取数据地址,根据该数据地址在存储单元中获取相应的数据,并根据获取的数据进行筛选操作,得到数据筛选结果。
数据筛选单元功能示意图如图4B所示,其输入数据为待筛选数据和位置信息数据,输出数据可以只包含筛选后数据,也可以同时包含筛选后数据的相关信息,相关信息例如是向量长度、数组大小、所占空间等。
进一步地,参见图4C,本实施例的数据筛选装置具体包括:
存储单元4-3,用于存储待筛选数据、位置信息数据和指令。
寄存器单元4-2,用于存放存储单元中的数据地址。
数据筛选模块4-1包括:
指令缓存单元4-12,用于存储指令。
控制单元4-13,从指令缓存单元中读取指令,并将其译码成具体操作微指令。
I/O单元4-16,将存储单元中的指令搬运到指令缓存单元中,将存储单元中的数据搬运到输入数据缓存单元和输出缓存单元中,也可以将输出缓存单元中的输出数据搬运到存储单元中。
输入数据缓存单元4-14,存储I/O单元搬运的数据,包括待筛选数据和位置信息数据。
数据筛选单元4-11,用于接收控制单元传来的微指令,并从寄存器单元获取数据地址,将输入数据缓存单元传来的待筛选数据和位置信息数据作为输入数据,对输入数据进行筛选操作,完成后将筛选后数据传给输出数据缓存单元。
输出数据缓存单元4-15,存储输出数据,输出数据可以只包含筛选后数据,也可以同时包含筛选后数据的相关信息,如向量长度、数组大小、所占空间等等。
本实施例的数据筛选装置适用于多种筛选对象,待筛选数据可以是向量或高维数 组等,位置信息数据既可以是二进制码,也可以是向量或高维数组等,其每个分量为0或1。其中,待筛选数据的分量和位置信息数据的分量可以是一一对应的。本领域技术人员应当可以理解的是,位置信息数据的每个分量为1或0只是位置信息的一种示例性的表示方式,位置信息的表示方式并不局限在这一种表示方式。
可选的,对于位置信息数据中每个分量使用0或1表示时,数据筛选单元对输入数据进行筛选操作具体包括,数据筛选单元扫描位置信息数据的每个分量,若分量为0,则删除该分量对应的待筛选数据分量,若分量为1,则保留该分量对应的待筛选数据分量;或者,若位置信息数据的分量为1,则删除该分量对应的待筛选数据分量,若分量为0,则保留该分量对应的待筛选数据分量。当扫描完毕,则筛选完成,得到筛选后数据并输出。此外,在进行筛选操作的同时,还可以对筛选后数据的相关信息进行记录,如向量长度、数组大小、所占空间等等,根据具体情况决定是否同步进行相关信息的记录和输出。需要说明的是,位置信息数据的每个分量采用其他表示方式进行表示时,数据筛选单元还可以配置与表示方式相应的筛选操作。
以下通过实例说明数据筛选的过程。
例一:
设待筛选数据为向量(1 0 101 34 243),需要筛选的是小于100的分量,那么输入的位置信息数据也是向量,即向量(1 1 0 1 0)。经过筛选后的数据可以仍保持向量结构,并且可以同时输出筛选后数据的向量长度。
其中,位置信息向量可以是外部输入的,也可以是内部生成的。可选的,在本公开装置还可以包括位置信息生成模块,位置信息生成模块可以用于生成位置信息向量,该位置信息生成模块与数据筛选单元连接。具体而言,位置信息生成模块可以通过向量运算生成位置信息向量,向量运算可以是向量比较运算,即通过对待筛选向量的分量逐一与预设数值比较大小得到。需要说明的是,位置信息生成模块还可以根据预设条件选择其他向量运算生成位置信息向量。。本例中规定位置信息数据的分量为1则保留对应的待筛选数据分量,分量为0则删除对应的待筛选数据分量。
数据筛选单元初始化一个变量length=0,用以记录筛选后数据的向量长度;
数据筛选单元读取输入数据缓存单元的数据,扫描位置信息向量的第1个分量,发现其值为1,则保留待筛选向量的第1个分量的值1,length=length+1;
扫描位置信息向量的第2个分量,发现其值为1,则保留待筛选向量的第2个分量的值0,length=length+1;
扫描位置信息向量的第3个分量,发现其值为0,则删去待筛选向量的第3个分量的值101,length不变;
扫描位置信息向量的第4个分量,发现其值为1,则保留待筛选向量的第4个分量的值34,length=length+1;
扫描位置信息向量的第5个分量,发现其值为0,则保留待筛选向量的第5个分量的值243,length不变;
保留下来的值组成筛选后向量(1 0 34)及其向量长度为length=3,并由输出数据缓存单元存储。
在本实施例的数据筛选装置中,数据筛选模块还可以包括一结构变形单元4-17,可以对输入数据缓存单元的输入数据、以及输出数据缓存单元的输出数据的存储结构进行变形,如将高维数组展开为向量,将向量变为高维数组等等。可选的,将高维数据展开的方式可以是行优先,也可以是列优先,还可以根据具体情况选择其他的展开方式。
例二:
设待筛选数据为四维数组
Figure PCTCN2018081929-appb-000019
需要筛选的是偶数值,那么输入的位置信息数组是
Figure PCTCN2018081929-appb-000020
筛选后数据为向量结构,不输出相关信息。本例中规定位置信息数据的分量为1,则保留对应的待筛选数据分量,分量为0则删除对应待筛选数据分量。
数据筛选单元读取输入数据缓存单元的数据,扫描位置信息数组的第(1,1)个分量,发现其值为0,则删去待筛选数组的第(1,1)个分量的值1;
扫描位置信息数组的第(1,2)个分量,发现其值为1,则保留待筛选数组的第(1,2)个分量的值4;
扫描位置信息数组的第(2,1)个分量,发现其值为0,则删去待筛选数组的第(2,1)个分量的值61;
扫描位置信息数组的第(2,2)个分量,发现其值为1,则保留待筛选数组的第(2,2)个分量的值22;
结构变形单元将保留下来的值变为向量,即筛选后数据为向量(4 22),并由输出数据缓存单元存储。
在一些实施例中,如图4D所示,所述数据数据筛选模块还可进一步包括:计算单元4-18。由此,本公开装置可同时实现数据筛选及处理,即可得到一种数据筛选及处理装置。所述计算单元的具体结构同前述实施例,此处不再赘述。
本公开提供了一种利用所述的数据筛选装置进行数据筛选的方法,包括:
数据筛选模块在寄存器单元中获取数据地址;
根据数据地址在存储单元中获取相应的数据;以及
对获取的数据进行筛选操作,得到数据筛选结果。
在一些实施例中,所述数据筛选模块在寄存器单元中获取数据地址的步骤包括:数据筛选单元从寄存器单元中获取待筛选数据的地址和位置信息数据的地址。
在一些实施例中,所述根据数据地址在存储单元中获取相应的数据的步骤包括以下子步骤:
I/O单元将存储单元中的待筛选数据和位置信息数据传递给输入数据缓存单元;以及
输入数据缓存单元将待筛选数据和位置信息数据传递给数据筛选单元。
可选的,所述I/O单元将存储单元中的待筛选数据和位置信息数据传递给输入数据缓存单元的子步骤和所述输入数据缓存单元将待筛选数据和位置信息数据传递给数 据筛选单元的子步骤之间还包括:判断是否进行存储结构变形。
如果进行存储结构变形,则输入数据缓存单元将待筛选数据传给结构变形单元,结构变形单元进行存储结构变形,将变形后的待筛选数据传回输入数据缓存单元,之后执行所述输入数据缓存单元将待筛选数据和位置信息数据传递给数据筛选单元的子步骤;如果否,则直接执行所述输入数据缓存单元将待筛选数据和位置信息数据传递给数据筛选单元的子步骤。
在一些实施例中,所述对获取的数据进行筛选操作,得到数据筛选结果的步骤包括:数据筛选单元根据位置信息数据,对待筛选数据进行筛选操作,将输出数据传递给输出数据缓存单元。
如图4E所示,在本公开的一具体实施例中,所述数据筛选方法的步骤具体如下:
步骤S4-1,控制单元从指令缓存单元中读入一条数据筛选指令,并将其译码为具体操作微指令,传递给数据筛选单元;
步骤S4-2,数据筛选单元从寄存器单元中获取待筛选数据和位置信息数据的地址;
步骤S4-3,控制单元从指令缓存单元中读入一条I/O指令,将其译码为具体操作微指令,传递给I/O单元;
步骤S4-4,I/O单元将存储单元中的待筛选数据和位置信息数据传递给输入数据缓存单元;
判断是否进行存储结构变形,如果是,则执行步骤S4-5,如果否,则直接执行步骤S4-6。
步骤S4-5,输入数据缓存单元将数据传给结构变形单元,并进行相应存储结构变形,然后将变形后的数据传回输入数据缓存单元,然后转步骤S4-6;
步骤S4-6,输入数据缓存单元将数据传递给数据筛选单元,数据筛选单元根据位置信息数据,对待筛选数据进行筛选操作;
步骤S4-7,将输出数据传递给输出数据缓存单元,其中输出数据可以只包含筛选后数据,也可以同时包含筛选后数据的相关信息,如向量长度,数组大小、所占空间等等。
至此,已经结合附图对本公开实施例进行了详细描述。依据以上描述,本领域技术人员应当对本公开的一种数据筛选装置和方法有了清楚的认识。
本公开的一个实施例提供了一种神经网络处理器,包括:存储器、高速暂存存储器和异构内核;其中,所述存储器,用于存储神经网络运算的数据和指令;所述高速暂存存储器,通过存储器总线与所述存储器连接;所述异构内核,通过高速暂存存储器总线与所述高速暂存存储器连接,通过高速暂存存储器读取神经网络运算的数据和指令,完成神经网络运算,并将运算结果送回到高速暂存存储器,控制高速暂存存储器将运算结果写回到存储器。
其中,所述异构内核指包括至少两种不同类型的内核,也即两种不同结构的内核。
在一些实施例中,所述异构内核包括:多个运算内核,其具有至少两种不同类型的运算内核,用于执行神经网络运算或神经网络层运算;以及一个或多个逻辑控制内核,用于根据神经网络运算的数据,决定由所述专用内核和/或所述通用内核执行神经网络运算或神经网络层运算。
进一步的,所述多个运算内核包括m个通用内核和n个专用内核;其中,所述专用内核专用于执行指定神经网络/神经网络层运算,所述通用内核用于执行任意神经网络/神经网络层运算。可选的,所述通用内核可以为cpu,所述专用内核可以为npu。
在一些实施例中,所述高速暂存存储器包括共享高速暂存存储器和/或非共享高速暂存存储器;其中,所述一共享高速暂存存储器通过高速暂存存储器总线与所述异构内核中的至少两个内核对应连接;所述一非共享高速暂存存储器通过高速暂存存储器总线与所述异构内核中的一内核对应连接。
具体而言,所述高速暂存存储器可以仅包括一个或多个共享高速暂存存储器,每一个共享高速暂存存储器与异构内核中的多个内核(逻辑控制内核、专用内核或通用内核)连接。所述高速暂存存储器也可以仅包括一个或多个非共享高速暂存存储器,每一个非共享高速暂存存储器与异构内核中的一个内核(逻辑控制内核、专用内核或通用内核)连接。所述高速暂存存储器也可以同时包括一个或多个共享高速暂存存储器、以及一个或多个非共享高速暂存存储器,其中,每一个共享高速暂存存储器与异构内核中的多个内核(逻辑控制内核、专用内核或通用内核)连接,每一个非共享高速暂存存储器与异构内核中的一个内核(逻辑控制内核、专用内核或通用内核)连接。
在一些实施例中,所述逻辑控制内核通过高速暂存存储器总线与所述高速暂存存储器连接,通过高速暂存存储器读取神经网络运算的数据,并根据神经网络运算的数据中的神经网络模型的类型和参数,决定由专用内核和/或通用内核作为目标内核来执行神经网络运算和/或神经网络层运算。其中,可以在内核之间加入通路,所述逻辑控制内核可通过控制总线直接向所述向目标内核发送信号,也可经所述高速暂存存储器存储器向所述向目标内核发送信号,从而控制目标内核执行神经网络运算和/或神经网络层运算。
本公开的一个实施例提出了一种异构多核神经网络处理器,参见图5A,包括:存储器11、非共享高速暂存存储器12和异构内核13。
存储器11,用于存储神经网络运算的数据和指令,数据包括偏置、权值、输入数据、输出数据、以及神经网络模型的类型和参数等当然,所述输出数据也可以不存储在存储器;指令包括神经网络运算对应的各种指令,例如CONFIG指令、COMPUTE指令、IO指令、NOP指令、JUMP指令、MOVE指令等。存储器11中存储的数据和指令可以通过非共享高速暂存存储器12传送到异构内核13中。
非共享高速暂存存储器12,包括多个高速暂存存储器121,每一高速暂存存储器121均通过存储器总线与存储器11连接,通过高速暂存存储器总线与异构内核13连接,实现异构内核13与非共享高速暂存存储器12之间、非共享高速暂存存储器12与存储器11之间的数据交换。当异构内核13所需的神经网络运算数据或指令未存储在非共享高速暂存存储器12中时,非共享高速暂存存储器12先通过存储器总线从存储器11中读取所需的数据或指令,然后将其通过高速暂存存储器总线送入到异构内核13中。
异构内核13,包括一逻辑控制内核131、一通用内核132以及多个专用内核133,逻辑控制内核131、通用内核132以及每一专用内核133均通过高速暂存存储器总线与一高速暂存存储器121对应连接。
异构内核13用以从非共享高速暂存存储器12中读取神经网络运算的指令和数据,完成神经网络运算,并将运算结果送回到非高速共享缓存12,控制非共享高速暂存存储器12将运算结果写回到存储器11。
逻辑控制内核131从非共享高速暂存存储器12中读入神经网络运算数据和指令,根据数据中的神经网络模型的类型和参数,判断是否存在支持该神经网络运算且能完成该神经网络运算规模的专用内核133,如果存在,则交由对应的专用内核133完成该神经网络运算,如果不存在,则交由通用内核132完成该神经网络运算。为了确定专用内核的位置及其是否空闲,可以使用为每类内核(支持同一层的专用内核属于一类,通用内核属于一类)设置一张表(称为专用/通用内核信息表),表中记录同类内核的编号(或地址)以及其当前是否空闲,初始均为空闲,之后空闲状态的更改有逻辑控制内核与内核间的直接或间接通信来维护,表中的内核编号可以在此网络处理器初始化时进行一遍扫描得到,这样即可支持动态可配置的异构内核(即可以随时更改异构内核中的专用处理器类型及数目,更改之后就会扫描更新内核信息表);可选的,也可以不支持异构内核的动态配置,这时就只需要把表中的内核编号固定即可,不需要多次扫描更新;可选的,如果每类专用内核的编号总是连续的,可以记录一个基址,然后用连续的若干比特位即可表示这些专用内核,用比特0或1即可表示其是否处于空闲状态。为了判断出网络模型的类型和参数,可以在逻辑控制内核中设置一个译码器,根据指令判断出网络层的类型,并可以判断出是通用内核的指令还是专用内核的指令,参数、数据地址等也可以从指令中解析出来;可选的,还可以规定数据包含一个数据头,其中包含各网络层的编号及其规模,以及对应计算数据及指令的地址等,并设一个专门的解析器(软件或硬件均可)来解析这些信息;可选的,将解析出的信息存储到指定区域。为了根据解析出来的网络层编号及规模决定使用哪个内核,可以在逻辑控制内核中设置一个内容寻址存储器CAM(content addressable memory),其中的内容可以实现为可配置的,这就需要逻辑控制内核提供一些指令来配置/写这个CAM,CAM中的内容有网络层编号,各个维度所能支持的最大规模,以及支持该层的专用内核信息表的地址和通用内核信息表地址,此方案下,用解析出来的层编号来找到对应表项,并比较规模限制;若满足则取专用内核信息表的地址,去其中寻找一个空闲的专用内核,根据其编号发送控制信号,为其分配计算任务;如果在CAM中没有找到对应层,或超出了规模限制,或专用内核信息表中无空闲内核,则取通用内核信息表中寻找一个空闲的通用内核,根据其编号发送控制信号,为其分配计算任务;如果在两张表中均为找到空闲的内核,则将此任务添加到一个等待队列中,并添加一些必要的信息,一旦有一个可计算此任务的空闲内核,则将其分配给它进行计算。
当然,确定专用内核的位置及其是否空闲可以有多种方式,上述确定专用内核的位置及其是否空闲的方式仅是作为示例说明。每一专用内核133可以独立完成一种神经网络运算,例如脉冲神经网络(SNN)运算等指定的神经网络运算,并将运算结果写回到其对应连接的高速暂存存储器121中,控制该高速暂存存储器121将运算结果写回到存储器11。
通用内核132可以独立完成超过专用内核所能支持的运算规模的或者所有专用内核133都不支持的神经网络运算,并将运算结果写回到其对应连接的高速暂存存储器 121中,控制该高速暂存存储器121将运算结果写回到存储器11。
本公开的一个实施例提出了一种异构多核神经网络处理器,参见图5B,包括:存储器21、共享高速暂存存储器22和异构内核23。
存储器21,用于存储神经网络运算的数据和指令,数据包括偏置、权值、输入数据、输出数据、和神经网络模型的类型和参数,指令包括神经网络运算对应的各种指令。存储器中存储的数据和指令通过共享高速暂存存储器22传送到异构内核23中。
共享高速暂存存储器22,通过存储器总线与存储器21相连,通过共享高速暂存存储器总线与异构内核23进行连接,实现异构内核23与共享高速暂存存储器22之间、共享高速暂存存储器22与存储器21之间的数据交换。
当异构内核23所需的神经网络运算数据或指令未存储在共享高速暂存存储器22中时,共享高速暂存存储器22先通过存储器总线从存储器21中读取所需的数据或指令,然后将其通过高速暂存存储器总线送入到异构内核23中。
异构内核23,包括一逻辑控制内核231、多个通用内核232以及多个专用内核233,逻辑控制内核231、通用内核232以及专用内核233均通过高速暂存存储器总线与共享高速暂存存储器22连接。
异构内核23用以从共享高速暂存存储器22中读取神经网络运算数据和指令,完成神经网络运算,并将运算结果送回到高速共享缓存22,控制共享高速暂存存储器22将运算结果写回到存储器21。
另外当逻辑控制内核231与通用内核232之间、逻辑控制内核231与专用内核233之间、通用内核232之间以及专用内核233之间需要进行数据传输时,发送数据的内核可以将数据通过共享高速暂存存储器总线先传输到共享高速暂存存储器22,然后再将数据传输到接收数据的内核,而不需要经过存储器21。
对于神经网络运算来说,其神经网络模型一般包括多个神经网络层,每个神经网络层利用上一神经网络层的运算结果进行相应的运算,其运算结果输出给下一神经网络层,最后一个神经网络层的运算结果作为整个神经网络运算的结果。在本实施例异构多核神经网络处理器中,通用内核232和专用内核233均可以执行一个神经网络层的运算,利用逻辑控制内核231、通用内核232以及专用内核233共同完成神经网络运算,以下为了描述方便,将神经网络层简称为层。
其中,每一专用内核233可以独立执行一层运算,例如神经网络层的卷积运算、全连接层、拼接运算、对位加/乘运算、Relu运算、池化运算、Batch Norm运算等,且神经网络运算层的规模不能过大,即不能超出对应的专用内核的所能支持的神经网络运算层的规模,也就是说专用内核运算对层的神经元和突触的数量有限制,在层运算结束后,将运算结果写回到共享高速暂存存储器22中。
通用内核232用于执行超过专用内核233所能支持的运算规模的或者所有专用内核都不支持的层运算,并将运算结果写回到共享高速暂存存储器22中,控制共享高速暂存存储器22将运算结果写回到存储器21。
进一步地,专用内核233和通用内核232将运算结果写回到存储器21后,逻辑控制内核231会向执行下一层运算的专用内核或通用内核发送开始运算信号,通知执行下一层运算的专用内核或通用内核开始运算。
更进一步的,专用内核233和通用内核232在接收到执行上一层运算的专用内核或通用内核发送的开始运算信号,且当前无正在进行的层运算时开始运算,若当前正在进行层运算,则将当前层运算完成,并将运算结果写回到共享高速暂存存储器22中后开始运算。
逻辑控制内核231从共享高速暂存存储器22中读入神经网络运算数据,针对其中的神经网络模型的类型和参数,对神经网络模型的每个层进行解析,对于每一层,判断是否存在支持该层运算且能完成该层运算规模的专用内核233,如果存在,则将该层的运算交由对应的专用内核233运算,如果不存在,则将该层的运算交由通用内核232进行运算。逻辑控制内核231还设置通用内核232和专用内核233进行层运算所需的数据和指令的对应地址,通用内核232和专用内核233读取对应地址的数据和指令,执行层运算。
对于执行第一层运算的专用内核233和通用内核232,逻辑控制内核231会在运算开始时对该专用内核233或通用内核232发送开始运算信号,而在神经网络运算结束后,执行最后一层运算的专用内核233或通用内核232会对逻辑控制内核231发送开始运算信号,逻辑控制内核231在接收到开始运算信号后,控制共享高速暂存存储器22将运算结果写回到存储器21中。
本公开的一个实施例提供了一种利用第一实施例所述的异构多核神经网络处理器进行神经网络运算的方法,参见图5C,步骤如下:
步骤S5-11:异构内核13中的逻辑控制内核131通过非共享高速暂存存储器12从存储器11中读取神经网络运算的数据和指令;
步骤S5-12:异构内核13中的逻辑控制内核131根据数据中的神经网络模型的类型和参数,判断是否存在符合条件的专用内核,符合条件是指该专用内核支持该神经网络运算且能完成该神经网络运算规模(规模限制可以是专用内核固有的,此时可以询问内核设计厂家;也可以是人为规定的(比如通过实验发现,超过一定规模,通用内核更有效),在配置CAM时可以设定规模限制)。如果专用内核m符合条件,则专用内核m作为目标内核,并执行步骤S5-13,否则执行步骤S5-15,其中,m为专用内核编号,1≤m≤M,M为专用内核数量。
步骤S5-13:异构内核13中的逻辑控制内核131向目标内核发送信号,激活该目标内核,同时将所要进行的神经网络运算的数据和指令对应的地址发送到该目标内核。
步骤S5-14:目标内核根据获取的地址通过非共享高速暂存存储器12从存储器11中获取神经网络运算的数据和指令,进行神经网络运算,然后将运算结果通过非共享高速暂存存储器12输出到存储器11中,运算完成。
进一步的,接上述步骤S5-12,如果不存在符合条件的专用内核,则接下来执行步骤S5-15至S5-16。
步骤S5-15:异构内核13中的逻辑控制内核131向通用内核132发送信号,激活通用内核132,同时将所要进行的神经网络运算的数据和指令对应的地址发送到通用内核132。
步骤S5-16:通用内核132根据获取的地址通过非共享高速暂存存储器12从存储器11中获取神经网络运算的数据和指令,进行神经网络运算,然后将运算结果通过非 共享高速暂存存储器12输出到存储器中11,运算完成。
本公开的一个实施例提供了一种利用第二实施例所述的异构多核神经网络处理器进行神经网络运算的方法,参见图5D,步骤如下:
步骤S5-21:异构内核23中的逻辑控制内核231通过共享高速暂存存储器22从存储器21中读取神经网络运算的数据和指令。
步骤S5-22:异构内核23中的逻辑控制内核231对数据中的神经网络模型的类型和参数进行解析,对于神经网络模型中第1到I层,分别判断是否存在符合条件的专用内核,I为神经网络模型的层数,符合条件是指该专用内核支持该层运算且能完成该层运算规模,并为每一层运算分配对应的通用内核或专用内核。
对于神经网络模型第i层运算,1≤i≤I,若专用内核m符合条件,则选择专用内核m执行神经网络模型第i层运算,m为专用内核编号,1≤m≤M,M为专用内核数量。如果没有专用内核符合条件,则选择通用内核M+n执行神经网络模型第i层运算,M+n为通用内核的编号,1≤n≤N,N为通用内核数量,其中将专用内核233和通用内核232统一编号(即专用内核和通用内核一起进行编号,例如x个专用内核,y个通用内核,则可以从1开始编号,编至x+y,每个专用内核、通用内核对应1至x+y中的一个编号;当然也可以专用内核和通用内核单独分别编号,例如x个专用内核,y个通用内核,则专用内核可以从1开始编号,编至x,通用内核可以从1开始编号,编至y,每个专用内核、通用内核对应一个编号),此时专用内核可能与通用内核编号相同,但只是逻辑编号相同,可以根据其物理地址来寻址,最后得到与神经网络模型第1到I层运算对应的内核序列。即该内核序列共有I个元素,每一元素为一专用内核或通用内核,依序对应神经网络模型第1到I层运算。例如内核序列1a,2b,…,il,其中1、2、i表示神经网络层的编号,a、b、1表示专用内核或通用内核的编号。
步骤S5-23:异构内核23中的逻辑控制内核231将所要进行的层运算的数据和指令对应的地址发送到执行该层运算的专用内核或通用内核,并向执行该层运算的专用内核或通用内核发送内核序列中下一个专用内核或通用内核的编号,其中向执行最后一层运算的专用内核或通用内核发送的是逻辑控制内核的编号。
步骤S5-24:异构内核23中的逻辑控制内核231向内核序列中第一个内核发送开始运算信号。第一专用内核233或通用内核232在接收到开始运算信号之后,若当前存在未完成的运算,则继续完成运算,然后,继续从数据和指令对应的地址读取数据和指令,进行当前层运算。
步骤S5-25:第一专用内核233或通用内核232在完成当前层运算之后,将运算结果传送到共享高速暂存存储器22的指定地址,同时,向内核序列中第二个内核发送开始运算信号。
步骤S5-26:以此类推,内核序列中每一内核接收到开始运算信号之后,若当前存在未完成的运算,则继续完成运算,然后,继续从数据和指令对应的地址读取数据和指令,进行其对应的层运算,将运算结果传送到共享高速暂存存储器22的指定地址,同时,向内核序列中下一个内核发送开始运算信号。其中,内核序列最后一个内核是向逻辑控制内核231发送开始运算信号。
步骤S5-27:逻辑控制内核231接收到开始运算信号后,控制共享高速暂存存储器 22将各个神经网络层的运算结果写回到存储器21中,运算完成。
如图5E所示,本实施例是对上述第一实施例的进一步扩展。实施例一(图1)高速暂存存储器121是每个核专用的,专用内核1只能访问高速暂存存储器3,而不能访问其它高速暂存存储器,其它内核也是一样,因此121构成的整体12就具有非共享的性质。但是,如果内核j要使用内核i(i≠j)的计算结果(此结果最初存放在内核i对应的高速暂存存储器中),内核i就必须先把此结果从高速暂存存储器写入存储器11,然后内核j还要从存储器11读入其可以访问的高速暂存存储器,这个过程后,内核j才能使用这一结果。为使这一过程更加便捷,在此基础上增加一个N×N的数据交换网络34,比如可以用crossbar实现,这样每一个内核(331或332或333)均可以访问所有高速暂存存储器(321)。此时32就具有了共享性质。
采用本实施例(对应图5E)的装置进行神经网络运算的方法的如下:
步骤S5-31:异构内核33中的逻辑控制内核331通过高速暂存存储器32从存储器31中读取神经网络运算的数据和指令;
步骤S5-32:异构内核33中的逻辑控制内核331根据数据中的神经网络模型的类型和参数,判断是否存在符合条件的专用内核,符合条件是指该专用内核支持该神经网络运算且能完成该神经网络运算规模。如果专用内核m符合条件,则专用内核m作为目标内核,并执行步骤S5-33,否则执行步骤S5-35,其中,m为专用内核编号。
步骤S5-33:异构内核33中的逻辑控制内核331向目标内核发送信号,激活该目标内核,同时将所要进行的神经网络运算的数据和指令对应的地址发送到该目标内核。
步骤S5-34:目标内核根据获取的地址(从高速暂存存储器32中)获取神经网络运算的数据和指令,进行神经网络运算,然后将运算结果存入高速暂存存储器32中,运算完成。
步骤S5-35:异构内核33中的逻辑控制内核331向通用内核332发送信号,激活通用内核332,同时将所要进行的神经网络运算的数据和指令对应的地址发送到通用内核332。
步骤S5-36:通用内核332根据获取的地址(从高速暂存存储器32中)获取神经网络运算的数据和指令,进行神经网络运算,然后将运算结果存入高速暂存存储器32中,运算完成。
进一步,可以对存储器与高速暂存存储器之间的连接方式进行更改。由此产生新的实施例,如图5F所示。其相对于图5E所述实施例的差别在于存储器41与高速暂存存储器42间的连接方式。原先采用总线连接,在多个高速暂存存储器321写存储器31时,要排队,效率不高(参见图5E)。现在将这里的结构抽象为一个1输入N输出的数据交换网络,可以使用多种多样的拓扑结构来实现这个功能,比如星型结构(存储器41与N个高速暂存存储器421均有专门的通路连接)、树状结构(存储器41在树根位置,高速暂存存储器421在树叶位置)等。
需要说明的是,本公开中对逻辑控制内核的数量、专用内核的数量、通用内核的数量、共享或非共享高速暂存存储器的数量、存储器的数量均不作限制,可根据神经网络运算的具体要求适当调整。
至此,已经结合附图对本公开实施例进行了详细描述。依据以上描述,本领域技 术人员应当对本公开的一种异构多核神经网络处理器和神经网络运算方法有了清楚的认识。
在一些实施例里,本公开还提供了一种芯片,其包括了上述运算装置。
在一些实施例里,本公开还提供了一种芯片封装结构,其包括了上述芯片。
在一些实施例里,本公开还提供了一种板卡,其包括了上述芯片封装结构。
在一些实施例里,本公开还提供了一种电子设备,其包括了上述板卡。
电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only  Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。
需要说明的是,在附图或说明书正文中,未绘示或描述的实现方式,均为所属技术领域中普通技术人员所知的形式,并未进行详细说明。此外,上述对各元件和方法的定义并不仅限于实施例中提到的各种具体结构、形状或方式,本领域普通技术人员可对其进行简单地更改或替换,例如:
本公开的控制模块并不限定于实施例的具体组成结构,所属技术领域的技术人员熟知的可实现存储模块和运算单元之间数据、运算指令交互的控制模块,均可用于实现本公开。
以上所述的具体实施例,对本公开的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本公开的具体实施例而已,并不用于限制本公开,凡在本公开的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。

Claims (26)

  1. 一种运算装置,包括:
    运算模块,用于执行神经网络运算;以及
    幂次转换模块,与所述运算模块连接,用于将神经网络运算的输入神经元数据和/或输出神经元数据转换为幂次神经元数据。
  2. 根据权利要求1所述的运算装置,其中,所述幂次转换模块包括:
    第一幂次转换单元,用于将所述运算模块输出的神经元数据转换为幂次神经元数据;以及
    第二幂次转换单元,用于将输入所述运算模块的神经元数据转换为幂次神经元数据。
  3. 根据权利要求2所述的运算装置,其中,所述运算模块还包括:第三幂次转换单元,用于将幂次神经元数据转换为非幂次神经元数据。
  4. 根据权利要求1所述的运算装置,还包括:
    存储模块,用于存储数据和运算指令;
    控制模块,用于控制数据和运算指令的交互,所述控制模块接收所述存储模块发送的数据和运算指令,并将运算指令译码成运算微指令;其中,
    所述运算模块包括运算单元,用于接收所述控制模块发送的数据和运算微指令,并根据运算微指令对其接收的数据执行神经网络运算。
  5. 根据权利要求4所述的运算装置,其中,所述控制模块包括:运算指令缓存单元、译码单元、输入神经元缓存单元、权值缓存单元以及数据控制单元;其中
    运算指令缓存单元,与所述数据控制单元连接,用于接收该数据控制单元发送的运算指令;
    译码单元,与所述运算指令缓存单元连接,用于从运算指令缓存单元中读取运算指令,并将其译码成运算微指令;
    输入神经元缓存单元,与所述数据控制单元连接,用于从该数据控制单元获取相应的幂次神经元数据;
    权值缓存单元,与所述数据控制单元连接,用于从数据控制单元获取相应的权值数据;
    数据控制单元,与所述存储模块连接,用于实现所述存储模块分别与所述运算指令缓存单元、所述权值缓存单元以及所述输入神经元缓存单元之间的数据和运算指令交互;其中,
    所述运算单元分别与所述译码单元、输入神经元缓存单元及权值缓存单元连接,接收运算微指令、幂次神经元数据及权值数据,根据所述运算微指令对运算单元接收的幂次神经元数据和权值数据执行相应的神经网络运算。
  6. 根据权利要求5所述的运算装置,还包括:输出模块,其包括输出神经元缓存单元,用于接收所述运算模块输出的神经元数据;
    其中,所述幂次转换模块包括:
    第一幂次转换单元,与所述输出神经元缓存单元连接,用于将所述输出神经元缓存单元输出的神经元数据转换为幂次神经元数据;
    第二幂次转换单元,与所述存储模块连接,用于将输入所述存储模块的神经元数据转换为幂次神经元数据;
    所述运算模块包括:
    第三幂次转换单元,与所述运算单元连接,用于将幂次神经元数据转换为非幂次神经元数据。
  7. 根据权利要求6所述的运算装置,其中,所述第一幂次转换单元还与所述数据控制单元连接,用于所述运算模块输出的神经元数据转换为幂次神经元数据并发送至所述数据控制单元,以作为下一层神经网络运算的输入数据。
  8. 根据权利要求1至7中任一项所述的运算装置,其中,所述幂次神经元数据包括符号位和幂次位,所述符号位用于表示所述幂次神经元数据的符号,所述幂次位用于表示所述幂次神经元数据的幂次位数据;所述符号位包括一位或多位比特位数据,所述幂次位包括m位比特位数据,m为大于1的正整数。
  9. 根据权利要求8所述的运算装置,其中,所述神经网络运算装置还包括存储模块,该存储模块预存有编码表,该编码表包括幂次位数据以及指数数值,所述编码表用于通过幂次神经元数据的幂次位数据获取与所述幂次位数据对应的指数数值。
  10. 根据权利要求9所述的运算装置,其中,所述编码表还包括一个或多个置零幂次位数据,所述置零幂次位数据对应的幂次神经元数据为0。
  11. 根据权利要求9所述的运算装置,其中,最大的幂次位数据对应幂次神经元数据为0或最小的幂次位数据对应幂次神经元数据为0。
  12. 根据权利要求9所述的运算装置,其中,所述编码表的对应关系为正相关关系,存储模块预存一个整数值x和一个正整数值y,所述编码表中最小的幂次位数据对应指数数值为x;其中,x表示偏置值,y表示步长。
  13. 根据权利要求12所述的运算装置,其中,所述编码表中最大的幂次位数据对应幂次神经元数据为0,最小和最大的幂次位数据之外的其他的幂次位数据对应指数数值为(幂次位数据+x)*y。
  14. 根据权利要求12或13所述的运算装置,其中,y=1,x的数值=-2 m-1
  15. 根据权利要求9所述的运算装置,其中,所述编码表的对应关系为负相关关系,存储模块预存一个整数值x和一个正整数值y,最大的幂次位数据对应指数数值为x;其中,x表示偏置值,y表示步长。
  16. 根据权利要求15所述的运算装置,其中,最大的幂次位数据对应指数数值为x,最小的幂次位数据对应幂次神经元数据为0,最小和最大的幂次位数据之外的其他的幂次位数据对应指数数值为(幂次位数据-x)*y。
  17. 根据权利要求15或16所述的运算装置,其中,y=1,x的数值等于2 m-1
  18. 根据权利要求1至17中任一项权利要求的运算装置,其中,神经元数据转换成幂次神经元数据包括:
    s out=s in
    Figure PCTCN2018081929-appb-100001
    其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据,s in为输入数据的符号,s out为输出数据的符号,d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out
    Figure PCTCN2018081929-appb-100002
    表示对数据x做取下整操作;或,
    s out=s in
    Figure PCTCN2018081929-appb-100003
    其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据,s in为输入数据的符号,s out为输出数据的符号,d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out
    Figure PCTCN2018081929-appb-100004
    表示对数据x做取上整操作;或,
    s out=s in
    d out+=[log 2(d in+)]
    其中,d in为幂次转换单元的输入数据,d out为幂次转换单元的输出数据;s in为输入数据的符号,s out为输出数据的符号;d in+为输入数据的正数部分,d in+=d in×s in,d out+为输出数据的正数部分,d out+=d out×s out;[x]表示对数据x做四舍五入操作。
  19. 一种运算方法,包括:
    执行神经网络运算;以及
    在执行神经网络运算之前,将神经网络运算的输入神经元数据转换为幂次神经元数据;和/或在执行神经网络运算之后,将神经网络运算的输出神经元数据转换为幂次神经元数据。
  20. 根据权利要求19所述的运算方法,其中,在执行神经网络运算之前,将神经网络运算的输入神经元数据转换为幂次神经元数据的步骤包括:
    将输入神经元数据中的非幂次神经元数据转换为幂次神经元数据;以及
    接收并存储运算指令、所述幂次神经元数据及权值数据。
  21. 根据权利要求20所述的运算方法,其中,在接收并存储运算指令、所述幂次神经元数据及权值数据的步骤和执行神经网络运算的步骤之间,还包括:
    读取运算指令,并将其译码成各运算微指令。
  22. 根据权利要求21所述的运算方法,其中,在执行神经网络运算的步骤中,根据运算微指令对权值数据和幂次神经元数据进行神经网络运算。
  23. 根据权利要求19至22中任一项所述的运算方法,其中,在执行神经网络运算之后,将神经网络运算的输出神经元数据转换为幂次神经元数据的步骤包括:
    输出神经网络运算后得到的神经元数据;以及
    将神经网络运算后得到的神经元数据中的非幂次神经元数据转换为幂次神经元数据。
  24. 根据权利要求23所述的运算方法,其中,将幂次神经元数据并发送至所述数据控制单元,以作为神经网络运算下一层的输入幂次神经元,重复神经网络运算步骤和非幂次神经元数据转换成幂次神经元数据步骤,直到神经网络最后一层运算结束。
  25. 根据权利要求19至24中任一项所述的运算方法,其中,在存储模块中预存一个整数值x和一个正整数值y;其中,x表示偏置值,y表示步长;通过改变存储模块预存的整数值x和正整数值y,以调整神经网络运算装置所能表示的幂次神经元数据范围。
  26. 一种使用权利要求12至17中任一项所述的运算装置的方法,其中,通过改变存储模块预存的整数值x和正整数值y,以调整神经网络运算装置所能表示的幂次神经元数据范围。
PCT/CN2018/081929 2017-04-06 2018-04-04 运算装置和方法 WO2018184570A1 (zh)

Priority Applications (13)

Application Number Priority Date Filing Date Title
EP19199528.1A EP3624018B1 (en) 2017-04-06 2018-04-04 Neural network computation device and method
EP19199521.6A EP3620992B1 (en) 2017-04-06 2018-04-04 Neural network processor and neural network computation method
EP19199524.0A EP3627437B1 (en) 2017-04-06 2018-04-04 Data screening device and method
EP24168317.6A EP4372620A3 (en) 2017-04-06 2018-04-04 Neural network processor and neural network computation method
CN201880001242.9A CN109219821B (zh) 2017-04-06 2018-04-04 运算装置和方法
EP18780474.5A EP3579150B1 (en) 2017-04-06 2018-04-04 Operation apparatus and method for a neural network
CN201811423295.8A CN109409515B (zh) 2017-04-06 2018-04-04 运算装置和方法
EP19199526.5A EP3633526A1 (en) 2017-04-06 2018-04-04 Computation device and method
US16/283,711 US10896369B2 (en) 2017-04-06 2019-02-22 Power conversion in neural networks
US16/520,041 US11551067B2 (en) 2017-04-06 2019-07-23 Neural network processor and neural network computation method
US16/520,082 US11010338B2 (en) 2017-04-06 2019-07-23 Data screening device and method
US16/520,654 US11049002B2 (en) 2017-04-06 2019-07-24 Neural network computation device and method
US16/520,615 US10671913B2 (en) 2017-04-06 2019-07-24 Computation device and method

Applications Claiming Priority (10)

Application Number Priority Date Filing Date Title
CN201710222232.5 2017-04-06
CN201710222232.5A CN108694181B (zh) 2017-04-06 2017-04-06 一种数据筛选装置和方法
CN201710227493.6 2017-04-07
CN201710227493.6A CN108694441B (zh) 2017-04-07 2017-04-07 一种网络处理器和网络运算方法
CN201710256444.5 2017-04-19
CN201710256444.5A CN108733625B (zh) 2017-04-19 2017-04-19 运算装置及方法
CN201710266052.7 2017-04-21
CN201710266052.7A CN108734280A (zh) 2017-04-21 2017-04-21 一种运算装置和方法
CN201710312415.6A CN108805271B (zh) 2017-05-05 2017-05-05 一种运算装置和方法
CN201710312415.6 2017-05-05

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US16/281,711 Continuation-In-Part US10804176B2 (en) 2019-02-21 2019-02-21 Low stress moisture resistant structure of semiconductor device
US16/283,711 Continuation-In-Part US10896369B2 (en) 2017-04-06 2019-02-22 Power conversion in neural networks

Publications (1)

Publication Number Publication Date
WO2018184570A1 true WO2018184570A1 (zh) 2018-10-11

Family

ID=63712007

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/081929 WO2018184570A1 (zh) 2017-04-06 2018-04-04 运算装置和方法

Country Status (4)

Country Link
US (1) US10896369B2 (zh)
EP (6) EP4372620A3 (zh)
CN (4) CN109219821B (zh)
WO (1) WO2018184570A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112446478A (zh) * 2019-09-05 2021-03-05 美光科技公司 数据存储装置中的高速缓存操作的智能优化

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11010338B2 (en) 2017-04-06 2021-05-18 Shanghai Cambricon Information Technology Co., Ltd Data screening device and method
CN111656367A (zh) * 2017-12-04 2020-09-11 优创半导体科技有限公司 神经网络加速器的系统和体系结构
US11531727B1 (en) 2018-04-20 2022-12-20 Perceive Corporation Computation of neural network node with large input values
US11210586B1 (en) 2018-04-20 2021-12-28 Perceive Corporation Weight value decoder of neural network inference circuit
US11568227B1 (en) 2018-04-20 2023-01-31 Perceive Corporation Neural network inference circuit read controller with multiple operational modes
US11886979B1 (en) 2018-04-20 2024-01-30 Perceive Corporation Shifting input values within input buffer of neural network inference circuit
US11783167B1 (en) 2018-04-20 2023-10-10 Perceive Corporation Data transfer for non-dot product computations on neural network inference circuit
US11586910B1 (en) 2018-04-20 2023-02-21 Perceive Corporation Write cache for neural network inference circuit
US12093696B1 (en) 2018-04-20 2024-09-17 Perceive Corporation Bus for transporting output values of a neural network layer to cores specified by configuration data
US10740434B1 (en) 2018-04-20 2020-08-11 Perceive Corporation Reduced dot product computation circuit
GB2577732B (en) * 2018-10-04 2022-02-23 Advanced Risc Mach Ltd Processing data in a convolutional neural network
US11995533B1 (en) 2018-12-05 2024-05-28 Perceive Corporation Executing replicated neural network layers on inference circuit
CN109711538B (zh) * 2018-12-14 2021-01-15 安徽寒武纪信息科技有限公司 运算方法、装置及相关产品
US11347297B1 (en) 2019-01-23 2022-05-31 Perceive Corporation Neural network inference circuit employing dynamic memory sleep
CN111695686B (zh) * 2019-03-15 2022-11-01 上海寒武纪信息科技有限公司 地址分配方法及装置
US12260317B1 (en) 2019-05-21 2025-03-25 Amazon Technologies, Inc. Compiler for implementing gating functions for neural network configuration
CN112114942A (zh) * 2019-06-21 2020-12-22 北京灵汐科技有限公司 一种基于众核处理器的流式数据处理方法及计算设备
US12249189B2 (en) 2019-08-12 2025-03-11 Micron Technology, Inc. Predictive maintenance of automotive lighting
US12061971B2 (en) 2019-08-12 2024-08-13 Micron Technology, Inc. Predictive maintenance of automotive engines
CN113435591B (zh) * 2019-08-14 2024-04-05 中科寒武纪科技股份有限公司 数据处理方法、装置、计算机设备和存储介质
US12210401B2 (en) 2019-09-05 2025-01-28 Micron Technology, Inc. Temperature based optimization of data storage operations
CN110991619A (zh) 2019-12-09 2020-04-10 Oppo广东移动通信有限公司 神经网络处理器、芯片和电子设备
CN111930669B (zh) * 2020-08-03 2023-09-01 中国科学院计算技术研究所 多核异构智能处理器及运算方法
CN111930668B (zh) * 2020-08-03 2023-09-26 中国科学院计算技术研究所 运算装置、方法、多核智能处理器及多核异构智能处理器
CN111985634B (zh) * 2020-08-21 2024-06-14 北京灵汐科技有限公司 神经网络的运算方法、装置、计算机设备及存储介质
CN116324809A (zh) * 2020-10-20 2023-06-23 华为技术有限公司 通信方法、装置及系统
US12217160B1 (en) 2021-04-23 2025-02-04 Amazon Technologies, Inc. Allocating blocks of unified memory for integrated circuit executing neural network
CN115185878A (zh) * 2022-05-24 2022-10-14 中科驭数(北京)科技有限公司 一种多核分组网络处理器架构及任务调度方法
KR102653745B1 (ko) * 2023-06-02 2024-04-02 라이프앤사이언스주식회사 최적화된 연산속도를 가지는 교육용 로봇제어기

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102930336A (zh) * 2012-10-29 2013-02-13 哈尔滨工业大学 电阻阵列自适应校正方法
US20140081893A1 (en) * 2011-05-31 2014-03-20 International Business Machines Corporation Structural plasticity in spiking neural networks with symmetric dual of an electronic neuron
US20140278126A1 (en) * 2013-03-15 2014-09-18 Src, Inc. High-Resolution Melt Curve Classification Using Neural Networks
CN104641385A (zh) * 2012-09-14 2015-05-20 国际商业机器公司 神经核心电路

Family Cites Families (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB8619452D0 (en) * 1986-08-08 1986-12-17 Dobson V G Signal generating & processing
DE327817T1 (de) * 1988-01-11 1990-04-12 Ezel Inc., Tokio/Tokyo Assoziatives musterkonversionssystem und anpassungsverfahren dafuer.
US5931945A (en) * 1994-04-29 1999-08-03 Sun Microsystems, Inc. Graphic system for masking multiple non-contiguous bytes having decode logic to selectively activate each of the control lines based on the mask register bits
GB9509988D0 (en) * 1995-05-17 1995-07-12 Sgs Thomson Microelectronics Matrix transposition
US5629884A (en) * 1995-07-28 1997-05-13 Motorola, Inc. Log converter utilizing offset and method of use thereof
US6175892B1 (en) * 1998-06-19 2001-01-16 Hitachi America. Ltd. Registers and methods for accessing registers for use in a single instruction multiple data system
US6247114B1 (en) * 1999-02-19 2001-06-12 Advanced Micro Devices, Inc. Rapid selection of oldest eligible entry in a queue
US6330657B1 (en) * 1999-05-18 2001-12-11 Ip-First, L.L.C. Pairing of micro instructions in the instruction queue
US6581049B1 (en) * 1999-11-08 2003-06-17 Saffron Technology, Inc. Artificial neurons including power series of weights and counts that represent prior and next association
WO2001086431A1 (en) * 2000-05-05 2001-11-15 Lee Ruby B A method and system for performing subword permutation instructions for use in two-dimensional multimedia processing
AU2002331837A1 (en) 2001-09-20 2003-04-01 Microchip Technology Incorported Serial communication device with dynamic filter allocation
US7333567B2 (en) * 2003-12-23 2008-02-19 Lucent Technologies Inc. Digital detector utilizable in providing closed-loop gain control in a transmitter
CN100437520C (zh) * 2004-09-21 2008-11-26 中国科学院计算技术研究所 一种用计算机对矩阵进行运算的方法
US8280936B2 (en) * 2006-12-29 2012-10-02 Intel Corporation Packed restricted floating point representation and logic for conversion to single precision float
CN101093474B (zh) * 2007-08-13 2010-04-07 北京天碁科技有限公司 利用矢量处理器实现矩阵转置的方法和处理系统
WO2010080405A1 (en) * 2008-12-19 2010-07-15 The Board Of Trustees Of The University Of Illinois Detection and prediction of physiological events in people with sleep disordered breathing using a lamstar neural network
US8892485B2 (en) * 2010-07-08 2014-11-18 Qualcomm Incorporated Methods and systems for neural processor training by encouragement of correct output
CN201726420U (zh) 2010-08-06 2011-01-26 北京国科环宇空间技术有限公司 一种盲均衡装置
KR101782373B1 (ko) * 2010-11-10 2017-09-29 삼성전자 주식회사 X-y 스택 메모리를 이용한 컴퓨팅 장치 및 방법
CN102508803A (zh) 2011-12-02 2012-06-20 南京大学 一种矩阵转置存储控制器
US20150356054A1 (en) * 2013-01-10 2015-12-10 Freescale Semiconductor, Inc. Data processor and method for data processing
CN103679710B (zh) * 2013-11-29 2016-08-17 杭州电子科技大学 基于多层神经元群放电信息的图像弱边缘检测方法
US9978014B2 (en) * 2013-12-18 2018-05-22 Intel Corporation Reconfigurable processing unit
EP3035249B1 (en) 2014-12-19 2019-11-27 Intel Corporation Method and apparatus for distributed and cooperative computation in artificial neural networks
CN104598391A (zh) 2015-01-21 2015-05-06 佛山市智海星空科技有限公司 一种待转置二维矩阵的分块线性存储读取方法及系统
US9712146B2 (en) * 2015-09-18 2017-07-18 University Of Notre Dame Du Lac Mixed signal processors
CN105426160B (zh) 2015-11-10 2018-02-23 北京时代民芯科技有限公司 基于sprac v8指令集的指令分类多发射方法
CN107578099B (zh) * 2016-01-20 2021-06-11 中科寒武纪科技股份有限公司 计算装置和方法
CN105844330B (zh) * 2016-03-22 2019-06-28 华为技术有限公司 神经网络处理器的数据处理方法及神经网络处理器
CN105930902B (zh) * 2016-04-18 2018-08-10 中国科学院计算技术研究所 一种神经网络的处理方法、系统
JP2017199167A (ja) * 2016-04-27 2017-11-02 ルネサスエレクトロニクス株式会社 半導体装置
CN109416754B (zh) * 2016-05-26 2020-06-23 多伦多大学管理委员会 用于深度神经网络的加速器
CN106066783A (zh) * 2016-06-02 2016-11-02 华为技术有限公司 基于幂次权重量化的神经网络前向运算硬件结构
EP3469522A4 (en) * 2016-06-14 2020-03-18 The Governing Council of the University of Toronto Accelerator for deep neural networks
CN106228238B (zh) * 2016-07-27 2019-03-22 中国科学技术大学苏州研究院 现场可编程门阵列平台上加速深度学习算法的方法和系统
US10579583B2 (en) * 2016-08-09 2020-03-03 International Business Machines Corporation True random generator (TRNG) in ML accelerators for NN dropout and initialization
US11449745B2 (en) * 2016-09-28 2022-09-20 SK Hynix Inc. Operation apparatus and method for convolutional neural network
US10949736B2 (en) * 2016-11-03 2021-03-16 Intel Corporation Flexible neural network accelerator and methods therefor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140081893A1 (en) * 2011-05-31 2014-03-20 International Business Machines Corporation Structural plasticity in spiking neural networks with symmetric dual of an electronic neuron
CN104641385A (zh) * 2012-09-14 2015-05-20 国际商业机器公司 神经核心电路
CN102930336A (zh) * 2012-10-29 2013-02-13 哈尔滨工业大学 电阻阵列自适应校正方法
US20140278126A1 (en) * 2013-03-15 2014-09-18 Src, Inc. High-Resolution Melt Curve Classification Using Neural Networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3579150A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112446478A (zh) * 2019-09-05 2021-03-05 美光科技公司 数据存储装置中的高速缓存操作的智能优化

Also Published As

Publication number Publication date
EP3579150A1 (en) 2019-12-11
EP3624018B1 (en) 2022-03-23
EP3579150B1 (en) 2022-03-23
CN109409515B (zh) 2022-08-09
CN109344965A (zh) 2019-02-15
EP3627437B1 (en) 2022-11-09
EP3627437A1 (en) 2020-03-25
EP3620992A1 (en) 2020-03-11
EP3579150A4 (en) 2020-03-25
US10896369B2 (en) 2021-01-19
CN109219821A (zh) 2019-01-15
CN109409515A (zh) 2019-03-01
CN109219821B (zh) 2023-03-31
EP4372620A2 (en) 2024-05-22
US20190205739A1 (en) 2019-07-04
EP3620992B1 (en) 2024-05-29
EP4372620A3 (en) 2024-07-17
EP3624018A1 (en) 2020-03-18
EP3633526A1 (en) 2020-04-08
CN109359736A (zh) 2019-02-19

Similar Documents

Publication Publication Date Title
WO2018184570A1 (zh) 运算装置和方法
CN109997154B (zh) 信息处理方法及终端设备
US11551067B2 (en) Neural network processor and neural network computation method
US11442786B2 (en) Computation method and product thereof
WO2018192500A1 (zh) 处理装置和处理方法
CN111353598B (zh) 一种神经网络压缩方法、电子设备及计算机可读介质
CN107992329A (zh) 一种计算方法及相关产品
TW202321999A (zh) 一種計算裝置及方法
CN111488963B (zh) 神经网络计算装置和方法
CN111382835B (zh) 一种神经网络压缩方法、电子设备及计算机可读介质
CN111368986B (zh) 一种神经网络计算装置和方法
CN111367567B (zh) 一种神经网络计算装置和方法
CN111368987B (zh) 一种神经网络计算装置和方法
CN111368990B (zh) 一种神经网络计算装置和方法
CN111291884B (zh) 神经网络剪枝方法、装置、电子设备及计算机可读介质
CN112394991A (zh) 浮点转半精度浮点指令处理装置、方法及相关产品
CN112394990A (zh) 浮点转半精度浮点指令处理装置、方法及相关产品
CN112394993A (zh) 半精度浮点转短整形指令处理装置、方法及相关产品
CN112394997A (zh) 八位整形转半精度浮点指令处理装置、方法及相关产品
CN112394903A (zh) 短整形转半精度浮点指令处理装置、方法及相关产品
CN112394902A (zh) 半精度浮点转浮点指令处理装置、方法及相关产品
CN112394987A (zh) 短整形转半精度浮点指令处理装置、方法及相关产品
WO2020125092A1 (zh) 计算装置及板卡
CN112394989A (zh) 无符号转半精度浮点指令处理装置、方法及相关产品
CN112394996A (zh) 八位整形转半精度浮点指令处理装置、方法及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18780474

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2018780474

Country of ref document: EP

Effective date: 20190905

NENP Non-entry into the national phase

Ref country code: DE