[go: up one dir, main page]

AU2019201716A1 - System and method of generating a neural network architecture - Google Patents

System and method of generating a neural network architecture Download PDF

Info

Publication number
AU2019201716A1
AU2019201716A1 AU2019201716A AU2019201716A AU2019201716A1 AU 2019201716 A1 AU2019201716 A1 AU 2019201716A1 AU 2019201716 A AU2019201716 A AU 2019201716A AU 2019201716 A AU2019201716 A AU 2019201716A AU 2019201716 A1 AU2019201716 A1 AU 2019201716A1
Authority
AU
Australia
Prior art keywords
architecture
parameters
rnn
neural network
primary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
AU2019201716A
Inventor
Kalyan Shankar Bhattacharjee
Amit Kumar Gupta
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canon Inc
Original Assignee
Canon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canon Inc filed Critical Canon Inc
Priority to AU2019201716A priority Critical patent/AU2019201716A1/en
Publication of AU2019201716A1 publication Critical patent/AU2019201716A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Complex Calculations (AREA)

Abstract

SYSTEM AND METHOD OF GENERATING A NEURAL NETWORK ARCHITECTURE ABSTRACT A system and method of generating a neural network architecture for performing a task. The method comprises receiving (110), at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; determining (130) a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and generating (150) a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture. 22274490_1 - 1/11 140 Primary Architecture Encoding 120 100 110 Start Train and RNN SEvaluate Primary (LSTM) Architecture - 130 16 Sample parameters of16 LSTM from trained 4 Gaussian Process\ User constraint(s) Generate Secondary 10 Architecture CEnd 199 Fig. 1 22209758_2

Description

- 1/11
140
Primary Architecture Encoding 120 100 110 Start
Train and RNN SEvaluate Primary (LSTM) Architecture
- 130 16 Sample parameters of16 LSTM from trained 4 Gaussian Process\
User constraint(s) Generate Secondary 10 Architecture
199 CEnd
Fig. 1
22209758_2
SYSTEM AND METHOD OF GENERATING A NEURAL NETWORK ARCHITECTURE TECHNICAL FIELD
[0001] The present invention relates to a method of generating neural network architecture using Gaussian processes for a given task starting from an initial network architecture modified using a Recurrent Neural Network. In particular, the present invention describes a system and method for generating a deep neural network architecture achieving user defined objectives where an initial deep neural network is given, to solve a particular problem.
BACKGROUND
[0002] A deep neural network is a type of artificial neural network comprising of multiple layers between the input and output layers. In the past, designing a deep neural network architecture required significant manual effort in terms of deciding each individual layer's type, hyperparameters, activation units and the sequence of such layers and activations in the final neural network architecture. Examples of a layer's type include fully connected, convolution, pooling and the like. Examples of a layer's hyperparameters include a number of hidden units for fully connected layers, number of channels, stride, kernel size and padding for convolution layers, kernel size and stride for pooling layers and the like. Examples of a layer's type of activation units include a rectifier linear unit (ReLU), sigmoid, tanh and the like.
[0003] There exist several known deep neural network architectures such as LeNet, VGG, ResNet, DenseNet, Inception etc. It is known that the training and the inference of the known deep neural network architectures is memory intensive. Generation of neural network architectures involve expert tuning of hyperparameters which is extremely time consuming for a human expert. Techniques for automated generation of efficient neural network architecture have been developed in an effort to reduce the manual effort.
[0004] Most of the existing methods for generating a neural network architecture only aim for higher performance for a given task which might lead to an even deeper neural network architecture compared to the existing ones. Aiming only for higher performance for a given task poses a significant limitation in terms of achieving user specified objectives for example reducing number of parameters used in the network. Typically, most of the existing methods do not start with a primary architecture. Instead, the existing methods start from random architectures which
22274490_1 effectively slows down convergence of the optimization process. Starting from random architectures also results in slower training of the resulting secondary network architectures to assess their performance during the optimization. The known methods start with a small convolutional neural network unit and involve a Recurrent Neural Network as a controller to devise policies for deciding the layer types, layer hyperparameters and layer sequences. The policies are optimized using a reinforcement learning strategy which only deals with a limited discrete set of hyperparameter design choices such as number of channels for a convolution can vary within [48, 64, 128, 256, 512]. Using a limited set of design choices effectively results in limited search space and hence, chance of generating efficient neural network architecture becomes low. Further, inclusion of more choices in terms of individual hyperparameter makes reinforcement learning increasingly difficult and computationally expensive to optimize.
[0005] Some known methods rely on designing specific kernel(s) to distinguish different network architectures and use morphism to connect various network modules or Bayesian optimization to generate potential secondary architectures. However, the methods are very sensitive to the choice of the kernel function(s) and only allow limited number of hyperparameters to be changed during optimization. Accordingly, the training process is less efficient in terms of finding better network architectures.
[0006] Thus, a need exists to devise a method for generating neural network architecture which allows a large number of hyperparameters to be changed during optimization.
SUMMARY
[0007] It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements.
[0008] One aspect of the present disclosure provides a method of generating a neural network architecture for performing a task, the method comprising: receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture.
22274490_1
[0009] According to another aspect, the plurality of parameters of the RNN are determined based on one or more user constraints associated with completing the task.
[00010] According to another aspect, the user constraints relate to one or more of number of multiplications, number of floating point operations, number of parameters, memory storage requirement, memory access requirement, and parameters associated with a restricted capability device.
[00011] According to another aspect, the plurality of parameters associated with the primary architecture are determined using a plurality of intermediate neural network architectures associated with the primary architecture.
[00012] According to another aspect, the Gaussian process is fitted to parameters of a plurality of intermediate neural network architectures.
[00013] According to another aspect, the Gaussian process is determined based on a Bayesian optimisation.
[00014] According to another aspect, the method further comprises measuring performance of the plurality of intermediate neural network architectures based on one or more user constraints associated with completing the task.
[00015] According to another aspect, the layers of the secondary architecture are generated using a skip flag and a layer preservation flag associated with the determined plurality of parameters of the RNN.
[00016] According to another aspect, the layers of the secondary architecture are generated based on a modification configuration comprising a linear layer.
[00017] According to another aspect, the layers of the secondary architecture are generated based on a modification configuration comprising a bi-directional LSTM for each layer of the received primary architecture.
[00018] According to another aspect, the plurality of reduction factors are determined using a linear function and a sigmoid function.
22274490_1
[00019] According to another aspect, the secondary neural network is generated by sharing the determined parameters of the RNN irrespective of layer type.
[00020] According to another aspect, the secondary neural network is generated by sharing the determined parameters of the RNN only within layers of the same type.
[00021] According to another aspect, the RNN is a Long Short Term Memory (LSTM) network.
[00022] According to another aspect, the Gaussian process is constructed using parameters of the RNN which are continuously valued.
[00023] Another aspect of the present disclosure provides a non-transitory computer readable medium having a computer program stored thereon to implement a method of generating a neural network architecture for performing a task, the program comprising: code for receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; code for determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and code for generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture.
[00024] Another aspect of the present disclosure provides apparatus, comprising: a memory; and a processor configured to execute code stored on the memory implement a method of generating a neural network architecture for performing a task, the method comprising: receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture.
[00025] Another aspect of the present disclosure provides a system, comprising: a memory; and a processor, wherein the processor is configured to execute code stored on the memory for
22274490_1 implementing a method of generating a neural network architecture for performing a task, the method comprising: receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture.
[00026] Other aspects are also disclosed.
BRIEF DESCRIPTION OF THE DRAWINGS
[00027] At least one example embodiment of the present invention will now be described with reference to the drawings and appendices, in which:
[00028] Fig. 1 shows a method of generating a secondary network architecture constructed using a recurrent neural network from a trained Gaussian process given a primary network architecture;
[00029] Fig. 2 shows a method of sampling recurrent neural network parameters from a trained Gaussian process as used in the method of Fig. 1,
[00030] Fig. 3 shows a method of training Gaussian processes as used in the method of Fig. 2;
[00031] Fig. 4 shows a method of initializing recurrent neural network parameters to construct cells as used in the method of Fig. 3;
[00032] Fig. 5 shows a method of generating an overall performance measure for a secondary architecture as used in the method of Fig. 3;
[00033] Fig. 6 shows a method of fitting a Gaussian process as used in the method of Fig. 3;
[00034] Fig. 7 shows a method of constructing an acquisition function as used the method of Fig. 2;
22274490_1
[00035] Fig. 8 shows a method of optimizing the acquisition function as used in the method of Fig. 2;
[00036] Fig. 9 shows a dataflow of generating a secondary architecture from a primary architecture given the recurrent neural network; and
[00037] Figs. 10A and 10B form a schematic block diagram of a general purpose computer system upon which arrangements described can be practiced.
DETAILED DESCRIPTION INCLUDING BEST MODE
[00038] Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears.
[00039] A method and system for generating neural network architectures for performing a given task is described below.
[00040] The present disclosure relates to a method of generating neural network architectures starting with modifying a primary network architecture using a Gaussian process built on a Recurrent Neural Network. The generated neural network may be generated based on one or more user specified objectives, also referred to as user constraints, associated with competing a required task. The user constraints can relate to one or more of number of multiplications, number of floating point operations (FLOPS), number of parameters, memory storage requirement, memory access requirement and parameters associated with a deployment device. Parameters associated with a deployment device can relate to operating parameters associated with device having reduced computational capability upon which the generated network is to be used, for example memory or number of operations permitted for a mobile device.
[00041] The present disclosure addresses the problem of generating a neural network architecture with a user specified objective for a particular task starting from a primary architecture. The primary architecture structure is defined using layers of configuration parameters, henceforth referred as layer hyperparameters.
22274490_1
[00042] A Recurrent Neural Network such as a Long Short Term Memory (LSTM) network takes encoding of the hyperparameters of different layers of the primary architecture as input at each of a number of time steps. The Recurrent Neural Network outputs reduction factors and two flags (for existence and skip-connection) corresponding to each of the layers. The existence flag is also referred to as a layer preservation flag. Based on the output, each input layer of the primary network architecture is modified to generate the corresponding layer of the secondary network architecture. In order to generate an effective secondary network architecture, the parameters of the LSTM require optimisation. Therefore, a Gaussian process is constructed with inputs of LSTM parameters and outputs of the overall performance of the secondary network architectures on an input task. Bayesian optimization is utilized to optimize the Gaussian process. The examples described herein relate to an LSTM network. However, other RNN networks such as fully recurrent, gated recurrent unit, bi-directional LSTM and the like may be used.
[00043] The arrangements described construct the Gaussian process on the parameters of the LSTM, which are continuously valued. Additionally, LSTM parameters are shared across various layers of the primary network architecture. These two factors reduce the complexity of the Bayesian optimization. Complexity of the Bayesian optimization remains invariant of the depth of the primary network architecture.
[00044] Figs. 10A and 1OB depict a general-purpose computer system 1000, upon which the various arrangements described can be practiced.
[00045] As seen in Fig. 1OA, the computer system 1000 includes: a computer module 1001; input devices such as a keyboard 1002, a mouse pointer device 1003, a scanner 1026, a camera 1027, and a microphone 1080; and output devices including a printer 1015, a display device 1014 and loudspeakers 1017. An external Modulator-Demodulator (Modem) transceiver device 1016 may be used by the computer module 1001 for communicating to and from a communications network 1020 via a connection 1021. The communications network 1020 may be a wide-area network (WAN), such as the Internet, a cellular telecommunications network, or a private WAN. Where the connection 1021 is a telephone line, the modem 1016 may be a traditional "dial-up" modem. Alternatively, where the connection 1021 is a high capacity (e.g., cable) connection, the modem 1016 may be a broadband modem. A wireless modem may also be used for wireless connection to the communications network 1020.
22274490_1
[00046] The computer module 1001 typically includes at least one processor unit 1005, and a memory unit 1006. For example, the memory unit 1006 may have semiconductor random access memory (RAM) and semiconductor read only memory (ROM). The computer module 1001 also includes an number of input/output (I/O) interfaces including: an audio-video interface 1007 that couples to the video display 1014, loudspeakers 1017 and microphone 1080; an I/O interface 1013 that couples to the keyboard 1002, mouse 1003, scanner 1026, camera 1027 and optionally a joystick or other human interface device (not illustrated); and an interface 1008 for the external modem 1016 and printer 1015. In some implementations, the modem 1016 may be incorporated within the computer module 1001, for example within the interface 1008. The computer module 1001 also has a local network interface 1011, which permits coupling of the computer system 1000 via a connection 1023 to a local-area communications network 1022, known as a Local Area Network (LAN). As illustrated in Fig. 10A, the local communications network 1022 may also couple to the wide network 1020 via a connection 1024, which would typically include a so-called "firewall" device or device of similar functionality. The local network interface 1011 may comprise an Ethernet circuit card, a Bluetooth©wireless arrangement or an IEEE 802.11 wireless arrangement; however, numerous other types of interfaces may be practiced for the interface 1011.
[00047] The I/O interfaces 1008 and 1013 may afford either or both of serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 1009 are provided and typically include a hard disk drive (HDD) 1010. Other storage devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 1012 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (e.g., CD-ROM, DVD, Blu-ray DiscTM), USB-RAM, portable, external hard drives, and floppy disks, for example, may be used as appropriate sources of data to the system 1000.
[00048] The components 1005 to 1013 of the computer module 1001 typically communicate via an interconnected bus 1004 and in a manner that results in a conventional mode of operation of the computer system 1000 known to those in the relevant art. For example, the processor 1005 is coupled to the system bus 1004 using a connection 1018. Likewise, the memory 1006 and optical disk drive 1012 are coupled to the system bus 1004 by connections 1019. Examples of computers on which the described arrangements can be practised include IBM-PC's and compatibles, Sun Sparcstations, Apple MacTM or like computer systems.
22274490_1
[00049] The methods described may be implemented using the computer system 1000 wherein the processes of Figs. 1-9, to be described, may be implemented as one or more software application programs 1033 executable within the computer system 1000. In particular, the steps of the methods of Figs. 1 to 9 are effected by instructions 1031 (see Fig. 10B) in the software 1033 that are carried out within the computer system 1000. The software instructions 1031 may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part and the corresponding code modules performs the described methods and a second part and the corresponding code modules manage a user interface between the first part and the user.
[00050] The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 1000 from the computer readable medium, and then executed by the computer system 1000. A computer readable medium having such software or computer program recorded on the computer readable medium is a computer program product. The use of the computer program product in the computer system 1000 preferably effects an advantageous apparatus for implementing the methods described of generating a neural network architecture.
[00051] The software 1033 is typically stored in the HDD 1010 or the memory 1006. The software is loaded into the computer system 1000 from a computer readable medium, and executed by the computer system 1000. Thus, for example, the software 1033 may be stored on an optically readable disk storage medium (e.g., CD-ROM) 1025 that is read by the optical disk drive 1012. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program product in the computer system 1000 preferably effects an apparatus for implementing the methods described of generating a neural network architecture.
[00052] In some instances, the application programs 1033 may be supplied to the user encoded on one or more CD-ROMs 1025 and read via the corresponding drive 1012, or alternatively may be read by the user from the networks 1020 or 1022. Still further, the software can also be loaded into the computer system 1000 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to the computer system 1000 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, DVD, Blu-rayT M Disc, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, or a computer readable
22274490_1 card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 1001. Examples of transitory or non-tangible computer readable transmission media that may also participate in the provision of software, application programs, instructions and/or data to the computer module 1001 include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like.
[00053] The second part of the application programs 1033 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 1014. Through manipulation of typically the keyboard 1002 and the mouse 1003, a user of the computer system 1000 and the application may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via the loudspeakers 1017 and user voice commands input via the microphone 1080.
[00054] Fig. 10B is a detailed schematic block diagram of the processor 1005 and a "memory" 1034. The memory 1034 represents a logical aggregation of all the memory modules (including the HDD 1009 and semiconductor memory 1006) that can be accessed by the computer module 1001 in Fig. 10A.
[00055] When the computer module 1001 is initially powered up, a power-on self-test (POST) program 1050 executes. The POST program 1050 is typically stored in a ROM 1049 of the semiconductor memory 1006 of Fig. 10A. A hardware device such as the ROM 1049 storing software is sometimes referred to as firmware. The POST program 1050 examines hardware within the computer module 1001 to ensure proper functioning and typically checks the processor 1005, the memory 1034 (1009, 1006), and a basic input-output systems software (BIOS) module 1051, also typically stored in the ROM 1049, for correct operation. Once the POST program 1050 has run successfully, the BIOS 1051 activates the hard disk drive 1010 of Fig. 1OA. Activation of the hard disk drive 1010 causes a bootstrap loader program 1052 that is resident on the hard disk drive 1010 to execute via the processor 1005. This loads an operating system 1053 into the RAM memory 1006, upon which the operating system 1053 commences operation. The operating system 1053 is a system level application, executable by the
22274490_1 processor 1005, to fulfil various high level functions, including processor management, memory management, device management, storage management, software application interface, and generic user interface.
[00056] The operating system 1053 manages the memory 1034 (1009, 1006) to ensure that each process or application running on the computer module 1001 has sufficient memory in which to execute without colliding with memory allocated to another process. Furthermore, the different types of memory available in the system 1000 of Fig. 10A must be used properly so that each process can run effectively. Accordingly, the aggregated memory 1034 is not intended to illustrate how particular segments of memory are allocated (unless otherwise stated), but rather to provide a general view of the memory accessible by the computer system 1000 and how such is used.
[00057] As shown in Fig. 10B, the processor 1005 includes a number of functional modules including a control unit 1039, an arithmetic logic unit (ALU) 1040, and a local or internal memory 1048, sometimes called a cache memory. The cache memory 1048 typically includes a number of storage registers 1044 - 1046 in a register section. One or more internal busses 1041 functionally interconnect these functional modules. The processor 1005 typically also has one or more interfaces 1042 for communicating with external devices via the system bus 1004, using a connection 1018. The memory 1034 is coupled to the bus 1004 using a connection 1019.
[00058] The application program 1033 includes a sequence of instructions 1031 that may include conditional branch and loop instructions. The program 1033 may also include data 1032 which is used in execution of the program 1033. The instructions 1031 and the data 1032 are stored in memory locations 1028, 1029, 1030 and 1035, 1036, 1037, respectively. Depending upon the relative size of the instructions 1031 and the memory locations 1028-1030, a particular instruction may be stored in a single memory location as depicted by the instruction shown in the memory location 1030. Alternately, an instruction may be segmented into a number of parts each of which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 1028 and 1029.
[00059] In general, the processor 1005 is given a set of instructions which are executed therein. The processor 1005 waits for a subsequent input, to which the processor 1005 reacts to by executing another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 1002, 1003, data
22274490_1 received from an external source across one of the networks 1020, 1002, data retrieved from one of the storage devices 1006, 1009 or data retrieved from a storage medium 1025 inserted into the corresponding reader 1012, all depicted in Fig. 10A. The execution of a set of the instructions may in some cases result in output of data. Execution may also involve storing data or variables to the memory 1034.
[00060] The disclosed arrangements use input variables 1054, which are stored in the memory 1034 in corresponding memory locations 1055, 1056, 1057. The described arrangements produce output variables 1061, which are stored in the memory 1034 in corresponding memory locations 1062, 1063, 1064. Intermediate variables 1058 may be stored in memory locations 1059, 1060, 1066 and 1067.
[00061] Referring to the processor 1005 of Fig. 10B, the registers 1044, 1045, 1046, the arithmetic logic unit (ALU) 1040, and the control unit 1039 work together to perform sequences of micro-operations needed to perform "fetch, decode, and execute" cycles for every instruction in the instruction set making up the program 1033. Each fetch, decode, and execute cycle comprises:
a fetch operation, which fetches or reads an instruction 1031 from a memory location 1028, 1029, 1030;
a decode operation in which the control unit 1039 determines which instruction has been fetched; and
an execute operation in which the control unit 1039 and/or the ALU 1040 execute the instruction.
[00062] Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 1039 stores or writes a value to a memory location 1032.
[00063] Each step or sub-process in the processes of Figs. I to 9 is associated with one or more segments of the program 1033 and is performed by the register section 1044, 1045, 1047, the ALU 1040, and the controlunit 1039 in the processor 1005 working together to perform the fetch,
22274490_1 decode, and execute cycles for every instruction in the instruction set for the noted segments of the program 1033.
[00064] The methods described may alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the functions or sub functions of Figs. 1 to 9. Such dedicated hardware may include graphic processors, digital signal processors, or one or more microprocessors and associated memories.
[00065] Fig. 1 shows a method 100 of generating a secondary network architecture for performing a particular task. The method 100 is typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005.
[00066] Inputs to the method 100 are a primary neural network architecture 140 (also referred to as a primary architecture) and a Recurrent Neural Network 120. The Recurrent Neural Network 120 is an LSTM network in the example described. Additional inputs can include one or more of user constraints (objectives) 160, a range of K intermediate networks to be generated (not shown) and a number of iterations N for Bayesian optimization. The examples described use the user constraints 160. The inputs may be stored in a database in the memory 1006 and selected by the user. Alternatively, the inputs may be input or selected by the user using an interface executing on the display 1014 and input 1013 of the device 1001 or received from a remote device via the network connection 1021.
[00067] The method 100 starts at a training step 110. The training step 110 trains the primary architecture of the required task and performs an evaluation of the results of the trained primary architecture. Training and evaluation of the primary network architecture 140 on the particular task at 110 can be implemented using known techniques by inputting a training dataset, updating network parameters using stochastic gradient descent based methods and then evaluating performance. The training and validation datasets, not shown in Fig 1, can be stored in the memory 1006 or input by the user of the module 1001.
[00068] The method 100 continues from step 110 to a sampling step 130. The sampling step 130 receives the primary architecture associated with a plurality of hyperparameters at the Recurrent Neural Network (LSTM) 120. Parameters of the LSTM are determined by sampling parameters
22274490_1 associated with the primary architecture 140 using a trained Gaussian process at execution of step 130.. Operation of step 130 is described in relation to a method 200 below.
[00069] The method 100 continues from step 130 to a generating step 150. The step 150 operates to generate a secondary neural network architecture, being a secondary RNN, using the parameters sampled at step 130 and the primary architecture encoding 140. The secondary neural network architecture (also referred to as a secondary or final architecture) is generated using a plurality of reduction factors associated with the determined plurality of parameters to generate a set of layers of the secondary neural network architecture. Each of the generated layers corresponds to a layer of the primary architecture 140. Operation of step 150 is described hereafter with respect to Fig. 9. The secondary neural network architecture is structured to perform the task. The method 100 ends upon execution of step 150.
[00070] The step of sampling LSTM parameters 130 is further described with reference to a method 200 shown in Fig. 2. The method 200 is typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005.
[00071] The method 200 starts with a training step 210. The Gaussian process is trained at step 210. Operation of the step 210 is described with reference to Fig. 3 hereafter. The method 200 continues from step 210 to an obtaining step 240. Step 240 executes to obtain the "best" numerical values of the LSTM parameters. The "best" values selected are the parameters that achieve the user constraints 160, or the parameters that give a closest (for example based upon a distance or error) result to the user constraints 160. Accordingly, the user parameters are determined based on the user constraints 160 if received. Alternatively, the "best" parameters may relate to parameters generated by the Gaussian process if constraints 160 have not been provided. The step 210 is now further described in further detail.
[00072] Fig. 3 shows a method 300 of training a Gaussian process, as implemented at step 210. The method 300 is typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005.
[00073] The method 300 starts with an initialization step 310. Step 310 operates to randomly initialize K sets of LSTM parameters of the RNN 120. A method 400 as implemented at step 310 is now described using Fig. 4. The method 400 is typically implemented as one or more modules
22274490_1 of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005.
[00074] The method 400 starts at an initializing step 410. Step 410 executes such that K instances of the RNN 120 are randomly initialized within a user specified range for a user selected value for K. Alternatively, a default range for K can be used. A range [-3,3] is considered to be typical of this range for the parameters of the RNN 120. The method 400 continues to a generating step 420. Step 420 generates K rows of flattened LSTM parameters using the initialised sets of step 410. Each row of flattened parameters corresponds to the parameters of an individual LSTM cell. The number of LSTM cells represented by K is also a user selected value, for example input via an interface. In an embodiment, K=10 is used. The range used in 410 and the value of K can be input by the user when inputting the constraints 160. The method 400 ends after implementation of step 420.
[00075] Returning to Fig. 3, the method 300 continues from step 310 to a generating step 315. Step 315 operates to generate a set of secondary neural network architectures, also referred to as intermediate neural network architectures. Step 315 implements K instances of generate a secondary architecture, as described hereafter in relation to Fig. 9. With the input of encodings of layers of the primary network architecture 140, K secondary network architectures are generated upon execution of step 315. The method 300 continues to an evaluating step 320. Step 320 operates to train and evaluate the K secondary architectures.
[00076] Operation of step 320 is now described with reference to a method 500 shown in Fig. 5. The method 500 is typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005.
[00077] The method 500 receives a training dataset 510, a validation dataset 530, a set secondary (intermediate) architectures 590 (generated by the step 315) and the user constraints 160 as inputs. The method 500 starts at a training step 520. The datasets 510 and 530 can be datasets used at step 110. Each of the K secondary network architectures 590 are trained at step 520 for a number of epochs (2 epochs for example) using the training dataset 510. The number of epochs maybe determined heuristically. After the step 520 the method 500 continues to a validation step 540. Step 540 inputs the validation dataset 530 to validate the trained secondary architecture generated at step 520.
22274490_1
[00078] Upon completion of the validation stage 540 the method 500 continues to a validation performance step 550. A performance (L,) is measured against the validation dataset 530 at step 550.
[00079] The method 500 continues from step 550 to a determining step 570. Step 570 comprises measuring overall performance of the secondary (intermediate) neural network architectures. The measurement can be based on the user constraints 160. For every secondary network architecture 590, additional quantities (L ) designated as are measured depending on the user specified
objectives (constraints) 160 such as total number of parameters, total number of layers, total FLOP count (count of floating point operations), total memory requirement and the like. At step 570, an overall performance measure (LO) is determined for every secondary network architecture using Equation (1).
Lo = 0.6L,+ L± (1)
[00080] In Equation (1) n is the number of additional performance measures (relating to the
user constraints) apart from validation performance to be measured for individual secondary network architecture. As shown in Equation (1) the overall performance measure is based upon the measured performace and the additional quantities and in particular is affected by the number of additional performance measures. Increasing numbers of performance measures can operate to reduce the overall performance measure. In one implementation the total number of parameters and total number of layers are measured. An alternative implementation measures FLOPs count as a performance measurement. Another embodiment could relate to total memory requirement or any differentiable/non-differentiable user specified objective. This overall performance function handles the all the objectives in an efficient way and provide a single performance measure Lo. The method 500 end after execution of step 570.
[00081] Returning to Fig. 3, once K secondary architectures are trained and overall performance measures are obtained for all the architectures at step 320, the method 300 continues to a fitting step 330. The Gaussian process is fitted at 330 using input of the K rows of flattened LSTM parameters and K overall performance measures. The Gaussian process can be implemented using techniques such as Python and the like. The Gaussian process represents a function that operates to select parameters of the LSTM performance measure in order to select a parameter most suitable for the secondary network on the basis of the user constraints.
22274490_1
[00082] Fig. 6 shows a method 600 offitting a Gaussian process as implemented at step 330. The method 600 is typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005. The method 600 received inputs of K sets LSTM parameters 605 (generated at step 420) and a performance measure 670 as generated at step 570.
[00083] The first step of the fitting Gaussian method 600 is an assuming step 610. A kernel is assumed or estimated at step 610. Since the input to the method 600 is continuous valued and the overall performance i.e. the output, is typically stochastic and noisy, an additive mixture of three kernels, matern kernel, whitekernel and constant kernel, are assumed at step 610. The kernel estimation function is descried in Equation (2).
Kernel = Matern + Whitekernel + Constantkernel (2)
[00084] Equation (2) provides a particular embodiment for estimating the kernel. Alternative embodiments will depend on the user contraints and user specified alternative kernel functions such as Squared exponential kernel, Rational quadratic kernel, Periodic kernel, Exponential kernel or custom kernel function and the like. Another alternative embodiment could include additive (K = >i kg), multiplicative (K = li ki) or combination of both from the above choices (K = (, ki) H1 kj) of individual kernel type. Once the choice of kernel function isfinalized, the
method 600 continues to an initializing step 620.
[00085] Step 620 executes to initialize the hyperparameters of LSTM parameters using the kernel function of step 610 and construct a log of marginal likelihood of fitting using the input and resultant output samples. Step 620 is executed for each of the K sets of LSTM cells (parameters) 605.
[00086] The method 600 continues from step 620 to a maximising step 630. In step 630 the negative log of marginal likelihood is maximized to obtain hyperparameters within a specified range provided by the user to provide the "best" fit of the LSTM parameters 605 and the overall performance measure 670 of the generated secondary network architectures. The step 630 is executed using each of the K sets of parameters 605. The method 600 continues from step 630 to a generating step 640. The method 600 generates an updated Gaussian process in execution of the step 640 using the results of step 630. Step 640 operates to generates a single Gaussian Process
22274490_1 based on the K sets of parameters 605. The method 600 outputs the single Gaussian process at step 640 and ends.
[00087] Returning to Fig. 3, once the Gaussian process is constructed and updated based on the K initialized input output pairs at step 330, the Gaussian process may be used in step 240 of Fig. 2. In the implementation described in Fig. 3, Bayesian optimization is used to further update the Gaussian process. Bayesian optimization is carried out in several steps as discussed below.
[00088] In the example of Fig. 3, the method 300 continues from step 330 to a check step 335. The Bayesian optimization can be carried out in a number of iterations N. the number of iterations N can be input by a user or set to a default number based upon previous results or experimentation. When the method 300 proceeds from step 330 to step 335 the current iteration number is set to zero (0).
[00089] The check step 335 checks if the number of iterations is equal to N. If not, "No" at step 335, the current iteration is incremented, the method 300 continues to a constructing step 351 and the Bayesian optimization commences. As a first step to the Bayesian optimization, an acquisition function (Acq(x)) is constructed at step 351 which acts as a surrogate to the actual overall performance measure determined at 570.
[00090] The step 351 can be implemented as the method 700 shown in in Fig. 7. The method 700 is typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005. The method 700 receives an updated Gaussian process 740 generated and sample corresponding LSTM parameters 710 as input. The Gaussian Process 740 and the sample LSTM parameters 710 are selected based on operation of step 330 or 356 depending on the current iteration. The acquisition function (A cq(x)) chosen for one implementation is an expected improvement (EI) criterion (aEI) which can be constructed according to Equation (3).
aEI(x ,D) f max(Obest -y)p(ylx;OD)dy (3)
aEI(x; 0, D) =o(x; 0, D) (y(x)p(y(x) + I'(y(x); 0, 1)
22274490_1
Where, y(x) = Ybest(x 0,D), p = cdf, N = pdf ao(x;O0,D)
[00091] In Equation (3) D represents the distance function for the used kernel (from implementation of step 330 or step 356 depending on the iteration N), and 0 represents hyper parameters of the kernel function used. For any new sample (x) of the flattened LSTM parameters 710, the updated Gaussian process 740 returns a predicted mean (p) and standard deviation (U) of the overall performance measure (y) (determined at step 320 or 355 depending on the current iteration). The method 700 starts at a determining step 720. Step 720 determines the predicted mean (p) and standard deviation (u) of the parameters 710 using the Gaussian process 740.
[00092] The method 700 continues from step 720 to a determining step 730. Based on the prediction determined at step 720, the expected improvement criterion is determined at step 730. The expected movement criterion is determined by following Equation (3) above where j denotes the cumulative density function and M denotes the probability density function. Alternative embodiments for the acquisition function include probability of improvement, upper confidence bound, information theoretic approaches and the like. In the example of Equation (3), the expected movement criterion is a function of the predicted mean (p), a best fitting resultYbest, and standard deviation (u). The method 700 outputs the acquisition function at step 730 and ends.
[00093] Returning to the method 300, the next step after step 351 is step 352 to optimize the acquisition function (Acq(x)). Step 352, operates to optimize the acquisition function based on Equation (4).
x; = argx*EXp*maxAcq(argxEXp maxAcq(x))} (4)
[00094] Step 352 can be implemented by a method 800 shown in Fig. 8. The method 800 is typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005. The method 800 receives an acquisition function 805 constructed at step 351 and a Gaussian Process 850 (generated at step 330 if the current iteration is one (1), otherwise generated at step 356) and outputs an adjusted acquisition function.
[00095] The method 800 starts at a step 810. In step 810, p samples (rows of flattened LSTM parameters) are randomly generated within the same user specified range as used at step 410 (Xp).
22274490_1
[00096] The method 800 continues at a maximising step 820. The acquisition function is optimized or adjusted at step 820 using the Gaussian Process 850. Step 820 starts from each of the p samples to avoid local minima. Execution of step 820 results in p new samples (Xp*) of LSTM parameters and their corresponding values of acquisition functions. The method 800 continues from step 820 to a step 840. Step 840 selects a "best" sample based on the maximized acquisition function value.
[00097] Returning to Fig. 3, the method 300 continues from step 352 to an obtaining step or selecting step 353. In step 353 a next sample (x*) is selected using the acquisition function
adjusted in step 352.
[00098] The method 300 continues from step 353 to a generating step 354. Using the sample of LSTM parameters selected at step 353 and encoding of hyperparameters of every layer of the primary network architecture 140, a new intermediate (secondary) network architecture is generated at step 354. Step 354 operates in a similar manner to 150 and 315. The neural network architectures generated at steps 315 and 354 are referred to as "intermediate" architectures as in some instances as the final architecture is generated at step 150.
[00099] Referring to Fig. 3, in the method 300 continues from step 354 to 355. In step 355, the new secondary architecture is trained and evaluated on the particular task in the same manner as the method 900. From step 355 the method 300 continues to a step 356. The Gaussian process generated in the last round of iteration (or at step 330 if the current iteration is one) is updated in execution of step 356. The method 300 continues from step 356 to step 335. Step 335 determines if the number of iterations N has been reached. If not, ("No" at step 335), the current iteration is incremented and the iterative loop from step 335 to step 356 is repeated. If the number of iterations has been reached ("Yes" at step 335), the method 300 continues to a selecting step 340.
[000100] Therefore, in the example of Fig. 3, all the processes after the first fitting of Gaussian process on the initialized K samples are repeated N times (where N is a user specified number) to update the Gaussian process using Bayesian optimization.
[000101] The step 340 selects the trained Gaussian process after finishing N iterations of Bayesian optimization. The architecture which has the highest value for the overall performance measure (as determined at step 330 or step 356) is selected as thefinal architecture. The method 300 ends after execution of step 340.
22274490_1
[000102] As shown in relation to steps 315 to 330 and steps 354 to 356, the sampled LSTM parameters are determined using a plurality of intermediate network architectures associated with the primary network architecture 140. Each of steps 330 and 356 operates to fit the Gaussian process to a parameter of a plurality of intermediate neural network architectures.
[000103] The step of generating a secondary architecture using an LSTM network and a teacher network is now described in detail. The generation of secondary network architecture from the primary network architecture and a single row of flattened LSTM parameters follow several steps, as implemented at step 150 is described with reference to a dataflow 900 shown in Fig. 9. Steps of the dataflow 900 are typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005.
[000104] The dataflow 900 receives the architecture 140 as input. The primary network architecture 140 has several layers. Three of the layers of the architecture 140 are shown as 910a, 910b and 910c in Fig. 9. The input to an LSTM cell 930 at each time step is the layer encoding which comprises of a set of 4 values, being kernel size, stride, number of output channels, padding respectively. Kernels, stride and padding are considered to be symmetric and two dimensional in the arrangements described. Symmetric layer encoding differentiates various layers in an efficient way. For example, the encoding for layer 910a is 920a, which shows that the first layer has kernel size as 3 in both dimensions with stride as 1 in both dimensions and number of output channels as 64 and padding as 1 in both dimensions. The encoding 920a suggests that 910a is a convolution layer. In contrast, the next layer 910b has kernel size as 2 in both dimensions with stride as 1 in both dimensions while the number of output channels and padding are zeros. The encoding 920b suggests that the layer 910b is a pooling layer. Similarly encoding 920c suggests that the layer 910c is a convolution layer. The modification configuration used in Fig. 9 is a single-layer modification. The encoding of individual layers acts as the input to the corresponding LSTM cell, for example input to 930 for the layer 910a, at each time step. Each LSTM is bi-directional, as indicated by arrows 940 to model the inter layer relationship. The parameters are shared across each time step to make the complexity of the Gaussian process invariant to the depth or number of layers of the primary network architecture.
[000105] The example shown shares the LSTM parameters irrespective of the layer type. An alternative arrangement would be to share the LSTM parameters only within the layers of similar types such as one LSTM cell shared only within convolutional layers and a different LSTM cell shared only within pooling layers and so on. The output of the LSTM cell 930 is input to a linear
22274490_1 layer 950 of the modification configuration. The output of the linear layer 950 is input to a sigmoid activation layer 960. The sigmoid activation layer produces an output 970 consisting of 6 float values ranging between 0 and 1. The semantic meaning of the output 970 generated using, and thereby associated with, the sampled LSTM parameters, is as follows:
• the first 4 values represent reduction factors corresponding to the input encoding,
* the 5 th value represents a flag value for existence, and
• the sixth value represents a flag to decide whether a skip connection will be introduced or not.
[000106] The order of the reduction factors and the existence and skip values can be varied in some implementations. Based on the output 970, the corresponding input encoding is modified to generate a corresponding layer 980 of the secondary network architecture. Certain design constraints are imposed. Examples of design constraints include the maximum and minimum values of output channels are considered to be 512 and 16 respectively, minimum value of kernel size for max-pooling is considered to be 2x2, subsequent layers cannot be of same type except for convolution, activation layers can't exist immediately after pooling layers, batchnorm layers can't exist immediately after pooling and activation layers, minimum value of kernel size for convolution is 1xi and kernel size for convolution is forced to be odd numbers such as 3x3, 5x5 and the like. The design constraints used depend on the structure of the layers of the LSTM network.
[000107] Each layer corresponding layer (such as 980) generated is used to generate the secondary architecture at step 990. The steps followed to construct the secondary network architecture at step 990 are:
(a) The first layer of the primary architecture is kept same in the secondary architecture.
(b) Linear layers or feedforward layers in the context of the secondary network architecture are deleted (due to having too many parameters) except for the last feedforward layer.
(c) Based on the outputs of the LSTM network the second existing convolution layer is identified which is not a part of an existing skip connection in the primary network architecture. New skip connections are only introduced on the following convolution layer(s) based on the value
22274490_1 of the last flag and if the following convolution layer is not a part of an existing skip connection in the primary network architecture. New skip connections are limited to subsequent layers only, that is skip connections cover only one convolution layer. Two types of skip connections can exist in the current framework: (i) where input is added to the immediate output and (ii) where input is appended to the immediate output along channel dimension. The type of new skip connection is chosen with equal probability.
(d) Changing the layer hyperparameter based on the output of the LSTM network follows following rules:
1. The output size of the previous layer is identified first using a single input tensor of the same dimension as the training dataset tensors.
2. The dimensions of kernel size, stride, output channels and padding for a layer in the secondary architecture are computed based on multiplying the corresponding dimensions in the primary network architecture by the reduction factors. The product is rounded off to integers.
3. The dimensions of the above hyperparameters for each layer are again modified to match the input-output size consistency. Additionally, for a particular layer, based on the type of the skip connection introduced (if any), the dimensions of the layer's hyperparameters are again adjusted. The modification can relate to a multiplication operation or adjustment of number of channels to be compatible with the skip connection.
(e) For an existing skip connection in the primary network architecture, the type of the connection is maintained in the secondary architecture and the layers within the connection are modified or deleted based on the rules above. However, no new skip connections are introduced for any of these layers in this case.
(f) Depending on the requirement of the task, additional downsampling or upsampling layers can be added at the end of the secondary network architecture. Downsampling or upsampling layers can occur for example when are generating a part network which is connected to another fixed network. In such conditions, requirements of output size can exists, resulting in a requirement of additional downsampling or upsampling layers. An example is generating a head network part for a single shot detection (SSD) network which has a head network part and a detector part.
22274490_1
[000108] Following the above design rules, the input output size consistency among all the generated layers are maintained and this generates the secondary network architecture 990. Accordingly, the number of layers of the architecture generates at step 990 can be reduced.
[000109] The arrangements described are applicable to the computer and data processing industries and particularly for the machine learning industries.
[000110] The arrangements described allow a Gaussian function to be used in such a manner to improve generation of a secondary neural network architecture by reducing computational complexity in developing the neural network architecture. The "flattened" structure of the architectures generated in steps 150, 315 and 354 further decrease complexity. At the same time, a range of variation between the primary neural network architecture and the final generated neural network architecture can be increased compared to traditional, discrete solutions. Ability of the user to set constraints and the constraints to be accounted improves adaptability of the generated neural network for different practical implementations. The Bayesian optimization further operates to increase likelihood the generated neural network architecture is suitable for performing the required task.
[000111] For example, a user may want to generate a neural network architecture for a implementing particular task such as classification of an object or detection of an objection in a scene. The generated neural network architecture may be intended to be deployed on a particular type of device such as a mobile device. The user can implement the method 100 providing constrains associated with operation on a mobile device such as reduced memory storage and/or reduced floating point operations. The method 100 is implemented and the resultant neural network architecture generated suitable for implementation on the deployment device (for example transmitted to a deployment device via the network 1020).
[000112] An example of automated architecture generation using the invention is described. The problem is chosen as CIFAR-10 classification task. The CIFAR-10 problem has 10 classes. Two different primary architectures are considered, being VGG16 with one fully connected layer and DenseNet-121. The CIFAR-10 problem has 50000 training data and 10000 test data. In the context of the arrangements described, the training data has been further divided into 40000 training data and 10000 validation data. A secondary architecture is generated for the CIFAR architecture using the method 100.
22274490_1
[000113] Existing methods typically focus on increasing performance of the generated architectures only. Therefore, the generated architectures may not be satisfying other constraints or objectives such as reducing total number of parameters, reducing total FLOPs count or reducing total memory requirement.
[000114] The arrangements described are able to generate a secondary network architecture starting from VGG16, which has 66% less number of parameters while sacrificing only 0.7% accuracy compared to VGG16. Whereas in the second case the generated secondary network architecture had 72 % less number of parameters with a sacrifice of 4% in accuracy compared to DenseNet-121.
[000115] A second example of use is to perform detection on Pascal VOC 2007. The dataset contains 20 classes excluding the background. There are total 5011 training and validation data and 4952 test data. The training and validation data has been divided into 4760 training data and 251 validation data. The primary network architecture is considered to be SSD512 with VGG19. A secondary architecture is generated using the method 100. The resultant secondary architecture had almost 30% parameter reduction and 1-2% sacrifice in performance in terms of mean average precision compared to the primary architecture.
[000116] The proposed method is also useful in applications where a secondary network is required to have a relatively low number of parameters in order to meet requirements of a lower resourced implementation platform such as a mobile device. The methods described also allow choosing other parameters such as number of FLOPS instructions for optimizing the second network architecture. Number of FLOPS optimization allows the resultant neural network architecture to be used in applications where execution time is an important consideration such as online object detection.
[000117] The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.
[000118] In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and "comprises" have correspondingly varied meanings.
22274490_1

Claims (18)

Claims:
1. A method of generating a neural network architecture for performing a task, the method comprising:
receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture.
2. The method according to claim 1, wherein the plurality of parameters of the RNN are determined based on one or more user constraints associated with completing the task.
3. The method according to claim 2, wherein the user constraints relate to one or more of number of multiplications, number of floating point operations, number of parameters, memory storage requirement, memory access requirement, and parameters associated with a restricted capability device.
4. The method according to claim 1, wherein the plurality of parameters associated with the primary architecture are determined using a plurality of intermediate neural network architectures associated with the primary architecture.
5. The method according to claim 1, wherein the Gaussian process is fitted to parameters of a plurality of intermediate neural network architectures.
6. The method according to claim 1, wherein the Gaussian process is determined based on a Bayesian optimisation.
7. The method according to claim 4, further comprising measuring performance of the plurality of intermediate neural network architectures based on one or more user constraints associated with completing the task.
22274490_1
8. The method according to claim 1, wherein the layers of the secondary architecture are generated using a skip flag and a layer preservation flag associated with the determined plurality of parameters of the RNN.
9. The method according to claim 1, wherein the layers of the secondary architecture are generated based on a modification configuration comprising a linear layer.
10. The method according to claim 9, wherein the layers of the secondary architecture are generated based on a modification configuration comprising a bi-directional LSTM for each layer of the received primary architecture.
11. The method according to claim 1, wherein the plurality of reduction factors are determined using a linear function and a sigmoid function.
12. The method according to claim 1, wherein the secondary neural network is generated by sharing the determined parameters of the RNN irrespective of layer type.
13. The method according to claim 1, wherein the secondary neural network is generated by sharing the determined parameters of the RNN only within layers of the same type.
14. The method according to claim 1, wherein the RNN is a Long Short Term Memory (LSTM) network.
15. The method according to claim 1, wherein the Gaussian process is constructed using parameters of the RNN which are continuously valued.
16. A non-transitory computer readable medium having a computer program stored thereon to implement a method of generating a neural network architecture for performing a task, the program comprising: code for receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; code for determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and code for generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of
22274490_1 layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture
17. Apparatus, comprising:
a memory; and
a processor configured to execute code stored on the memory implement a method of generating a neural network architecture for performing a task, the method comprising: receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture.
18. A system, comprising: a memory; and a processor, wherein the processor is configured to execute code stored on the memory for implementing a method of generating a neural network architecture for performing a task, the method comprising: receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture.
Canon Kabushiki Kaisha Patent Attorneys for the Applicant/Nominated Person SPRUSON&FERGUSON
22274490_1
- 1 / 11 - 13 Mar 2019
140
Primary Architecture Encoding 120 2019201716
100 110 Start
Train and RNN Evaluate Primary (LSTM) Architecture
130 Sample parameters of 160 LSTM from trained Gaussian Process
User constraint(s) Generate Secondary 150 Architecture
End 199
Fig. 1
22209758_2
- 2 / 11 - 13 Mar 2019
200 (130) 2019201716
Start 210
Train Gaussian Process 160
240 Obtain sample of User RNN (LSTM) contraint(s) parameters) 299
End
Fig. 2
22209758_2
- 3 / 11 - 13 Mar 2019
300 (210)
Start 310 140 Randomly initialize K sets of RNN (LSTM) 2019201716
parameters
315 Primary Generate K Architecture Secondary Encoding Architectures 320 Evaluate K 330 Seconday Architectures
Fit Gaussian process 335 340
Yes N iterations No finished ? 351 Trained 356 Construct Gaussian Acquisition Process function Fit Gaussian process 352 Optimize End Acquisition Evaluate function Secondary Architecture 353 399 355 Obtain next Generate sample (RNN 140 354 Secondary parameters) Architecture Primary Architecture Fig. 3 Encoding
22209758_2
- 4 / 11 - 13 Mar 2019
400 (310) Start 410 2019201716
Randomly initialize K sets of LSTM parameters (weights, biases) 420
Generate K independent LSTM cells 499 End
Fig. 4
22209758_2
- 5 / 11 - 13 Mar 2019
500 (320, 355)
Start 590
510 2019201716
Secondary Architectures Problem (Training Training Dataset) 520 Problem (Validation Validation Dataset) User constraints 540 530 Validation Performance 550 160
Overall Performance Measure 570
End 599
Fig. 5 22209758_2
- 6 / 11 - 13 Mar 2019
600 (330, 356) Start
605 670 610 2019201716
Assume Kernel(s)
RNN (LSTM) Performance Parameters Measure
620 Initialize hyperparameters 630
Maximize negative log of marginal likelihood 640
Generate updated Gaussian Process 699
End
Fig. 6 22209758_2
- 7 / 11 - 13 Mar 2019
700 (351)
Start 2019201716
Updated Sample Gaussian (LSTM Process parameters)
Obtain mean (µ ) and std (σ) of performance 740 measure for the 710 sample 720
Determine acquisition function value (e.g. expected improvement 730 (EI))
799 End
Fig. 7 22209758_2
- 8 / 11 - 13 Mar 2019
800 (352)
Start 2019201716
810
Randomly initialize p samples (LSTM parameters)
820
Updated Maximize acquisition Acquisition Gaussian function value starting function Process from each sample
850 805 Obtain best sample of RNN (LSTM) parameters) 840
899 End
Fig. 8 22209758_2
- 9 / 11 - 13 Mar 2019
900 (150, 315, 354)
Primary Architecture Start Encoding 2019201716
140 910c
910b 910a P1 P2 P3 920b 920c 920a
3 1 64 1 2 1 0 0 1 1 128 1
930 940
RNN (LSTM) RNN (LSTM) RNN (LSTM) Cell Shared Cell Shared Cell Shared
950 Linear Linear Linear 960 Sigmoid Sigmoid Sigmoid
0.25 0.43 0.55 0.87 0.73 0.26 0.35 0.53 0.75 0.82 0.83 0.16 0.65 0.45 0.26 0.57 0.13 0.16
S1 970 S2 980
Generate 990 Secondary Fig. 9 Architecture 999
22209758_2 End
- 10 / 11 - 13 Mar 2019
(Wide-Area) Communications Network 1020 2019201716
Printer 1015
Microphone 1024 1080 1021 1017 (Local-Area) Video Communications Display Network 1022 Ext. 1023 1014 Modem 1016 1000
1001
Appl. Prog Storage Audio-Video I/O Interfaces Local Net. 1033 Devices Interface 1007 1008 I/face 1011 HDD 1010 1009
1004
1018 1019
Processor I/O Interface Memory Optical Disk 1005 1013 1006 Drive 1012
Keyboard 1002
Scanner 1026 Disk Storage 1003 Medium 1025 Camera 1027
Fig. 10A 22209758_2
- 11 / 11 - 13 Mar 2019
1034 1033
Instruction (Part 1) 1028 Data 1035 Instruction (Part 2) 1029 Data 1036 1032 1031
Instruction 1030 Data 1037 2019201716
ROM 1049 POST BIOS Bootstrap Operating 1050 1051 Loader 1052 System 1053
Input Variables 1054 Output Variables 1061
1055 1062
1056 1063
1057 1064
Intermediate Variables 1058 1059 1066 1060 1067
1019 1004
1018
1005 Interface 1042
1041 1048 Reg. 1044 (Instruction) Control Unit 1039 Reg. 1045
ALU 1040 Reg. 1046 (Data)
22209758_2 Fig. 10B
AU2019201716A 2019-03-13 2019-03-13 System and method of generating a neural network architecture Abandoned AU2019201716A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2019201716A AU2019201716A1 (en) 2019-03-13 2019-03-13 System and method of generating a neural network architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2019201716A AU2019201716A1 (en) 2019-03-13 2019-03-13 System and method of generating a neural network architecture

Publications (1)

Publication Number Publication Date
AU2019201716A1 true AU2019201716A1 (en) 2020-10-01

Family

ID=72608225

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2019201716A Abandoned AU2019201716A1 (en) 2019-03-13 2019-03-13 System and method of generating a neural network architecture

Country Status (1)

Country Link
AU (1) AU2019201716A1 (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516228A (en) * 2021-07-08 2021-10-19 哈尔滨理工大学 Network anomaly detection method based on deep neural network
CN113743606A (en) * 2021-09-08 2021-12-03 广州文远知行科技有限公司 A neural network search method, device, computer equipment and storage medium
US11403069B2 (en) 2017-07-24 2022-08-02 Tesla, Inc. Accelerated mathematical engine
US11409692B2 (en) 2017-07-24 2022-08-09 Tesla, Inc. Vector computational unit
US11487288B2 (en) 2017-03-23 2022-11-01 Tesla, Inc. Data synthesis for autonomous control systems
US11537811B2 (en) 2018-12-04 2022-12-27 Tesla, Inc. Enhanced object detection for autonomous vehicles based on field view
US11561791B2 (en) 2018-02-01 2023-01-24 Tesla, Inc. Vector computational unit receiving data elements in parallel from a last row of a computational array
US11562231B2 (en) 2018-09-03 2023-01-24 Tesla, Inc. Neural networks for embedded devices
US11567514B2 (en) 2019-02-11 2023-01-31 Tesla, Inc. Autonomous and user controlled vehicle summon to a target
US11610117B2 (en) 2018-12-27 2023-03-21 Tesla, Inc. System and method for adapting a neural network model on a hardware platform
US11636333B2 (en) 2018-07-26 2023-04-25 Tesla, Inc. Optimizing neural network structures for embedded systems
US11665108B2 (en) 2018-10-25 2023-05-30 Tesla, Inc. QoS manager for system on a chip communications
US11681649B2 (en) 2017-07-24 2023-06-20 Tesla, Inc. Computational array microprocessor system using non-consecutive data formatting
US11734562B2 (en) 2018-06-20 2023-08-22 Tesla, Inc. Data pipeline and deep learning system for autonomous driving
US11748620B2 (en) 2019-02-01 2023-09-05 Tesla, Inc. Generating ground truth for machine learning from time series elements
US11790664B2 (en) 2019-02-19 2023-10-17 Tesla, Inc. Estimating object properties using visual image data
US11816585B2 (en) 2018-12-03 2023-11-14 Tesla, Inc. Machine learning models operating at different frequencies for autonomous vehicles
US11841434B2 (en) 2018-07-20 2023-12-12 Tesla, Inc. Annotation cross-labeling for autonomous control systems
US11893393B2 (en) 2017-07-24 2024-02-06 Tesla, Inc. Computational array microprocessor system with hardware arbiter managing memory requests
US11893774B2 (en) 2018-10-11 2024-02-06 Tesla, Inc. Systems and methods for training machine models with augmented data
US12014553B2 (en) 2019-02-01 2024-06-18 Tesla, Inc. Predicting three-dimensional features for autonomous driving
US12307350B2 (en) 2018-01-04 2025-05-20 Tesla, Inc. Systems and methods for hardware-based pooling
US12462575B2 (en) 2021-08-19 2025-11-04 Tesla, Inc. Vision-based machine learning model for autonomous driving with adjustable virtual camera
US12522243B2 (en) 2021-08-19 2026-01-13 Tesla, Inc. Vision-based system training with simulated content

Cited By (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11487288B2 (en) 2017-03-23 2022-11-01 Tesla, Inc. Data synthesis for autonomous control systems
US12020476B2 (en) 2017-03-23 2024-06-25 Tesla, Inc. Data synthesis for autonomous control systems
US11681649B2 (en) 2017-07-24 2023-06-20 Tesla, Inc. Computational array microprocessor system using non-consecutive data formatting
US12216610B2 (en) 2017-07-24 2025-02-04 Tesla, Inc. Computational array microprocessor system using non-consecutive data formatting
US11403069B2 (en) 2017-07-24 2022-08-02 Tesla, Inc. Accelerated mathematical engine
US11409692B2 (en) 2017-07-24 2022-08-09 Tesla, Inc. Vector computational unit
US11893393B2 (en) 2017-07-24 2024-02-06 Tesla, Inc. Computational array microprocessor system with hardware arbiter managing memory requests
US12086097B2 (en) 2017-07-24 2024-09-10 Tesla, Inc. Vector computational unit
US12307350B2 (en) 2018-01-04 2025-05-20 Tesla, Inc. Systems and methods for hardware-based pooling
US12455739B2 (en) 2018-02-01 2025-10-28 Tesla, Inc. Instruction set architecture for a vector computational unit
US11561791B2 (en) 2018-02-01 2023-01-24 Tesla, Inc. Vector computational unit receiving data elements in parallel from a last row of a computational array
US11797304B2 (en) 2018-02-01 2023-10-24 Tesla, Inc. Instruction set architecture for a vector computational unit
US11734562B2 (en) 2018-06-20 2023-08-22 Tesla, Inc. Data pipeline and deep learning system for autonomous driving
US11841434B2 (en) 2018-07-20 2023-12-12 Tesla, Inc. Annotation cross-labeling for autonomous control systems
US11636333B2 (en) 2018-07-26 2023-04-25 Tesla, Inc. Optimizing neural network structures for embedded systems
US12079723B2 (en) 2018-07-26 2024-09-03 Tesla, Inc. Optimizing neural network structures for embedded systems
US11562231B2 (en) 2018-09-03 2023-01-24 Tesla, Inc. Neural networks for embedded devices
US12346816B2 (en) 2018-09-03 2025-07-01 Tesla, Inc. Neural networks for embedded devices
US11983630B2 (en) 2018-09-03 2024-05-14 Tesla, Inc. Neural networks for embedded devices
US11893774B2 (en) 2018-10-11 2024-02-06 Tesla, Inc. Systems and methods for training machine models with augmented data
US11665108B2 (en) 2018-10-25 2023-05-30 Tesla, Inc. QoS manager for system on a chip communications
US12367405B2 (en) 2018-12-03 2025-07-22 Tesla, Inc. Machine learning models operating at different frequencies for autonomous vehicles
US11816585B2 (en) 2018-12-03 2023-11-14 Tesla, Inc. Machine learning models operating at different frequencies for autonomous vehicles
US11537811B2 (en) 2018-12-04 2022-12-27 Tesla, Inc. Enhanced object detection for autonomous vehicles based on field view
US12198396B2 (en) 2018-12-04 2025-01-14 Tesla, Inc. Enhanced object detection for autonomous vehicles based on field view
US11908171B2 (en) 2018-12-04 2024-02-20 Tesla, Inc. Enhanced object detection for autonomous vehicles based on field view
US11610117B2 (en) 2018-12-27 2023-03-21 Tesla, Inc. System and method for adapting a neural network model on a hardware platform
US12136030B2 (en) 2018-12-27 2024-11-05 Tesla, Inc. System and method for adapting a neural network model on a hardware platform
US12223428B2 (en) 2019-02-01 2025-02-11 Tesla, Inc. Generating ground truth for machine learning from time series elements
US12014553B2 (en) 2019-02-01 2024-06-18 Tesla, Inc. Predicting three-dimensional features for autonomous driving
US11748620B2 (en) 2019-02-01 2023-09-05 Tesla, Inc. Generating ground truth for machine learning from time series elements
US12164310B2 (en) 2019-02-11 2024-12-10 Tesla, Inc. Autonomous and user controlled vehicle summon to a target
US11567514B2 (en) 2019-02-11 2023-01-31 Tesla, Inc. Autonomous and user controlled vehicle summon to a target
US11790664B2 (en) 2019-02-19 2023-10-17 Tesla, Inc. Estimating object properties using visual image data
US12236689B2 (en) 2019-02-19 2025-02-25 Tesla, Inc. Estimating object properties using visual image data
CN113516228A (en) * 2021-07-08 2021-10-19 哈尔滨理工大学 Network anomaly detection method based on deep neural network
US12462575B2 (en) 2021-08-19 2025-11-04 Tesla, Inc. Vision-based machine learning model for autonomous driving with adjustable virtual camera
US12522243B2 (en) 2021-08-19 2026-01-13 Tesla, Inc. Vision-based system training with simulated content
CN113743606B (en) * 2021-09-08 2024-12-17 广州文远知行科技有限公司 Searching method and device for neural network, computer equipment and storage medium
CN113743606A (en) * 2021-09-08 2021-12-03 广州文远知行科技有限公司 A neural network search method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
AU2019201716A1 (en) System and method of generating a neural network architecture
Kwon et al. Diffusion models already have a semantic latent space
US20240354579A1 (en) Methods and systems for neural architecture search
US20230115700A1 (en) Automated generation of machine learning models
US12265889B2 (en) Systematic approach for explaining machine learning predictions
US20210081763A1 (en) Electronic device and method for controlling the electronic device thereof
US11816185B1 (en) Multi-view image analysis using neural networks
KR102246318B1 (en) System and method for searching pathological image
KR20220079726A (en) Method for predicting disease based on medical image
US20200327379A1 (en) Fastestimator healthcare ai framework
KR20220099409A (en) Method for classification using deep learning model
US12488430B2 (en) Training masked autoencoders for image inpainting
US12283051B2 (en) Method for analyzing lesion based on medical image
CN114004383A (en) Training method of time series prediction model, time series prediction method and device
CN111063000B (en) Magnetic resonance fast imaging method and device based on neural network structure search
Seo et al. Closing the gap between deep neural network modeling and biomedical decision-making metrics in segmentation via adaptive loss functions
Behtash et al. Universality of layer-level entropy-weighted quantization beyond model architecture and size
US20250094862A1 (en) Fairness feature importance: understanding and mitigating unjustifiable bias in machine learning models
Král et al. Policy search for active fault diagnosis with partially observable state
US11768265B2 (en) Harmonizing diffusion tensor images using machine learning
CN120217034A (en) Model training method, classification method, device, medium and program product
CN120258158B (en) Fuzzy dynamic reasoning method, device and equipment for industrial time series
US12462200B1 (en) Accelerated training of a machine learning model
CN117878899B (en) Wind power forecast error uncertainty modeling method and device
Sharma et al. Leveraging Prediction Confidence For Versatile Optimizations to CNNs

Legal Events

Date Code Title Description
MK4 Application lapsed section 142(2)(d) - no continuation fee paid for the application