AU2019201716A1

AU2019201716A1 - System and method of generating a neural network architecture

Info

Publication number: AU2019201716A1
Application number: AU2019201716A
Authority: AU
Inventors: Kalyan Shankar Bhattacharjee; Amit Kumar Gupta
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2019-03-13
Filing date: 2019-03-13
Publication date: 2020-10-01

Abstract

SYSTEM AND METHOD OF GENERATING A NEURAL NETWORK ARCHITECTURE ABSTRACT A system and method of generating a neural network architecture for performing a task. The method comprises receiving (110), at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; determining (130) a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and generating (150) a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture. 22274490_1 - 1/11 140 Primary Architecture Encoding 120 100 110 Start Train and RNN SEvaluate Primary (LSTM) Architecture - 130 16 Sample parameters of16 LSTM from trained 4 Gaussian Process\ User constraint(s) Generate Secondary 10 Architecture CEnd 199 Fig. 1 22209758_2

Description

- 1/11

140

Primary Architecture Encoding 120 100 110 Start

Train and RNN SEvaluate Primary (LSTM) Architecture

- 130 16 Sample parameters of16 LSTM from trained 4 Gaussian Process\

User constraint(s) Generate Secondary 10 Architecture

199 CEnd

Fig. 1

22209758_2

SYSTEM AND METHOD OF GENERATING A NEURAL NETWORK ARCHITECTURE TECHNICAL FIELD

[0001] The present invention relates to a method of generating neural network architecture using Gaussian processes for a given task starting from an initial network architecture modified using a Recurrent Neural Network. In particular, the present invention describes a system and method for generating a deep neural network architecture achieving user defined objectives where an initial deep neural network is given, to solve a particular problem.

BACKGROUND

[0002] A deep neural network is a type of artificial neural network comprising of multiple layers between the input and output layers. In the past, designing a deep neural network architecture required significant manual effort in terms of deciding each individual layer's type, hyperparameters, activation units and the sequence of such layers and activations in the final neural network architecture. Examples of a layer's type include fully connected, convolution, pooling and the like. Examples of a layer's hyperparameters include a number of hidden units for fully connected layers, number of channels, stride, kernel size and padding for convolution layers, kernel size and stride for pooling layers and the like. Examples of a layer's type of activation units include a rectifier linear unit (ReLU), sigmoid, tanh and the like.

[0003] There exist several known deep neural network architectures such as LeNet, VGG, ResNet, DenseNet, Inception etc. It is known that the training and the inference of the known deep neural network architectures is memory intensive. Generation of neural network architectures involve expert tuning of hyperparameters which is extremely time consuming for a human expert. Techniques for automated generation of efficient neural network architecture have been developed in an effort to reduce the manual effort.

[0004] Most of the existing methods for generating a neural network architecture only aim for higher performance for a given task which might lead to an even deeper neural network architecture compared to the existing ones. Aiming only for higher performance for a given task poses a significant limitation in terms of achieving user specified objectives for example reducing number of parameters used in the network. Typically, most of the existing methods do not start with a primary architecture. Instead, the existing methods start from random architectures which

22274490_1 effectively slows down convergence of the optimization process. Starting from random architectures also results in slower training of the resulting secondary network architectures to assess their performance during the optimization. The known methods start with a small convolutional neural network unit and involve a Recurrent Neural Network as a controller to devise policies for deciding the layer types, layer hyperparameters and layer sequences. The policies are optimized using a reinforcement learning strategy which only deals with a limited discrete set of hyperparameter design choices such as number of channels for a convolution can vary within [48, 64, 128, 256, 512]. Using a limited set of design choices effectively results in limited search space and hence, chance of generating efficient neural network architecture becomes low. Further, inclusion of more choices in terms of individual hyperparameter makes reinforcement learning increasingly difficult and computationally expensive to optimize.

[0005] Some known methods rely on designing specific kernel(s) to distinguish different network architectures and use morphism to connect various network modules or Bayesian optimization to generate potential secondary architectures. However, the methods are very sensitive to the choice of the kernel function(s) and only allow limited number of hyperparameters to be changed during optimization. Accordingly, the training process is less efficient in terms of finding better network architectures.

[0006] Thus, a need exists to devise a method for generating neural network architecture which allows a large number of hyperparameters to be changed during optimization.

SUMMARY

[0007] It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements.

[0008] One aspect of the present disclosure provides a method of generating a neural network architecture for performing a task, the method comprising: receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture.

22274490_1

[0009] According to another aspect, the plurality of parameters of the RNN are determined based on one or more user constraints associated with completing the task.

[00010] According to another aspect, the user constraints relate to one or more of number of multiplications, number of floating point operations, number of parameters, memory storage requirement, memory access requirement, and parameters associated with a restricted capability device.

[00011] According to another aspect, the plurality of parameters associated with the primary architecture are determined using a plurality of intermediate neural network architectures associated with the primary architecture.

[00012] According to another aspect, the Gaussian process is fitted to parameters of a plurality of intermediate neural network architectures.

[00013] According to another aspect, the Gaussian process is determined based on a Bayesian optimisation.

[00014] According to another aspect, the method further comprises measuring performance of the plurality of intermediate neural network architectures based on one or more user constraints associated with completing the task.

[00015] According to another aspect, the layers of the secondary architecture are generated using a skip flag and a layer preservation flag associated with the determined plurality of parameters of the RNN.

[00016] According to another aspect, the layers of the secondary architecture are generated based on a modification configuration comprising a linear layer.

[00017] According to another aspect, the layers of the secondary architecture are generated based on a modification configuration comprising a bi-directional LSTM for each layer of the received primary architecture.

[00018] According to another aspect, the plurality of reduction factors are determined using a linear function and a sigmoid function.

22274490_1

[00019] According to another aspect, the secondary neural network is generated by sharing the determined parameters of the RNN irrespective of layer type.

[00020] According to another aspect, the secondary neural network is generated by sharing the determined parameters of the RNN only within layers of the same type.

[00021] According to another aspect, the RNN is a Long Short Term Memory (LSTM) network.

[00022] According to another aspect, the Gaussian process is constructed using parameters of the RNN which are continuously valued.

[00023] Another aspect of the present disclosure provides a non-transitory computer readable medium having a computer program stored thereon to implement a method of generating a neural network architecture for performing a task, the program comprising: code for receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; code for determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and code for generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture.

[00024] Another aspect of the present disclosure provides apparatus, comprising: a memory; and a processor configured to execute code stored on the memory implement a method of generating a neural network architecture for performing a task, the method comprising: receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture.

[00025] Another aspect of the present disclosure provides a system, comprising: a memory; and a processor, wherein the processor is configured to execute code stored on the memory for

22274490_1 implementing a method of generating a neural network architecture for performing a task, the method comprising: receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture.

[00026] Other aspects are also disclosed.

BRIEF DESCRIPTION OF THE DRAWINGS

[00027] At least one example embodiment of the present invention will now be described with reference to the drawings and appendices, in which:

[00028] Fig. 1 shows a method of generating a secondary network architecture constructed using a recurrent neural network from a trained Gaussian process given a primary network architecture;

[00029] Fig. 2 shows a method of sampling recurrent neural network parameters from a trained Gaussian process as used in the method of Fig. 1,

[00030] Fig. 3 shows a method of training Gaussian processes as used in the method of Fig. 2;

[00031] Fig. 4 shows a method of initializing recurrent neural network parameters to construct cells as used in the method of Fig. 3;

[00032] Fig. 5 shows a method of generating an overall performance measure for a secondary architecture as used in the method of Fig. 3;

[00033] Fig. 6 shows a method of fitting a Gaussian process as used in the method of Fig. 3;

[00034] Fig. 7 shows a method of constructing an acquisition function as used the method of Fig. 2;

22274490_1

[00035] Fig. 8 shows a method of optimizing the acquisition function as used in the method of Fig. 2;

[00036] Fig. 9 shows a dataflow of generating a secondary architecture from a primary architecture given the recurrent neural network; and

[00037] Figs. 10A and 10B form a schematic block diagram of a general purpose computer system upon which arrangements described can be practiced.

DETAILED DESCRIPTION INCLUDING BEST MODE

[00038] Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears.

[00039] A method and system for generating neural network architectures for performing a given task is described below.

[00040] The present disclosure relates to a method of generating neural network architectures starting with modifying a primary network architecture using a Gaussian process built on a Recurrent Neural Network. The generated neural network may be generated based on one or more user specified objectives, also referred to as user constraints, associated with competing a required task. The user constraints can relate to one or more of number of multiplications, number of floating point operations (FLOPS), number of parameters, memory storage requirement, memory access requirement and parameters associated with a deployment device. Parameters associated with a deployment device can relate to operating parameters associated with device having reduced computational capability upon which the generated network is to be used, for example memory or number of operations permitted for a mobile device.

[00041] The present disclosure addresses the problem of generating a neural network architecture with a user specified objective for a particular task starting from a primary architecture. The primary architecture structure is defined using layers of configuration parameters, henceforth referred as layer hyperparameters.

22274490_1

[00042] A Recurrent Neural Network such as a Long Short Term Memory (LSTM) network takes encoding of the hyperparameters of different layers of the primary architecture as input at each of a number of time steps. The Recurrent Neural Network outputs reduction factors and two flags (for existence and skip-connection) corresponding to each of the layers. The existence flag is also referred to as a layer preservation flag. Based on the output, each input layer of the primary network architecture is modified to generate the corresponding layer of the secondary network architecture. In order to generate an effective secondary network architecture, the parameters of the LSTM require optimisation. Therefore, a Gaussian process is constructed with inputs of LSTM parameters and outputs of the overall performance of the secondary network architectures on an input task. Bayesian optimization is utilized to optimize the Gaussian process. The examples described herein relate to an LSTM network. However, other RNN networks such as fully recurrent, gated recurrent unit, bi-directional LSTM and the like may be used.

[00043] The arrangements described construct the Gaussian process on the parameters of the LSTM, which are continuously valued. Additionally, LSTM parameters are shared across various layers of the primary network architecture. These two factors reduce the complexity of the Bayesian optimization. Complexity of the Bayesian optimization remains invariant of the depth of the primary network architecture.

[00044] Figs. 10A and 1OB depict a general-purpose computer system 1000, upon which the various arrangements described can be practiced.

[00045] As seen in Fig. 1OA, the computer system 1000 includes: a computer module 1001; input devices such as a keyboard 1002, a mouse pointer device 1003, a scanner 1026, a camera 1027, and a microphone 1080; and output devices including a printer 1015, a display device 1014 and loudspeakers 1017. An external Modulator-Demodulator (Modem) transceiver device 1016 may be used by the computer module 1001 for communicating to and from a communications network 1020 via a connection 1021. The communications network 1020 may be a wide-area network (WAN), such as the Internet, a cellular telecommunications network, or a private WAN. Where the connection 1021 is a telephone line, the modem 1016 may be a traditional "dial-up" modem. Alternatively, where the connection 1021 is a high capacity (e.g., cable) connection, the modem 1016 may be a broadband modem. A wireless modem may also be used for wireless connection to the communications network 1020.

22274490_1

[00046] The computer module 1001 typically includes at least one processor unit 1005, and a memory unit 1006. For example, the memory unit 1006 may have semiconductor random access memory (RAM) and semiconductor read only memory (ROM). The computer module 1001 also includes an number of input/output (I/O) interfaces including: an audio-video interface 1007 that couples to the video display 1014, loudspeakers 1017 and microphone 1080; an I/O interface 1013 that couples to the keyboard 1002, mouse 1003, scanner 1026, camera 1027 and optionally a joystick or other human interface device (not illustrated); and an interface 1008 for the external modem 1016 and printer 1015. In some implementations, the modem 1016 may be incorporated within the computer module 1001, for example within the interface 1008. The computer module 1001 also has a local network interface 1011, which permits coupling of the computer system 1000 via a connection 1023 to a local-area communications network 1022, known as a Local Area Network (LAN). As illustrated in Fig. 10A, the local communications network 1022 may also couple to the wide network 1020 via a connection 1024, which would typically include a so-called "firewall" device or device of similar functionality. The local network interface 1011 may comprise an Ethernet circuit card, a Bluetooth©wireless arrangement or an IEEE 802.11 wireless arrangement; however, numerous other types of interfaces may be practiced for the interface 1011.

[00047] The I/O interfaces 1008 and 1013 may afford either or both of serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 1009 are provided and typically include a hard disk drive (HDD) 1010. Other storage devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 1012 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (e.g., CD-ROM, DVD, Blu-ray DiscTM), USB-RAM, portable, external hard drives, and floppy disks, for example, may be used as appropriate sources of data to the system 1000.

[00048] The components 1005 to 1013 of the computer module 1001 typically communicate via an interconnected bus 1004 and in a manner that results in a conventional mode of operation of the computer system 1000 known to those in the relevant art. For example, the processor 1005 is coupled to the system bus 1004 using a connection 1018. Likewise, the memory 1006 and optical disk drive 1012 are coupled to the system bus 1004 by connections 1019. Examples of computers on which the described arrangements can be practised include IBM-PC's and compatibles, Sun Sparcstations, Apple MacTM or like computer systems.

22274490_1

[00049] The methods described may be implemented using the computer system 1000 wherein the processes of Figs. 1-9, to be described, may be implemented as one or more software application programs 1033 executable within the computer system 1000. In particular, the steps of the methods of Figs. 1 to 9 are effected by instructions 1031 (see Fig. 10B) in the software 1033 that are carried out within the computer system 1000. The software instructions 1031 may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part and the corresponding code modules performs the described methods and a second part and the corresponding code modules manage a user interface between the first part and the user.

[00050] The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 1000 from the computer readable medium, and then executed by the computer system 1000. A computer readable medium having such software or computer program recorded on the computer readable medium is a computer program product. The use of the computer program product in the computer system 1000 preferably effects an advantageous apparatus for implementing the methods described of generating a neural network architecture.

[00051] The software 1033 is typically stored in the HDD 1010 or the memory 1006. The software is loaded into the computer system 1000 from a computer readable medium, and executed by the computer system 1000. Thus, for example, the software 1033 may be stored on an optically readable disk storage medium (e.g., CD-ROM) 1025 that is read by the optical disk drive 1012. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program product in the computer system 1000 preferably effects an apparatus for implementing the methods described of generating a neural network architecture.

[00052] In some instances, the application programs 1033 may be supplied to the user encoded on one or more CD-ROMs 1025 and read via the corresponding drive 1012, or alternatively may be read by the user from the networks 1020 or 1022. Still further, the software can also be loaded into the computer system 1000 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to the computer system 1000 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, DVD, Blu-rayT M Disc, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, or a computer readable

22274490_1 card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 1001. Examples of transitory or non-tangible computer readable transmission media that may also participate in the provision of software, application programs, instructions and/or data to the computer module 1001 include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like.

[00053] The second part of the application programs 1033 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 1014. Through manipulation of typically the keyboard 1002 and the mouse 1003, a user of the computer system 1000 and the application may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via the loudspeakers 1017 and user voice commands input via the microphone 1080.

[00054] Fig. 10B is a detailed schematic block diagram of the processor 1005 and a "memory" 1034. The memory 1034 represents a logical aggregation of all the memory modules (including the HDD 1009 and semiconductor memory 1006) that can be accessed by the computer module 1001 in Fig. 10A.

[00055] When the computer module 1001 is initially powered up, a power-on self-test (POST) program 1050 executes. The POST program 1050 is typically stored in a ROM 1049 of the semiconductor memory 1006 of Fig. 10A. A hardware device such as the ROM 1049 storing software is sometimes referred to as firmware. The POST program 1050 examines hardware within the computer module 1001 to ensure proper functioning and typically checks the processor 1005, the memory 1034 (1009, 1006), and a basic input-output systems software (BIOS) module 1051, also typically stored in the ROM 1049, for correct operation. Once the POST program 1050 has run successfully, the BIOS 1051 activates the hard disk drive 1010 of Fig. 1OA. Activation of the hard disk drive 1010 causes a bootstrap loader program 1052 that is resident on the hard disk drive 1010 to execute via the processor 1005. This loads an operating system 1053 into the RAM memory 1006, upon which the operating system 1053 commences operation. The operating system 1053 is a system level application, executable by the

22274490_1 processor 1005, to fulfil various high level functions, including processor management, memory management, device management, storage management, software application interface, and generic user interface.

[00056] The operating system 1053 manages the memory 1034 (1009, 1006) to ensure that each process or application running on the computer module 1001 has sufficient memory in which to execute without colliding with memory allocated to another process. Furthermore, the different types of memory available in the system 1000 of Fig. 10A must be used properly so that each process can run effectively. Accordingly, the aggregated memory 1034 is not intended to illustrate how particular segments of memory are allocated (unless otherwise stated), but rather to provide a general view of the memory accessible by the computer system 1000 and how such is used.

[00057] As shown in Fig. 10B, the processor 1005 includes a number of functional modules including a control unit 1039, an arithmetic logic unit (ALU) 1040, and a local or internal memory 1048, sometimes called a cache memory. The cache memory 1048 typically includes a number of storage registers 1044 - 1046 in a register section. One or more internal busses 1041 functionally interconnect these functional modules. The processor 1005 typically also has one or more interfaces 1042 for communicating with external devices via the system bus 1004, using a connection 1018. The memory 1034 is coupled to the bus 1004 using a connection 1019.

[00058] The application program 1033 includes a sequence of instructions 1031 that may include conditional branch and loop instructions. The program 1033 may also include data 1032 which is used in execution of the program 1033. The instructions 1031 and the data 1032 are stored in memory locations 1028, 1029, 1030 and 1035, 1036, 1037, respectively. Depending upon the relative size of the instructions 1031 and the memory locations 1028-1030, a particular instruction may be stored in a single memory location as depicted by the instruction shown in the memory location 1030. Alternately, an instruction may be segmented into a number of parts each of which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 1028 and 1029.

[00059] In general, the processor 1005 is given a set of instructions which are executed therein. The processor 1005 waits for a subsequent input, to which the processor 1005 reacts to by executing another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 1002, 1003, data

22274490_1 received from an external source across one of the networks 1020, 1002, data retrieved from one of the storage devices 1006, 1009 or data retrieved from a storage medium 1025 inserted into the corresponding reader 1012, all depicted in Fig. 10A. The execution of a set of the instructions may in some cases result in output of data. Execution may also involve storing data or variables to the memory 1034.

[00060] The disclosed arrangements use input variables 1054, which are stored in the memory 1034 in corresponding memory locations 1055, 1056, 1057. The described arrangements produce output variables 1061, which are stored in the memory 1034 in corresponding memory locations 1062, 1063, 1064. Intermediate variables 1058 may be stored in memory locations 1059, 1060, 1066 and 1067.

[00061] Referring to the processor 1005 of Fig. 10B, the registers 1044, 1045, 1046, the arithmetic logic unit (ALU) 1040, and the control unit 1039 work together to perform sequences of micro-operations needed to perform "fetch, decode, and execute" cycles for every instruction in the instruction set making up the program 1033. Each fetch, decode, and execute cycle comprises:

a fetch operation, which fetches or reads an instruction 1031 from a memory location 1028, 1029, 1030;

a decode operation in which the control unit 1039 determines which instruction has been fetched; and

an execute operation in which the control unit 1039 and/or the ALU 1040 execute the instruction.

[00062] Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 1039 stores or writes a value to a memory location 1032.

[00063] Each step or sub-process in the processes of Figs. I to 9 is associated with one or more segments of the program 1033 and is performed by the register section 1044, 1045, 1047, the ALU 1040, and the controlunit 1039 in the processor 1005 working together to perform the fetch,

22274490_1 decode, and execute cycles for every instruction in the instruction set for the noted segments of the program 1033.

[00064] The methods described may alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the functions or sub functions of Figs. 1 to 9. Such dedicated hardware may include graphic processors, digital signal processors, or one or more microprocessors and associated memories.

[00065] Fig. 1 shows a method 100 of generating a secondary network architecture for performing a particular task. The method 100 is typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005.

[00066] Inputs to the method 100 are a primary neural network architecture 140 (also referred to as a primary architecture) and a Recurrent Neural Network 120. The Recurrent Neural Network 120 is an LSTM network in the example described. Additional inputs can include one or more of user constraints (objectives) 160, a range of K intermediate networks to be generated (not shown) and a number of iterations N for Bayesian optimization. The examples described use the user constraints 160. The inputs may be stored in a database in the memory 1006 and selected by the user. Alternatively, the inputs may be input or selected by the user using an interface executing on the display 1014 and input 1013 of the device 1001 or received from a remote device via the network connection 1021.

[00067] The method 100 starts at a training step 110. The training step 110 trains the primary architecture of the required task and performs an evaluation of the results of the trained primary architecture. Training and evaluation of the primary network architecture 140 on the particular task at 110 can be implemented using known techniques by inputting a training dataset, updating network parameters using stochastic gradient descent based methods and then evaluating performance. The training and validation datasets, not shown in Fig 1, can be stored in the memory 1006 or input by the user of the module 1001.

[00068] The method 100 continues from step 110 to a sampling step 130. The sampling step 130 receives the primary architecture associated with a plurality of hyperparameters at the Recurrent Neural Network (LSTM) 120. Parameters of the LSTM are determined by sampling parameters

22274490_1 associated with the primary architecture 140 using a trained Gaussian process at execution of step 130.. Operation of step 130 is described in relation to a method 200 below.

[00069] The method 100 continues from step 130 to a generating step 150. The step 150 operates to generate a secondary neural network architecture, being a secondary RNN, using the parameters sampled at step 130 and the primary architecture encoding 140. The secondary neural network architecture (also referred to as a secondary or final architecture) is generated using a plurality of reduction factors associated with the determined plurality of parameters to generate a set of layers of the secondary neural network architecture. Each of the generated layers corresponds to a layer of the primary architecture 140. Operation of step 150 is described hereafter with respect to Fig. 9. The secondary neural network architecture is structured to perform the task. The method 100 ends upon execution of step 150.

[00070] The step of sampling LSTM parameters 130 is further described with reference to a method 200 shown in Fig. 2. The method 200 is typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005.

[00071] The method 200 starts with a training step 210. The Gaussian process is trained at step 210. Operation of the step 210 is described with reference to Fig. 3 hereafter. The method 200 continues from step 210 to an obtaining step 240. Step 240 executes to obtain the "best" numerical values of the LSTM parameters. The "best" values selected are the parameters that achieve the user constraints 160, or the parameters that give a closest (for example based upon a distance or error) result to the user constraints 160. Accordingly, the user parameters are determined based on the user constraints 160 if received. Alternatively, the "best" parameters may relate to parameters generated by the Gaussian process if constraints 160 have not been provided. The step 210 is now further described in further detail.

[00072] Fig. 3 shows a method 300 of training a Gaussian process, as implemented at step 210. The method 300 is typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005.

[00073] The method 300 starts with an initialization step 310. Step 310 operates to randomly initialize K sets of LSTM parameters of the RNN 120. A method 400 as implemented at step 310 is now described using Fig. 4. The method 400 is typically implemented as one or more modules

22274490_1 of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005.

[00074] The method 400 starts at an initializing step 410. Step 410 executes such that K instances of the RNN 120 are randomly initialized within a user specified range for a user selected value for K. Alternatively, a default range for K can be used. A range [-3,3] is considered to be typical of this range for the parameters of the RNN 120. The method 400 continues to a generating step 420. Step 420 generates K rows of flattened LSTM parameters using the initialised sets of step 410. Each row of flattened parameters corresponds to the parameters of an individual LSTM cell. The number of LSTM cells represented by K is also a user selected value, for example input via an interface. In an embodiment, K=10 is used. The range used in 410 and the value of K can be input by the user when inputting the constraints 160. The method 400 ends after implementation of step 420.

[00075] Returning to Fig. 3, the method 300 continues from step 310 to a generating step 315. Step 315 operates to generate a set of secondary neural network architectures, also referred to as intermediate neural network architectures. Step 315 implements K instances of generate a secondary architecture, as described hereafter in relation to Fig. 9. With the input of encodings of layers of the primary network architecture 140, K secondary network architectures are generated upon execution of step 315. The method 300 continues to an evaluating step 320. Step 320 operates to train and evaluate the K secondary architectures.

[00076] Operation of step 320 is now described with reference to a method 500 shown in Fig. 5. The method 500 is typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005.

[00077] The method 500 receives a training dataset 510, a validation dataset 530, a set secondary (intermediate) architectures 590 (generated by the step 315) and the user constraints 160 as inputs. The method 500 starts at a training step 520. The datasets 510 and 530 can be datasets used at step 110. Each of the K secondary network architectures 590 are trained at step 520 for a number of epochs (2 epochs for example) using the training dataset 510. The number of epochs maybe determined heuristically. After the step 520 the method 500 continues to a validation step 540. Step 540 inputs the validation dataset 530 to validate the trained secondary architecture generated at step 520.

22274490_1

[00078] Upon completion of the validation stage 540 the method 500 continues to a validation performance step 550. A performance (L,) is measured against the validation dataset 530 at step 550.

[00079] The method 500 continues from step 550 to a determining step 570. Step 570 comprises measuring overall performance of the secondary (intermediate) neural network architectures. The measurement can be based on the user constraints 160. For every secondary network architecture 590, additional quantities (L ) designated as are measured depending on the user specified

objectives (constraints) 160 such as total number of parameters, total number of layers, total FLOP count (count of floating point operations), total memory requirement and the like. At step 570, an overall performance measure (LO) is determined for every secondary network architecture using Equation (1).

Lo = 0.6L,+ L± (1)

[00080] In Equation (1) n is the number of additional performance measures (relating to the

user constraints) apart from validation performance to be measured for individual secondary network architecture. As shown in Equation (1) the overall performance measure is based upon the measured performace and the additional quantities and in particular is affected by the number of additional performance measures. Increasing numbers of performance measures can operate to reduce the overall performance measure. In one implementation the total number of parameters and total number of layers are measured. An alternative implementation measures FLOPs count as a performance measurement. Another embodiment could relate to total memory requirement or any differentiable/non-differentiable user specified objective. This overall performance function handles the all the objectives in an efficient way and provide a single performance measure Lo. The method 500 end after execution of step 570.

[00081] Returning to Fig. 3, once K secondary architectures are trained and overall performance measures are obtained for all the architectures at step 320, the method 300 continues to a fitting step 330. The Gaussian process is fitted at 330 using input of the K rows of flattened LSTM parameters and K overall performance measures. The Gaussian process can be implemented using techniques such as Python and the like. The Gaussian process represents a function that operates to select parameters of the LSTM performance measure in order to select a parameter most suitable for the secondary network on the basis of the user constraints.

22274490_1

[00082] Fig. 6 shows a method 600 offitting a Gaussian process as implemented at step 330. The method 600 is typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005. The method 600 received inputs of K sets LSTM parameters 605 (generated at step 420) and a performance measure 670 as generated at step 570.

[00083] The first step of the fitting Gaussian method 600 is an assuming step 610. A kernel is assumed or estimated at step 610. Since the input to the method 600 is continuous valued and the overall performance i.e. the output, is typically stochastic and noisy, an additive mixture of three kernels, matern kernel, whitekernel and constant kernel, are assumed at step 610. The kernel estimation function is descried in Equation (2).

Kernel = Matern + Whitekernel + Constantkernel (2)

[00084] Equation (2) provides a particular embodiment for estimating the kernel. Alternative embodiments will depend on the user contraints and user specified alternative kernel functions such as Squared exponential kernel, Rational quadratic kernel, Periodic kernel, Exponential kernel or custom kernel function and the like. Another alternative embodiment could include additive (K = >i kg), multiplicative (K = li ki) or combination of both from the above choices (K = (, ki) H1 kj) of individual kernel type. Once the choice of kernel function isfinalized, the

method 600 continues to an initializing step 620.

[00085] Step 620 executes to initialize the hyperparameters of LSTM parameters using the kernel function of step 610 and construct a log of marginal likelihood of fitting using the input and resultant output samples. Step 620 is executed for each of the K sets of LSTM cells (parameters) 605.

[00086] The method 600 continues from step 620 to a maximising step 630. In step 630 the negative log of marginal likelihood is maximized to obtain hyperparameters within a specified range provided by the user to provide the "best" fit of the LSTM parameters 605 and the overall performance measure 670 of the generated secondary network architectures. The step 630 is executed using each of the K sets of parameters 605. The method 600 continues from step 630 to a generating step 640. The method 600 generates an updated Gaussian process in execution of the step 640 using the results of step 630. Step 640 operates to generates a single Gaussian Process

22274490_1 based on the K sets of parameters 605. The method 600 outputs the single Gaussian process at step 640 and ends.

[00087] Returning to Fig. 3, once the Gaussian process is constructed and updated based on the K initialized input output pairs at step 330, the Gaussian process may be used in step 240 of Fig. 2. In the implementation described in Fig. 3, Bayesian optimization is used to further update the Gaussian process. Bayesian optimization is carried out in several steps as discussed below.

[00088] In the example of Fig. 3, the method 300 continues from step 330 to a check step 335. The Bayesian optimization can be carried out in a number of iterations N. the number of iterations N can be input by a user or set to a default number based upon previous results or experimentation. When the method 300 proceeds from step 330 to step 335 the current iteration number is set to zero (0).

[00089] The check step 335 checks if the number of iterations is equal to N. If not, "No" at step 335, the current iteration is incremented, the method 300 continues to a constructing step 351 and the Bayesian optimization commences. As a first step to the Bayesian optimization, an acquisition function (Acq(x)) is constructed at step 351 which acts as a surrogate to the actual overall performance measure determined at 570.

[00090] The step 351 can be implemented as the method 700 shown in in Fig. 7. The method 700 is typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005. The method 700 receives an updated Gaussian process 740 generated and sample corresponding LSTM parameters 710 as input. The Gaussian Process 740 and the sample LSTM parameters 710 are selected based on operation of step 330 or 356 depending on the current iteration. The acquisition function (A cq(x)) chosen for one implementation is an expected improvement (EI) criterion (aEI) which can be constructed according to Equation (3).

aEI(x ,D) f max(Obest -y)p(ylx;OD)dy (3)

aEI(x; 0, D) =o(x; 0, D) (y(x)p(y(x) + I'(y(x); 0, 1)

22274490_1

Where, y(x) = Ybest(x 0,D), p = cdf, N = pdf ao(x;O0,D)

[00091] In Equation (3) D represents the distance function for the used kernel (from implementation of step 330 or step 356 depending on the iteration N), and 0 represents hyper parameters of the kernel function used. For any new sample (x) of the flattened LSTM parameters 710, the updated Gaussian process 740 returns a predicted mean (p) and standard deviation (U) of the overall performance measure (y) (determined at step 320 or 355 depending on the current iteration). The method 700 starts at a determining step 720. Step 720 determines the predicted mean (p) and standard deviation (u) of the parameters 710 using the Gaussian process 740.

[00092] The method 700 continues from step 720 to a determining step 730. Based on the prediction determined at step 720, the expected improvement criterion is determined at step 730. The expected movement criterion is determined by following Equation (3) above where j denotes the cumulative density function and M denotes the probability density function. Alternative embodiments for the acquisition function include probability of improvement, upper confidence bound, information theoretic approaches and the like. In the example of Equation (3), the expected movement criterion is a function of the predicted mean (p), a best fitting resultYbest, and standard deviation (u). The method 700 outputs the acquisition function at step 730 and ends.

[00093] Returning to the method 300, the next step after step 351 is step 352 to optimize the acquisition function (Acq(x)). Step 352, operates to optimize the acquisition function based on Equation (4).

x; = argx*EXp*maxAcq(argxEXp maxAcq(x))} (4)

[00094] Step 352 can be implemented by a method 800 shown in Fig. 8. The method 800 is typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005. The method 800 receives an acquisition function 805 constructed at step 351 and a Gaussian Process 850 (generated at step 330 if the current iteration is one (1), otherwise generated at step 356) and outputs an adjusted acquisition function.

[00095] The method 800 starts at a step 810. In step 810, p samples (rows of flattened LSTM parameters) are randomly generated within the same user specified range as used at step 410 (Xp).

22274490_1

[00096] The method 800 continues at a maximising step 820. The acquisition function is optimized or adjusted at step 820 using the Gaussian Process 850. Step 820 starts from each of the p samples to avoid local minima. Execution of step 820 results in p new samples (Xp*) of LSTM parameters and their corresponding values of acquisition functions. The method 800 continues from step 820 to a step 840. Step 840 selects a "best" sample based on the maximized acquisition function value.

[00097] Returning to Fig. 3, the method 300 continues from step 352 to an obtaining step or selecting step 353. In step 353 a next sample (x*) is selected using the acquisition function

adjusted in step 352.

[00098] The method 300 continues from step 353 to a generating step 354. Using the sample of LSTM parameters selected at step 353 and encoding of hyperparameters of every layer of the primary network architecture 140, a new intermediate (secondary) network architecture is generated at step 354. Step 354 operates in a similar manner to 150 and 315. The neural network architectures generated at steps 315 and 354 are referred to as "intermediate" architectures as in some instances as the final architecture is generated at step 150.

[00099] Referring to Fig. 3, in the method 300 continues from step 354 to 355. In step 355, the new secondary architecture is trained and evaluated on the particular task in the same manner as the method 900. From step 355 the method 300 continues to a step 356. The Gaussian process generated in the last round of iteration (or at step 330 if the current iteration is one) is updated in execution of step 356. The method 300 continues from step 356 to step 335. Step 335 determines if the number of iterations N has been reached. If not, ("No" at step 335), the current iteration is incremented and the iterative loop from step 335 to step 356 is repeated. If the number of iterations has been reached ("Yes" at step 335), the method 300 continues to a selecting step 340.

[000100] Therefore, in the example of Fig. 3, all the processes after the first fitting of Gaussian process on the initialized K samples are repeated N times (where N is a user specified number) to update the Gaussian process using Bayesian optimization.

[000101] The step 340 selects the trained Gaussian process after finishing N iterations of Bayesian optimization. The architecture which has the highest value for the overall performance measure (as determined at step 330 or step 356) is selected as thefinal architecture. The method 300 ends after execution of step 340.

22274490_1

[000102] As shown in relation to steps 315 to 330 and steps 354 to 356, the sampled LSTM parameters are determined using a plurality of intermediate network architectures associated with the primary network architecture 140. Each of steps 330 and 356 operates to fit the Gaussian process to a parameter of a plurality of intermediate neural network architectures.

[000103] The step of generating a secondary architecture using an LSTM network and a teacher network is now described in detail. The generation of secondary network architecture from the primary network architecture and a single row of flattened LSTM parameters follow several steps, as implemented at step 150 is described with reference to a dataflow 900 shown in Fig. 9. Steps of the dataflow 900 are typically implemented as one or more modules of the application 1033, stored in the memory 1006 and controlled under execution of the processor 1005.

[000104] The dataflow 900 receives the architecture 140 as input. The primary network architecture 140 has several layers. Three of the layers of the architecture 140 are shown as 910a, 910b and 910c in Fig. 9. The input to an LSTM cell 930 at each time step is the layer encoding which comprises of a set of 4 values, being kernel size, stride, number of output channels, padding respectively. Kernels, stride and padding are considered to be symmetric and two dimensional in the arrangements described. Symmetric layer encoding differentiates various layers in an efficient way. For example, the encoding for layer 910a is 920a, which shows that the first layer has kernel size as 3 in both dimensions with stride as 1 in both dimensions and number of output channels as 64 and padding as 1 in both dimensions. The encoding 920a suggests that 910a is a convolution layer. In contrast, the next layer 910b has kernel size as 2 in both dimensions with stride as 1 in both dimensions while the number of output channels and padding are zeros. The encoding 920b suggests that the layer 910b is a pooling layer. Similarly encoding 920c suggests that the layer 910c is a convolution layer. The modification configuration used in Fig. 9 is a single-layer modification. The encoding of individual layers acts as the input to the corresponding LSTM cell, for example input to 930 for the layer 910a, at each time step. Each LSTM is bi-directional, as indicated by arrows 940 to model the inter layer relationship. The parameters are shared across each time step to make the complexity of the Gaussian process invariant to the depth or number of layers of the primary network architecture.

[000105] The example shown shares the LSTM parameters irrespective of the layer type. An alternative arrangement would be to share the LSTM parameters only within the layers of similar types such as one LSTM cell shared only within convolutional layers and a different LSTM cell shared only within pooling layers and so on. The output of the LSTM cell 930 is input to a linear

22274490_1 layer 950 of the modification configuration. The output of the linear layer 950 is input to a sigmoid activation layer 960. The sigmoid activation layer produces an output 970 consisting of 6 float values ranging between 0 and 1. The semantic meaning of the output 970 generated using, and thereby associated with, the sampled LSTM parameters, is as follows:

• the first 4 values represent reduction factors corresponding to the input encoding,

* the 5 th value represents a flag value for existence, and

• the sixth value represents a flag to decide whether a skip connection will be introduced or not.

[000106] The order of the reduction factors and the existence and skip values can be varied in some implementations. Based on the output 970, the corresponding input encoding is modified to generate a corresponding layer 980 of the secondary network architecture. Certain design constraints are imposed. Examples of design constraints include the maximum and minimum values of output channels are considered to be 512 and 16 respectively, minimum value of kernel size for max-pooling is considered to be 2x2, subsequent layers cannot be of same type except for convolution, activation layers can't exist immediately after pooling layers, batchnorm layers can't exist immediately after pooling and activation layers, minimum value of kernel size for convolution is 1xi and kernel size for convolution is forced to be odd numbers such as 3x3, 5x5 and the like. The design constraints used depend on the structure of the layers of the LSTM network.

[000107] Each layer corresponding layer (such as 980) generated is used to generate the secondary architecture at step 990. The steps followed to construct the secondary network architecture at step 990 are:

(a) The first layer of the primary architecture is kept same in the secondary architecture.

(b) Linear layers or feedforward layers in the context of the secondary network architecture are deleted (due to having too many parameters) except for the last feedforward layer.

(c) Based on the outputs of the LSTM network the second existing convolution layer is identified which is not a part of an existing skip connection in the primary network architecture. New skip connections are only introduced on the following convolution layer(s) based on the value

22274490_1 of the last flag and if the following convolution layer is not a part of an existing skip connection in the primary network architecture. New skip connections are limited to subsequent layers only, that is skip connections cover only one convolution layer. Two types of skip connections can exist in the current framework: (i) where input is added to the immediate output and (ii) where input is appended to the immediate output along channel dimension. The type of new skip connection is chosen with equal probability.

(d) Changing the layer hyperparameter based on the output of the LSTM network follows following rules:

1. The output size of the previous layer is identified first using a single input tensor of the same dimension as the training dataset tensors.

2. The dimensions of kernel size, stride, output channels and padding for a layer in the secondary architecture are computed based on multiplying the corresponding dimensions in the primary network architecture by the reduction factors. The product is rounded off to integers.

3. The dimensions of the above hyperparameters for each layer are again modified to match the input-output size consistency. Additionally, for a particular layer, based on the type of the skip connection introduced (if any), the dimensions of the layer's hyperparameters are again adjusted. The modification can relate to a multiplication operation or adjustment of number of channels to be compatible with the skip connection.

(e) For an existing skip connection in the primary network architecture, the type of the connection is maintained in the secondary architecture and the layers within the connection are modified or deleted based on the rules above. However, no new skip connections are introduced for any of these layers in this case.

(f) Depending on the requirement of the task, additional downsampling or upsampling layers can be added at the end of the secondary network architecture. Downsampling or upsampling layers can occur for example when are generating a part network which is connected to another fixed network. In such conditions, requirements of output size can exists, resulting in a requirement of additional downsampling or upsampling layers. An example is generating a head network part for a single shot detection (SSD) network which has a head network part and a detector part.

22274490_1

[000108] Following the above design rules, the input output size consistency among all the generated layers are maintained and this generates the secondary network architecture 990. Accordingly, the number of layers of the architecture generates at step 990 can be reduced.

[000109] The arrangements described are applicable to the computer and data processing industries and particularly for the machine learning industries.

[000110] The arrangements described allow a Gaussian function to be used in such a manner to improve generation of a secondary neural network architecture by reducing computational complexity in developing the neural network architecture. The "flattened" structure of the architectures generated in steps 150, 315 and 354 further decrease complexity. At the same time, a range of variation between the primary neural network architecture and the final generated neural network architecture can be increased compared to traditional, discrete solutions. Ability of the user to set constraints and the constraints to be accounted improves adaptability of the generated neural network for different practical implementations. The Bayesian optimization further operates to increase likelihood the generated neural network architecture is suitable for performing the required task.

[000111] For example, a user may want to generate a neural network architecture for a implementing particular task such as classification of an object or detection of an objection in a scene. The generated neural network architecture may be intended to be deployed on a particular type of device such as a mobile device. The user can implement the method 100 providing constrains associated with operation on a mobile device such as reduced memory storage and/or reduced floating point operations. The method 100 is implemented and the resultant neural network architecture generated suitable for implementation on the deployment device (for example transmitted to a deployment device via the network 1020).

[000112] An example of automated architecture generation using the invention is described. The problem is chosen as CIFAR-10 classification task. The CIFAR-10 problem has 10 classes. Two different primary architectures are considered, being VGG16 with one fully connected layer and DenseNet-121. The CIFAR-10 problem has 50000 training data and 10000 test data. In the context of the arrangements described, the training data has been further divided into 40000 training data and 10000 validation data. A secondary architecture is generated for the CIFAR architecture using the method 100.

22274490_1

[000113] Existing methods typically focus on increasing performance of the generated architectures only. Therefore, the generated architectures may not be satisfying other constraints or objectives such as reducing total number of parameters, reducing total FLOPs count or reducing total memory requirement.

[000114] The arrangements described are able to generate a secondary network architecture starting from VGG16, which has 66% less number of parameters while sacrificing only 0.7% accuracy compared to VGG16. Whereas in the second case the generated secondary network architecture had 72 % less number of parameters with a sacrifice of 4% in accuracy compared to DenseNet-121.

[000115] A second example of use is to perform detection on Pascal VOC 2007. The dataset contains 20 classes excluding the background. There are total 5011 training and validation data and 4952 test data. The training and validation data has been divided into 4760 training data and 251 validation data. The primary network architecture is considered to be SSD512 with VGG19. A secondary architecture is generated using the method 100. The resultant secondary architecture had almost 30% parameter reduction and 1-2% sacrifice in performance in terms of mean average precision compared to the primary architecture.

[000116] The proposed method is also useful in applications where a secondary network is required to have a relatively low number of parameters in order to meet requirements of a lower resourced implementation platform such as a mobile device. The methods described also allow choosing other parameters such as number of FLOPS instructions for optimizing the second network architecture. Number of FLOPS optimization allows the resultant neural network architecture to be used in applications where execution time is an important consideration such as online object detection.

[000117] The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.

[000118] In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and "comprises" have correspondingly varied meanings.

22274490_1

Claims

Claims:

1. A method of generating a neural network architecture for performing a task, the method comprising:

receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture.

2. The method according to claim 1, wherein the plurality of parameters of the RNN are determined based on one or more user constraints associated with completing the task.

3. The method according to claim 2, wherein the user constraints relate to one or more of number of multiplications, number of floating point operations, number of parameters, memory storage requirement, memory access requirement, and parameters associated with a restricted capability device.

4. The method according to claim 1, wherein the plurality of parameters associated with the primary architecture are determined using a plurality of intermediate neural network architectures associated with the primary architecture.

5. The method according to claim 1, wherein the Gaussian process is fitted to parameters of a plurality of intermediate neural network architectures.

6. The method according to claim 1, wherein the Gaussian process is determined based on a Bayesian optimisation.

7. The method according to claim 4, further comprising measuring performance of the plurality of intermediate neural network architectures based on one or more user constraints associated with completing the task.

22274490_1

8. The method according to claim 1, wherein the layers of the secondary architecture are generated using a skip flag and a layer preservation flag associated with the determined plurality of parameters of the RNN.

9. The method according to claim 1, wherein the layers of the secondary architecture are generated based on a modification configuration comprising a linear layer.

10. The method according to claim 9, wherein the layers of the secondary architecture are generated based on a modification configuration comprising a bi-directional LSTM for each layer of the received primary architecture.

11. The method according to claim 1, wherein the plurality of reduction factors are determined using a linear function and a sigmoid function.

12. The method according to claim 1, wherein the secondary neural network is generated by sharing the determined parameters of the RNN irrespective of layer type.

13. The method according to claim 1, wherein the secondary neural network is generated by sharing the determined parameters of the RNN only within layers of the same type.

14. The method according to claim 1, wherein the RNN is a Long Short Term Memory (LSTM) network.

15. The method according to claim 1, wherein the Gaussian process is constructed using parameters of the RNN which are continuously valued.

16. A non-transitory computer readable medium having a computer program stored thereon to implement a method of generating a neural network architecture for performing a task, the program comprising: code for receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; code for determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and code for generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of

22274490_1 layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture

17. Apparatus, comprising:

a memory; and

a processor configured to execute code stored on the memory implement a method of generating a neural network architecture for performing a task, the method comprising: receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture.

18. A system, comprising: a memory; and a processor, wherein the processor is configured to execute code stored on the memory for implementing a method of generating a neural network architecture for performing a task, the method comprising: receiving, at a Recurrent Neural Network (RNN), a primary architecture associated with a first plurality of hyperparameters; determining a plurality of parameters of the RNN by sampling parameters associated with the primary architecture using a Gaussian process; and generating a secondary neural network architecture using a plurality of reduction factors associated with the determined plurality of parameters of the RNN to generate a set of layers of the secondary architecture, each of the generated layers corresponding to a layer of the primary architecture.

Canon Kabushiki Kaisha Patent Attorneys for the Applicant/Nominated Person SPRUSON&FERGUSON

22274490_1

- 1 / 11 - 13 Mar 2019

140

Primary Architecture Encoding 120 2019201716

100 110 Start

Train and RNN Evaluate Primary (LSTM) Architecture

130 Sample parameters of 160 LSTM from trained Gaussian Process

User constraint(s) Generate Secondary 150 Architecture

End 199

Fig. 1

22209758_2

- 2 / 11 - 13 Mar 2019

200 (130) 2019201716

Start 210

Train Gaussian Process 160

240 Obtain sample of User RNN (LSTM) contraint(s) parameters) 299

End

Fig. 2

22209758_2

- 3 / 11 - 13 Mar 2019

300 (210)

Start 310 140 Randomly initialize K sets of RNN (LSTM) 2019201716

parameters

315 Primary Generate K Architecture Secondary Encoding Architectures 320 Evaluate K 330 Seconday Architectures

Fit Gaussian process 335 340

Yes N iterations No finished ? 351 Trained 356 Construct Gaussian Acquisition Process function Fit Gaussian process 352 Optimize End Acquisition Evaluate function Secondary Architecture 353 399 355 Obtain next Generate sample (RNN 140 354 Secondary parameters) Architecture Primary Architecture Fig. 3 Encoding

22209758_2

- 4 / 11 - 13 Mar 2019

400 (310) Start 410 2019201716

Randomly initialize K sets of LSTM parameters (weights, biases) 420

Generate K independent LSTM cells 499 End

Fig. 4

22209758_2

- 5 / 11 - 13 Mar 2019

500 (320, 355)

Start 590

510 2019201716

Secondary Architectures Problem (Training Training Dataset) 520 Problem (Validation Validation Dataset) User constraints 540 530 Validation Performance 550 160

Overall Performance Measure 570

End 599

Fig. 5 22209758_2

- 6 / 11 - 13 Mar 2019

600 (330, 356) Start

605 670 610 2019201716

Assume Kernel(s)

RNN (LSTM) Performance Parameters Measure

620 Initialize hyperparameters 630

Maximize negative log of marginal likelihood 640

Generate updated Gaussian Process 699

End

Fig. 6 22209758_2

- 7 / 11 - 13 Mar 2019

700 (351)

Start 2019201716

Updated Sample Gaussian (LSTM Process parameters)

Obtain mean (µ ) and std (σ) of performance 740 measure for the 710 sample 720

Determine acquisition function value (e.g. expected improvement 730 (EI))

799 End

Fig. 7 22209758_2

- 8 / 11 - 13 Mar 2019

800 (352)

Start 2019201716

810

Randomly initialize p samples (LSTM parameters)

820

Updated Maximize acquisition Acquisition Gaussian function value starting function Process from each sample

850 805 Obtain best sample of RNN (LSTM) parameters) 840

899 End

Fig. 8 22209758_2

- 9 / 11 - 13 Mar 2019

900 (150, 315, 354)

Primary Architecture Start Encoding 2019201716

140 910c

910b 910a P1 P2 P3 920b 920c 920a

3 1 64 1 2 1 0 0 1 1 128 1

930 940

RNN (LSTM) RNN (LSTM) RNN (LSTM) Cell Shared Cell Shared Cell Shared

950 Linear Linear Linear 960 Sigmoid Sigmoid Sigmoid

0.25 0.43 0.55 0.87 0.73 0.26 0.35 0.53 0.75 0.82 0.83 0.16 0.65 0.45 0.26 0.57 0.13 0.16

S1 970 S2 980

Generate 990 Secondary Fig. 9 Architecture 999

22209758_2 End

- 10 / 11 - 13 Mar 2019

(Wide-Area) Communications Network 1020 2019201716

Printer 1015

Microphone 1024 1080 1021 1017 (Local-Area) Video Communications Display Network 1022 Ext. 1023 1014 Modem 1016 1000

1001

Appl. Prog Storage Audio-Video I/O Interfaces Local Net. 1033 Devices Interface 1007 1008 I/face 1011 HDD 1010 1009

1004

1018 1019

Processor I/O Interface Memory Optical Disk 1005 1013 1006 Drive 1012

Keyboard 1002

Scanner 1026 Disk Storage 1003 Medium 1025 Camera 1027

Fig. 10A 22209758_2

- 11 / 11 - 13 Mar 2019

1034 1033

Instruction (Part 1) 1028 Data 1035 Instruction (Part 2) 1029 Data 1036 1032 1031

Instruction 1030 Data 1037 2019201716

ROM 1049 POST BIOS Bootstrap Operating 1050 1051 Loader 1052 System 1053

Input Variables 1054 Output Variables 1061

1055 1062

1056 1063

1057 1064

Intermediate Variables 1058 1059 1066 1060 1067

1019 1004

1018

1005 Interface 1042

1041 1048 Reg. 1044 (Instruction) Control Unit 1039 Reg. 1045

ALU 1040 Reg. 1046 (Data)

22209758_2 Fig. 10B