KR20240028915A

KR20240028915A - Dropout method and system for efficiently training deep learning models

Info

Publication number: KR20240028915A
Application number: KR1020230057813A
Authority: KR
Inventors: 서지원; 이준열
Original assignee: 한양대학교 산학협력단
Priority date: 2022-08-25
Filing date: 2023-05-03
Publication date: 2024-03-05

Abstract

딥러닝 모델을 효율적으로 학습하기 위한 드롭아웃 방법 및 시스템이 개시된다. 일 실시예에 따르면, 드롭아웃 시스템에 의해 수행되는 드롭아웃 방법은, 딥러닝 모델의 전체 학습 기간 동안 학습에 참여하는 서브 네트워크의 개수를 제한하는 단계; 상기 제한된 서브 네트워크의 개수에 기초하여 상기 딥러닝 모델에서 선택된 서브 네트워크의 각각을 기 설정된 학습 반복만큼 학습시키는 단계; 및 상기 선택된 서브 네트워크의 각각의 행렬 곱셈 연산을 위한 드롭아웃 마스크(dropout mask)를 생성하는 단계를 포함할 수 있다.A dropout method and system for efficiently learning a deep learning model are disclosed. According to one embodiment, a dropout method performed by a dropout system includes limiting the number of subnetworks participating in learning during the entire learning period of a deep learning model; training each of the subnetworks selected from the deep learning model by a preset number of learning repetitions based on the limited number of subnetworks; and generating a dropout mask for each matrix multiplication operation of the selected subnetwork.

Description

Dropout method and system for efficiently learning deep learning models {DROPOUT METHOD AND SYSTEM FOR EFFICIENTLY TRAINING DEEP LEARNING MODELS}

아래의 설명은 딥러닝 모델에서 과적합 문제를 해결하기 위한 드롭아웃 기술에 관한 것이다.The explanation below is about dropout technology to solve overfitting problems in deep learning models.

드롭아웃(Dropout)은 딥러닝 모델의 각 레이어의 결과 뉴런들에 대하여 일정 확률로 일부 뉴런의 값을 0으로 만들어 한 학습 반복(Training Iteration)에만 배제시키는 기법이다. 이때 일부 뉴런의 값을 0으로 만들기 위해서 각 레이어의 결과 뉴런들에 대해 1 또는 0으로 구성된 드롭아웃 마스크(dropout mask)를 생성하여 적용한다. 드롭아웃은 딥러닝 모델 안에 존재하는 인접한 뉴런들 사이의 상호적응(Co-adaptation)을 피하고 이를 통해 학습 데이터에만 과하게 학습되어 평가 데이터에 대해 높은 정확도를 얻지 못하는 현상인 과적합(Overfitting)을 해소하는 기법이다.Dropout is a technique that sets the value of some neurons to 0 with a certain probability for the resulting neurons of each layer of a deep learning model and excludes them from only one training iteration. At this time, in order to set the values of some neurons to 0, a dropout mask consisting of 1 or 0 is created and applied to the resulting neurons of each layer. Dropout avoids co-adaptation between adjacent neurons within a deep learning model and thereby eliminates overfitting, a phenomenon in which high accuracy is not obtained for evaluation data due to excessive learning on the training data. It's a technique.

드롭아웃은 매 학습 반복마다 딥러닝 모델의 각 레이어의 뉴런들에 대해서 일정 확률로 무작위로 뉴런의 값을 0으로 만들기 때문에 학습 반복마다 서로 다른 딥러닝 모델의 서브 네트워크들이 학습된다. 결과적으로 학습 단계를 구성하는 총 학습 반복의 수만큼 서로 다른 서브 네트워크들이 한 학습 반복만큼만 학습된다. 또한 무작위로 뉴런의 값을 0으로 만들기 때문에 딥러닝 모델의 연산이 이루어지는 GPU에서 0으로 변한 뉴런들을 효과적으로 연산에서 제외하기 힘들다는 문제가 있다.Because dropout randomly sets the neuron value to 0 with a certain probability for neurons in each layer of the deep learning model at each learning repetition, different subnetworks of the deep learning model are learned at each learning repetition. As a result, subnetworks as different as the total number of learning iterations that make up the learning step are trained for only one learning iteration. Additionally, because the values of neurons are randomly set to 0, there is a problem that it is difficult to effectively exclude neurons that have changed to 0 from calculations on the GPU where deep learning model calculations are performed.

딥러닝 모델 학습을 효율적으로 하기 위한 드롭아웃 기반 학습 기법인 자이로 드롭아웃을 제공할 수 있다. Gyro Dropout, a dropout-based learning technique, can be provided to efficiently learn deep learning models.

자이로 드롭아웃을 변형한 GPU 친화적인 학습 기법인 블록 단위 드롭아웃을 제공할 수 있다.Block-level dropout, a GPU-friendly learning technique that is a modification of gyro dropout, can be provided.

드롭아웃 시스템에 의해 수행되는 드롭아웃 방법은, 딥러닝 모델의 전체 학습 기간 동안 학습에 참여하는 서브 네트워크의 개수를 제한하는 단계; 상기 제한된 서브 네트워크의 개수에 기초하여 상기 딥러닝 모델에서 선택된 서브 네트워크의 각각을 기 설정된 학습 반복만큼 학습시키는 단계; 및 상기 선택된 서브 네트워크의 각각의 행렬 곱셈 연산을 위한 드롭아웃 마스크(dropout mask)를 생성하는 단계를 포함할 수 있다. The dropout method performed by the dropout system includes limiting the number of subnetworks participating in learning during the entire learning period of the deep learning model; training each of the subnetworks selected from the deep learning model by a preset number of learning repetitions based on the limited number of subnetworks; and generating a dropout mask for each matrix multiplication operation of the selected subnetwork.

상기 제한하는 단계는, 상기 딥러닝 모델의 전체 학습 기간 동안 학습에 참여하는 모든 서브 네트워크의 개수를 나타내는 하이퍼파라미터와, 하나의 학습 반복에서 동시에 학습할 서브 네트워크의 개수를 나타내는 하이퍼파라미터를 포함하는 복수 개의 하이퍼파라미터를 설정하는 단계를 포함할 수 있다. The limiting step includes a hyperparameter representing the number of all subnetworks participating in learning during the entire learning period of the deep learning model, and a plurality of hyperparameters representing the number of subnetworks to be simultaneously learned in one learning iteration. It may include the step of setting hyperparameters.

상기 제한하는 단계는, 상기 설정된 복수 개의 하이퍼파라미터의 값을 조절하여 상기 딥러닝 모델의 각 학습 반복에 참여할 서브 네트워크의 개수를 제한하는 단계를 포함할 수 있다. The limiting step may include limiting the number of subnetworks that will participate in each learning iteration of the deep learning model by adjusting the values of the plurality of set hyperparameters.

상기 학습시키는 단계는, 상기 제한된 서브 네트워크의 개수에 기초하여 상기 딥러닝 모델의 학습에 참여할 일부의 서브 네트워크를 선택하고, 상기 선택된 일부의 서브 네트워크를 기 설정된 학습 반복만큼 학습시키고, 상기 딥러닝 모델의 학습에 참여할 다른 서브 네트워크를 선택하고, 상기 선택된 다른 서브 네트워크를 기 설정된 학습 반복만큼 학습시키는 동작을 딥러닝 모델의 전체 학습 기간이 종료될 때까지 반복하는 단계를 포함할 수 있다. The learning step includes selecting some sub-networks to participate in learning of the deep learning model based on the limited number of sub-networks, training the selected some sub-networks by a preset learning repetition, and training the deep learning model. It may include selecting another sub-network to participate in learning, and repeating the operation of training the selected other sub-network by a preset number of learning repetitions until the entire learning period of the deep learning model ends.

상기 학습시키는 단계는, 하나의 학습 반복에서 동시에 학습시킬 데이터의 개수에 기초하여 상기 선택된 서브 네트워크의 각각에서 연속된 복수 개의 데이터를 학습하도록 드롭아웃 마스크를 생성하는 단계를 포함할 수 있다. The learning step may include generating a dropout mask to learn a plurality of consecutive data from each of the selected subnetworks based on the number of data to be simultaneously learned in one learning repetition.

상기 생성하는 단계는, 상기 선택된 서브 네트워크의 각각을 통해 특정 모양으로 연속된 뉴런들의 값을 0으로 생성하여 GPU 행렬 곱셈 연산에서 뉴런을 제외시키는 단계를 포함할 수 있다. The generating step may include generating the value of consecutive neurons in a specific shape as 0 through each of the selected subnetworks to exclude the neurons from the GPU matrix multiplication operation.

드롭아웃 방법을 상기 드롭아웃 시스템에 실행시키기 위해 비-일시적인 컴퓨터 판독 가능한 기록 매체에 저장되는 컴퓨터 프로그램을 포함할 수 있다. It may include a computer program stored in a non-transitory computer-readable recording medium to execute the dropout method on the dropout system.

드롭아웃 시스템은, 딥러닝 모델의 전체 학습 기간 동안 학습에 참여하는 서브 네트워크의 개수를 제한하는 개수 제한부; 상기 제한된 서브 네트워크의 개수에 기초하여 상기 딥러닝 모델에서 선택된 서브 네트워크의 각각을 기 설정된 학습 반복만큼 학습시키는 반복 학습부; 및 상기 선택된 서브 네트워크의 각각의 행렬 곱셈 연산을 위한 드롭아웃 마스크(dropout mask)를 생성하는 최적화부를 포함할 수 있다.The dropout system includes a number limiter that limits the number of subnetworks participating in learning during the entire learning period of the deep learning model; an iterative learning unit that trains each of the subnetworks selected from the deep learning model by a preset number of learning repetitions based on the limited number of subnetworks; and an optimization unit that generates a dropout mask for each matrix multiplication operation of the selected subnetwork.

고정된 수의 서브 네트워크를 사전에 선택하여 구현이 간단하고 학습 반복동안 더 높은 정확도를 달성할 수 있다. By pre-selecting a fixed number of subnetworks, implementation is simple and higher accuracy can be achieved during training iterations.

GPU 에서 손실된 계산을 효율적으로 프루닝할 수 있어 정확도 손실없이 학습 처리량을 향상시킬 수 있다.Lost computations can be efficiently pruned on GPUs, improving training throughput without loss of accuracy.

도 1은 일 실시예에 있어서, 드롭아웃 기법을 이용한 레이어의 출력을 설명하기 위한 도면이다.
도 2는 일 실시예에 있어서, 드롭아웃 시스템을 설명하기 위한 블록도이다.
도 3은 일 실시예에 있어서, 드롭아웃 방법을 설명하기 위한 흐름도이다.
도 4는 일 실시예에 있어서, 드롭아웃을 이용한 학습 동작을 설명하기 위한 도면이다.
도 5는 일 실시예에 있어서, 블록 단위 드롭아웃 동작을 설명하기 위한 도면이다. Figure 1 is a diagram for explaining the output of a layer using a dropout technique, according to one embodiment.
Figure 2 is a block diagram for explaining a dropout system, according to one embodiment.
Figure 3 is a flowchart for explaining a dropout method, according to one embodiment.
Figure 4 is a diagram for explaining a learning operation using dropout, according to one embodiment.
Figure 5 is a diagram for explaining a block-by-block dropout operation, according to one embodiment.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings.

기존의 드롭아웃이 매우 많은 수의 서브 네트워크(딥러닝 모델의 일부 뉴런으로 구성된 모델)를 짧은 학습 기간 동안 학습시키는 것과 다르게, 실시예에서는 제한된 서브 네트워크의 개수에 기초하여 딥러닝 모델에서 선택된 서브 네트워크를 학습 반복만큼 학습시켜 평가 단계에서 학습시켰던 서브 네트워크의 앙상블 성능을 향상시키고자 한다. 이에, 실시예에서는 신경망 훈련의 효율성을 향상시키는 드롭아웃 동작에 대하여 설명하기로 한다. Unlike existing dropout, which trains a very large number of subnetworks (models composed of some neurons of a deep learning model) over a short learning period, in the embodiment, a subnetwork selected from the deep learning model based on a limited number of subnetworks We want to improve the ensemble performance of the subnetwork learned in the evaluation stage by learning as many training repetitions as possible. Accordingly, in the embodiment, a dropout operation that improves the efficiency of neural network training will be described.

도 1은 일 실시예에 있어서, 드롭아웃 기법을 이용한 레이어의 출력을 설명하기 위한 도면이다. Figure 1 is a diagram for explaining the output of a layer using a dropout technique, according to one embodiment.

도 1은 8개의 뉴런과 6개의 배치 사이즈를 가진 레이어의 출력에 적용되는 각 드롭아웃 기법을 나타낸 도면이다. 도 1(a)는 기존의 드롭아웃, 도 1(b)은 실시예에서 제안하는 제1 드롭아웃(자이로 드롭아웃(Gyro Dropout)), 도 1(c)는 실시예에서 제안하는 제2 드롭아웃(블록 단위 드롭아웃(Block-wise Dropout))을 나타낸 예이다. Figure 1 is a diagram showing each dropout technique applied to the output of a layer with 8 neurons and 6 batch sizes. Figure 1(a) is an existing dropout, Figure 1(b) is a first dropout (Gyro Dropout) proposed in the embodiment, and Figure 1(c) is a second dropout proposed in the embodiment. This is an example of out (block-wise dropout).

딥러닝 모델 학습 시 학습된 모델의 정확도를 높이기 위해 사용되는 기법 중 학습과 실제 평가 사이의 추론 정확도의 차이를 의미하는 과적합(Overfitting)을 줄이는 기법인 드롭아웃이 있다. 기존의 드롭아웃은 도 1(a)처럼 딥러닝 모델을 구성하는 각 레이어의 결과 뉴런들의 일부를 특정 확률로 0으로 생성하여 다음 레이어의 입력으로 넣기 전에 학습에서 제외시키는 동작을 수행한다. 도 1(a)의 기존의 드롭아웃은 매 학습 반복마다 딥러닝 모델의 서브 네트워크를 새로 선택하고 학습하기 때문에 상당한 많은 수의 서브 네트워크가 학습에 사용되며, 각 서브 네트워크가 한 번의 학습 반복에서만 학습이 된다. Among the techniques used to increase the accuracy of the learned model when learning a deep learning model, there is dropout, which is a technique to reduce overfitting, which refers to the difference in inference accuracy between learning and actual evaluation. Existing dropout generates some of the resulting neurons of each layer that constitutes the deep learning model as 0 with a specific probability, as shown in Figure 1(a), and excludes them from learning before inputting them as input to the next layer. Since the existing dropout in Figure 1(a) selects and learns a new subnetwork of the deep learning model at each learning iteration, a significant number of subnetworks are used for learning, and each subnetwork is learned in only one learning iteration. This happens.

인공지능 학습 이론 중 하나인 앙상블(Ensemble) 학습 관점으로 드롭아웃을 보면 학습 데이터 하나마다 딥러닝 모델의 서로 다른 서브 네트워크(즉 전체 딥러닝 모델의 일부 뉴런으로 이루어진 모델)를 학습시키고, 평가 단계에서 학습된 모든 서브 네트워크들을 통합(Ensemble) 하여 추론한다. 앙상블 학습의 관점에서 학습에 참여하는 베이스(Base) 모델들의 다양성이 클수록 앙상블한 모델의 정확도가 높아지는 경향성이 있고, 앙상블한 모델의 정확도가 가장 높아지는 베이스 모델들의 개수가 존재한다. 이에 따라 매우 많은 수의 서브 네트워크를 학습시키고, 각 서브 네트워크의 학습 기간이 짧은 기존의 드롭아웃 기법은 앙상블 학습 관점에서 개선할 여지가 존재한다.If we look at dropout from the ensemble learning perspective, which is one of the artificial intelligence learning theories, each training data trains a different sub-network of the deep learning model (i.e., a model made up of some neurons of the entire deep learning model), and in the evaluation stage, Inference is made by integrating all learned sub-networks. From the perspective of ensemble learning, the greater the diversity of base models participating in learning, the higher the accuracy of the ensemble model, and there is a number of base models for which the accuracy of the ensemble model is highest. Accordingly, the existing dropout technique, which trains a very large number of subnetworks and has a short learning period for each subnetwork, has room for improvement from an ensemble learning perspective.

도 1(b)은 모든 학습 반복에서 뉴런을 무작위로 제거하는 대신, 고정된 개수의 서브 네트워크를 이용하여 학습 반복동안 학습시키는 드롭아웃 기법이다. 각 서브 네트워크는 안정적으로 학습되기 때문에 더 다양해지고 앙상블이 좋은 일반화를 달성한다. 표 1은 실시예에서 제안된 자이로 드롭아웃 알고리즘의 알고리즘을 나타낸 것이다. 드롭아웃 시스템은 학습 이전에 먼저 학습에 참여할 서브 네트워크를 미리 선택하여 각각을 특정 학습 반복 수만큼 학습시키고 다음 서브 네트워크를 선택하여 특정 학습 반복 수만큼 학습하기를 학습 단계 전체가 완료될 때까지 반복할 수 있다. Figure 1(b) is a dropout technique that trains neurons during learning iterations using a fixed number of subnetworks, instead of randomly removing neurons in every learning iteration. Because each subnetwork is trained stably, it becomes more diverse and the ensemble achieves good generalization. Table 1 shows the algorithm of the gyro dropout algorithm proposed in the embodiment. Before learning, the dropout system first selects subnetworks to participate in learning, trains each of them for a certain number of learning iterations, selects the next subnetwork, and trains it for a certain number of learning iterations. This is repeated until the entire learning step is completed. You can.

표 1: Table 1:

도 1(c)는 GPU 친화적인 도 1(b)의 변형인 블록 단위 드롭아웃 기법에 관한 것으로, 블록 단위 드롭아웃은 개별 뉴런을 드롭아웃하는 대신 뉴런 블록을 선택하고 모든 출력을 0으로 설정한다. 블록 단위 드롭아웃은 히든 뉴런을 학습 전반에 걸쳐 함께 제거되어야 하는 복수 개의 그룹화하여 모두 드롭아웃하므로 GPU에서 워프 실행을 효율적으로 프루닝(pruning) 할 수 있다. Figure 1(c) illustrates the block-wise dropout technique, a GPU-friendly variant of Figure 1(b), where block-wise dropout selects a block of neurons instead of dropping out individual neurons and sets all outputs to 0. . Block-level dropout groups hidden neurons into multiple groups that must be removed together throughout training and drops them all out, allowing for efficient pruning of warp execution on GPUs.

이와 같이, 미니 배치의 6개의 데이터에 대해 도 1(a)의 기존의 드롭아웃은 6개의 서로 다른 서브 네트워크를 학습시키는 것을 알 수 있다. 다시 말해서, 각 행은 다른 모든 행의 드롭아웃 패턴과 다르다. 도 1(b)는 상위 3개의 행과 하위 3개의 행의 드롭아웃 패턴이 동일하기 때문에 서브 네트워크가 두 개뿐인 학습이며, 이 경우, 동시에 학습할 서브 네트워크의 개수는 2이다. In this way, it can be seen that the existing dropout in Figure 1(a) trains six different subnetworks for six pieces of data in a mini-batch. In other words, each row has a different dropout pattern than every other row. Figure 1(b) shows learning with only two subnetworks because the dropout patterns of the top 3 rows and the bottom 3 rows are the same. In this case, the number of subnetworks to be learned simultaneously is 2.

실험을 통해 한 개 이상의 학습 데이터에 대해 동일한 서브 네트워크를 사용해서 학습하여 (즉, 도 1(b)처럼 연속된 행마다 동일한 위치의 뉴런을 드롭아웃하여 학습하였을 때에) 전체 학습 기간 동안 학습에 참여하는 딥러닝 모델의 서브 네트워크의 개수를 제한시켰을 때 기존의 드롭아웃 기법 대비 학습된 모델의 정확도가 더 높은 것이 확인될 수 있다. 이러한 결과를 활용하여 실시예에서는 학습 과정에서 생성되는 서브 네트워크의 개수를 조절하고 각 서브 네트워크의 학습 기간을 연장하는 새로운 학습 기법인 자이로 드롭아웃과 자이로 드롭아웃의 GPU 친화적인 변형 기법인 블록 단위 드롭아웃을 제공할 수 있다.Through experiments, one or more learning data is learned using the same subnetwork (i.e., when learning is done by dropping out neurons at the same location for each consecutive row as shown in Figure 1(b)), participating in learning during the entire learning period. When the number of subnetworks of the deep learning model is limited, it can be confirmed that the accuracy of the learned model is higher than that of the existing dropout technique. Using these results, the embodiment uses Gyro Dropout, a new learning technique that adjusts the number of subnetworks created in the learning process and extends the learning period of each subnetwork, and Block Unit Drop, a GPU-friendly variant of Gyro Dropout. Out can be provided.

도 2는 일 실시예에 있어서, 드롭아웃 시스템을 설명하기 위한 블록도이고, 도 3은 일 실시예에 있어서, 드롭아웃 방법을 설명하기 위한 흐름도이다.FIG. 2 is a block diagram for explaining a dropout system in one embodiment, and FIG. 3 is a flowchart for explaining a dropout method in one embodiment.

드롭아웃 시스템(100)의 프로세서는 개수 제한부(210), 반복 학습부(220) 및 최적화부(230)를 포함할 수 있다. 이러한 프로세서의 구성요소들은 드롭아웃 시스템에 저장된 프로그램 코드가 제공하는 제어 명령에 따라 프로세서에 의해 수행되는 서로 다른 기능들(different functions)의 표현들일 수 있다. 프로세서 및 프로세서의 구성요소들은 도 3의 드롭아웃 방법이 포함하는 단계들(310 내지 330)을 수행하도록 드롭아웃 시스템을 제어할 수 있다. 이때, 프로세서 및 프로세서의 구성요소들은 메모리가 포함하는 운영체제의 코드와 적어도 하나의 프로그램의 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다.The processor of the dropout system 100 may include a number limiting unit 210, an iterative learning unit 220, and an optimization unit 230. These processor components may be expressions of different functions performed by the processor according to control instructions provided by program code stored in the dropout system. The processor and its components may control the dropout system to perform steps 310 to 330 included in the dropout method of FIG. 3. At this time, the processor and its components may be implemented to execute instructions according to the code of an operating system included in the memory and the code of at least one program.

프로세서는 드롭아웃 방법을 위한 프로그램의 파일에 저장된 프로그램 코드를 메모리에 로딩할 수 있다. 예를 들면, 드롭아웃 시스템에서 프로그램이 실행되면, 프로세서는 운영체제의 제어에 따라 프로그램의 파일로부터 프로그램 코드를 메모리에 로딩하도록 드롭아웃 시스템을 제어할 수 있다. 이때, 개수 제한부(210), 반복 학습부(220) 및 최적화부(230) 각각은 메모리에 로딩된 프로그램 코드 중 대응하는 부분의 명령을 실행하여 이후 단계들(310 내지 330)을 실행하기 위한 프로세서의 서로 다른 기능적 표현들일 수 있다.The processor may load the program code stored in the file of the program for the dropout method into memory. For example, when a program is executed in a dropout system, the processor can control the dropout system to load program code from the program's file into memory under the control of the operating system. At this time, the number limit unit 210, the iterative learning unit 220, and the optimization unit 230 each execute commands of corresponding portions of the program code loaded into the memory to execute subsequent steps 310 to 330. These may be different functional representations of the processor.

단계(310)에서 개수 제한부(210)는 딥러닝 모델의 전체 학습 기간 동안 학습에 참여하는 서브 네트워크의 개수를 제한할 수 있다. 개수 제한부(210)는 딥러닝 모델의 전체 학습 기간 동안 학습에 참여하는 모든 서브 네트워크의 개수를 나타내는 하이퍼파라미터와, 하나의 학습 반복에서 동시에 학습할 서브 네트워크의 개수를 나타내는 하이퍼파라미터를 포함하는 복수 개의 하이퍼파라미터를 설정할 수 있다. 개수 제한부(210)는 설정된 복수 개의 하이퍼파라미터의 값을 조절하여 딥러닝 모델의 각 학습 반복에 참여할 서브 네트워크의 개수를 제한할 수 있다.In step 310, the number limiter 210 may limit the number of sub-networks participating in learning during the entire learning period of the deep learning model. The number limit unit 210 includes a hyperparameter representing the number of all sub-networks participating in learning during the entire learning period of the deep learning model, and a hyperparameter representing the number of sub-networks to be learned simultaneously in one learning iteration. Hyperparameters can be set. The number limiter 210 may limit the number of sub-networks that will participate in each learning iteration of the deep learning model by adjusting the values of a plurality of set hyperparameters.

단계(320)에서 반복 학습부(220)는 제한된 서브 네트워크의 개수에 기초하여 딥러닝 모델에서 선택된 서브 네트워크의 각각을 기 설정된 학습 반복만큼 학습시킬 수 있다. 반복 학습부(220)는 제한된 서브 네트워크의 개수에 기초하여 딥러닝 모델의 학습에 참여할 일부의 서브 네트워크를 선택하고, 선택된 일부의 서브 네트워크를 기 설정된 학습 반복만큼 학습시키고, 딥러닝 모델의 학습에 참여할 다른 서브 네트워크를 선택하고, 선택된 다른 서브 네트워크를 기 설정된 학습 반복만큼 학습시키는 동작을 딥러닝 모델의 전체 학습 기간이 종료될 때까지 반복할 수 있다. 반복 학습부(220)는 하나의 학습 반복에서 동시에 학습시킬 데이터의 개수에 기초하여 선택된 서브 네트워크의 각각에서 연속된 복수 개의 데이터를 학습하도록 드롭아웃 마스크를 생성할 수 있다.In step 320, the iterative learning unit 220 may train each of the subnetworks selected in the deep learning model by a preset number of learning repetitions based on the limited number of subnetworks. The iterative learning unit 220 selects some sub-networks to participate in learning of the deep learning model based on the limited number of sub-networks, trains some of the selected sub-networks by a preset learning repetition, and performs the learning of the deep learning model. The operation of selecting another sub-network to participate and training the other selected sub-network by a preset number of learning repetitions can be repeated until the entire learning period of the deep learning model ends. The iterative learning unit 220 may generate a dropout mask to learn a plurality of consecutive data from each of the selected subnetworks based on the number of data to be learned simultaneously in one learning repetition.

단계(330)에서 최적화부(230)는 선택된 서브 네트워크의 각각의 행렬 곱셈 연산을 위한 드롭아웃 마스크(dropout mask)를 생성할 수 있다. 최적화부(230)는 선택된 서브 네트워크의 각각을 통해 특정 모양으로 연속된 뉴런들의 값을 0으로 생성하여 GPU 행렬 곱셈 연산에서 뉴런을 제외시킬 수 있다.In step 330, the optimizer 230 may generate a dropout mask for each matrix multiplication operation of the selected subnetwork. The optimizer 230 may exclude the neurons from the GPU matrix multiplication operation by generating values of consecutive neurons in a specific shape as 0 through each of the selected subnetworks.

도 4는 일 실시예에 있어서, 드롭아웃을 이용한 학습 동작을 설명하기 위한 도면이다. Figure 4 is a diagram for explaining a learning operation using dropout, according to one embodiment.

첫 번째 레이어의 출력 행렬을 표시한 예이다. 행렬의 높이는 배치 크기를 나타내며, 행렬의 너비는 뉴런의 개수이다. 도 4(a)와 도 4(b)는 모두 한 번에 4개의 서브 네트워크를 동시에 학습하는 것을 예를 들어 설명하기로 한다. 도 4(a)에서 4개의 서브 네트워크는 단일 배치 내에 있고, 도 4(b)에서 4개의 서브 네트워크는 2개의 배치에 걸쳐있다. 오른쪽 도면은 블록 단위 드롭아웃이 있는 3개의 레이어 딥러닝 모델의 순방향 계산을 나타낸다.This is an example showing the output matrix of the first layer. The height of the matrix represents the batch size, and the width of the matrix is the number of neurons. Figures 4(a) and 4(b) will be explained using the example of simultaneously learning four sub-networks at a time. In Figure 4(a) the four subnetworks are within a single deployment, and in Figure 4(b) the four subnetworks span two deployments. The figure on the right shows the forward computation of a three-layer deep learning model with block-wise dropout.

드롭아웃 시스템은 사전 선택된 서브 네트워크의 개수()와 학습 반복에서 동시에 학습할 서브 네트워크의 개수()두 개의 하이퍼파라미터를 이용하여 각 학습 반복에 참여할 서브 네트워크의 개수를 제한할 수 있다. 도 4(a)는 =4, =16인 자이로 드롭아웃을 통해 딥러닝 모델을 40학습 반복동안 학습하는 동작을 나타낸 도면이다. 도 4에 나타난 행렬들은 모두 딥러닝 모델의 한 레이어의 결과 뉴런들을 의미하고, 행렬의 8개의 행은 한 학습 반복동안 딥러닝 모델이 동시에 학습하는 데이터의 개수가 8임을 의미한다. 학습에 참여하는 모든 서브 네트워크의 개수가 16이고, 한 학습 반복동안 학습에 참여하는 서브 네트워크의 개수가 4이기 때문에, 4번 학습시킬 서브 네트워크가 새로 선택될 수 있다. 또한, 한 학습 반복에서 동시에 학습시킬 수 있는 데이터의 개수가 8이기 때문에 하나의 서브 네트워크마다 연속된 두 개의 데이터를 학습하도록 드롭 마스크가 생성될 수 있다. The dropout system is a pre-selected number of subnetworks ( ) and the number of subnetworks to be learned simultaneously in the learning iteration ( ) You can use two hyperparameters to limit the number of subnetworks that will participate in each learning iteration. Figure 4(a) shows =4, This diagram shows the operation of learning a deep learning model for 40 learning iterations through gyro dropout with =16. The matrices shown in Figure 4 all represent the resulting neurons of one layer of the deep learning model, and the 8 rows of the matrix mean that the number of data that the deep learning model simultaneously learns during one learning iteration is 8. Since the number of all subnetworks participating in learning is 16, and the number of subnetworks participating in learning during one learning iteration is 4, a new subnetwork to be learned 4 times can be selected. Additionally, since the number of data that can be learned simultaneously in one learning iteration is 8, a drop mask can be created to learn two consecutive data for each subnetwork.

드롭아웃 시스템은 GPU 행렬 곱셈 연산 도중 값이 0이된 결과 뉴런들을 연산에서 제외할 수 있도록 드롭아웃 마스크를 생성할 수 있다. 블록 단위 드롭아웃은 학습에 참여하는 서브 네트워크가 모두 도 1(c)와 같이 일정 개수의 연속된 뉴런들의 값이 0이 되는 형태를 가진 자이로 드롭아웃이다. 드룹아웃이 일정 확률로 결과 뉴런들을 학습에서 제외되면 기존의 드롭아웃처럼 무질서하게 산개되어 뉴런들의 값이 0이 되는데, 이렇게 되면 GPU 행렬 곱셈 연산 시 학습에서 제외된 뉴런들을 연산에서 제외하지 못하고 연산을 하게 되어 연산 시간을 단축시키지 못한다. 하지만, 블록 단위 드롭아웃을 적용하게 되면 가로 방향과 연속해서 뉴런들의 값을 0으로 생성하고, 연속된 행에 동일한 서브 네트워크를 사용하기 때문에 결과적으로 사각형 모양의 뉴런들의 값을 0으로 생성하게 되어 GPU 행렬 곱셈 연산에서 효과적으로 제외시킬 수 있어 시간 단축에 효율적이다. The dropout system can create a dropout mask so that neurons whose value becomes 0 during the GPU matrix multiplication operation can be excluded from the operation. Block-level dropout is a gyro dropout in which the values of a certain number of consecutive neurons in all subnetworks participating in learning are 0, as shown in Figure 1(c). When droopout excludes the resulting neurons from learning with a certain probability, they are scattered in a disorderly manner like a conventional dropout, and the values of the neurons become 0. In this case, when performing GPU matrix multiplication calculations, the neurons excluded from learning cannot be excluded from the calculation and the calculation is performed. This does not shorten the computation time. However, when block-level dropout is applied, the values of neurons in the horizontal direction are generated as 0 in succession, and because the same subnetwork is used in consecutive rows, the values of square-shaped neurons are generated as 0 as a result, which causes GPU It is efficient in reducing time as it can be effectively excluded from matrix multiplication operations.

도 5는 일 실시예에 있어서, 블록 단위 드롭아웃 동작을 설명하기 위한 도면이다. Figure 5 is a diagram for explaining a block-by-block dropout operation, according to one embodiment.

블록 단위 드롭아웃을 설명하기 위해 GPU 실행 모델과 GPU에서 매트릭스 곱셈이 계산되는 동작에 대하여 설명하기로 한다. GPU에는 복수 개의 스트리밍 멀티프로세서(SM)가 있으며, 각각은 여러 개의 하드웨어 스레드를 동시에 실행한다. GPU는 SIMT(단일 명령 다중 스레드) 모델을 채택하여 레드 그룹이 동일한 명령 시퀀스를 동시에 실행한다. 실행 모델은 스레드 블록의 개념을 제공하는 프로그래밍 요약에 의해 지원된다. 스레드 블록은 동일한 스트리밍 멀티프로세서에 할당되어 병렬로 실행되는 스레드 그룹이다. 스레드 블록은 수백에서 수천 개의 스레드로 구성되며 1D, 2D 또는 3D 구조로 구성되어 응용 프로그램 도메인에서 다차원 데이터를 나타낸다. GPU는 내부적으로 각 스레드 블록을 동일한 명령 시퀀스를 실행하는 32개의 스레드로 구성된 워프라는 워프라고 하는 스케줄링된 단위로 분할한다. To explain block-level dropout, we will explain the GPU execution model and the operation in which matrix multiplication is calculated on the GPU. A GPU has multiple streaming multiprocessors (SMs), each executing multiple hardware threads simultaneously. GPUs adopt a single-instruction-multi-threading (SIMT) model, where red groups execute the same sequence of instructions simultaneously. The execution model is supported by a programming abstract that provides the concept of thread blocks. A thread block is a group of threads assigned to the same streaming multiprocessor and executed in parallel. A thread block consists of hundreds to thousands of threads and is organized in 1D, 2D, or 3D structures to represent multidimensional data in the application domain. The GPU internally splits each block of threads into scheduled units called warps, which are made up of 32 threads executing the same sequence of instructions.

SIMT는 워프의 모든 스레드가 동일한 명령 시퀀스를 실행해야 하기 때문에 임의의 희소 행렬 연산에는 비효율적이다. 예를 들어, GPU에서 임의의 희소 행렬의 곱셈을 생각해 보기로 한다. 워프가 곱셈과 덧셈 계산을 실행할 때 워프의 모든 스레드가 동일한 명령을 실행하기 때문에 0과 예비 GPU 주기를 가진 곱셈을 건너뛸 수 없다. SIMT 실행 모델에서, 워프의 모든 스레드가 독점적으로 0을 곱하는 경우에만 곱셈을 건너뛰고 실행 속도를 높일 수 있다.SIMT is inefficient for arbitrarily sparse matrix operations because all threads in a warp must execute the same instruction sequence. For example, let's consider the multiplication of arbitrary sparse matrices on a GPU. When a warp executes multiplication and addition calculations, it cannot skip the multiplication with 0 and spare GPU cycles because all threads in the warp execute the same instructions. In the SIMT execution model, skipping multiplication can only speed up execution if all threads in a warp exclusively multiply by zero.

DRAM 지연 시간을 숨기기 위해 GEMM(General Matrix Multiplication) 커널은 일반적으로 타일링 및 파이프라이닝을 적용한다. 타일링 최적화는 출력 행렬의 각 요소를 계산하는 대신 타일이라고 하는 요소 그룹의 곱셈을 동시에 계산한다. 파이프라이닝 최적화는 DRAM에서 타일을 로드하여 타일 계산과 중첩한다. To hide DRAM latency, General Matrix Multiplication (GEMM) kernels typically apply tiling and pipelining. Instead of computing each element of the output matrix, tiling optimization simultaneously computes the multiplication of groups of elements, called tiles. Pipelining optimization loads tiles from DRAM and overlaps tile calculations.

최첨단 오픈 소스 GEMM 커널인 CUTLASS도 두 가지 최적화를 적용한다. CUTLASS에서 출력 매트릭스의 타일은 단일 스레드 블록에 의해 계산되므로 타일을 스레드 블록 타일이라고 한다. 도 5를 참고하면, CUTLASS의 스레드 블록 타일을 나타낸다. 스레드 블록 타일의 기본 크기는 128×128이다. 스레드 블록 타일 및 기타 타일의 모양은 통합된 메모리 액세스를 달성하고 메모리 대역폭 효율성을 최대화하도록 결정될 수 있다. 타일의 곱셈을 계산하기 위해 입력 행렬은 레벨 1 캐시(즉, 공유 메모리)에 대한 128×8(또는 8×128) 모양의 입력 타일에 로드된다(좌우의 사선 빗금 상자). 스레드 블록 타일은 256개의 스레드 또는 8개의 워프(W0-W7로 표시됨)로 구성된 단일 스레드 블록에 의해 처리되며, 각각은 도 5의 오른쪽에 표시된 32×16 모양의 4개의 워프 타일을 계산하다. 입력 타일이 로드되는 동안 이전에 로드된 타일이 있는 계산이 실행되므로 메모리 로드와 계산이 파이프라이닝되고 중첩된다.CUTLASS, a state-of-the-art open source GEMM kernel, also applies two optimizations. In CUTLASS, the tiles of the output matrix are computed by a single thread block, so the tiles are called thread block tiles. Referring to Figure 5, it shows the thread block tile of CUTLASS. The default size of a thread block tile is 128×128. The shape of thread block tiles and other tiles can be determined to achieve unified memory access and maximize memory bandwidth efficiency. To compute the multiplication of a tile, the input matrix is loaded into a 128×8 (or 8×128) shaped input tile to the level 1 cache (i.e. shared memory) (hatched boxes on the left and right). Thread block tiles are processed by a single thread block consisting of 256 threads or 8 warps (marked W0-W7), each computing four warp tiles of the 32×16 shape shown on the right side of Figure 5. While input tiles are loaded, computations with previously loaded tiles are executed, so memory loads and computations are pipelined and overlapped.

드롭아웃 시스템은 GEMM 커널의 통합 메모리 액세스와 서브 네트워크 사전 선택 효과라는 두 가지의 요소를 고려하여 타일링 및 드롭아웃 블록 모양을 결정할 수 있다. The dropout system can determine the tiling and dropout block shape by considering two factors: the integrated memory access of the GEMM kernel and the subnetwork preselection effect.

GEMM 커널은 일반적으로 GPU의 메모리 하위 시스템 아키텍처를 활용하고 128바이트의 캐시 라인 세분화로 DRAM에서 공유 메모리로 데이터를 가져온다. 더 작은 크기로 프루닝하면 메모리 하위 시스템을 완전히 활용할 수 없고 메모리 대역폭이 낭비된다. 따라서 효율적인 메모리 트랜잭션을 위해 드롭아웃 블록 크기는 32개 요소(예를 들면, 128바이트) 이상이어야 한다. 동시에 해당 입력 데이터(공유 메모리에서 레지스터로)의 로드를 건너뛸 수 있도록 스레드 블록 타일의 전체 행 또는 열의 계산을 프루닝해야 한다. 이를 위해서는 드롭아웃 블록의 1차원이 128(요소)이어야 한다. 따라서 128×32 및 32×128 드롭아웃 블록 모양을 후보로 고려한다. 전자는 배치 크기(즉, 스레드 블록 타일의 높이) 차원에 더 큰 값을 할당하고 후자는 출력 뉴런의 차원을 더 크게 설정한다. 이제 정확도를 최대화하기 위해 동시에 학습할 서브 네트워크의 개수가 상당히 적다는 점을 고려할 수 있다. 또한 출력 뉴런 차원에 더 큰 값을 사용하고 32×128 블록의 드롭아웃을 사용하면 총 서브 네트워크의 가능한 개수가 크게 줄어든다. 이러한 요인을 고려하여 후보 드롭아웃 블록 모양이 128×32로 결정될 수 있다.The GEMM kernel typically leverages the GPU's memory subsystem architecture and pulls data from DRAM into shared memory with a cache line granularity of 128 bytes. Pruning to a smaller size does not fully utilize the memory subsystem and wastes memory bandwidth. Therefore, for efficient memory transactions, the dropout block size should be at least 32 elements (e.g., 128 bytes). At the same time, computations of entire rows or columns of thread block tiles must be pruned so that loading of the corresponding input data (from shared memory to registers) can be skipped. To achieve this, the first dimension of the dropout block must be 128 (elements). Therefore, we consider 128×32 and 32×128 dropout block shapes as candidates. The former assigns a larger value to the batch size (i.e. the height of the thread block tile) dimension, while the latter sets the dimension of the output neuron to be larger. Now, to maximize accuracy, we can consider that the number of subnetworks to be trained simultaneously is quite small. Additionally, using a larger value for the output neuron dimension and a dropout of 32×128 blocks significantly reduces the total number of possible subnetworks. Considering these factors, the candidate dropout block shape can be determined to be 128×32.

블록 단위 드록아웃을 사용하면 두 가지 유형, 즉 입력 기반 프루닝과 출력 기반 프루닝을 지원한다. 이름에서 알 수 있듯이, 전자는 입력 행렬에서 드롭아웃된 뉴런을 프루닝하고 후자는 출력 행렬에서 누락된 뉴런을 프루닝한다. 두 가지 프루닝 유형 사이에서 입력 기반 프루닝은 출력 기반 프루닝보다 입력 데이터 로드를 더 효율적으로 하기 때문에 약간 더 효율적이다. 이에, 출력 프루닝보다 입력 프루닝을 우선시 할 수 있다. 도 4의 오른쪽 도면에서, 블록 단위 드롭아웃이 있는 3개의 레이어 딥러닝 모델의 순방향 계산을 보여준다. 첫 번째 레이어에는 출력 프루닝을 적용하고 마지막 두 레이어에는 입력 프루닝을 적용한다.Using block-wise dropout, two types are supported: input-based pruning and output-based pruning. As the name suggests, the former prunes neurons that are dropped out in the input matrix, while the latter prunes neurons that are missing in the output matrix. Between the two pruning types, input-based pruning is slightly more efficient than output-based pruning because it loads input data more efficiently. Accordingly, input pruning can be prioritized over output pruning. In the right diagram of Figure 4, the forward computation of a three-layer deep learning model with block-wise dropout is shown. Output pruning is applied to the first layer and input pruning is applied to the last two layers.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented with hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), etc. , may be implemented using one or more general-purpose or special-purpose computers, such as a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. A processing device may execute an operating system (OS) and one or more software applications that run on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device includes multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include a plurality of processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. It can be embodied in . Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes optical media (magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, various modifications and variations can be made by those skilled in the art from the above description. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims also fall within the scope of the claims described below.

Claims

In the dropout method performed by the dropout system,
Limiting the number of subnetworks participating in learning during the entire learning period of the deep learning model;
training each of the subnetworks selected from the deep learning model by a preset number of learning repetitions based on the limited number of subnetworks; and
Generating a dropout mask for each matrix multiplication operation of the selected subnetwork.
A dropout method containing .

According to paragraph 1,
The limiting step is,
Setting a plurality of hyperparameters, including a hyperparameter indicating the number of all subnetworks participating in learning during the entire learning period of the deep learning model, and a hyperparameter indicating the number of subnetworks to be learned simultaneously in one learning iteration. step
A dropout method containing .

According to paragraph 2,
The limiting step is,
Limiting the number of subnetworks that will participate in each learning iteration of the deep learning model by adjusting the values of the plurality of set hyperparameters
A dropout method containing .

According to paragraph 1,
The learning step is,
Based on the limited number of sub-networks, some sub-networks that will participate in learning the deep learning model are selected, some of the selected sub-networks are trained by a preset learning repetition, and other sub-networks that participate in learning the deep learning model are selected. Selecting a network and repeating the operation of training other selected sub-networks by a preset number of learning repetitions until the entire learning period of the deep learning model ends.
A dropout method containing .

According to paragraph 1,
The learning step is,
Generating a dropout mask to learn a plurality of consecutive data from each of the selected subnetworks based on the number of data to be learned simultaneously in one learning iteration.
A dropout method containing .

According to paragraph 1,
The generating step is,
Excluding the neurons from the GPU matrix multiplication operation by generating the value of consecutive neurons in a specific shape as 0 through each of the selected subnetworks.
A dropout method containing .

A computer program stored in a non-transitory computer-readable recording medium for executing the dropout method of any one of claims 1 to 6 on the dropout system.

In the dropout system,
A number limiter that limits the number of subnetworks participating in learning during the entire learning period of the deep learning model;
an iterative learning unit that trains each of the subnetworks selected from the deep learning model by a preset number of learning repetitions based on the limited number of subnetworks; and
An optimization unit that generates a dropout mask for each matrix multiplication operation of the selected subnetwork.
Dropout system including.

According to clause 8,
The number limit part is,
Setting a plurality of hyperparameters, including a hyperparameter indicating the number of all subnetworks participating in learning during the entire learning period of the deep learning model, and a hyperparameter indicating the number of subnetworks to be learned simultaneously in one learning iteration.
A dropout system characterized in that.

According to clause 9,
The number limit part is,
Limiting the number of subnetworks that will participate in each learning iteration of the deep learning model by adjusting the values of the plurality of set hyperparameters
A dropout system characterized in that.

According to clause 8,
The iterative learning unit,
Based on the limited number of sub-networks, some sub-networks that will participate in learning the deep learning model are selected, some of the selected sub-networks are trained by a preset learning repetition, and other sub-networks that participate in learning the deep learning model are selected. The operation of selecting a network and training other selected sub-networks by a preset number of learning repetitions is repeated until the entire learning period of the deep learning model ends.
A dropout system characterized in that.

According to clause 8,
The iterative learning unit,
Generating a dropout mask to learn a plurality of consecutive data from each of the selected subnetworks based on the number of data to be learned simultaneously in one learning iteration.
A dropout system characterized in that.

According to clause 8,
The optimization unit,
Excluding neurons from the GPU matrix multiplication operation by generating the value of consecutive neurons in a specific shape as 0 through each of the selected subnetworks.
A dropout system characterized in that.