[go: up one dir, main page]

Next Article in Journal
A Hybrid Model for Analysis of Laser Beam Distortions Using Monte Carlo and Shack–Hartmann Techniques: Numerical Study and Experimental Results
Previous Article in Journal
Basis Functions for a Transient Analysis of Linear Commensurate Fractional-Order Systems
Previous Article in Special Issue
Optimization of the Compressive Measurement Matrix in a Massive MIMO System Exploiting LSTM Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Simplified Routing Mechanism for Capsule Networks

1
Department of Informatics, Széchenyi István University, 9026 Győr, Hungary
2
Department of Automation, Széchenyi István University, 9026 Győr, Hungary
*
Authors to whom correspondence should be addressed.
Algorithms 2023, 16(7), 336; https://doi.org/10.3390/a16070336
Submission received: 11 April 2023 / Revised: 2 July 2023 / Accepted: 10 July 2023 / Published: 13 July 2023
(This article belongs to the Special Issue Deep Learning Architecture and Applications)
Figure 1
<p>Typical structure of a neuron. (green: inputs, blue: operations, yellow: output, purple: neuron).</p> ">
Figure 2
<p>Typical structure of a capsule. (green: inputs, red: prediction vectors, blue: operations, yellow: output, purple: capsule).</p> ">
Figure 3
<p>Our squash activation function with different <math display="inline"><semantics> <mi>ε</mi> </semantics></math> values.</p> ">
Figure 4
<p>Block diagram of the dynamic routing algorithm by Sabour et al. [<a href="#B11-algorithms-16-00336" class="html-bibr">11</a>]. (green: inputs, yellow: operations, blue: activations, purple: internal tensors).</p> ">
Figure 5
<p>Block diagram of our proposed routing algorithm. (green: inputs, yellow: operations, blue: activations, purple: internal tensors).</p> ">
Figure 6
<p>The capsule network architecture used in the research, based on work by Sabour et al. [<a href="#B11-algorithms-16-00336" class="html-bibr">11</a>]. (green: input, purple: convolutional layer, yellow: primary capsule layer, red: secondary capsule layer, gray: prediction).</p> ">
Figure 7
<p>Sample data from MNIST dataset.</p> ">
Figure 8
<p>Sample data from Fashion-MNIST dataset.</p> ">
Figure 9
<p>Sample data from SmallNORB dataset.</p> ">
Figure 10
<p>Sample data from CIFAR10 dataset.</p> ">
Figure 11
<p>Sample data from SVHN dataset.</p> ">
Figure 12
<p>Sample data from GTSRB dataset.</p> ">
Figure 13
<p>Classification test accuracy on MNIST dataset.</p> ">
Figure 14
<p>Classification test accuracy on Fashion-MNIST dataset.</p> ">
Figure 15
<p>Classification test accuracy on SmallNORB dataset.</p> ">
Figure 16
<p>Classification test accuracy on CIFAR10 dataset.</p> ">
Figure 17
<p>Classification test accuracy on SVHN dataset.</p> ">
Figure 18
<p>Classification test accuracy on GTSRB dataset.</p> ">
Figure 19
<p>Classification test loss on MNIST dataset.</p> ">
Figure 20
<p>Classification test loss on Fashion-MNIST dataset.</p> ">
Figure 21
<p>Classification test loss on SmallNORB dataset.</p> ">
Figure 22
<p>Classification test loss on CIFAR10 dataset.</p> ">
Figure 23
<p>Classification test loss on SVHN dataset.</p> ">
Figure 24
<p>Classification test loss on GTSRB dataset.</p> ">
Figure 25
<p>Comparison of training time on the same hardware (Nvidia Quadro RTX4000 series).</p> ">
Figure 26
<p>Confusion matrices for the capsule-based networks.</p> ">
Versions Notes

Abstract

:
Classifying digital images using neural networks is one of the most fundamental tasks within the field of artificial intelligence. For a long time, convolutional neural networks have proven to be the most efficient solution for processing visual data, such as classification, detection, or segmentation. The efficient operation of convolutional neural networks requires the use of data augmentation and a high number of feature maps to embed object transformations. Especially for large datasets, this approach is not very efficient. In 2017, Geoffrey Hinton and his research team introduced the theory of capsule networks. Capsule networks offer a solution to the problems of convolutional neural networks. In this approach, sufficient efficiency can be achieved without large-scale data augmentation. However, the training time for Hinton’s capsule network is much longer than for convolutional neural networks. We have examined the capsule networks and propose a modification in the routing mechanism to speed up the algorithm. This could reduce the training time of capsule networks by almost half in some cases. Moreover, our solution achieves performance improvements in the field of image classification.

1. Introduction

For processing visual data, convolutional neural networks (CNNs) are proving to be the best solutions nowadays. The most popular applications of convolutional neural networks in the field of image processing are image classification [1,2], object detection [3,4], semantic segmentation [5,6] and instance segmentation [7,8]. However, the biggest challenge of convolutional neural networks is their inability to recognize pose, texture and deformations of an object, caused by the pooling layers. Pooling layers are used in the feature maps. Where we can find several types of this layer: max pooling, min pooling, average pooling and sum pooling are the most common types of pooling layers [9]. Due to this layer, the efficiency of the convolutional neural network to recognize the same object in different input images under different conditions is high. At the same time, the size of the tensors is reduced due to the pooling layer, thus reducing the computational complexity of the network. In most cases, pooling layers are one of the best tools for feature extraction; however, they introduce spatial invariance in convolutional neural networks. Due to the nature of the pooling layer, a great amount of information is lost, which in some cases may even be important features in the image. To compensate for this, the convolutional neural network needs a substantial amount of training data where data augmentation is necessary.
Geoffrey Hinton and his research team introduced capsule network theory as an alternative to convolutional neural networks. Hinton et al. published the first paper in the field of capsule networks in 2011 [10], where the potential of the new theory is explained, but the solution for effective training it is not yet available. The next important milestone came in 2017, when Sabour et al. introduced the dynamic routing algorithm between capsule layers [11]. Thanks to this dynamic routing algorithm, the training and optimization of capsule-based networks can be performed efficiently. Finally, Hinton et al. published a matrix capsule-based approach in 2018 [12]. These are the three most important results that the inventors of the theory have published in the field of capsule networks. The basic building block of convolutional neural networks is the neuron, while capsule networks are made up of so-called capsules. A capsule is a group of related neurons, where each neuron’s output represents a different property of the same feature. Hence, the input and output of the capsule networks are both vectors (n-dimensional capsules), while the neural network works with scalar values (neurons). Instead of pooling layers, a dynamic routing algorithm was introduced in capsule networks. In this approach, the lower-level features (lower-level capsules) will only be sent to higher-level capsules that match its contents. This property makes capsule networks a more effective solution than convolutional neural networks in some use cases.
However, the training process for capsule networks can be much longer than for convolutional neural networks, where due to the high number of parameters, the memory requirements of the network can be much higher. Therefore, for complex datasets (e.g., large input images, high number of output classes), presently, capsule networks do not perform well yet. This is due to the complexity of the dynamic routing algorithm. For this reason, we have attempted to make modifications to the dynamic routing algorithm. Our primary aim was to reduce the time of the training process, and secondly to achieve a higher efficiency. In our method, we reduced the weight of the input capsule vector during the optimization in the routing process. We also proposed a parameterizable activation function interpreted in terms of vectors, based on the squash function. In this paper, we demonstrate the effectiveness of our proposed modified routing algorithm and compare it with other capsule network-based methods and convolutional neural network-based approaches.
This paper is structured as follows. In Section 2, we provide the theoretical background of the capsule network theory proposed by Hinton et al. [10] and Sabour et al. [11]. Section 3 clarifies our improved routing mechanism for capsule network and our parameterizable activation squash function. Section 4 describes the capsule network architecture used in this research. In Section 5, we present the datasets used to compare the dynamic routing algorithm and our proposed solution. Our results are summarized in Section 6, where we compare our improved routing solution with Sabour et al.’s method, and with some recently published neural network-based solutions. Finally, our conclusions based on our results are summarized in Section 7.

2. Theory of Capsule Network

The capsule network [10,11,12] (or CapsNet) is very similar to the classical neural network. The main difference is the basic building block. In the neural network, we use neurons, but in the capsule network, we can find capsules. Figure 1 and Figure 2 show the main differences between the classical artificial neurons and the capsules.
A capsule is a group of neurons that perform a multitude of internal computation and encapsulate the results of the computations into an n-dimensional vector. This vector is the output of the capsule. The length of this output vector is the probability and the direction of the vector, indicating certain properties about the entity.
In a capsule-based network, we use routing-by-agreement, where the output vector of any capsule is sent to all higher-level capsules. Each capsule output is compared with the actual output of the higher-level capsules. Where the outputs match, the coupling coefficient between the two capsules are increased.
Let i be a lower-level capsule and j be a higher-level capsule. The prediction vector is calculated as follows:
u ^ ( j | i ) = W i j u i
where W i j is a trainable weighting matrix and u i is an output pose vector from the i -th capsule to the j -th capsule. The coupling coefficients are calculated with a simple SoftMax function, as follows:
c i j = e x p ( b i j ) k e x p ( b i k )
where b i j is the log probability of capsule i coupled with capsule j , and it is initialized with zero values. The total input to capsule j is a weighted sum over the prediction vectors, calculated as follows:
s j = i c i j u ^ j | i
In capsule networks, we use the length of the output vector to represent the probability for the capsule. Therefore, we use a non-linear activation function, which is called the squashing function. The squashing function is the next:
v j = s q u a s h ( s j ) = s j 2 1 + s j 2 s j s j
We can use the dynamic routing algorithm (by Sabour et al. [11]) to update the c i j values in every iteration. In this case, the goal is to optimize the v j vector. In the dynamic routing algorithm, the b i j vector is updated in every iteration, as follows:
b i j = b i j + u ^ j | i v j

3. Improved Routing Algorithm

Our experiments on capsule network theory have shown that the u ^ ( j | i ) input tensor in the dynamic routing algorithm has too large an impact on the output tensor and greatly increaes the processing time. When calculating the output vector v j , the formula includes the input u ^ ( j | i ) twice:
v j = squash ( i softmax ( b i j + u ^ j | i v j ) u ^ j | i )
To improve the routing mechanism between lower-level and higher-level capsules, the following modifications to the routing algorithm are proposed:
v j = squash ( i softmax ( b i j + j v j ) u ^ j | i )
Let
v j = [ c 11 c 1 m c n 1 c n m ]
where c k l is the value of the l -th neuron of the k -th capsule. If v j is an intermediate capsule layer, then n is the number of output capsules. If v j is an output capsule layer, then n is the number of possible object categories.
Let
v j = [ 𝓋 1 𝓋 2 𝓋 n ]
where   x { 1 ,   2 ,   ,   n } , let
𝓋 x = y = 1 n c x y
This minimal modification makes the routing algorithm simpler and faster to compute. Our other proposed change concerns the squashing function. In the last capsule layer, we use a modified squashing function, as follows:
s q u a s h o u r ( s ) = s e s s s + ε
where ε is a fine-tuning parameter. Based on our experience, we used ε = 1 × 10 7 in this work. Figure 3 shows a simple example of our squash function in a one-dimensional case for different values of ε .
Figure 4 and Figure 5 show a block diagram of the dynamic routing algorithm and our improved routing solution, where the main differences between the two methods are clearly visible.

4. Network Architecture

In this work, we have used the network architecture proposed by Sabour et al. [11] to compare our proposed routing mechanism with other optimization solutions in the field of capsule networks. This capsule network architecture is shown in Figure 6. The original paper used a fixed 32 × 32 × 1 -sized input tensor, because they only tested the network efficiency for the MNIST [13] dataset. In contrast, we trained and tested the capsule networks for six fundamentally different datasets in the field of image classification. In our work, the shape of the input layer varies depending on the dataset. We used the following input shapes: 28 × 28 × 1 , 48 × 48 × 1 and 32 × 32 × 3 . After the input layer, the capsule network architecture consisted of three main components: the first is a convolutional layer, the next is the primary capsule layer, and the last one is the secondary capsule layer.
The convolution layer contains 256 convolution kernels of size 9 × 9 with a stride of 1 , and a ReLU (rectified linear unit) [14] activation layer. This convolutional layer generates the main visual features based on intensities for the primary capsule layer.
The primary capsule block contains a convolutional layer, where both the input and the output are of the size 256 . This capsule block also contains a squash layer. In this case, the original squash function (Equation (4)) is used for both implementations. The output of this block contains 32 capsules, where each capsule has 8 dimensions. This capsule block contains advanced features, which are passed onto the secondary capsule block.
The secondary capsule block has one capsule per class. As mentioned earlier, we worked with several different datasets, so the number of capsules in this capsule block varied, always according to the class number of the dataset: 5 , 10 or 43 . This capsule block contains the routing mechanism, which is responsible for determining the connection weights between the lower and higher capsules. Therefore, this capsule block represents the main difference between the solution of Sabour et al. and our presented method. In this block, we applied our proposed squash function (Equation (11)). The secondary capsule block contains a trainable matrix, called W (Equation (14)). The shape of the W matrix, for both solutions, is p c × n × 16 × 8 , where n is the number of output classes and p c depends on the input image shape as follows:
p c = {       32 × 6 × 6 ,       i m s i z e = ( 28 ,   28 )       32 × 8 × 8 ,       i m s i z e = ( 32 ,   32 ) 32 × 16 × 16 ,       i m s i z e = ( 48 ,   48 )
where i m s i z e is the size of the input image. The routing algorithm was run through r = 3 iterations in both cases. The output of this capsule block is a 16 -dimensional vector per each class. This means that the block produces n   16 -dimensional capsules, where n is the number of output classes. The length of the output capsules represents the probability values belonging to the given class.

5. Datasets

In this work, six different datasets were used. A classification task was performed for each dataset in our work. The datasets needed to have different levels of complexity. This allowed us to test as wide a range of datasets as possible. The selected datasets include grayscale and color images. The number of classes also varies, from 5 to 43 . The size of the images is typically small, but here, again, we tried to experiment with different sizes. The datasets included a fixed background which had been used, as well as a variable color background which had been applied. Table 1 shows the main properties of the six datasets that we used in this research.

5.1. MNIST

The MNIST dataset (Modified National Institute of Standards and Technology dataset) is a large set of handwritten digits (from 0 to 9 ) that is one of the most widely used datasets in the field of image classification. The MNIST dataset contains 60,000 training images and 10,000 testing images, where every image is grayscale with a 28 pixel width and 28 pixel height. Figure 7 shows some samples from this dataset.

5.2. Fashion-MNIST

The Fashion-MNIST (or F-MNIST) dataset is very similar to the MNIST dataset. The main parameters are the same. It contains 60,000 training and 10,000 testing examples. Every sample is 28 pixels in width and 28 pixels in height, and a grayscale image where colors are inverted (an intensity value of 255 represents the darkest color). The Fashion-MINST dataset contains 10 fashion categories: t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag and ankle boot. Figure 8 shows some samples from this dataset.

5.3. SmallNORB

The SmallNORB dataset contains images of 3D objects. The specialty of this dataset is that the images were taken under several different lighting conditions and poses. This dataset contains images of toys belonging to   five different categories: four-legged animals, human figures, airplanes, trucks and cars. The images were taken with two cameras under six lighting conditions, nine elevations and eighteen azimuths. All images are grayscale, with a size of 96 pixels in width by 96 pixels in height. However, in this work, we resized the images to 48 × 48 pixels. Figure 9 shows some samples from this dataset.

5.4. CIFAR10

The CIFAR10 (Canadian Institute for Advanced Research) dataset is one of the most widely used datasets in the field of machine learning-based image classification. This dataset is composed of 60,000 RGB colored images, where 50,000 images are the training samples and 10,000 images are the testing samples. Each image is 32 pixels wide and 32 pixels high. The object categories are the following: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships and trucks. Figure 10 shows some samples from this dataset.

5.5. SVHN

The SVHN (Street View House Numbers) dataset contains small, cropped digits, like the MNIST dataset. However, the SVHN dataset is slightly more complex than the MNIST dataset. The SVHN is obtained from house numbers in Google Street View, where the background of the digits is not homogeneous and images may also include part of the adjacent digit. This property makes the SVHN dataset more difficult to classify than the MNIST dataset. The size of the images in this dataset is 32 pixels wide and 32 pixels high. The SVHN dataset contains 73,257 training samples and 26,032 testing samples. Figure 11 shows some samples from this dataset.

5.6. GTSRB

The GTSRB (German Traffic Sign Recognition Benchmark) dataset includes 43 classes of traffic signs. Each image contains one traffic sign with varying light conditions and rich backgrounds. Images are 32 pixels wide and 32 pixels high. The traffic sign classes are the following: speed limit { 20 , 30 , 50 , 60 , 70 , 80 , 100 , 120 } km/h, end of speed limit (80 km/h), no passing, no passing for vehicles over 3.5 metric tons, right-of-way at the next intersection, priority road, yield, stop, no vehicles, vehicles over 3.5 metric tons prohibited, no entry, general caution, dangerous curve to the {left, right}, double curve, bumpy road, slippery road, road narrows on the right, road work, traffic signals, pedestrians, children crossing, bicycles crossing, beware of ice/snow, wild animals crossing, end of all speed and passing limits, turn {right, left} ahead, ahead only, go straight or {right, left}, keep {right, left}, roundabout mandatory, end of no passing and end of no passing by vehicles over 3.5 metric tons. Figure 12 shows some samples from this dataset.

6. Results

In this work, the network architecture presented in Section 4 has been designed in three different ways and trained separately on the datasets presented in Section 5. The difference between the three networks is the routing algorithm used: the first is the original capsule network by Sabour et al., the second is our modified capsule network with some improvements, and the third is the efficient vector routing by Heinsen [20]. Capsule networks are trained separately on the six presented dataset. For the implementation, we used Python 3.9.16 [21] programming language with PyTorch 1.12.1 [22] machine learning framework and CUDA toolkit 11.6 platform. The capsule networks are trained on the Paperspace [23] online artificial intelligence platform with an Nvidia Quadro RTX4000 series graphical processing unit.
We trained all networks for 35 epochs with the Adam [24] optimizer algorithm where the train and test batch size are both 128 . We also attempted to train the networks over many more epochs, but found that the difference between the three solutions does not change significantly after 35 epochs. In this study, we used 5 × 10 4 initial learning rate. In each epoch, we reduced the learning rate as follows:
l r i = l r i n i t × 0.97 i
where l r i n i t is the initial learning rate and l r i is the learning rate in the i -th epoch. We used β 1 = 0.9 and β 2 = 0.999 hyperparameter values to control the exponential decay, and ε = 1 × 10 8 to prevent any division by zero in the implementation. In the training process, we used the same loss function as proposed by Sabour et al.
= T k × max ( 0 ,   m + p ) 2 + λ × ( 1 T k ) × max ( 0 ,   p m ) 2
where
T k = { 1 ,     if   object   of   class   k   present 0 ,     otherwise
m + , m and λ are hyperparameters. In the present work, we used the same values for these three hyperparameters as proposed by Sabour et al., in this case, m + = 0.9 , m = 0.1 and λ = 0.5 . The original study also used reconstruction in the training process; however, we did not apply this in our work. During training without reconstruction, the efficiency of the capsule network is reduced. In the long term, we want to translate our results in the field of capsule networks into real-world applications. In this respect, reconstruction of the input image is a rarely necessary step. Therefore, we explicitly investigated the ability of capsule networks without reconstruction.
Figure 13, Figure 14, Figure 15, Figure 16, Figure 17 and Figure 18 show the accuracy of the training processes for the test sets of the six presented datasets with different numbers of routings. The value of r indicates the number of iterations which the routing algorithm has optimized the coefficients. Based on Sabour et al.’s experiment, the r = 3 is a good choice; however, we showed the efficiency with r = 1 and r = 10 . This makes the difference more visible between the routing methods. As can be seen, for all six datasets, we have achieved efficiency gains compared to the two other capsule network solutions. The difference in efficiency between our and Sabour et al.’s solutions is minimal for about the first 10 epochs; however, after that, there is a noticeable difference in the learning curve. Although Heinsen’s solution also proves to be effective in most cases; its performance is slightly lower than the other two solutions. It is also noticeable that changing the number of iterations has a much larger impact on Sabour et al.’s solution and that of Heinsen. Our proposed solution is less sensitive to the iteration value chosen during the optimization.
Figure 19, Figure 20, Figure 21, Figure 22, Figure 23 and Figure 24 show the test losses (Equation (14)) during the training processes with different numbers of routings. There is not much difference in the loss function, but it is noticeable that our solution is less noisy and converges more smoothly.
Figure 25 shows the processing times for the three capsule-based networks. It can be seen that for all six datasets, the capsule network was faster with our proposed routing algorithm than with the dynamic routing algorithm introduced by Sabour et al. The smallest increase was achieved for the Fashion-MNIST and MNIST datasets, but this still represents an 18.60 % and a 19.33 % speedup. For more complex datasets, much higher speed increases were achieved. A running time reduction of 25.55% was achieved for SVHN and 26.54% for CIFAR10. The best results were observed for the SmallNORB and GTSRB datasets. For SmallNORB it was 35.28%, while for GTSRB, it was 48.30%. Compared to Heinsen’s solution, our proposed algorithm performed worse, but the difference in efficiency between the two solutions is significant.
The test errors during the training process are shown in Table 2, where the capsule-based solutions were compared with the recently released neural network-based approaches. It can be clearly seen that our proposed modifications to the routing algorithm have led to efficiency gains. It is important to note that our capsule-based solution does not always approach the effectiveness of the state-of-the-art solutions, however, the capsule network used consists of only three layers with a very minimal number of parameters ( < 5.4   M ). Its architecture is quite simple, but with further improvements, a higher efficiency can be achieved. Prior to this, we felt it necessary to improve the efficiency of the routing algorithm. Experience has shown that designing deep network architecture in the area of capsule networks is too resource-intensive. For this reason, it is necessary to increase the processing speed of the routing algorithm.
Figure 26 shows the confusion matrices for the capsule network-based approaches in the case of the six datasets used. It can be seen that for simpler datasets, such as MNIST, the difference between the three optimization algorithms is minimal. For more complex datasets, the differences are more pronounced. Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 show the efficiencies achieved by capsule networks for each class, separately. This table also shows that our proposed method is in most cases able to provide a more effective solution than the other solutions tested. However, there are cases where our solution falls short compared with solutions by Sabour et al. and Heinsen.
Table 9 summarizes the percentage of classes per dataset that were able to provide the best result for a given solution. From this approach as well, our solution performed the best. Only in the case of the GTSRB dataset was the method proposed by Sabour et al. more efficient. For the other five datasets, our proposed method was able to achieve the best accuracy for most classes. Table 10, Table 11 and Table 12 summarize the recall score, dice score and F1-score for the capsule-based implementations under study. It can be observed that, according to all three metrics, our proposed method performs the best. The solution by Sabour et al. performs better only for the Fashion-MNIST dataset, but there was no difference in accuracy of this dataset. The solution by Heinsen underperforms the other two solutions in the cases studied. It can also be seen that, where the method of Sabour et al. and our approach perform worse, the score of Heinsen’s solution also decreases in a similar way.

7. Conclusions

Our work involved research in the field of capsule networks. We showed the main differences between classical convolutional neural networks and capsule networks, highlighting the new potential of capsule networks. We have shown that the dynamic routing algorithm for capsule networks is too complex and that the training time makes it difficult to build more deep and complex networks. At the same time, capsule networks can achieve very good efficiency, but their practical application is difficult due to the complexity of routing. Therefore, it is important to improve the optimization algorithm and introduce novel solutions.
We proposed a modified routing algorithm for capsule networks and a parameterizable activation function for capsules, based on the dynamic routing algorithm introduced by Sabour et al. In this approach, we aimed to reduce the computational complexity of the current dynamic routing algorithm. Thanks to our proposed routing algorithm and activation function, the training time can be reduced. In our work, we have shown its effectiveness on six different datasets, compared with neural network-based solutions and capsule-based solutions. As can be seen, the training time was reduced in all cases, by almost 30 % on average. Even in the worst case, a speed increase of almost 20 % was achieved. And in some cases, an increase in speed of almost 50 % can be seen. Despite the increase in speed, the efficiency of the network has not decreased. For several different metrics, our proposed solution was compared with other capsule-based methods. As we have shown, our proposed approach can increase the efficiency of the routing mechanism. For all six datasets tested in this research, our solution provided the highest results in almost all cases.
In the future, we would like to perform further research on routing algorithms and capsule networks to be able to achieve even greater improvements. We would like to carry out more complex studies on larger datasets, compared with other solutions. We would like to further optimize our solution based on the test results. Our goal is to be able to provide an efficient and fast solution for more complex tasks, such as instance segmentation or reconstruction, in the field of capsule networks. This is necessary to create much deeper and more complex capsule networks, so it is important to address the issue of optimization. This will allow us to apply the theory of capsule networks to real practical applications with great efficiency.

Author Contributions

Conceptualization, J.H., Á.B. and C.R.P.; methodology, J.H., Á.B. and C.R.P.; software, J.H.; validation, J.H.; formal analysis, J.H.; investigation, J.H.; resources, J.H.; data curation, J.H.; writing—original draft preparation, J.H.; writing—review and editing, J.H.; visualization, J.H.; supervision, Á.B. and C.R.P.; project administration, J.H.; funding acquisition, J.H. and Á.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the European Union within the framework of the National Laboratory for Artificial Intelligence grant number RRF-2.3.1-21-2022-00004 and the APC was funded by RRF-2.3.1-21-2022-00004.

Data Availability Statement

Data sharing is not applicable.

Acknowledgments

The research was supported by the European Union within the framework of the National Laboratory for Artificial Intelligence (RRF-2.3.1-21-2022-00004).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chen, X.; Liang, C.; Huang, D.; Real, E.; Wang, K.; Liu, Y.; Pham, H.; Dong, X.; Luong, T.; Hsieh, C.; et al. Symbolic Discovery of Optimization Algorithms. arXiv 2023, arXiv:2302.06675. [Google Scholar]
  2. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
  3. Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. arXiv 2023, arXiv:2211.05778. [Google Scholar]
  4. Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), online, 19–25 June 2021. [Google Scholar]
  5. Su, W.; Zhu, X.; Tao, C.; Lu, L.; Li, B.; Huang, G.; Qiao, Y.; Wang, X.; Zhou, J.; Dai, J. Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information. arXiv 2022, arXiv:2211.09807. [Google Scholar]
  6. Yuan, Y.; Chen, X.; Chen, X.; Wang, J. Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Online, 23–28 August 2020. [Google Scholar]
  7. Fang, Y.; Wang, W.; Xie, B.; Sun, Q.; Wu, L.; Wang, X.; Huang, T.; Wang, X.; Cao, Y. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. arXiv 2022, arXiv:2211.07636. [Google Scholar]
  8. Zhang, H.; Li, F.; Zou, X.; Liu, S.; Li, C.; Gao, J.; Yang, J.; Zhang, L. A Simple Framework for Open-Vocabulary Segmentation and Detection. arXiv 2023, arXiv:2303.08131. [Google Scholar]
  9. Zafar, A.; Aamir, M.; Mohd Nawi, N.; Arshad, A.; Riaz, S.; Alruban, A.; Dutta, A.K.; Almotairi, S. A Comparison of Pooling Methods for Convolutional Neural Networks. Appl. Sci. 2022, 12, 8643. [Google Scholar] [CrossRef]
  10. Hinton, G.E.; Krizhevsky, A.; Wang, S.D. Transforming Auto-Encoders. International Conference on Artificial Neural Networks; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6791, pp. 44–51. [Google Scholar]
  11. Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing Between Capsules. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–7 December 2017. [Google Scholar]
  12. Hinton, G.E.; Sabour, S.; Frosst, N. Matrix capsules with EM routing. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  13. LeCun, Y.; Cortes, C.; Burges, C.J.C. The MNIST Database of Handwritten Digits. 2012. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 9 April 2023).
  14. Fukushima, K. Visual Feature Extraction by a Multilayered Network of Analog Threshold Elements. In IEEE Transactions on Systems Science and Cybernetics, October 1969; IEEE: Piscataway, NJ, USA, 1969; Volume 5, pp. 322–333. [Google Scholar]
  15. Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
  16. LeCun, Y.; Huang, F.J.; Bottou, L. Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Washington, DC, USA, 27 June–2 July 2004; pp. 97–104. [Google Scholar]
  17. Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
  18. Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading Digits in Natural Images with Unsupervised Feature Learning. In Proceedings of the 25th Conference on Neural Information Processing Systems, Granada, Spain, 12–17 December 2011. [Google Scholar]
  19. Stallkamp, J.; Schlipsing, M.; Salmen, J.; Igel, C. The German traffic sign recognition benchmark: A multi-class classification competition. In Proceedings of the International Joint Conference on Neural Networks, San Jose, CA, USA, 31 July–5 August 2011; pp. 1453–1460. [Google Scholar]
  20. Heinsen, F.A. An Algorithm for Routing Vectors in Sequences. arXiv 2022, arXiv:2211.11754. [Google Scholar]
  21. Rossum, V.G.; Fred, L.D. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, USA, 2009. [Google Scholar]
  22. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Adv. Neural Inf. Process. Syst. 2019, 32, 8024–8035. [Google Scholar]
  23. Paperspace. Available online: https://www.paperspace.com/ (accessed on 9 April 2023).
  24. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
  25. Goyal, P.; Duval, Q.; Seessel, I.; Caron, M.; Misra, I.; Sagun, L.; Joulin, A.; Bojanowski, P. Vision Models Are More Robust and Fair When Pretrained On Uncurated Images without Supervision. arXiv 2022, arXiv:2202.08360. [Google Scholar] [CrossRef]
  26. Taylor, L.; King, A.; Harper, N. Robust and Accelerated Single-Spike Spiking Neural Network Training with Applicability to Challenging Temporal Tasks. arXiv 2022, arXiv:2205.15286. [Google Scholar] [CrossRef]
  27. Phaye, S.S.R.; Sikka, A.; Dhall, A.; Bathula, D. Dense and Diverse Capsule Networks: Making the Capsules Learn Better. arXiv 2018, arXiv:1805.04001. [Google Scholar] [CrossRef]
  28. Remerscheid, N.W.; Ziller, A.; Rueckert, D.; Kaissis, G. SmoothNets: Optimizing CNN Architecture Design for Differentially Private Deep Learning. arXiv 2022, arXiv:2205.04095. [Google Scholar] [CrossRef]
  29. Dupont, E.; Doucet, A.; Teh, Y.W. Augmented Neural ODEs. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
  30. Abad, G.; Ersoy, O.; Picek, S.; Urbieta, A. Sneaky Spikes: Uncovering Stealthy Backdoor Attacks in Spiking Neural Networks with Neuromorphic Data. arXiv 2023, arXiv:2302.06279. [Google Scholar] [CrossRef]
Figure 1. Typical structure of a neuron. (green: inputs, blue: operations, yellow: output, purple: neuron).
Figure 1. Typical structure of a neuron. (green: inputs, blue: operations, yellow: output, purple: neuron).
Algorithms 16 00336 g001
Figure 2. Typical structure of a capsule. (green: inputs, red: prediction vectors, blue: operations, yellow: output, purple: capsule).
Figure 2. Typical structure of a capsule. (green: inputs, red: prediction vectors, blue: operations, yellow: output, purple: capsule).
Algorithms 16 00336 g002
Figure 3. Our squash activation function with different ε values.
Figure 3. Our squash activation function with different ε values.
Algorithms 16 00336 g003
Figure 4. Block diagram of the dynamic routing algorithm by Sabour et al. [11]. (green: inputs, yellow: operations, blue: activations, purple: internal tensors).
Figure 4. Block diagram of the dynamic routing algorithm by Sabour et al. [11]. (green: inputs, yellow: operations, blue: activations, purple: internal tensors).
Algorithms 16 00336 g004
Figure 5. Block diagram of our proposed routing algorithm. (green: inputs, yellow: operations, blue: activations, purple: internal tensors).
Figure 5. Block diagram of our proposed routing algorithm. (green: inputs, yellow: operations, blue: activations, purple: internal tensors).
Algorithms 16 00336 g005
Figure 6. The capsule network architecture used in the research, based on work by Sabour et al. [11]. (green: input, purple: convolutional layer, yellow: primary capsule layer, red: secondary capsule layer, gray: prediction).
Figure 6. The capsule network architecture used in the research, based on work by Sabour et al. [11]. (green: input, purple: convolutional layer, yellow: primary capsule layer, red: secondary capsule layer, gray: prediction).
Algorithms 16 00336 g006
Figure 7. Sample data from MNIST dataset.
Figure 7. Sample data from MNIST dataset.
Algorithms 16 00336 g007
Figure 8. Sample data from Fashion-MNIST dataset.
Figure 8. Sample data from Fashion-MNIST dataset.
Algorithms 16 00336 g008
Figure 9. Sample data from SmallNORB dataset.
Figure 9. Sample data from SmallNORB dataset.
Algorithms 16 00336 g009
Figure 10. Sample data from CIFAR10 dataset.
Figure 10. Sample data from CIFAR10 dataset.
Algorithms 16 00336 g010
Figure 11. Sample data from SVHN dataset.
Figure 11. Sample data from SVHN dataset.
Algorithms 16 00336 g011
Figure 12. Sample data from GTSRB dataset.
Figure 12. Sample data from GTSRB dataset.
Algorithms 16 00336 g012
Figure 13. Classification test accuracy on MNIST dataset.
Figure 13. Classification test accuracy on MNIST dataset.
Algorithms 16 00336 g013
Figure 14. Classification test accuracy on Fashion-MNIST dataset.
Figure 14. Classification test accuracy on Fashion-MNIST dataset.
Algorithms 16 00336 g014
Figure 15. Classification test accuracy on SmallNORB dataset.
Figure 15. Classification test accuracy on SmallNORB dataset.
Algorithms 16 00336 g015
Figure 16. Classification test accuracy on CIFAR10 dataset.
Figure 16. Classification test accuracy on CIFAR10 dataset.
Algorithms 16 00336 g016
Figure 17. Classification test accuracy on SVHN dataset.
Figure 17. Classification test accuracy on SVHN dataset.
Algorithms 16 00336 g017
Figure 18. Classification test accuracy on GTSRB dataset.
Figure 18. Classification test accuracy on GTSRB dataset.
Algorithms 16 00336 g018
Figure 19. Classification test loss on MNIST dataset.
Figure 19. Classification test loss on MNIST dataset.
Algorithms 16 00336 g019
Figure 20. Classification test loss on Fashion-MNIST dataset.
Figure 20. Classification test loss on Fashion-MNIST dataset.
Algorithms 16 00336 g020
Figure 21. Classification test loss on SmallNORB dataset.
Figure 21. Classification test loss on SmallNORB dataset.
Algorithms 16 00336 g021
Figure 22. Classification test loss on CIFAR10 dataset.
Figure 22. Classification test loss on CIFAR10 dataset.
Algorithms 16 00336 g022
Figure 23. Classification test loss on SVHN dataset.
Figure 23. Classification test loss on SVHN dataset.
Algorithms 16 00336 g023
Figure 24. Classification test loss on GTSRB dataset.
Figure 24. Classification test loss on GTSRB dataset.
Algorithms 16 00336 g024
Figure 25. Comparison of training time on the same hardware (Nvidia Quadro RTX4000 series).
Figure 25. Comparison of training time on the same hardware (Nvidia Quadro RTX4000 series).
Algorithms 16 00336 g025
Figure 26. Confusion matrices for the capsule-based networks.
Figure 26. Confusion matrices for the capsule-based networks.
Algorithms 16 00336 g026
Table 1. Main properties of the datasets used.
Table 1. Main properties of the datasets used.
DatasetImage SizeChannelsClassesTrain SetTest SetBackground
MNIST [13](28, 28)11060,00010,000false
F-MNIST [15](28, 28)11060,00010,000false
SmallNORB [16](48, 48)1548,60048,600true
CIFAR10 [17](32, 32)31050,00010,000true
SVHN [18](32, 32)31073,25726,032true
GTSRB [19](32, 32)34326,64012,630true
Table 2. Classification test errors on the different datasets, compared with other methods.
Table 2. Classification test errors on the different datasets, compared with other methods.
MNISTF-MNISTS.NORBCIFAR10SVHNGTSRB
Goyal et al. [25]0.58%--10%13.6%9.29%
Taylor et al. [26]2.09%10.95%----
Phaye et al. [27]--5.57%---
Remerscheid et al. [28]---26.5%--
Dupont et al. [29]1.8%--39.4%16.5%-
Abad et al. [30]0.6%--31.7%--
Sabour et al. [11]0.45%8.35%9.15%29.36%8.05%2.67%
Heinsen [20]0.7%9.25%10.18%41.18%11.36%9.79%
Ours0.41%8.35%8.54%28.26%6.90%2.22%
Table 3. Classification test accuracy by class for capsule networks on the MNIST dataset.
Table 3. Classification test accuracy by class for capsule networks on the MNIST dataset.
Sabour et al. [11]OursHeinsen [20]
00.99690.99800.9980
10.99740.99740.9965
20.99320.99610.9932
30.99410.99500.9941
40.99080.99290.9847
50.99330.99330.9944
60.99160.99480.9896
70.99710.99610.9893
80.99590.99690.9918
90.98910.99210.9851
Table 4. Classification test accuracy by class for capsule networks on the Fashion-MNIST dataset.
Table 4. Classification test accuracy by class for capsule networks on the Fashion-MNIST dataset.
Sabour et al. [11]OursHeinsen [20]
T-shirt/top0.89200.90600.8410
Trouser0.97800.98000.9780
Pullover0.85600.88000.8710
Dress0.92300.91900.8900
Coat0.83700.83900.8400
Sandal0.98100.98100.9740
Shirt0.73200.67700.6340
Sneaker0.97900.97800.9820
Bag0.98400.98200.9750
Ankle boot0.96000.96200.9530
Table 5. Classification test accuracy by class for capsule networks on the CIFAR10 dataset.
Table 5. Classification test accuracy by class for capsule networks on the CIFAR10 dataset.
Sabour et al. [11]OursHeinsen [20]
Airplane0.62800.71000.6110
Automobile0.77800.68100.6260
Bird0.46300.51800.4230
Cat0.46600.44400.3670
Deer0.69300.59700.4340
Dog0.54800.64500.5110
Frog0.81100.75300.5380
Horse0.67100.73000.5830
Ship0.84800.75900.6260
Truck0.78100.86000.5920
Table 6. Classification test accuracy by class for capsule networks on the SmallNORB dataset.
Table 6. Classification test accuracy by class for capsule networks on the SmallNORB dataset.
Sabour et al. [11]OursHeinsen [20]
Animal0.84960.86450.8681
Human0.92950.95130.7709
Plane0.92060.92100.8175
Truck0.99580.99320.9610
Car0.70750.79730.8192
Table 7. Classification test accuracy by class for capsule networks on the GTSRB dataset.
Table 7. Classification test accuracy by class for capsule networks on the GTSRB dataset.
Sabour et al. [11]OursHeinsen [20]
Speed limit 20 km/h0.86670.81670.6500
Speed limit 30 km/h0.98890.99170.9458
Speed limit 50 km/h0.99200.98930.9707
Speed limit 60 km/h0.98890.97560.9333
Speed limit 70 km/h0.97580.97580.8727
Speed limit 80 km/h0.96190.98100.8000
Speed limit 100 km/h0.85330.85330.8067
Speed limit 120 km/h0.91330.88220.8289
End of speed limit (80 km/h)0.94440.95110.9000
No passing0.99581.00000.9458
No passing for vehicles over 3.5 metric t0.99240.99550.9697
Right-of-way at the next intersection0.96190.97140.8929
Priority road0.98120.98840.9087
Yield0.99720.99720.9861
Stop1.00001.00000.9630
No vehicles1.00001.00000.9762
Vehicles over 3.5 metric tons prohibited0.99330.99330.9533
No entry0.99720.99720.8806
General caution0.93850.94100.7231
Dangerous curve to the left1.00001.00000.4833
Dangerous curve to the right0.98890.98890.8889
Double curve0.76670.70000.6778
Bumpy road0.98330.99170.9083
Slippery road0.94670.92670.6667
Road narrows on the right0.94440.94440.5333
Road work0.95420.95000.9438
Traffic signals0.82780.81670.7833
Pedestrians0.50000.56670.5000
Children crossing0.99330.99330.9200
Bicycles crossing1.00001.00000.9222
Beware of ice/snow0.76670.77330.5000
Wild animals crossing0.98150.97410.9074
End of all speed and passing limits1.00001.00001.0000
Turn right ahead0.99520.99520.9476
Turn left ahead0.99170.99170.9833
Ahead only0.99230.99740.9538
Go straight or right0.97500.96670.9417
Go straight or left1.00001.00000.9000
Keep right0.97390.98700.9087
Keep left1.00001.00000.8000
Roundabout mandatory0.96670.97780.8000
End of no passing0.80000.76670.5500
End of no passing by vehicles over 3.5 t0.97780.97780.9778
Table 8. Classification test accuracy by class for capsule networks on the SVHN dataset.
Table 8. Classification test accuracy by class for capsule networks on the SVHN dataset.
Sabour et al. [11]OursHeinsen [20]
00.91510.89280.9002
10.96860.95470.9280
20.93930.95490.9152
30.86360.85500.7696
40.90800.91640.9045
50.86240.89680.8335
60.84020.87200.8255
70.85790.87420.8757
80.78610.81930.7657
90.82450.86210.8301
Table 9. Best efficiency ratio per class for the presented capsule networks.
Table 9. Best efficiency ratio per class for the presented capsule networks.
MNISTF-MNISTSmallNORBCIFAR10SVHNGTSRB
Sabour et al. [11]15%35%50%20%30%51.94%
Heinsen [20]15%20%0%40%10%0.77%
Ours70%45%50%40%60%47.29%
Table 10. Recall scores for the capsule networks.
Table 10. Recall scores for the capsule networks.
MNISTF-MNISTSmallNORBCIFAR10SVHNGTSRB
Sabour et al. [11]0.99390.91590.90560.67900.89580.9513
Heinsen [20]0.99160.89370.84190.53110.85320.8304
Ours0.99480.91240.91380.69250.90120.9519
Table 11. Dice scores for the capsule networks.
Table 11. Dice scores for the capsule networks.
MNISTF-MNISTSmallNORBCIFAR10SVHNGTSRB
Sabour et al. [11]0.99400.91530.90520.67480.90120.9576
Heinsen [20]0.99160.89450.84190.53500.85460.8396
Ours0.99490.91180.91310.68820.90460.9576
Table 12. F1-scores for the capsule networks.
Table 12. F1-scores for the capsule networks.
MNISTF-MNISTSmallNORBCIFAR10SVHNGTSRB
Sabour et al. [11]0.99400.91590.90560.67900.90840.9711
Heinsen [20]0.99160.89370.84190.53110.86670.8920
Ours0.99490.91240.91380.69250.91220.9727
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hollósi, J.; Ballagi, Á.; Pozna, C.R. Simplified Routing Mechanism for Capsule Networks. Algorithms 2023, 16, 336. https://doi.org/10.3390/a16070336

AMA Style

Hollósi J, Ballagi Á, Pozna CR. Simplified Routing Mechanism for Capsule Networks. Algorithms. 2023; 16(7):336. https://doi.org/10.3390/a16070336

Chicago/Turabian Style

Hollósi, János, Áron Ballagi, and Claudiu Radu Pozna. 2023. "Simplified Routing Mechanism for Capsule Networks" Algorithms 16, no. 7: 336. https://doi.org/10.3390/a16070336

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop