Depth estimation using Convolutional Neural Network with transfer
learning
Abstract
Predicting depth is crucial to understand the 3-D geometry of a scene. For stereo images local correspondence
suffices for estimation, whereas for finding depth from a single image is less straightforward, requiring
integration of both global and local information. In this paper, authors deal with depth estimation using
Convolutional Neural Network (CNN) with transfer learning. Authors propose a not so deep CNN which uses
transfer learning to extract low-level features. Fully convolutional architecture has been employed, which first
extracts low-level image features by pre-trained ResNet-50 and VGG19 network. Transfer learning has been
done by using the initial layers of ResNet-50 and VGG19 which are connected in parallel without down
sampling anywhere in the proposed architecture to recover the size of the depth map. This CNN network can run
real-time on images with enough computing power because it is not such a deep architecture. The model has
been compared with a pure CNN network and illustrates the effectiveness of transfer learning. It has been
demonstrated that our method of deep CNN with transfer learning yields a better result than one pure CNN and
is faster to converge. The influence of different loss functions during training has also been shown. The results
are shown by comparing qualitative visualization and quantitative metrics.
Keywords
Transfer learning, CNN, ResNet-50, VGG19.
1. Introduction
To achieve true humanoid, 3d information of the environment is one of the critical features a
robot must have. Historically Depth information is predominantly used in localization tasks
relative to the environment [8] be it the robot or a self-driving car using various techniques
like VO (visual odometry) or SLAM (simultaneous localization and mapping), 3D modelling
[1][2], etc. Estimating depth is an essential part of understanding geometric relative relations
within an environment. In turn, these relations provide a richer representation of the object
and its surrounding often lead to improvements in existing tasks be it recognition task [6],
physics, robotics [7][5], i.e., helps in determining the relative pose of the camera, and can be
used in reasoning about the occlusions.
Estimating depth from a monocular image is a known ill-posed problem, because an RGB
image may correspond to an infinite number of real-world scenarios. Previously, several
classic algorithms had been tried to tackle this problem, including the Structure from Motion
method. It leverages camera motion to estimate camera poses through different temporal
intervals and, in turn, estimate depth via triangulation from pairs of progressive views.
As the general trends of drones becoming smaller and smaller the monocular approach to
estimate depth is becoming interesting because the stereo case degenerates to the monocular
case when the distance between the baseline of a stereo camera is minimal compared to the
object or landmarks distance from the camera.
Recently, CNNs have been employed to learn depth from an RGB image. Motivated fom
recent developments in AI, we implemented an end-to-end trainable CNN architecture (not so
deep) combined with transfer learning to learn a mapping between color image pixel intensity
with the corresponding depth map.
In relation to depth estimation, the absolute scale of the surrounding environment symbolizes
the actual size of the objects in the physical world. Depth map correctly predicts the absolute
scale of the scene represents depth values where absolute values are similar to the actual
depth values. Conversely, a depth map which reflects the relative depth of the scene has the
same relation between the pixels (higher/lower depth) as the true depth map. In this work, we
tried to predict the relative scale of the scene, since it is more intuitive to predict using a
single image and therefore easier to train.
2. Literature Survey
2.1 In the work done by Eigen et al. [3], they were the one to first suggest a solution to
estimate the depth that uses CNNs. They used two-scale CNN consisting of the coarse-scale
network and the fine-scale network. The coarse scale network was a CNN that identifies the
geometry of a scene in Global Context. The output of the coarse-scale network is a low-
resolution image. This in turn, along with the original input image is then fed to the fine-scale
network. The fine-scale network is a CNN consisting of three CNN layers and was used to
refine the coarse prediction. Additionally, Eigen et al. [3] in their work considered the issue
of scale-invariance. In the paper, they have used scale-invariant error function for
performance evaluation and a scale-invariant loss function 𝐿 (𝑦, 𝑦*) for the training.
❑ ❑
1 1
𝐿 (𝑦, 𝑦*) =
n
∑ ❑(log 𝑦i − log 𝑦i* + n
∑ ❑(log 𝑦 * − log 𝑦 )) .
j j
2
(2.1)
i j
Where, 𝑦i signifies the projected depth value for the pixel 𝑖, 𝑦i* is the true depth value for the
Pixel. Scale invariance is achieved through the inner sum, which consists of the mean log
difference between predicted and ground truth depth values.
2.2 Xiaobai Ma, Zhenglin Geng, Zhi Bie [4] compared the three different CNN
architectures using three metrics of evaluation and three different loss functions. The 1 st
architecture is prone to overfitting as the no of parameters in the FC layer is 73293824 which
is astronomical compared to CNN layers which are only 27232. The 2nd architecture uses only
CNN layers to overcome the over fitting problem, but it is not enough deep to generalize the
depth and can give results equal or slightly better than the 1 st architecture if trained for
sufficient time. The 3rd architecture employs transfer learning and also gives the best results
when compared to other.
3. Network Architecture
We experimented with two CNN architectures model 1, i.e., pure/basic CNN and model 2
CNN with transfer learning, the best out of two is chosen after qualitative and quantitative
analysis. Each one is illustrated in the next subsections, and a comparison of their
performance in the experiment section.
Model 1 is a pure CNN model without transfer learning and can be seen in figure 1. Model 2
is a rather deep CNN model with transfer learning as shown in figure 2. For transfer learning
architecture, i.e., Model 2 we choose two pre-trained architectures to use which are ResNet50
and VGG19. We only use the pre-trained up to the point where the dimension of the image is
higher or equals to the output of the depth map and thus ten layers of VGG19, and 12 layers
of ResNet50 are used for the proposed transfer learning model (Model 2). After making the
output dimensions of both pre-trained models same we concatenate them and then add more
convolution and max polling layers to learn high-level features. The output of both models is
a flatten 4070 1D array for training purposes. It can be noted that we lose information if we
reduce the size of the image and then again increase it that is to decompress a picture after
compressing it. Therefore, we choose the pre-trained models up to the layers where the
resolution of the image is greater than or equal to 55x74, to avoid losing information by first
compressing and then decompressing the image.
Also, a large model like ResNet50 and VGG19 requires large memory to fit on a computer.
So, the two reason for not choosing the full ResNet50 and VGG19 architectures is to avoid
loss of information and memory management. Also, training time is one of the crucial
factors.
3.1 Pure/Basic CNN
As fully connected networks have an overfitting issue which is shown by Xiaobai Ma,
Zhenglin Geng, Zhi Bie [4] in their paper in the 1 st architecture, so we started from pure
CNN. The pure CNN architecture they used does not consider the global and local relation of
the scene. The architecture that we proposed of pure/basic CNN architecture (model 1) can be
seen in figure 1.Our pure/basic CNN architecture considers both global and local scene
information. The network consists of 8 convolution layers before the output layer out of
which three considers the global relation of the scene, and we expect this network to have a
better behaviour compared to second architecture proposed in section 2.2. Each convolution
layer is followed by batch normalization to facilitate the training process and then followed
by ReLU activation layer.
3.2 CNN with transfer learning
Training a neural network architecture from the level zero is a tiresome process and require a
lot of data and processing power and it becomes more cumbersome in case of training it on
images as the no of features is a lot compared to other simple tasks. Therefore, we employed
transfer learning in our second method, i.e., model 2. The architecture of CNN with transfer
learning (model 2) is shown in figure 2. The network consists of initially two parallel pre-
trained architectures, i.e., ResNet50 and VGG19 after which they are concatenated after
making their dimensions equal in the next step. After this point, the network has six more
CNN layers before the output layer which is flattened 1D array.
Figure 1. Block diagram of Model 1
Figure 2. Block diagram of Model 2
4. Dataset Used
For this research, NYU Depth v2 [8] dataset’s raw Kinect data and raw RGB output from the
camera component is used. The data has images in 640x480 resolution but for training data
augmentation has been done and changed the RGB image resolution to 304x228 and the raw
Kinect data resolution to 74x55 as our training set. Since it is easy to train the model for less
number of features. We used python for data augmentation. We trained our model on 32
images because of training time it takes as the training data increases and the memory
restrictions we can fit a certain amount of data at a time.
KITTI is another famous dataset used by researchers in similar projects. Since our
computation resources are limited, the feasibility of the study is based only on the training
result on the NYU dataset.
5. Loss functions for training
We used two loss functions to compare the results. They are the root mean square error
(RMSE) and root mean squared logarithmic error (RMSLE) for comparison and training
purpose of the proposed models.
Depth value 𝑦 is in logarithmic space (or log space) equals to (𝑦𝑙𝑖𝑛), where 𝑦𝑙𝑖𝑛 is the true
depth in linear space. For depth values in log space, The Euclidean loss does not minimize
the difference between estimated depth 𝑦 and the ground truth 𝑦*, but minimizes the
y
difference between and 𝑙𝑜𝑔𝑦*, which can be rewritten as 𝑙𝑜 ¿ . This means that loss in
❑
the log space optimizes ratio between 𝑦 and 𝑦* and achieves minimum when the ratio is 1.
Later experiments show that logarithmic approach achieves better prediction of relative depth
structure.
6. Training and testing
We compared both the architecture as mentioned earlier fixing same no of input parameters
and cost function (RMSE) and uses the one which gives the best results and also the one
which generalizes the results faster and better than the other.
After choosing the learning architecture, we trained the model for two different loss functions
and compared the results. The results can be seen in the subsequent sections. We choose the
best of them. The two loss functions that we have chosen are stated in section 5.
We trained our network on Amazon cloud GPU (Tesla k80) as the training time will be high
as the model we propose is a deep neural network architecture, and the input features are not
just in hundreds or thousands but in hundred thousand. The output is also not just simple
classification task or just a single output regression task, but in our case, the output has the
dimension in terms of thousands and not just one or two, as usual, is the case in other
regression problems in deep learning.
7. Results and discussion
In this section, we provide qualitative and quantitative analysis of our evaluations. We also
show the performance of different loss functions for the task. All the experiments are
implemented using the TensorFlow framework.
7.1 Results for choosing the best CNN architecture out of two
Best of two proposed architectures proposed in section 3 is chosen by training on 32 training
samples with RMSE as loss function and with 250 iterations. The RGB image is shown in
figure 3, and True depth map can be seen in figure 4, and the predicted results can be seen in
Figure 5 and 6.
Model 1 corresponds to the CNN model without transfer learning. Model 2 corresponds to
the CNN model with transfer learning.
Figure 3. True RGB image
Figure 4. True depth map
Figure 5. Predicted depth map of Figure 6. Predicted depth map of
model 1 model 2
From the figure 5 and 6, it is clear that the CNN model with transfer learning architecture,
i.e., Model 2 gives better result compared to the one without using transfer learning, i.e.,
Model 1 after training both of them for 250 iterations. It is logical too because we are
transferring the knowledge of two different pre-trained models to our proposed network.
Figure 7. Model 1 and Model 2 training error comparison.
It is prevalent from figure 7 that Model 2 converges or generalizes faster compared to Model
1. Model 2 is deeper and more parameters to train and thus harder to train, to be precise
Model 2 has 3,725,717 parameters and Model 1 has 711,682 parameters.
It can be concluded that Model 2 is superior compared to Model 1 and thus using Transfer
learning helps a lot and was a right decision.
7.2 RESULTS FOR CHOOSING THE BEST LOSS FUNCTION OUT OF TWO
Figure 8 shows the True depth map for refrence. From figure 9, 10 and 17 it can be concluded
that the RMSLE loss function converges faster compared to the RMSE loss function for the
same Model. It is logical too because in RMSLE we are taking the ratio of predicted and true
depth to calculate the loss function and as we know it is the property of log function that the
value increases faster if the input is around 1 and when its 1 it outputs 0, where in case of
RMSE its just merely the linear difference of predicted and true depth values. Figure 11 & 12
are for refrence to see the predicted output of model after 750 iterations for both the loss
functions.
Therefore, in case of RMSLE the rate of change and the error itself in the starting is very high
and as the no of iterations increases the rate of change of error decreases because the ratio
closes to 1 and becomes very small after that the error becomes stagnant that can be seen
from the figure 16. Whereas in the case of RMSE the rate of change of error is slow in the
starting as compared to RMSLE and becomes stagnant after enough no of iterations as it can
be seen in figure 15.
As the no of iterations starts increasing and reaches around 1000 or more both of the loss
functions become almost equal, and it is hard to differentiate between them, this can be seen
in figure 13,14,17. It reaches a certain threshold around 1300 and become stagnant later on.
After the loss becomes almost constant, the network starts fine tuning the weights.
Figure 8. True depth image
Figure 9. Predicted depth image Figure 10. Predicted depth image
with RMSE loss with RMSLE loss
Figure 11. Predicted depth image Figure 12. Predicted depth image
. with RMSE loss with RMSLE loss
1000-iterations
Figure 13. Predicted depth image Figure 14. Predicted depth image
with RMSE loss with RMSLE loss
Figure 15. RMSE Loss graph Figure 16. RMSLE loss graph
Figure 17. RMSE and RMSLE loss function plotted together.
For the chosen model, i.e., Model 2, we choose different learning rate and different batch
sizes for 250, 500 and 1000 no of epochs respectively which can be seen from table 1. In the
starting, there is a scope of rapid or higher learning rate as there is a higher error in the
starting which can be concluded from previous discussions. As the no of epochs increased the
rate of change of error becomes small and therefore the learning rate should be less, or we
can say it should decay with no of epochs. We took a small batch and increased the batch size
in steps after some defined no of epochs which can be seen in table 1 because as the error is
higher in the starting so we should update the weights faster to decrease the error faster.
No. of Epochs Learning rate (LR) Batch Size
0-250 0.001 4
250-750 0.001 16
750-1750 0.0001 32
Table 1. Final CNN Model Training Parameters (Model 2)
We can compare the performance of different loss function using Model 2 from table 2.
Model 2 using RMSE yields the smallest Abs. Rel. Difference while Model 2 using RMSLE
yields the smallest RMSE.
Abs. Rel. Difference RMSE
RMSE 0.1594 0.1158
RMSLE 0.1598 0.1156
Table 2. Metric evaluation of different loss function (Model2)
8. CONCLUSION
This research uses convolutional neural networks for dealing with problem of depth
estimation from a single image. Depth estimation from a single camera is ambiguous and
hard to train even with advance CNN's. The transfer learning approach generalized the results
faster. This model explicitly utilizes the transfer learning approach to predict the depth map.
Results of the experiments shown that utilizing transfer learning increases the performance of
the network. This is evident in the qualitative and quantitative assessment of the output depth
maps as shown in section 7. Furthermore, the results in section 7 shows that using the
RMSLE function improves the learning capability of the proposed Model and is suitable in
our case rather than using RMSE. Thus we propose a transfer learning model, i.e., model 2
with RMSLE function for the task of depth estimation from a single image.
Conflict of Interest: There is no conflict of interest between the authors. Also there are no
financial and non-financial interest to disclose.
Data availability: It can be accessed at below mentioned url (data size is around 10GB).
https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html.
References
1. A. Saxena, M. Sun, And A. Y. Ng. Make3d: earning 3-D Scene Structure From A Single Still Image. TPAMI,
2008.
2. D. Hoiem, A. A. Efros, And M. Hebert. Automatic Photo Pop-Up. In ACM SIGGRAPH, Pages 577–584,
2005.
3. Eigen, D.; Puhrsch, C.; Fergus, R.: Depth Map Prediction from a Single Image using a Multi-Scale Deep
Network. CoRR. vol. abs/1406.2283. 2014.
4. http://cs231n.stanford.edu/reports/2017/pdfs/203.pdf
5. J. Michels, A. Saxena, And A. Y. Ng. High-Speed Obstacle Avoidance Using Monocular Vision And
Reinforcement Learning. In ICML, Pages 593–600, 2005.
6. N. Silberman, D. Hoiem, P. Kohli, And R. Fergus. Indoor Segmentation And Support Inference From Rgbd
Images. In ECCV, 2012.
7. R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scoffier, K. Kavukcuoglu, U. Muller, And Y. Le- Cun.
Learning Long-Range Vision For Autonomous Off-Road Driving. Journal Of Field Robotics,
26(2):120–144, 2009.
8. Shotton, J.; Girshick, R.; Fitzgibbon, A.; Et Al.: Decision Forests For Computer Vision And Medical Image
Analysis. Chapter Efficient Human Pose Estimation From Single Depth Images. London: Springer London.
2013. ISBN 978-1-4471-4929-3. Pp. 175–192. Doi:10.1007/978-1-4471-4929-3_13.