Module 3: Deep Convolutional Models- Transfer Learning
Deep Classification Modeling Implementation VGG, ResNet, AlexNet, InceptionNet v3
Deep classification Modeling
VGG-16:
VGG-16 is a popular convolutional neural network (CNN) architecture that was developed by
the Visual Geometry Group (VGG) at the University of Oxford. It is widely used in image
classification and other computer vision tasks. Here's a concise overview for students:
Architecture Overview
Layers: VGG-16 is a deep network with 16 layers, consisting of 13 convolutional
layers and 3 fully connected layers.
Conv Layers: The convolutional layers use small 3x3 filters with a stride of 1, which
allows the network to capture fine details and complex patterns in images.
Pooling: Max-pooling layers are used after every 2 or 3 convolutional layers to
reduce the spatial dimensions of the feature maps, which helps in reducing
computation and controlling overfitting.
Fully Connected Layers: After the convolutional and pooling layers, VGG-16 has
three fully connected layers. The last of these layers is the output layer, which
typically has a softmax activation function for classification tasks.
Key Characteristics
Dr.R.Bhuvanya
Depth: With 16 weight layers, VGG-16 is significantly deeper than earlier CNN
models like LeNet, allowing it to capture more complex features.
Uniform Design: The use of the same 3x3 convolution filters throughout the network
provides a uniform architecture, which makes it easier to understand and implement.
Parameter Count: VGG-16 has about 138 million parameters, making it a large
model that requires significant computational resources for training and inference.
Transfer Learning: Due to its pre-training on the large ImageNet dataset, VGG-16 is
commonly used in transfer learning, where the pre-trained model is fine-tuned for
specific tasks like medical image classification, object detection, etc.
Advantages
High Accuracy: VGG-16 was one of the top-performing models in the ImageNet
Large Scale Visual Recognition Challenge (ILSVRC) 2014.
Simplicity: Despite its depth, the architecture is straightforward and easy to
implement, making it a good starting point for learning about deep CNNs.
Disadvantages
Large Size: The model is computationally expensive and requires a lot of memory
due to its high number of parameters.
Training Time: Training VGG-16 from scratch can be time-consuming, especially
without powerful GPUs.
Applications
Image Classification: VGG-16 is widely used for classifying images into various
categories.
Feature Extraction: The convolutional layers of VGG-16 are often used as feature
extractors in other computer vision tasks.
Transfer Learning: Fine-tuning VGG-16 for specific tasks is common practice in
domains like medical imaging, where large labeled datasets are scarce.
Practical Considerations
Dr.R.Bhuvanya
Pre-trained Models: Use pre-trained weights for faster training and better
performance when applying VGG-16 to new tasks.
Model Optimization: Techniques like pruning, quantization, and model distillation
can be used to reduce the model's size and inference time without significantly
sacrificing accuracy.
VGG-16 is a foundational model in deep learning, offering a balance of simplicity and
performance. It is an excellent model for students to study and experiment with, especially in
the context of transfer learning and feature extraction.
Using the VGG16 model as a feature extractor (up to its last convolutional layer).
Adding two dense layers for classification on top of the extracted features.
The model is likely intended for a classification task with 10 classes, as indicated by
the 10 units in the final layer and the use of softmax activation.
To use 9 layers from the bottom : [-9] and to use 9 layers from the top [9].
Bottom Layers (Typically Convolutional Layers)
Lower-level features: Extract more basic and general features, such as edges, corners, and
textures.
Generic features: Can be useful for a wider range of tasks, including style transfer, image
generation, and feature visualization.
Feature extraction: Can be used for feature extraction, where the extracted features are used
as input for other models or tasks.
Dr.R.Bhuvanya
Top Layers (Typically Convolutional Layers)
Higher-level features: Extract more abstract and complex features, such as object parts or
shapes.
Task-specific: Often more suitable for tasks that require fine-grained object recognition or
classification.
Transfer learning: Can be used effectively for transfer learning, where the pre-trained
weights are fine-tuned for a new task.
AlexNet:
Alexnet model was proposed in 2012 in the research paper named Imagenet Classification
with Deep Convolution Neural Network by Alex Krizhevsky and his colleagues. AlexNet
was a groundbreaking model that won the ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) 2012 by a significant margi Architecture Overview:
Architecture Overview:
Layers: AlexNet consists of 8 learned layers: 5 convolutional layers and 3 fully
connected layers.
Conv Layers: The first two convolutional layers are followed by max-pooling layers,
and the third, fourth, and fifth convolutional layers are connected directly without
pooling.
Dr.R.Bhuvanya
Fully Connected Layers: The final three layers are fully connected, with the last one
being a 1000-way softmax layer.
Activation Function: Uses ReLU (Rectified Linear Unit) after each convolutional
and fully connected layer, speeding up training and mitigating the vanishing gradient
problem.
Key Features:
Large Filters and Strides: The first convolutional layer uses a large filter size of
11x11 with a stride of 4, which was unusual compared to modern architectures that
prefer smaller filters. An 11x11 filter has a large receptive field, meaning it covers a
larger portion of the input image. This allows the first convolutional layer to
capture more spatial information and extract larger patterns (like edges,
textures, and simple shapes) right from the beginning.
With a stride of 4, the large filter also significantly reduces the spatial dimensions of
the image early on in the network. This down sampling reduces the computational
burden in later layers by shrinking the size of the feature maps.
Dropout: AlexNet introduced dropout in the fully connected layers to combat
overfitting, which was a novel approach at the time.
Data Augmentation: Used techniques like random cropping, horizontal flipping, and
RGB color shifting to artificially increase the size of the training dataset.
Local Response Normalization (LRN): Applied after ReLU activations in the earlier
layers, LRN was used to enhance generalization, although it has become less common
Dr.R.Bhuvanya
in newer architectures. (Note: BN is like keeping everything balanced and making the
whole network learn better. LRN is like boosting the most important neurons and
reducing the noise from nearby neurons.)
4. Training:
Optimization: AlexNet was trained using stochastic gradient descent (SGD) with
momentum. SGD updates the weights of the network in small steps by calculating the
gradient of the loss function with respect to the network's weights. It also used a
learning rate schedule to gradually decrease the learning rate during training.
Hardware: Training was split across two GPUs due to the limited memory available
on GPUs at the time. This parallelism was crucial for handling the large
computational load.
5. Applications:
Image Classification: The model’s strong performance on ImageNet popularized the
use of CNNs (Convolutional Neural Networks) for image classification tasks.
Feature Extraction: AlexNet has been used in various computer vision tasks beyond
classification, including object detection and image segmentation.
Transfer Learning: Though surpassed by more recent models, AlexNet was once a
go-to model for transfer learning tasks.
6. Advantages:
Revolutionary Performance: AlexNet's performance on ImageNet demonstrated the
power of deep learning and CNNs, leading to widespread adoption in the computer
vision community.
Introduction of ReLU and Dropout: These innovations helped improve training
efficiency and model generalization, influencing many subsequent neural network
architectures.
Dr.R.Bhuvanya
7. Disadvantages:
Computationally Intensive: At the time of its release, AlexNet required significant
computational resources, including multiple GPUs for training.
Large Filters: The use of large filters in the first layer is less efficient compared to
more modern architectures, which prefer smaller filters with more layers.
import tensorflow as tf
from tensorflow.keras import layers, models
# Define the AlexNet architecture
def alexnet(input_shape):
model = models.Sequential([
layers.Conv2D(filters=96, kernel_size=(11, 11), strides=(4, 4), activation='relu',
input_shape=input_shape),
layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)),
# 2nd Convolutional Layer
layers.Conv2D(filters=256, kernel_size=(5, 5), strides=(1, 1), padding="same",
activation='relu'),
layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)),
# 3rd Convolutional Layer
layers.Conv2D(filters=384, kernel_size=(3, 3), strides=(1, 1), padding="same",
activation='relu'),
# 4th Convolutional Layer
layers.Conv2D(filters=384, kernel_size=(3, 3), strides=(1, 1), padding="same",
activation='relu'),
# 5th Convolutional Layer
layers.Conv2D(filters=256, kernel_size=(3, 3), strides=(1, 1), padding="same",
activation='relu'),
layers.MaxPooling2D(pool_size=(3, 3), strides=(2, 2)),
# Flatten the output
layers.Flatten(),
# 1st Fully Connected Layer
layers.Dense(4096, activation='relu'),
layers.Dropout(0.5), # Dropout for regularization
Dr.R.Bhuvanya
# 2nd Fully Connected Layer
layers.Dense(4096, activation='relu'),
layers.Dropout(0.5), # Dropout for regularization
# Output Layer for 10 classes
layers.Dense(10, activation='softmax') # Change 10 to your number of classes
])
return model
# Define input shape (for example 227x227x3, typical for AlexNet)
input_shape = (227, 227, 3) # Change this to match your input data
# Build the model
model = alexnet(input_shape)
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Display model architecture
model.summary()
ResNet:
ResNet won the ImageNet competition in 2015 and revolutionized the training of very deep
neural networks by addressing the vanishing gradient problem, which had previously made it
difficult to train deep models.
In traditional deep networks, as the number of layers increases, gradients (error signals) during
backpropagation can shrink exponentially. This causes earlier layers to receive gradients that
are too small to update weights effectively, slowing down or even halting learning (the
vanishing gradient problem).
ResNet overcomes this issue by introducing skip connections (or residual connections) that
allow the network to bypass certain layers, effectively creating shortcut paths for the
gradients to flow through during backpropagation.
Dr.R.Bhuvanya
In the image above, the skip connection is depicted by the red line.
1. Residual Learning
ResNet introduces the concept of residual learning.
Instead of learning the desired underlying mapping directly, the network learns the residual (or
the difference) between the input and the output.
Let’s say we want to learn a function that outputs y = x + 5. Instead of learning y directly,
the network learns the residual:
Residual (difference) = y - x = 5 Now, the network only has to figure out that the residual is
5, which is much simpler than learning the entire mapping from x to y.
Dr.R.Bhuvanya
Learning the difference (residual) is usually simpler than learning the full transformation,
especially in deep layers where complex patterns can be harder to detect. By learning the
residual, the network can train faster and more effectively, which is why ResNet can have so
many layers and still perform well.
2. Residual Blocks
The core building block of ResNet is the residual block. A residual block typically consists of
a few convolutional layers with batch normalization and ReLU activation functions. The key
is the shortcut connection (or skip connection) that bypasses one or more layers and adds the
input of the block directly to the output.
3. Advantages of Residual Blocks
Easier Training: Residual blocks make it easier to train very deep networks by mitigating the
vanishing gradient problem.
Performance: ResNet architectures, such as ResNet-50, ResNet-101, and ResNet-152, achieve
very high performance on a variety of tasks, often outperforming shallower networks.
4. Architecture Variants
ResNet comes in various depths, commonly referred to as ResNet-18, ResNet-34, ResNet-50,
ResNet-101, and ResNet-152. The number indicates the number of layers.
ResNet-50: One of the most widely used variants, with 50 layers, making it a good balance
between depth and computational efficiency.
5. Skip Connections and Identity Mapping
Skip connections in ResNet are used to perform identity mapping, which means that the output
of the residual block adds the input directly to the output of the block.
This helps preserve the information from the earlier layers and ensures that the network does
not have to learn the identity function from scratch.
6. Impact and Applications
Dr.R.Bhuvanya
ResNet had a significant impact on the field of computer vision, leading to state-of-the-art
performance in image classification, object detection, and segmentation.
It is also a foundational architecture for other complex models, such as ResNeXt, DenseNet,
and many others.
import tensorflow as tf
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.models import Model
# Load the pre-trained ResNet50 model + higher level layers
base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
# Add new layers on top
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x) # Assuming 10 classes
# Define the new model
model = Model(inputs=base_model.input, outputs=predictions)
# Print the model summary
model.summary()
InceptionNet:
InceptionNet, also known as GoogLeNet, is a deep convolutional neural network (CNN)
architecture developed by Google. It introduced Inception modules to improve feature
extraction while optimizing computational efficiency. This model is widely used for image
classification and other computer vision tasks.
Dr.R.Bhuvanya
Architecture Overview
Inception Modules: Instead of using a single convolution filter size, the network
applies 1x1, 3x3, and 5x5 convolutions simultaneously within the same layer,
capturing different levels of spatial information.
Hint:
Innovation behind the inception module is Multi-scale feature extraction
It refers to the ability of a neural network to capture features of different sizes and
resolutions within an image simultaneously.
Inception Module (Used in InceptionNet)
Uses multiple convolution filters (e.g., 1×1, 3×3, 5×5) in parallel.
This allows the network to capture small, medium, and large features at the same
time.
Example:
1×1 convolution captures fine-grained details.
3×3 convolution extracts mid-level textures.
5×5 convolution captures larger structures.
Dimensionality Reduction: 1x1 convolutions are used before applying larger filters
to reduce computational cost and improve efficiency.
Auxiliary Classifiers: Intermediate softmax classifiers are added at earlier stages to
assist learning and prevent vanishing gradients.
Global Average Pooling (GAP): Instead of fully connected layers, the network uses
average pooling before the final classification layer, reducing overfitting and
improving generalization.
Key Characteristics
Deep Architecture: The model consists of 22 layers, making it capable of learning
complex hierarchical features.
Efficient Computation: The inception module optimizes feature extraction while
reducing the number of parameters.
Multi-Scale Feature Learning: By processing different kernel sizes in parallel, the
network effectively captures both fine and coarse details in an image.
Dr.R.Bhuvanya
Transfer Learning: The pre-trained model is widely used in medical imaging,
object detection, and scene recognition tasks.
Advantages
High Accuracy: The use of inception modules improves classification performance
by extracting diverse features.
Lower Computational Cost: Efficient design reduces memory usage and
computational complexity.
Robust Feature Extraction: Multi-scale convolutional filters enhance the model’s
ability to detect features at various levels.
Disadvantages
Complex Architecture: The inception module structure is more intricate, requiring
careful design and implementation.
Hyperparameter Tuning: Optimizing kernel sizes and filter configurations can be
challenging.
InceptionNet remains a powerful deep learning model, offering a balance between
accuracy, efficiency, and feature extraction capabilities for various classification tasks.
import tensorflow as tf
from tensorflow.keras.applications import InceptionV3
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.models import Model
Dr.R.Bhuvanya
# Load the pre-trained InceptionV3 model without the top layer
base_model = InceptionV3(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
# Add new layers on top
x = base_model.output
x = GlobalAveragePooling2D()(x) # Global Average Pooling to reduce dimensions
x = Dense(1024, activation='relu')(x) # Fully connected layer
predictions = Dense(10, activation='softmax')(x) # Assuming 10 classes
# Define the new model
model = Model(inputs=base_model.input, outputs=predictions)
# Print the model summary
model.summary()
Dr.R.Bhuvanya
Dr.R.Bhuvanya