Computer Vision Deep Dive - January 2025
COMPUTER VISION & CNNS
Teaching Machines to See and Understand the Visual World
The Vision Revolution: From simple edge detection to understanding
complex scenes - Computer Vision has transformed how machines
perceive our visual world. CNNs are the backbone of this revolution!
1. Computer Vision Overview
Computer Vision (CV) is about extracting meaningful information from
images and videos. It's how we give machines the gift of sight!
Core Principle: Images are just numbers! A grayscale image is a 2D
matrix, RGB is a 3D tensor. Our job is to find patterns in these
numbers.
Evolution of Computer Vision:
1960s-1980s: Hand-crafted features (edges, corners)
1990s-2000s: Statistical methods (SVM, Random Forests)
2012-Present: Deep Learning dominance (AlexNet breakthrough)
2. Image Fundamentals
Digital Image Representation:
Grayscale: Image[height, width] → values: 0-255 RGB:
Image[height, width, 3] → R, G, B channels Example:
224×224×3 = 150,528 input values!
Common Preprocessing:
Normalization: pixel_value / 255.0
Standardization: (pixel - mean) / std
Resizing: Match model input size
Data Augmentation: Rotation, flip, zoom, crop
3. Traditional CV Techniques
Edge Detection
Sobel Filter: Gradient-based edge detection
Canny Edge: Multi-stage algorithm for optimal edges
Laplacian: Second derivative for edge detection
Feature Descriptors
SIFT: Scale-Invariant Feature Transform
SURF: Speeded-Up Robust Features
HOG: Histogram of Oriented Gradients
ORB: Oriented FAST and Rotated BRIEF
4. Enter Convolutional Neural Networks
Why CNNs? They automatically learn hierarchical features! Early layers
detect edges, middle layers detect shapes, deep layers detect objects.
CNN Building Blocks:
Input Image
↓
[Convolution Layer] → Feature Maps
↓
[ReLU Activation] → Non-linearity
↓
[Pooling Layer] → Downsampling
↓
[Convolution Layer] → More Features
↓
[ReLU Activation]
↓
[Pooling Layer]
↓
[Flatten]
↓
[Fully Connected] → Classification
↓
Output (Classes)
5. Convolution Operation - The Heart of
CNNs
Input (5×5) Filter (3×3) Output (3×3) 1 2 3 4 5 1 0 -1 2 3 4 5 6 2 0
-2 = Feature Map 3 4 5 6 7 1 0 -1 4 5 6 7 8 (Sliding window 5 6 7 8 9
operation)
Output Size = (Input - Filter + 2×Padding) / Stride
+ 1 Example: (32 - 3 + 2×1) / 1 + 1 = 32 (same
padding)
Key Concepts:
Filters/Kernels: Learnable feature detectors
Stride: How much the filter moves
Padding: Adding borders to maintain size
Receptive Field: Input region affecting output
6. Pooling Layers
Max Pooling (2×2, stride=2):
Input: Output:
1 3 2 4 3 4
2 2 1 3 →
4 1 3 2 4 3
2 1 2 1
Takes maximum value in each window
Why Pooling?
Reduces spatial dimensions (computation)
Provides translation invariance
Helps prevent overfitting
7. Famous CNN Architectures
Architecture Year Key Innovation Depth
LeNet-5 1998 First successful CNN 7 layers
AlexNet 2012 ReLU, Dropout, Data Aug 8 layers
VGGNet 2014 Small filters (3×3) 16-19 layers
GoogLeNet 2014 Inception modules 22 layers
ResNet 2015 Skip connections 50-152 layers
DenseNet 2017 Dense connections 100+ layers
EfficientNet 2019 Compound scaling Varies
8. ResNet - The Game Changer
Problem: Deeper networks suffered from vanishing gradients
Solution: Skip connections! Allow gradients to flow directly
Output = F(x) + x Where F(x) is the residual
function to be learned x is the identity mapping
(skip connection)
Residual Block:
Input (x)
|
├──────────────────┐ (skip connection)
| |
[Conv → BN → ReLU] |
| |
[Conv → BN] |
| |
+──────────────────┘
|
[ReLU]
|
Output
9. Transfer Learning - The Practical
Approach
Golden Rule: Don't train from scratch unless you have millions of
images! Use pre-trained models and fine-tune.
Transfer Learning Strategies:
1. Feature Extraction: Freeze conv layers, train only classifier
2. Fine-tuning: Unfreeze top layers, train with low learning rate
3. Full Training: Unfreeze all, but initialize with pre-trained weights
PyTorch Transfer Learning:
import torchvision.models as models
# Load pre-trained ResNet
model = models.resnet50(pretrained=True)
# Freeze all layers
for param in model.parameters():
param.requires_grad = False
# Replace final layer for your task
num_features = model.fc.in_features
model.fc = nn.Linear(num_features, num_classes)
# Now only final layer will train!
10. Data Augmentation Techniques
Spatial Transformations:
Random Crop
Horizontal/Vertical Flip
Rotation (±15-30°)
Translation
Zoom/Scale
Pixel-level Transformations:
Brightness adjustment
Contrast changes
Saturation/Hue shifts
Gaussian noise
Cutout/Random erasing
11. Object Detection
Two-Stage Detectors:
R-CNN: Region proposals → CNN features → Classification
Fast R-CNN: Shared computation for proposals
Faster R-CNN: Region Proposal Network (RPN)
Single-Stage Detectors:
YOLO: You Only Look Once - Real-time detection
SSD: Single Shot MultiBox Detector
RetinaNet: Focal loss for class imbalance
mAP (mean Average Precision) = Primary metric IoU
(Intersection over Union) = Area_overlap /
Area_union Threshold typically 0.5 for detection
12. Semantic Segmentation
Classify every pixel! Applications: Medical imaging, autonomous
driving, satellite imagery
Popular Architectures:
FCN: Fully Convolutional Networks
U-Net: Encoder-decoder with skip connections
DeepLab: Atrous convolutions for multi-scale
Mask R-CNN: Instance segmentation
13. Vision Transformers (ViT)
2020 Breakthrough: Transformers aren't just for NLP! ViT treats
images as sequences of patches.
Vision Transformer Pipeline:
Image (224×224)
↓
[Split into 16×16 patches]
↓
[Linear Projection of patches]
↓
[Add position embeddings]
↓
[Transformer Encoder blocks]
↓
[Classification head]
↓
Output
14. Practical Tips & Tricks
Training Best Practices:
Start with small learning rate (1e-4 for fine-tuning)
Use learning rate scheduling
Monitor validation loss for early stopping
Batch Normalization helps convergence
Mix precision training for speed
Common Pitfalls:
Forgetting to normalize inputs
Wrong channel order (RGB vs BGR)
Training on imbalanced datasets
Not using data augmentation
Overfitting on small datasets
15. Evaluation Metrics
Task Metrics
Classification Accuracy, Precision, Recall, F1, Confusion Matrix
Object Detection mAP, IoU, FPS (for real-time)
Segmentation Pixel Accuracy, IoU, Dice Coefficient
Face Recognition ROC curve, FAR/FRR
16. Real-World Applications
Where CNNs Shine:
Medical: Tumor detection, X-ray analysis, retinal scans
Automotive: Self-driving cars, parking assistance
Security: Face recognition, anomaly detection
Retail: Visual search, inventory management
Agriculture: Crop disease detection, yield prediction
Manufacturing: Quality control, defect detection
17. Code Example - Building a CNN
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super(SimpleCNN, self).__init__()
# Convolutional layers
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
# Pooling layer
self.pool = nn.MaxPool2d(2, 2)
# Fully connected layers
self.fc1 = nn.Linear(128 * 28 * 28, 512)
self.fc2 = nn.Linear(512, num_classes)
# Activation and regularization
self.relu = nn.ReLU()
self.dropout = nn.Dropout(0.5)
def forward(self, x):
# Conv Block 1
x = self.relu(self.conv1(x))
x = self.pool(x)
# Conv Block 2
x = self.relu(self.conv2(x))
x = self.pool(x)
# Conv Block 3
x = self.relu(self.conv3(x))
# Flatten and classify
x = x.view(x.size(0), -1)
x = self.dropout(self.relu(self.fc1(x)))
x = self.fc2(x)
return x
18. Latest Trends & Future
What's Hot in 2025:
Self-supervised learning (DINO, MAE)
Neural Architecture Search (NAS)
3D Computer Vision
Video understanding
Multimodal models (CLIP, DALL-E)
Efficient models for edge devices
19. Resources & Next Steps
My Study Plan:
1. Master PyTorch/TensorFlow for CV
2. Implement classic papers from scratch
3. Kaggle competitions for practice
4. Build an end-to-end CV application
5. Explore 3D vision and video
6. Dive into vision transformers
"The eye sees only what the mind is prepared to comprehend" - Now we're teaching
machines to comprehend!