[go: up one dir, main page]

0% found this document useful (0 votes)
15 views12 pages

Computer Vision & CNNs - Study Notes

Uploaded by

xedac78301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

Computer Vision & CNNs - Study Notes

Uploaded by

xedac78301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Computer Vision Deep Dive - January 2025

COMPUTER VISION & CNNS

Teaching Machines to See and Understand the Visual World

The Vision Revolution: From simple edge detection to understanding


complex scenes - Computer Vision has transformed how machines
perceive our visual world. CNNs are the backbone of this revolution!

1. Computer Vision Overview

Computer Vision (CV) is about extracting meaningful information from


images and videos. It's how we give machines the gift of sight!

Core Principle: Images are just numbers! A grayscale image is a 2D


matrix, RGB is a 3D tensor. Our job is to find patterns in these
numbers.

Evolution of Computer Vision:

1960s-1980s: Hand-crafted features (edges, corners)

1990s-2000s: Statistical methods (SVM, Random Forests)

2012-Present: Deep Learning dominance (AlexNet breakthrough)

2. Image Fundamentals

Digital Image Representation:

Grayscale: Image[height, width] → values: 0-255 RGB:


Image[height, width, 3] → R, G, B channels Example:
224×224×3 = 150,528 input values!

Common Preprocessing:

Normalization: pixel_value / 255.0

Standardization: (pixel - mean) / std

Resizing: Match model input size

Data Augmentation: Rotation, flip, zoom, crop

3. Traditional CV Techniques

Edge Detection

Sobel Filter: Gradient-based edge detection

Canny Edge: Multi-stage algorithm for optimal edges

Laplacian: Second derivative for edge detection

Feature Descriptors

SIFT: Scale-Invariant Feature Transform

SURF: Speeded-Up Robust Features

HOG: Histogram of Oriented Gradients

ORB: Oriented FAST and Rotated BRIEF

4. Enter Convolutional Neural Networks


Why CNNs? They automatically learn hierarchical features! Early layers
detect edges, middle layers detect shapes, deep layers detect objects.

CNN Building Blocks:

Input Image

[Convolution Layer] → Feature Maps

[ReLU Activation] → Non-linearity

[Pooling Layer] → Downsampling

[Convolution Layer] → More Features

[ReLU Activation]

[Pooling Layer]

[Flatten]

[Fully Connected] → Classification

Output (Classes)

5. Convolution Operation - The Heart of


CNNs

Input (5×5) Filter (3×3) Output (3×3) 1 2 3 4 5 1 0 -1 2 3 4 5 6 2 0


-2 = Feature Map 3 4 5 6 7 1 0 -1 4 5 6 7 8 (Sliding window 5 6 7 8 9
operation)

Output Size = (Input - Filter + 2×Padding) / Stride


+ 1 Example: (32 - 3 + 2×1) / 1 + 1 = 32 (same
padding)
Key Concepts:

Filters/Kernels: Learnable feature detectors

Stride: How much the filter moves

Padding: Adding borders to maintain size

Receptive Field: Input region affecting output

6. Pooling Layers

Max Pooling (2×2, stride=2):

Input: Output:
1 3 2 4 3 4
2 2 1 3 →
4 1 3 2 4 3
2 1 2 1

Takes maximum value in each window

Why Pooling?

Reduces spatial dimensions (computation)

Provides translation invariance

Helps prevent overfitting

7. Famous CNN Architectures

Architecture Year Key Innovation Depth

LeNet-5 1998 First successful CNN 7 layers

AlexNet 2012 ReLU, Dropout, Data Aug 8 layers


VGGNet 2014 Small filters (3×3) 16-19 layers

GoogLeNet 2014 Inception modules 22 layers

ResNet 2015 Skip connections 50-152 layers

DenseNet 2017 Dense connections 100+ layers

EfficientNet 2019 Compound scaling Varies

8. ResNet - The Game Changer

Problem: Deeper networks suffered from vanishing gradients


Solution: Skip connections! Allow gradients to flow directly

Output = F(x) + x Where F(x) is the residual


function to be learned x is the identity mapping
(skip connection)

Residual Block:

Input (x)
|
├──────────────────┐ (skip connection)
| |
[Conv → BN → ReLU] |
| |
[Conv → BN] |
| |
+──────────────────┘
|
[ReLU]
|
Output
9. Transfer Learning - The Practical
Approach

Golden Rule: Don't train from scratch unless you have millions of
images! Use pre-trained models and fine-tune.

Transfer Learning Strategies:

1. Feature Extraction: Freeze conv layers, train only classifier

2. Fine-tuning: Unfreeze top layers, train with low learning rate

3. Full Training: Unfreeze all, but initialize with pre-trained weights

PyTorch Transfer Learning:

import torchvision.models as models

# Load pre-trained ResNet


model = models.resnet50(pretrained=True)

# Freeze all layers


for param in model.parameters():
param.requires_grad = False

# Replace final layer for your task


num_features = model.fc.in_features
model.fc = nn.Linear(num_features, num_classes)

# Now only final layer will train!

10. Data Augmentation Techniques

Spatial Transformations:
Random Crop

Horizontal/Vertical Flip

Rotation (±15-30°)

Translation

Zoom/Scale

Pixel-level Transformations:

Brightness adjustment

Contrast changes

Saturation/Hue shifts

Gaussian noise

Cutout/Random erasing

11. Object Detection

Two-Stage Detectors:

R-CNN: Region proposals → CNN features → Classification

Fast R-CNN: Shared computation for proposals

Faster R-CNN: Region Proposal Network (RPN)

Single-Stage Detectors:

YOLO: You Only Look Once - Real-time detection

SSD: Single Shot MultiBox Detector


RetinaNet: Focal loss for class imbalance

mAP (mean Average Precision) = Primary metric IoU


(Intersection over Union) = Area_overlap /
Area_union Threshold typically 0.5 for detection

12. Semantic Segmentation

Classify every pixel! Applications: Medical imaging, autonomous


driving, satellite imagery

Popular Architectures:

FCN: Fully Convolutional Networks

U-Net: Encoder-decoder with skip connections

DeepLab: Atrous convolutions for multi-scale

Mask R-CNN: Instance segmentation

13. Vision Transformers (ViT)

2020 Breakthrough: Transformers aren't just for NLP! ViT treats


images as sequences of patches.

Vision Transformer Pipeline:

Image (224×224)

[Split into 16×16 patches]

[Linear Projection of patches]

[Add position embeddings]

[Transformer Encoder blocks]

[Classification head]

Output

14. Practical Tips & Tricks

Training Best Practices:

Start with small learning rate (1e-4 for fine-tuning)

Use learning rate scheduling

Monitor validation loss for early stopping

Batch Normalization helps convergence

Mix precision training for speed

Common Pitfalls:

Forgetting to normalize inputs

Wrong channel order (RGB vs BGR)

Training on imbalanced datasets

Not using data augmentation

Overfitting on small datasets

15. Evaluation Metrics


Task Metrics

Classification Accuracy, Precision, Recall, F1, Confusion Matrix

Object Detection mAP, IoU, FPS (for real-time)

Segmentation Pixel Accuracy, IoU, Dice Coefficient

Face Recognition ROC curve, FAR/FRR

16. Real-World Applications

Where CNNs Shine:

Medical: Tumor detection, X-ray analysis, retinal scans

Automotive: Self-driving cars, parking assistance

Security: Face recognition, anomaly detection

Retail: Visual search, inventory management

Agriculture: Crop disease detection, yield prediction

Manufacturing: Quality control, defect detection

17. Code Example - Building a CNN

import torch.nn as nn

class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super(SimpleCNN, self).__init__()
# Convolutional layers
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)

# Pooling layer
self.pool = nn.MaxPool2d(2, 2)

# Fully connected layers


self.fc1 = nn.Linear(128 * 28 * 28, 512)
self.fc2 = nn.Linear(512, num_classes)

# Activation and regularization


self.relu = nn.ReLU()
self.dropout = nn.Dropout(0.5)

def forward(self, x):


# Conv Block 1
x = self.relu(self.conv1(x))
x = self.pool(x)

# Conv Block 2
x = self.relu(self.conv2(x))
x = self.pool(x)

# Conv Block 3
x = self.relu(self.conv3(x))

# Flatten and classify


x = x.view(x.size(0), -1)
x = self.dropout(self.relu(self.fc1(x)))
x = self.fc2(x)

return x

18. Latest Trends & Future

What's Hot in 2025:

Self-supervised learning (DINO, MAE)


Neural Architecture Search (NAS)

3D Computer Vision

Video understanding

Multimodal models (CLIP, DALL-E)

Efficient models for edge devices

19. Resources & Next Steps

My Study Plan:

1. Master PyTorch/TensorFlow for CV

2. Implement classic papers from scratch

3. Kaggle competitions for practice

4. Build an end-to-end CV application

5. Explore 3D vision and video

6. Dive into vision transformers

"The eye sees only what the mind is prepared to comprehend" - Now we're teaching
machines to comprehend!

You might also like