Unit – 5 (COMPUTER VISION)
REVISION NOTES
Introduction to Computer Vision
• Definition: Computer Vision (CV) is a field within Artificial Intelligence (AI) that enables
computers and systems to derive meaningful information from digital images, videos, and
other visual inputs. It allows machines to process and analyze visual data to simulate human
sight. By using algorithms and machine learning models, CV applications can detect objects,
recognize patterns, and make decisions based on the visual input provided.
• Example – Emoji Scavenger Hunt: Imagine playing a game where a machine shows you an
emoji and asks you to find a real-life object that matches it. In the “Emoji Scavenger Hunt”
game, the computer uses its “vision” to detect the objects you show in front of your camera
and check if they match the emoji. This simulates how CV enables machines to identify
objects from real-world environments using camera input.
• How It Works: Computer Vision uses advanced algorithms to interpret visual data. It breaks
down images into pixels, processes them using machine learning techniques, and identifies
patterns, shapes, or objects by comparing them with its dataset.
Applications of Computer Vision
Over the years, CV has evolved to become a crucial part of various industries, with applications that
have transformed sectors ranging from retail to healthcare. Here are some real-world applications of
CV:
a. Facial Recognition
• Definition: Facial recognition systems identify or verify a
person’s identity using their facial features.
• Applications:
• Smart Homes & Cities: CV plays a critical role in
enhancing security. In smart homes, facial recognition
technology can be used to control access, allowing only
registered individuals inside. Similarly, smart city
cameras can recognize and track people in public spaces
for security purposes.
• Attendance Systems: Schools and workplaces use facial
recognition for automated attendance marking.
• Example: Schools can track student attendance automatically
by scanning students’ faces upon entry.
b. Face Filters in Social Media
• Definition: Face filters are used to apply augmented reality
(AR) effects to users’ faces in apps like Instagram and
Snapchat.
• How It Works: Computer vision algorithms detect and map
facial features in real-time. Using this data, the system overlays
digital filters that enhance or alter the appearance of the face.
• Example: When you apply a dog filter on Snapchat, CV
Page - 1 -
algorithms track the eyes, mouth, and nose, allowing the filter to adjust dynamically as you
move.
c. Google’s Search by Image
• Definition: Google’s “Search by Image”
feature uses computer vision to allow users to
upload an image instead of typing keywords,
and Google returns relevant search results
based on that image.
• How It Works: The CV system analyzes
features like colors, shapes, and patterns of the uploaded image, compares them to images in
its database, and displays matching results.
• Example: If you upload a picture of a landmark, Google will identify it and provide detailed
information about the place, including its history and location.
d. Computer Vision in Retail
• Customer Behavior Tracking: Retailers use CV to
track customers’ movements within stores. Cameras
and CV algorithms analyze how people navigate
through aisles, which helps in optimizing store layouts.
• Inventory Management: Cameras monitor stock
levels on shelves, and CV algorithms provide real-time
analysis of which products need restocking.
• Example: Amazon Go stores use computer vision to
create a cashier-less shopping experience. Shoppers can
pick items off the shelf, and CV systems automatically
detect what they’ve selected, charge their account, and
let them walk out without checking out manually.
e. Self-Driving Cars
• Definition: Autonomous vehicles rely heavily on
computer vision to interpret the surrounding
environment, helping the car navigate safely
without human intervention.
• Key Tasks: CV enables self-driving cars to detect
objects like other cars, pedestrians, road signs, and
obstacles. It also assists in lane detection and route
navigation.
• Example: Tesla’s autopilot uses computer vision
to detect nearby vehicles and ensure lane
accuracy, adjust speed, and manage
traffic conditions.
f. Medical Imaging
• Definition: CV is revolutionizing
healthcare by aiding in the analysis of
medical images such as X-rays, MRIs,
and CT scans.
• How It Works: The technology helps to
Page - 2 -
identify abnormalities and diseases by converting 2D scans into detailed 3D models, offering
better insights for diagnosis.
• Example: AI-powered systems can detect tumors or fractures from medical images faster and
sometimes more accurately than human radiologists, providing early diagnosis and better
treatment outcomes.
g. Google Translate App (Augmented Reality)
• Definition: By using CV combined with augmented reality
(AR), Google Translate allows users to point their phone
cameras at foreign text and receive a real-time translation
overlay.
• How It Works: Optical character recognition (OCR) detects
the foreign words, while AR translates and overlays the text
in the user’s preferred language.
• Example: If you’re traveling abroad and come across a sign
in a language you don’t understand, pointing your camera at
it will display the translated text almost instantly on your screen.
Core Tasks in Computer Vision
• Classification: Assigning a label to an image (e.g., categorizing images as “cat” or “dog”).
• Classification + Localization: Identifying the object and its position within the image.
• Object Detection: Detecting multiple objects in a single image along with their positions.
• Example: In self-driving cars, object detection is used to identify pedestrians, traffic
signals, or other vehicles.
• Instance Segmentation: Segmenting an image to identify multiple instances of objects and
assigning labels to individual pixels.
• Example: In medical imaging, instance segmentation can identify and label different
organs in a scan.
Basics of Images in Computer Vision
• Pixels: The smallest unit of an image. Each image is made up of thousands or millions of
these pixels.
• Resolution: The number of pixels in an image determines its resolution, which affects clarity.
For example, a 5-megapixel camera produces images with 5 million pixels.
Page - 3 -
• Pixel Value: Each pixel has a brightness or color value ranging from 0 to 255. In grayscale
images, 0 represents black, and 255 represents white.
• RGB Images: Color images are made by combining three color channels (Red, Green, and
Blue). Every color image pixel has a set of three values, each corresponding to the intensity of
these colors.
Image Features
• Definition: Features are essential visual elements in an image that help in recognizing or
categorizing objects.
• Key Features:
• Edges: Boundaries between different regions in an image.
• Corners: Points where two edges meet.
• Blobs: Regions that differ in properties such as color or intensity from surrounding
areas.
• Example: In facial recognition, detecting key features like eyes, nose, and mouth edges is
crucial for identification.
Convolutional Neural Networks (CNN)
A Convolutional Neural Network (CNN) is a Deep Learning algorithm which can take in an input
image, assign importance (learnable weights and biases) to various aspects/objects in the image and
be able to differentiate one from the other.
CNNs are a specialized class of deep neural networks designed to process and analyze visual data.
They are highly effective in tasks such as image classification, object detection, and image
segmentation.
• Structure of a CNN:
1. Convolution Layer: The first layer in a CNN where filters (kernels) scan the input
image to extract features like edges, colors, and textures.
• Example: If you input a picture of a cat, the convolution layer extracts features
like the shape of the cat’s eyes, ears, and fur pattern.
2. ReLU Layer (Rectified Linear Unit): This activation function removes negative
values from the feature maps, introducing non-linearity.
3. Pooling Layer: Reduces the dimensionality of the feature maps by selecting the most
important information (e.g., through Max Pooling).
• Example: Max Pooling selects the brightest or most prominent feature in a
given region, allowing the network to focus on key details.
4. Fully Connected Layer: Flattens the input and uses it for classification. The flattened
vector is used to assign labels to the input image.
• Example: After feature extraction, the fully connected layer identifies whether
the input image is a cat or a dog based on the probability distribution across
labels.
• Convolution Operation: Convolution is the core operation in CNNs. A small matrix (kernel)
is slid across the image and multiplies pixel values to detect features such as edges. The
convolution output is a feature map, highlighting specific patterns in the image.
• Example: Applying an edge-detection filter on an image will highlight the boundaries of
objects, such as outlining a building’s edges in a photograph.
Page - 4 -