UNIT 1-COMPUTER VISION BASICS
COMPUTER VISION
Computer Vision is a branch of Artificial Intelligence (AI) that enables computers to
acquire, process, analyze, and understand images or videos, and make decisions or take actions
based on that information.
Computer Vision is the technology that allows machines to gain understanding from images and
videos.
Key Goals of Computer Vision:
Detect and recognize objects
Classify and label images
Track motion in videos
Understand scenes and environments
Objective:
To simulate human vision by enabling machines to:
See (capture images)
Understand (interpret objects, scenes, motion)
Make decisions (based on visual input)
Key Tasks in Computer Vision:
Image classification – Identifying what is in an image
Object detection – Locating and identifying multiple objects
Image segmentation – Dividing an image into meaningful parts
Facial recognition – Identifying individuals from their facial features
Motion tracking – Analyzing movement in video sequences
Applications of Computer Vision:
Autonomous Vehicles – Object and lane detection
Medical imaging Healthcare – Analyzing X-rays, MRIs, and scans
Surveillance and security systems
Augmented and Virtual Reality
Security Systems – Face and activity recognition
Manufacturing – Quality inspection using cameras
Retail & Marketing – Customer behavior analysis
Related Fields/Disciplines
Artificial Intelligence (AI)
Machine Learning (ML)
Computer Graphics
Image Processing
Robotics
Computer Vision aims to enable machines to perceive, interpret, and understand visual
information from the world. Below are its key goals along with purposes and examples.
Goals of Computer Vision
1. Image Understanding
→ Understand content in an image
Example: Google Photos grouping pictures by person or location
2. Object Recognition & Classification
→ Identify and classify objects
Example: Amazon Go stores recognizing items for checkout
3. Object Detection & Localization
→ Detect objects and their positions
Object Detection is a computer vision technique that identifies and locates objects
within an image or video.
Localisation refers to identifying the position of the detected object in the image,
usually by drawing a bounding box around it.
Example: Face detection in mobile phone cameras
4. Scene Reconstruction
→ Create 3D models from 2D images
Example: Augmented Reality (AR) in interior design apps
5. Motion Analysis & Tracking
→ Track moving objects in video
Example: CCTV tracking a person’s movement in real
6. Image Restoration & Enhancement
→ Improve image quality
Example: AI tools restoring old or blurred photographs
7. Automation & Robotics
→ Help machines interact with surroundings
Example: Self-driving cars detecting roads and obstacles
8. Face & Text Recognition
→ Identify faces or read text in images
Example: Passport scanners at airports, Google Lens for text
Advantages of Computer Vision:
1. Automation and Speed: Processes visual data much faster than humans (e.g., inspection
in factories).Enables real-time decisions in applications like self-driving cars.
2. Accuracy and Consistency: Reduces human error in tasks like medical image analysis
or quality control.
3. Handles Large Volumes of Data
Can process and analyze vast amounts of image or video data that would be
overwhelming for humans.
Disadvantages of Computer Vision:
1. High Initial Cost and Complexity: Requires expensive hardware and large datasets for
training.
2. Limited in Unstructured Environments: Performance may drop in poor lighting,
cluttered scenes, or unfamiliar situations.
3. Privacy Concerns:Widespread surveillance and facial recognition can raise ethical and
legal issues.
IMAGE FORMATION
Image formation is the process of capturing a visual representation of a scene using a
camera or sensor and converting it into a digital image that a computer can process.
Key Steps in Image Formation:
1. Light Reflection from Objects
o Light from a source (like the sun or a bulb) reflects off objects in the scene.
2. Camera Lens Captures Light
o The reflected light passes through a camera lens, which focuses it to form an
image.
3. Projection onto Image Sensor
o The focused light hits a sensor (like CCD or CMOS) in the camera, converting it
into electrical signals.
4. Conversion to Digital Image
o The signals are digitized into pixels — small units that represent brightness and
color.
Example:
When you take a photo of a tree using a smartphone:
o The tree reflects light.
o The phone's lens captures and focuses that light.
o The sensor records the light and produces a digital image.
IMAGE CAPTURE
Definition:
Image capture refers to the process of recording the formed image using a sensor and converting
it to a digital format.
Steps/Process:
Analog Signal Generation: The sensor detects light intensity.
Analog-to-Digital Conversion (ADC): Converts analog signals to digital pixel values.
Image Storage: The digital image is stored in memory (as JPG, PNG, etc.).
CCD (Charge-Coupled Device) and CMOS (Complementary Metal-Oxide
Semiconductor) are image sensors used in cameras to capture light and convert it into
digital images.
Example:
CCTV camera records a video stream in a store. It captures continuous frames per second and
stores them in a digital video format.
IMAGE REPRESENTATION
Definition:
Image representation is how a digital image is stored and processed in a computer system.
Types:
Grayscale Image: Each pixel has one intensity value (0–255).
Color Image (RGB): Each pixel has three components – Red, Green, Blue.
Binary Image: Pixels are either 0 (black) or 1 (white).
Image as Matrix:
An image is represented as a 2D (grayscale) or 3D (color) matrix of pixels.
Example:
Face recognition systems convert captured facial images into pixel matrices to compare and
identify people
Summary Table:
Concept Meaning Real-Time Example
Converting scene into image using
Image Formation Taking a photo using a camera
optics
Image Capture Digitizing and storing the image CCTV recording a video
Image Storing image as a pixel matrix in Face recognition software
Representation digital format processing an image
LINEAR FILTERING, CORRELATION, AND CONVOLUTION
Linear filtering, correlation, and convolution are fundamental operations in image
processing and computer vision. They are used to manipulate or extract features from images.
What is a Kernel?
A kernel (or filter or mask) is a small matrix used in image processing. It is applied to each pixel
of an image to change its value based on its neighbors. Common sizes are 3×3, 5×5, etc.
It moves over each pixel in the image (this is called convolution or correlation).
At each position, it performs a calculation using neighboring pixel values to produce a
new value.
EX:
In photo editing apps, when you apply blur or sharpen, the app is using different kernels behind
the scenes
Example 3×3 Kernel:This kernel averages the pixel values in a 3×3 neighborhood.Useful for
blurring or smoothing an image.
[1/9 1/9 1/9]
[1/9 1/9 1/9]
[1/9 1/9 1/9]
If you use a 5×5 kernel (25 pixels), each element will be 1/25.
If you use a 2×2 kernel, each element will be 1/4, and so on.
Kernel Size Number of Pixels Divide By
3×3 9 9
5×5 25 25
2×2 4 4
Kernel Size Values in Kernel Value of Each Cell
3×3 9 1/9
5×5 25 1/25
2×2 4 1/4
Center Pixel Concept
When a kernel is placed over a patch of an image, the pixel at the center of the patch is called the
center pixel. The kernel calculates a new value for this center pixel using its neighbors.
Example image patch:
[10 20 30]
[40 50 60]
[70 80 90]
Here, 50 is the center pixel.
Why only the middle? Because when the kernel slides across the image, we assign the calculated
result to the position of the center pixel in the output image. This ensures the output image size
remains the same.
LINEAR FILTERING
Definition: A process of applying a filter (or kernel) to an image to enhance certain features (like
edges) or reduce noise.
The image is processed by sliding a kernel (small matrix) across it.
Each pixel is updated based on the weighted sum of neighboring pixels.
Use cases:
Noise reduction (e.g., Gaussian blur)
Edge detection
Smoothing
Common Filters:
Mean Filter: Averages surrounding pixels – smoothing
Gaussian Filter: Weighted average – less blurring than mean
Laplacian Filter: Edge enhancement
Formula:
Output(x, y) = Σ Σ [ Image(x+i, y+j) × Kernel(i, j) ]
Example: Mean Blur Kernel (3×3):
[1/9 1/9 1/9]
[1/9 1/9 1/9]
[1/9 1/9 1/9]
Applied to:
[10 20 30]
[40 50 60]
[70 80 90]
Take all 9 numbers, add them, and then divide by 9.
This gives the average – that's why it blurs or smooths the image.
Take all 9 pixels:
10 + 20 + 30 + 40 + 50 + 60 + 70 + 80 + 90 = 450
Now divide by 9:
450 / 9 = 50
Sum = 450, divide by 9 = 50 → new value for center pixel.
So, the center pixel (50) stays the same in this case,
but in real images, this would smooth sharp edges and reduce noise.
CORRELATION
Correlation measures similarity between the kernel and the image patch. We slide the kernel over
the image, multiply corresponding values, and sum them. The kernel is NOT flipped.
Formula:
Output(x, y) = Σ Σ [ Image(x+i, y+j) × Kernel(i, j) ]
Example Kernel:
[0 1 0]
[1 -4 1]
[0 1 0]
Image Patch:
[1 2 3]
[4 5 6]
[7 8 9]
Calculation: (1×0)+(2×1)+(3×0)+(4×1)+(5×-4)+(6×1)+(7×0)+(8×1)+(9×0) = 0
In a face detection system, correlation helps to:
Detect eyes or nose by comparing parts of the image with a known pattern.
If a part of the image matches the kernel (like the shape of an eye), the correlation result
is high.
3. CONVOLUTION
Convolution is similar to correlation but the kernel is flipped horizontally and vertically before
applying. In many cases, if the kernel is symmetric, flipping has no effect.
What Does “Kernel Flipped” Mean?
When we say "flipping a kernel", we mean reversing the kernel in both directions:
1. Flip Horizontally (Left–Right)
Switch columns from left to right.
2.Flip Vertically (Top–Bottom)
Switch rows from top to bottom.
In mobile photo editing apps, convolution is behind effects like sharpen, emboss, or blur.
Formula:
Output(x, y) = Σ Σ [ Image(x+i, y+j) × Kernel(-i, -j) ]
Example:
Kernel before flip:
[0 1 0]
[1 -4 1]
[0 1 0]
After flipping (same in this symmetric case), applying to:
[1 2 3]
[4 5 6]
[7 8 9]
Result = 0.
General Formula for Correlation / Linear Filtering
The general formula for applying a kernel (size: (2m+1) × (2n+1)) to an image is:
g(x, y) = Σ (i = -m to m) Σ (j = -n to n) [ f(x + i, y + j) × h(i, j) ]
where:
- g(x, y) is the output image pixel value at (x, y)
- f(x + i, y + j) is the input image pixel value at the corresponding position
- h(i, j) is the kernel value at position (i, j)
- m, n define the kernel size (for 3×3 kernel, m = 1, n = 1)
EDGE DETECTION
What is an Edge?
An edge in an image is a point where the brightness (intensity) of the image changes sharply. It
marks the boundary between two regions, such as between an object and the background.
Why Detect Edges?
Edge detection helps:
Identify object boundaries
Reduce the amount of data
Extract important features for further processing (like object recognition, segmentation,
etc.)
Types of Edges
1. Step Edge – sudden change in intensity.
2. Ramp Edge – gradual change in intensity.
3. Line Edge – bright line on a dark background.
4. Roof Edge – sharp peak (similar to ramp but thinner).
Common Edge Detection Operators
Operator Description
Sobel Uses gradient magnitude in horizontal and vertical directions.
Prewitt Similar to Sobel but simpler masks.
Canny Advanced method with noise reduction and edge thinning.
Lane Detection in Self-Driving Cars
The car's camera captures road images.
Edge detection highlights white lane markings.
Helps the car stay in the correct lane.