0% found this document useful (0 votes)

40 views8 pages

YOLOv5 Parallel Processing Comparison

This study evaluates three YOLOv5 implementations—Standard, Multiprocessing, and Multithreading—within a GPU-enabled Google Colab environment, focusing on performance metrics like FPS and inference time. Multithreading outperformed the others with 55.28 FPS and 18.09 ms inference time, while Multiprocessing showed significant underperformance. The findings recommend Multithreading for speed and Standard for resource efficiency, providing a replicable framework for developers.

Uploaded by

zawad.ishmam.hriddo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views8 pages

YOLOv5 Parallel Processing Comparison

Uploaded by

zawad.ishmam.hriddo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Abstract

This study compares three YOLOv5 implementations—Standard, Multiprocessing, and

Multithreading—in a GPU-enabled Google Colab environment. Evaluated metrics include
FPS, inference time, CPU/GPU usage, and memory usage. Multithreading achieved the
highest FPS (55.28) and lowest inference time (18.09 ms), while Multiprocessing under-
performed (1.85 FPS, 540.01 ms). The results highlight trade-offs, recommending Multi-
threading for speed and Standard for resource efficiency. Practical details and a replicable
framework are provided.

1
YOLOv5 Parallel Processing Implementation
Comparison

Zawad Ishmam Hriddo

BRAC University

May 2025

1 Introduction
Despite its efficiency, YOLOv5’s performance under parallel processing remains underex-
plored. YOLOv5, developed by Ultralytics (YOLOv5 Repository), is a state-of-the-art object
detection model renowned for its balance of speed and accuracy. It is widely used in real-
time applications such as autonomous driving, surveillance, and robotics. As computational
demands for such tasks increase, optimizing inference through parallelization techniques is
critical. This research compares three implementations of YOLOv5: a standard sequential
approach, a multiprocessing approach leveraging multiple CPU cores, and a multithreading
approach for concurrent processing. The study evaluates these implementations based on per-
formance metrics including Frames Per Second (FPS), average inference time, CPU usage,
memory usage, GPU usage, and GPU memory usage. Conducted in a GPU-enabled Google
Colab environment, the findings provide insights into the trade-offs of each method, guiding
developers in selecting the most suitable approach for specific use cases.

The research also incorporates practical implementation details, including environment setup,
data preparation, and empirical results from a Jupyter Notebook executed in Google Colab.
These details enhance the study by offering a replicable framework for developers to test and
extend the findings.

2 Methodologies
2.1 Standard YOLOv5
The standard implementation serves as the baseline, processing images sequentially, often in
batches, to leverage PyTorch’s built-in optimizations and GPU acceleration. This approach is
straightforward, relying on efficient batching to maximize GPU throughput. However, it may
underutilize CPU resources during GPU-bound inference, as preprocessing and inference occur
sequentially.

2
2.2 YOLOv5 with Multiprocessing
The multiprocessing implementation uses Python’s multiprocessing library (Multipro-
cessing Documentation) to create multiple processes, each handling a subset of images or
pipeline stages. By bypassing Python’s Global Interpreter Lock (GIL), it enables true par-
allel computation across CPU cores. Each process loads its own model instance, increasing
memory usage. Challenges include:

• Process startup overhead.

• Memory overhead due to independent model instances.

• Inter-process communication complexity.

• Potential GPU contention when multiple processes access a single GPU.

2.3 YOLOv5 with Multithreading

The multithreading implementation employs Python’s threading library (Threading Docu-
mentation) to create threads within a single process, sharing memory and reducing overhead
compared to multiprocessing. While the GIL limits true parallelism for CPU-bound tasks, mul-
tithreading excels in I/O-bound operations, such as image loading and preprocessing, and can
overlap with GPU computations. Key considerations include:

• Lower memory overhead due to shared memory space.

• Faster thread creation compared to processes.

• Simpler data sharing.

• Challenges in managing thread safety and GPU serialization.

3 Experimental Setup
The experiments were conducted in a Google Colab environment with the following specifica-
tions:

• Hardware: NVIDIA T4 GPU (CUDA-enabled), multi-core CPU (e.g., Intel Xeon).

• Software: Python 3.11, PyTorch, YOLOv5 repository (version 7.0), psutil for re-
source monitoring, pynvml (pynvml Documentation) for GPU metrics.

• Dataset: 20 images (JPEG or PNG) from the COCO dataset, stored in a test images
folder.

• Parameters: Batch size of 1 for standard and multithreading implementations; adjusted

for multiprocessing based on CPU core count. All implementations used CUDA for GPU
acceleration.

3
3.1 Environment Setup
The YOLOv5 environment was configured using the following commands in Google Colab:

!pip install torch torchvision opencv-python matplotlib psutil

!git clone https://github.com/ultralytics/yolov5
%cd yolov5
!pip install -r requirements.txt
%cd ..

These commands installed necessary libraries and cloned the YOLOv5 repository, ensuring
compatibility with the experimental setup.

3.2 Data Preparation

Test images were sourced from the COCO dataset (COCO Dataset) using:

!wget https://github.com/ultralytics/yolov5/releases/download/v1.0/coco12
!unzip coco128.zip

The images/train2017 folder was used, with 20 images selected to create a manageable
test set for rapid experimentation. This small dataset size facilitated quick evaluation but may
limit insights into scaling behavior with larger datasets.

3.3 Practical Considerations

The implementation included robust error handling, such as checking for CUDA availability
and falling back to CPU if necessary. Additionally, the pynvml library was used to track
GPU metrics, with fallbacks in case of import failures, ensuring adaptability across different
environments.

4 Performance Metrics
The following metrics were evaluated:

• Frames Per Second (FPS): Number of images processed per second (higher is better).

• Average Inference Time: Time to process one image, measured in milliseconds (lower
is better).

• CPU Usage: Average and maximum CPU utilization (%).

• Memory Usage: Average and maximum system memory used (MB).

• GPU Usage: Average and maximum GPU utilization (%), reflecting CUDA efficiency.

• GPU Memory Usage: Average and maximum GPU memory used (MB).

Metrics were collected using psutil for CPU and memory, and pynvml for GPU metrics,
ensuring accurate resource monitoring during inference.

4
5 Results
The empirical results from running the implementations on 20 images are summarized in the
following table:

Table 1: Performance Metrics for YOLOv5 Implementations

Metric Standard Multiprocessing Multithreading

FPS 28.68 1.85 55.28
Avg Inference Time (ms) 34.87 540.01 18.09
Avg CPU Usage (%) 67.2 99.6 67.7
Max CPU Usage (%) 99.9 100.0 100.0
Avg Memory Usage (MB) 1234.6 1324.8 1405.3
Max Memory Usage (MB) 1316.4 1325.5 1411.7
Avg GPU Usage (%) 19.4 3.2 26.1
Max GPU Usage (%) 28.0 56.0 94.0
Avg GPU Memory Usage (MB) 622.3 851.7 732.7
Max GPU Memory Usage (MB) 685.9 1545.9 735.9

As shown in Table 1 and Figure 1, these results were visualized using line graphs generated
by a plot comparisons function, which plotted FPS, CPU usage, memory usage, GPU
usage, average inference time, and GPU memory usage. The line graphs, with distinct col-
ors and markers for each metric, provided a clear comparison of performance trends across
implementations.

6 Discussion
6.1 Performance Analysis
• FPS and Inference Time: Multithreading significantly outperforms others, achieving
55.28 FPS and an 18.09 ms inference time. This suggests efficient overlapping of CPU
and GPU tasks, maximizing throughput. The Standard implementation (28.68 FPS, 34.87
ms) is a reliable baseline but is surpassed by Multithreading. Multiprocessing performs
poorly (1.85 FPS, 540.01 ms), likely due to process creation overhead and model loading
in each process.

• CPU Usage: Multiprocessing has the highest average CPU usage (99.6%), indicating
full CPU core utilization but at the cost of efficiency. Multithreading and Standard have
similar average usage ( 67%) but reach 100% maximum usage, reflecting occasional CPU
spikes.

• Memory Usage: Multithreading uses the most memory (1405.3 MB avg), followed by
Multiprocessing (1324.8 MB) and Standard (1234.6 MB). Multithreading’s shared mem-
ory model is efficient but still incurs higher usage due to concurrent operations.

• GPU Usage: Multithreading maximizes GPU utilization (26.1% avg, 94.0% max), com-
pared to Standard (19.4% avg, 28.0% max) and Multiprocessing (3.2% avg, 56.0% max).

5
Figure 1: Performance Comparison of YOLOv5 Implementations: Line graphs showing FPS,
CPU Usage, Memory Usage, GPU Usage, Average Inference Time, and GPU Memory Usage
for Standard, Multiprocessing, and Multithreading methods.

Multiprocessing’s low average GPU usage indicates inefficient GPU sharing.

• GPU Memory Usage: Multiprocessing has the highest maximum GPU memory usage
(1545.9 MB), likely due to multiple model instances. Multithreading (732.7 MB avg)
and Standard (622.3 MB avg) are more memory-efficient.

6.2 Key Insights

• Multithreading is ideal for GPU-bound tasks, leveraging CUDA effectively by overlap-
ping I/O and inference.

• Multiprocessing is less effective in this GPU setup due to overhead and GPU contention,
but it may perform better in CPU-only environments.

• The Standard implementation balances performance and resource usage, suitable for
resource-constrained systems.

6
6.3 Practical Implementation
The Jupyter Notebook provided complete Python code for each implementation, including
functions for:

• Model Loading: Using attempt load from YOLOv5’s models.experimental.

• Image Preprocessing: A custom letterbox function to resize and pad images to

640x640.

• Parallel Processing: run standard yolov5, run multiprocessing yolov5,

and run multithreading yolov5 functions, with specific handling for batch pro-
cessing, process pools, and thread queues.

The code included robust error handling, such as checking for CUDA availability and handling
missing pynvml, making it adaptable to various environments.

6.4 Limitations
• The results are specific to a GPU environment with a small dataset (20 images). Larger
datasets or CPU-only setups may yield different outcomes.

• The batch size and number of threads/processes were not extensively tuned, which could
further optimize performance.

• The single-GPU setup (NVIDIA T4) limits insights into multi-GPU scenarios.

7 Conclusion
This research highlights Multithreading as the most effective YOLOv5 implementation in a
GPU-enabled environment, offering superior FPS (55.28) and low inference time (18.09 ms)
due to efficient GPU utilization. However, it demands higher memory and CPU resources.
The Standard implementation provides a balanced alternative, while Multiprocessing underper-
forms due to significant overhead. For real-time applications prioritizing speed, Multithreading
is recommended. In resource-constrained or CPU-only scenarios, the Standard or Multipro-
cessing approaches may be more appropriate, depending on specific requirements. Developers
are encouraged to prioritize Multithreading for GPU-intensive tasks and explore hybrid ap-
proaches in future studies.

Future work could explore larger datasets, multi-GPU setups, and parameter tuning to refine
these findings. The practical implementation details and empirical results from the Jupyter
Notebook provide a replicable framework for further experimentation.

8 References
• Ultralytics YOLOv5 Repository for Object Detection

• Python Multiprocessing Library Documentation

7
• Python Threading Library Documentation

• pynvml Library for NVIDIA GPU Monitoring

• COCO Dataset for Computer Vision Tasks

Project Problem Statement - AI ML
No ratings yet
Project Problem Statement - AI ML
14 pages
Revised
No ratings yet
Revised
8 pages
Optimization For AI Inference Engines On GPUs
No ratings yet
Optimization For AI Inference Engines On GPUs
13 pages
Yolov12 7 8
No ratings yet
Yolov12 7 8
2 pages
Report
No ratings yet
Report
9 pages
YOLOv5 Architecture and Algorithm For Object Detection
No ratings yet
YOLOv5 Architecture and Algorithm For Object Detection
7 pages
Project 2
No ratings yet
Project 2
10 pages
Cs336 Spring2024 Assignment2 Systems
No ratings yet
Cs336 Spring2024 Assignment2 Systems
30 pages
Detailed Human Detection Model Guide Retry
No ratings yet
Detailed Human Detection Model Guide Retry
5 pages
Guidance On Yolov5
No ratings yet
Guidance On Yolov5
16 pages
Project Solution - AI ML
No ratings yet
Project Solution - AI ML
2 pages
YOLOv5 Model Export Guide
No ratings yet
YOLOv5 Model Export Guide
13 pages
Object Detection Using YOLOv5 and OpenCV DNN in C++ & Python
No ratings yet
Object Detection Using YOLOv5 and OpenCV DNN in C++ & Python
21 pages
YOLO
No ratings yet
YOLO
10 pages
Report
No ratings yet
Report
8 pages
Codeyolov 5
No ratings yet
Codeyolov 5
16 pages
CPU-Based YOLO for Real-Time Detection
No ratings yet
CPU-Based YOLO for Real-Time Detection
4 pages
Assign 3
No ratings yet
Assign 3
6 pages
SATAY: A Streaming Architecture Toolflow For Accelerating YOLO Models On FPGA Devices
No ratings yet
SATAY: A Streaming Architecture Toolflow For Accelerating YOLO Models On FPGA Devices
10 pages
Metal Strands Detection
No ratings yet
Metal Strands Detection
10 pages
IEEE - HiPC 2023
No ratings yet
IEEE - HiPC 2023
2 pages
19bce0014 VL2021220702099 Pe003
No ratings yet
19bce0014 VL2021220702099 Pe003
17 pages
Add A Heading
No ratings yet
Add A Heading
15 pages
Detect
No ratings yet
Detect
6 pages
Yolo Vs Coreml
No ratings yet
Yolo Vs Coreml
2 pages
Pytorch Project Pedro Aguiar
No ratings yet
Pytorch Project Pedro Aguiar
27 pages
W Yolo 5: A: Hat Is V Deep Look Into The Internal Features of The Popular Object Detector
No ratings yet
W Yolo 5: A: Hat Is V Deep Look Into The Internal Features of The Popular Object Detector
8 pages
DC Project
No ratings yet
DC Project
4 pages
NB4-06 PT I Using CNN
No ratings yet
NB4-06 PT I Using CNN
21 pages
On-Device ML Vehicle Detection
No ratings yet
On-Device ML Vehicle Detection
17 pages
Object Detection
No ratings yet
Object Detection
3 pages
Performance Comparison of FPGA, GPU and CPU in Image Processing
No ratings yet
Performance Comparison of FPGA, GPU and CPU in Image Processing
7 pages
Development of A Pumpkin Fruits Pick
No ratings yet
Development of A Pumpkin Fruits Pick
13 pages
Multiple Object Tracking Using Deep Learning With Yolo v5 IJERTCONV9IS13010
No ratings yet
Multiple Object Tracking Using Deep Learning With Yolo v5 IJERTCONV9IS13010
5 pages
Federated Learning for Ship Classification
No ratings yet
Federated Learning for Ship Classification
2 pages
Pick-And-Place Application Using A Dual Arm Collaborative Robot and An RGB-D Camera With YOLOv5
No ratings yet
Pick-And-Place Application Using A Dual Arm Collaborative Robot and An RGB-D Camera With YOLOv5
14 pages
NLP Assignment (213-15-4243)
No ratings yet
NLP Assignment (213-15-4243)
6 pages
Report 2
No ratings yet
Report 2
16 pages
Deep Learning on GPU Clusters
No ratings yet
Deep Learning on GPU Clusters
50 pages
Requirements
No ratings yet
Requirements
1 page
FPGA Implementation of Object Detection Accelerator Based On Vitis-AI
No ratings yet
FPGA Implementation of Object Detection Accelerator Based On Vitis-AI
7 pages
4-Channel YOLO Validation, Testing & Evaluation Guide
No ratings yet
4-Channel YOLO Validation, Testing & Evaluation Guide
29 pages
PARISI Thesis
No ratings yet
PARISI Thesis
82 pages
Object Detection Project Report
No ratings yet
Object Detection Project Report
3 pages
Twinpilots
No ratings yet
Twinpilots
7 pages
The Next Generation of GPU Performance in PyTorch With Nvfuser - 1647043230943001r3L1
No ratings yet
The Next Generation of GPU Performance in PyTorch With Nvfuser - 1647043230943001r3L1
64 pages
Hubconf
No ratings yet
Hubconf
4 pages
Irjet V9i4167
No ratings yet
Irjet V9i4167
5 pages
CNN Regression Performance Reasons
No ratings yet
CNN Regression Performance Reasons
28 pages
Data Parallelism
No ratings yet
Data Parallelism
5 pages
GPU Computing With Python: Performance, Energy Efficiency and Usability
No ratings yet
GPU Computing With Python: Performance, Energy Efficiency and Usability
23 pages
Pan Card Detection
No ratings yet
Pan Card Detection
5 pages
Parallel Computing Project
No ratings yet
Parallel Computing Project
4 pages
Sanju HPC 9,10
No ratings yet
Sanju HPC 9,10
5 pages
Java Multi-Threading Evolution and Topics
No ratings yet
Java Multi-Threading Evolution and Topics
45 pages
LAB #5 CPU Scheduling Algorithms-1: Objective Theory
No ratings yet
LAB #5 CPU Scheduling Algorithms-1: Objective Theory
2 pages
Process Management Overview
No ratings yet
Process Management Overview
8 pages
YAC: BFT Consensus Algorithm For Blockchain
No ratings yet
YAC: BFT Consensus Algorithm For Blockchain
7 pages
Chapter 2 Process Management Part 2 Threads and Multithreading
No ratings yet
Chapter 2 Process Management Part 2 Threads and Multithreading
42 pages
Anr 6.45 (64500004) 20250531 132308 380015538
No ratings yet
Anr 6.45 (64500004) 20250531 132308 380015538
16 pages
Java ConcurrentHashMap Guide
No ratings yet
Java ConcurrentHashMap Guide
3 pages
Parallel & Distributed Systems Course
No ratings yet
Parallel & Distributed Systems Course
4 pages
Unit 1 Modern Processors
No ratings yet
Unit 1 Modern Processors
52 pages
Peterson's Solution & Semaphore Tracing
No ratings yet
Peterson's Solution & Semaphore Tracing
3 pages
(SRX) Session Synchronization in Chassis Cluster and Session Load Balancing - Juniper Networks
No ratings yet
(SRX) Session Synchronization in Chassis Cluster and Session Load Balancing - Juniper Networks
3 pages
Java Synchronization & Multithreading
No ratings yet
Java Synchronization & Multithreading
14 pages
What Is Virtual Memory and How Is It Implemented?
No ratings yet
What Is Virtual Memory and How Is It Implemented?
38 pages
Very Large Instruction Word (Vliw) Processors: What Is Good and What Is Bad With Superscalars ?
No ratings yet
Very Large Instruction Word (Vliw) Processors: What Is Good and What Is Bad With Superscalars ?
11 pages
CS802A Lec-2 PDF
No ratings yet
CS802A Lec-2 PDF
28 pages
OMC 303 - Section A
No ratings yet
OMC 303 - Section A
5 pages
OS Course for CSE Students
No ratings yet
OS Course for CSE Students
96 pages
ch5 CPU Scheduling
No ratings yet
ch5 CPU Scheduling
26 pages
JNTUA Operating Systems Notes - R20
No ratings yet
JNTUA Operating Systems Notes - R20
131 pages
Simultaneous Multi-Threaded Design: Virendra Singh
No ratings yet
Simultaneous Multi-Threaded Design: Virendra Singh
15 pages
Processes and Threads: 2.1 Processes 2.2 Threads 2.3 Interprocess Communication 2.4 Classical IPC Problems 2.5 Scheduling
No ratings yet
Processes and Threads: 2.1 Processes 2.2 Threads 2.3 Interprocess Communication 2.4 Classical IPC Problems 2.5 Scheduling
55 pages
Power Off Reset Reason
No ratings yet
Power Off Reset Reason
2 pages
Threads 1
No ratings yet
Threads 1
48 pages
OS Process Management Basics
100% (1)
OS Process Management Basics
15 pages
Rcs401: Operating Systems Unit I: References
No ratings yet
Rcs401: Operating Systems Unit I: References
1 page
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
34 pages
Anr 7.2.6 (70266108) 20190920 161250
No ratings yet
Anr 7.2.6 (70266108) 20190920 161250
8 pages
Uniprocessor Scheduling Guide
No ratings yet
Uniprocessor Scheduling Guide
8 pages
Chapter 4 - Scheduling-1
No ratings yet
Chapter 4 - Scheduling-1
24 pages
OS DPPs A4
No ratings yet
OS DPPs A4
47 pages

YOLOv5 Parallel Processing Comparison

Uploaded by

YOLOv5 Parallel Processing Comparison

Uploaded by

Abstract

This study compares three YOLOv5 implementations—Standard, Multiprocessing, and

Zawad Ishmam Hriddo

• Process startup overhead.

• Memory overhead due to independent model instances.

• Inter-process communication complexity.

• Potential GPU contention when multiple processes access a single GPU.

2.3 YOLOv5 with Multithreading

• Lower memory overhead due to shared memory space.

• Faster thread creation compared to processes.

• Simpler data sharing.

• Challenges in managing thread safety and GPU serialization.

• Hardware: NVIDIA T4 GPU (CUDA-enabled), multi-core CPU (e.g., Intel Xeon).

• Parameters: Batch size of 1 for standard and multithreading implementations; adjusted

!pip install torch torchvision opencv-python matplotlib psutil

3.2 Data Preparation

3.3 Practical Considerations

• CPU Usage: Average and maximum CPU utilization (%).

• Memory Usage: Average and maximum system memory used (MB).

Table 1: Performance Metrics for YOLOv5 Implementations

Metric Standard Multiprocessing Multithreading

Multiprocessing’s low average GPU usage indicates inefficient GPU sharing.

6.2 Key Insights

• Model Loading: Using attempt load from YOLOv5’s models.experimental.

• Image Preprocessing: A custom letterbox function to resize and pad images to

• Parallel Processing: run standard yolov5, run multiprocessing yolov5,

• Python Multiprocessing Library Documentation

• pynvml Library for NVIDIA GPU Monitoring

• COCO Dataset for Computer Vision Tasks

You might also like