Project Report
Project Report
A Project Report
of
by
pg. I
ACKNOWLEDGEMENT
Words cannot explain the gratitude I have for my trainers, Mr. Aditya Prashant Ardak, and
the Edunet Team. They were part and parcel of whatever help and guidance I have
received while carrying out my project on the Human Pose Estimation Project using
Machine Learning. With the knowledge imparted by Mr. Ardak in machine learning and
computer vision, my own understanding of the field grew considerably. Mr. Ardak's
guidance for me was extremely helpful, especially as he would explain deeper things in a
simplified way and provide guidance on the research.
Gratefulness and appreciation to the trainers: Mr. Aditya Prashant Ardak and the entire
Edunet Team for their uninterrupted support during the tenure of the Human Pose
Estimation Project using Machine Learning. It would not have been possible to complete
the work without their guidance and help. Moreover, with Mr. Ardak's guidance in a
manner that kept me on track, my understanding of machine learning and computer vision
deepened considerably. Mr. Ardak's guidance for me was extremely helpful, especially as
he would explain deeper things in a simplified way and provide guidance on the research.
This project has been an enriching experience, and without Mr. Ardak and the Edunet
Team supporting me, I don’t think I would have made it this far. Their commitment to
teaching and growing their students truly makes a difference for my learning experience. I
am looking forward to further applying skills and techniques learned under their tutelage.
Ultimately, I would like to express my gratitude for both trainers and Edunet Team for
continuous motivation they have provided me. Their professionalism and devotion towards
the achievement of students have made an indelible impression upon me. Their time and
attention to conversion made a major contribution toward the success of this project, for
which I am deeply and sincerely appreciative.
pg. I
ABSTRACT
Human pose estimation, often referred to as pose estimation, is the process of specifying
and tracking human poses in images or videos and has applications in fields such as computer
vision, robotics, and sports analysis. The main aim of this project was to train an efficient
machine learning model that can accurately estimate the human pose in real-time from visual
input. Zoning for the complexity in human movements, the different body shapes, and
environment conditions makes the pose estimation problem an interesting and difficult
problem to solve. To this end, using convolutional neural networks and deep learning
models, which have shown competence in extracting spatial features from images, this
project has been undertaken. The model was trained on a very huge dataset that contained
annotated human pose landmarks. A key element was using pre-trained models such as
OpenPose and PoseNet and fine-tuning those models for specific pose-estimation tasks. The
model was evaluated in terms of accuracy, robustness, and speed, relying on both qualitative
and quantitative metrics. The results showed a very significant improvement in pose
accuracy among others with key body joints like the elbows, knees, and wrists even under
difficult occlusions and varying poses. It also enables almost real-time performance suitable
for live applications. This project showed that the prospect of using machine learning
techniques is promising in solving the problem of human pose estimation. Potential impact
of this work includes potential applications Sir in gesture recognition, augmented reality,
and human-computer interaction in real-time systems. The future directions will continue
research and optimizations of the model to increase the potential applications in this field.
pg. I
TABLE OF CONTENT
Abstract ............................................................................................................... I
pg. I
LIST OF FIGURES
Page
Figure No. Figure Caption
No.
pg. I
CHAPTER 1
Introduction
1.1 Problem Statement:
Human Pose Estimation (HPE) defines the process to identify and classify various
positions or places of limbs, joints, and other prominent landmarks in an image or a
video. These technologies are key to various domains of healthcare, sports analytics,
entertainment, human-computer interaction, and security. The ultimate challenge is
to develop a strong system that can work out the pose of any human in any real-world
scenario, which is largely affected by environmental conditions and movements of
people.
The problem this project addresses is estimating diverse and dynamic poses of human
humans with reliability and in efficiency. Human poses are generally very complex
and dynamic due to differences in body shapes, body postures, and movements.
These variations are often creating occlusions, meaning parts of the body could be
hidden from the camera, or ambiguous poses, meaning there could be two or more
different poses appearing to be very similar under visual inspection. The complexity
of this problem is enhanced with the addition of background clutter, lighting
conditions, and the requirement for real-time processing. For example, a person can
be partly or totally occluded by objects, or otherwise, they can be posed at an unusual
angle with respect to the camera. Such scenarios make it challenging for traditional
computer vision techniques to work without advanced models to obtain the pose
precisely.
Human pose estimation is important due to its wide range of applicability across
several industries. In healthcare, accurate pose estimation can be helpful in physical
therapy, monitoring the movements of patients, and helping them in rehabilitation.
In sports analytics, it can be used to track and analyze athletes' movements to
optimize performance and reduce risks of injury. Moreover, in entertainment and
gaming, pose estimation can contribute to creating more immersive and interactive
experiences by enabling gesture recognition and motion capture for virtual
characters. In security and surveillance, it could be applied in the area of unusual
behavior, identifying individuals, and monitoring activities in crowds.
pg. 1
The problem is even significant in HCI, where human gestures and body language
play crucial roles in interacting intuitively with devices. Towards more advanced
interfaces, such as AR and VR, the role of human pose estimation is fundamental to
ensure effective user experience. Additional applications of precise human pose
estimation include autonomous systems, such as robots or self-driving vehicles that
need to perceive and navigate safe human environments.
This project has focused on developing an advanced human pose estimation model
using state-of-the-art machine learning techniques for optimizing accuracy and
efficiency under real-world conditions. The project aims to improve the methods in
handling variability in human pose, addressing occlusion, and dealing with
environmental conditions, providing a solution with the potential to transform
industries and their applications.
1.2 Motivation:
The motivation behind this project is the increasing importance of human pose
estimation in various fields, which is driven by the advancement of computer vision
and artificial intelligence. Human pose estimation allows machines to understand and
interpret human body movements, making it a very important technology for
applications in healthcare, sports, entertainment, and human-computer interaction.
pg. 2
and impactful. This project aims at contributing to building the most efficient and
precise solution in pose estimation systems, which should help transform some of
the numerous industries, whether it is medical, entertainment-based, or related to
other branches, into exciting and accessible human experiences.
1.3 Objective:
This project aims to build a strong and precise Human Pose Estimation (HPE)
model, which can apply machine learning algorithms to detect and analyze human
body movements in real-time from visual data, like images or videos. The specific
objectives of the project are outlined below:
Dataset Preparation and Model Training: To train the model effectively, a large,
diverse dataset containing annotated human pose landmarks is used. The dataset
will help the model learn and generalize human poses under different poses,
occlusions, and environmental conditions.
The model will be evaluated with various performance metrics such as accuracy,
robustness, and speed. Special focus will be placed on improving the model's
capability to handle challenges like occlusions, varying body shapes, and different
viewing angles.
pg. 3
Demonstration of application: Lastly, the project will intend to show how the
model will practically and practically apply to the real life in healthcare, sports
analytics, and virtual reality by giving its potential and transforming industries
altogether.
Scope: This project develops a Human Pose Estimation model using machine learning to
detect and track human body poses from visual data, including images and videos. Primary
scope includes the following:
Human Pose Estimation: Identify and track important body landmarks like joints (for
example, elbows, knees, wrists) and limbs in both static and dynamic environments.
Machine Learning Integration: The work uses deep learning techniques, such as CNNs, to
improve the accuracy of pose prediction. It further fine-tunes the pre-trained models,
OpenPose and PoseNet, to achieve performance.
Real-Time Processing: One of the primary goals is to have the model pose estimation in real-
time, useful for applications in healthcare, sports, gaming, and human-computer interaction.
Evaluation and Optimization: The project will evaluate the performance of the model in
terms of accuracy, speed, and robustness, especially under changing conditions such as
occlusions or different body angles.
Dataset Limitations: The quality and diversity of the dataset used to train the model limit its
performance. Limited datasets with insufficient representations of various body types,
movements, and environmental conditions can impair the generalization of the model.
Occlusion and Viewpoint Variations: The model would fail in situations involving
occlusions, for example, where parts of the body are not visible, and extreme variations of
body posture and viewpoint, causing a decrease in accuracy.
pg. 4
CHAPTER 2
Literature Survey
Samkari, E., Arif, M., Alghamdi, M., & Al Ghamdi, M. A. (2023). Human pose
estimation using deep learning: a systematic literature review. Machine Learning and
Knowledge Extraction, 5(4), 1612-1659.
Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen, J., ... & Shah, M. (2023). Deep
learning-based human pose estimation: A survey. ACM Computing Surveys, 56(1), 1-
37.
Gupta, A., Gupta, K., Gupta, K., & Gupta, K. (2021). Human Activity Recognition Using
Pose Estimation and Machine Learning Algorithm. In ISIC (Vol. 21, pp. 25-27).
pg. 5
person settings. OpenPose works in two stages: a first stage generates part confidence
maps, and the second stage refines those maps to produce final joint locations.
OpenPose has been a breakthrough in real-time pose estimation, enabling efficient and
accurate detection of keypoints even in challenging scenarios like occlusions or
varying poses. It's also capable of detecting facial landmarks and hand poses, making it
a comprehensive framework.
pg. 6
detection tasks. HRNet is particularly effective for fine-grained details and has been
recognized as one of the state-of-the-art methods.
pg. 7
2.3 Gaps or Limitations in Existing Solutions and How This Project
Addresses Them
Despite excellent improvement in Human Pose Estimation, by models like OpenPose,
PoseNet, and HRNet, there are still many gaps and limitations that prevent the solution
from being widely deployed into real applications and that prevent it from reaching better
performance. As follows, key limitations of current solutions along with how this project
will address them:
One of the greatest challenges when doing human pose estimation is trying to identify
accurate keypoints for all the parts of a human body with either partial occlusions, meaning
those body parts being covered, or if multiple people were found in a frame. Current state-
of-the-art models OpenPose and AlphaPose could manage the former situations reasonably
but did poorly with highly occluded people and overlapped poses in crowding
environments.
Existing Limitation: While models like PoseNet offer real-time pose estimation, they
sacrifice some level of accuracy, especially in challenging conditions such as extreme body
angles, varying poses, or different lighting conditions. OpenPose, although more accurate,
requires significant computational resources and is not ideal for real-time applications on
resource-constrained devices.
Project Contribution: The objective of this project is to balance between accuracy and
real-time performance. Model optimization will be used to speed up the processing with no
loss of accuracy, mainly through model pruning and transfer learning, which increases
computational efficiency, especially for real-time or mobile applications.
pg. 8
2.3.3 Generalization to Diverse Human Poses and Body Types
Existing Limitation: Many existing models struggle to generalize well across diverse
human body types, ages, and poses. For instance, models trained on limited datasets might
not perform well with unusual poses or on datasets that contain people of various body
shapes, ethnicities, or in non-ideal conditions (e.g., poor lighting, low-quality video).
Project Contribution: This project will focus on generalizing the model by using
diversified and extensive datasets for training. In addition, data augmentation techniques
will be applied to increase variability and robustness in model performance, making sure
that the system can handle different body types, movements, and challenging
environmental conditions.
Project Contribution: This project contributes to the scalability of the multi-person pose
estimation capabilities. With the help of PAFs and optimization of the network for
detecting poses of multiple individuals at once, the system would be more effective in
handling dense crowds and would provide better tracking for multiple people.
Project Contribution: This project will center around reducing latency by optimizing the
architecture to make faster inference without losing too much accuracy. Techniques such
as model quantization, knowledge distillation, and backend optimization for faster pose
pg. 9
estimation are possible ways to make it feasible in real-time application for AR/VR and
robotics.
Existing Limitation: Most of the existing pose estimation models require large
computational resources for training and inference. This makes them challenging to deploy
in low-resource environments such as mobile devices or edge computing platforms.
pg. 10
CHAPTER 3
Proposed Methodology
The image shows a man standing upright in a neutral pose, probably taken in a
controlled environment. In the image, the human figure is well defined, with
various body parts such as the head, shoulders, elbows, wrists, hips, knees, and
ankles forming the key points that are essential for pose estimation. The algorithm,
therefore, has managed to track the position of each of these key landmarks and,
hence, outline the human body's skeletal structure using visual markers such as dots
or lines at each joint position.
pg. 11
Considering pose estimation, this is the perfect case; the body is fully visible with
no occlusions, and the algorithm can perfectly predict the positioning of all major
joints and limbs. The pose estimation system has probably utilized deep learning
approaches, such as CNNs, to identify the posture of the person, and then translate
that into a digital skeleton representation. The accuracy of the system can be seen
in the exact placement of joints and limbs, and each keypoint, such as the nose,
elbows, knees, and wrists, is correctly placed and connected to form the complete
skeleton.
The human pose, as detected by the system, is a pose that reflects alignment and symmetry
in the body. Since the subject is standing, the pose will generally be considered neutral, as
the limbs will be relaxed, and the weight of the body will be equally distributed. Such a
pose would be useful in a wide variety of applications: physical therapy for posture
analysis, surveillance, and even sports performance analysis for body alignment. The
model has successfully tracked the subject pose with a high degree of accuracy, which
proves the efficiency of a machine learning system in identifying human body keypoints
with accurate bounding boxes for such a static pose.
1. CPU (Processor)
Recommended: Intel Core i5 or i7 (or equivalent AMD Ryzen)
o For image processing and running machine learning models (especially when
using frameworks like OpenCV), a multi-core processor helps to efficiently
handle the parallel processing of images.
o Models like Pose Estimation often involve heavy computation, so a multi-
core processor will speed up data manipulation and model inference.
Minimum: Intel Core i3 or equivalent AMD processor
o This can work for lighter tasks or less complex models, but performance may
degrade with larger models or datasets.
pg. 12
2. GPU (Graphics Processing Unit)
Recommended: NVIDIA GPU with at least 6GB VRAM (e.g., NVIDIA GTX 1060,
1660, RTX 2060, or better)
o For deep learning tasks like pose estimation using models like OpenPose,
HRNet, or PoseNet, having a dedicated GPU is crucial for speeding up the
model's training and inference times.
o A CUDA-enabled GPU is necessary to utilize GPU acceleration,
significantly improving performance when using deep learning libraries like
TensorFlow, PyTorch, or OpenCV.
Minimum: NVIDIA GTX 1050 Ti, 4GB VRAM (or equivalent)
o If you're working with smaller models or a pre-trained model (without fine-
tuning), this GPU should still allow you to run pose estimation with decent
performance. However, for large datasets or real-time processing, this might
be slower.
3. RAM (Memory)
Recommended: 16 GB or more
o Pose estimation algorithms, especially when dealing with high-resolution
images or video data, require a good amount of RAM to load and process
data efficiently. For deep learning tasks (model inference or training), more
memory ensures smooth processing and faster performance.
Minimum: 8 GB
o While 8 GB RAM can work for basic image processing tasks, you might
experience slower performance or memory-related issues when working with
more complex models, large datasets, or real-time applications.
pg. 13
o If you plan to store large video datasets or process real-time streams, a larger
SSD would be beneficial.
Minimum: HDD with at least 1 TB (or SSD with 120 GB)
o A traditional HDD might suffice for small datasets or offline processing, but
it will be much slower in data access and may negatively impact overall
performance. If you're working on large datasets, an SSD is highly
recommended.
5. Operating System
Recommended: Linux (Ubuntu or other distributions) or Windows 10 (64-bit)
o Linux is often preferred for machine learning tasks due to better compatibility
with various libraries, packages, and faster overall performance. It also
provides better support for GPU acceleration through CUDA.
o Windows 10 is also fine for pose estimation and offers better compatibility
with certain frameworks like TensorFlow and OpenCV, but Linux can
sometimes offer better performance and ease of use for deep learning models.
Minimum: Windows 10 (64-bit) or macOS
o These operating systems are suitable for development and can support most
machine learning tools, though Linux is generally preferred for training
models and handling large-scale data.
pg. 14
Summary of Average Hardware Requirements:
NVIDIA GTX 1060, 1660, or RTX 2060 NVIDIA GTX 1050 Ti (4GB
GPU
(6GB VRAM) VRAM)
RAM 16 GB or more 8 GB
Windows 10 (64-bit) or
OS Linux (Ubuntu) or Windows 10 (64-bit)
macOS
2. Python Libraries
The following Python libraries (which you already mentioned) are essential for
your project:
1. opencv_python_headless==4.5.1.48
pg. 15
o OpenCV is used for computer vision tasks such as image reading,
manipulation, and video processing. The headless version is ideal for
environments where no graphical interface is needed.
2. streamlit==0.76.0
o Streamlit enables you to create interactive web applications for data science
projects with minimal effort. It’s useful for visualizing results like the pose
estimation output.
3. numpy==1.18.5
o NumPy is used for numerical computations, particularly for array
manipulations. It’s crucial for handling image data and performing the
necessary mathematical operations for pose estimation.
4. matplotlib==3.3.2
o Matplotlib is a plotting library that allows you to visualize images, graphs,
and results of your pose estimation. It’s essential for displaying the pose
estimation results in a comprehensible way.
5. Pillow==8.1.2
o Pillow is a library for image processing, enabling you to read, edit, and save
images in various formats. It’s useful for image loading and preprocessing
before running pose estimation algorithms.
pg. 16
o Keras is a high-level neural networks API, written in Python, running on top
of TensorFlow. It simplifies the process of building and training deep
learning models.
ONNX (Optional)
o Open Neural Network Exchange (ONNX) is a format that allows models to
be transferred across different frameworks (e.g., TensorFlow to PyTorch). If
your project involves working with different deep learning frameworks,
ONNX can be beneficial.
4. Package Management
pip (Python Package Installer)
o Use pip to install, upgrade, or remove Python libraries and packages from the
Python Package Index (PyPI).
o Command example: pip install opencv-python-headless numpy streamlit
matplotlib Pillow
virtualenv or conda (Optional)
o virtualenv or conda (Anaconda) helps you create isolated environments for
Python projects. This is useful when you need specific library versions or
avoid conflicts with system-wide packages.
For virtualenv:
o Create a virtual environment: python -m venv your_project_name
o Activate it: source your_project_name/bin/activate (Linux/macOS)
your_project_name\Scripts\activate (Windows)
For conda:
o Create a new environment:
conda create -n your_project_name python=3.8
o Activate it: conda activate your_project_name
pg. 17
5. Additional Tools & Dependencies
CUDA & cuDNN (for NVIDIA GPUs)
o If you're using an NVIDIA GPU for acceleration, you will need to install
CUDA and cuDNN to take advantage of GPU computing. These are essential
for running models efficiently in frameworks like TensorFlow or PyTorch.
o CUDA: A parallel computing platform and programming model that enables
software to use GPU hardware for general-purpose computing.
o cuDNN: A GPU-accelerated library for deep neural networks, useful for
faster training and inference of models.
These can be installed by following the official guidelines provided by NVIDIA
for setting up CUDA and cuDNN with your chosen deep learning framework
(TensorFlow or PyTorch).
Jupyter Notebook (Optional)
o Jupyter Notebook is an interactive environment where you can run Python
code, visualize outputs, and create documents with code and results together.
This can be useful during the development phase for experimenting with code
and visualizing intermediate results.
6. Operating System
The operating system you use will play a role in determining how you install
and use the above software packages.
Recommended: Linux (Ubuntu, CentOS)
o Linux is the most widely used OS for deep learning tasks because of its
compatibility with deep learning libraries, good package management, and
support for GPU acceleration (CUDA).
Alternative: Windows 10 (64-bit)
o While Windows is perfectly capable of running most software, Linux is
typically preferred for machine learning tasks due to its better support for
certain libraries and GPU frameworks.
macOS
o macOS can also be used, but it may not be as well-suited for GPU-accelerated
deep learning tasks (unless you're using Apple's M1 chip, which has growing
support for machine learning).
pg. 18
CHAPTER 4
Implementation and Result
Fig 1
The picture of a woman running with real-time movements tracked using human pose
estimation technology is captured. This detects and shows important body joints on
the model, such as the head, shoulders, elbows, hips, knees, and ankles, to create a
form of mapping out the posture of running. This technology gives an accurate
reflection of the motion of a woman, enabling the system to track her dynamic pose
and analyze her gait. It is, therefore, a possibility that this kind of estimation in such
images can be put to application, for instance in fitness analysis, motion capture, or
health monitoring. The points are shown tracked on her fluid movement as she runs.
pg. 19
Fig 2
The child in the image is dribbling the football, giving a lively and energetic feel of
the movements. With their body postures, it is quite likely that they are dribbling
with a lot of concentration and enthusiasm. Using pose estimation technology, the
body joints such as the feet, knees, hips, and torso are tracked for analyzing the stance
and motion. This can be useful for deriving how the child coordinates their balance
and movement to prevent falls while playing. The dynamic view of active play by
the child makes it useful for any sporting training, injury prevention, or simply how
children move during sport activities like football.
Fig 3
pg. 20
The image shows a man running, with his body posture indicating speed and strength.
His legs are in full stride, and his arms are likely swinging to maintain balance. Pose
estimation technology can track the key points of his body, such as his head,
shoulders, elbows, knees, and ankles, to analyze his running form. The data from
these key points can provide insights into his running efficiency, posture, and
biomechanics, helping in areas like athletic performance improvement, injury
prevention, or even providing feedback for optimizing running techniques. The man's
dynamic motion is captured as he speeds ahead.
Fig 4
It has a woman apparently standing with weight even distribution between the legs.
The person's posture, therefore, will be interpreted in terms of how calm or stable
she looks. If it has been followed and tracked with some pose estimation technology,
her key points from her head to wrists, elbows, shoulders, hips, knees, and ankles,
would have their corresponding mapped outlines to study further her posture as well
as align. Such analysis may help in examining body posture or establishing
ergonomical health and can help detect imbalances or the unhygienic postures
leading to ineffective communication. It is still with relatively calmer movements, as
the system and application based on tracking are more dynamic in comparison.
Real-time Performance: The model currently does not work optimally for
real-time applications, especially for video streams. Optimizing the model for
faster inference times or employing lightweight architectures such as
MobileNetV2 or EfficientNet can enable smoother real-time tracking for
mobile devices or edge devices.
pg. 22
Integration with Other Sensors: By combining the pose estimation with other
sensors such as depth cameras or IMUs, spatial information accuracy could
improve. Such applications would especially find value for virtual reality,
fitness, or rehabilitation use cases.
The heart of this work is the potential to provide accurate real-time pose estimation.
State-of-the-art deep learning models, such as CNNs, and advanced models like
OpenPose or MediaPipe, are applied to identify main body landmarks in order to perform
precise movement tracking. This ability ensures users can inspect posture, gestures, and
body alignment as a method of contributing to fitness applications by ensuring proper
form during exercises, which is very crucial for avoiding injuries and maximizing
effectiveness. Besides, for an individual undergoing physical therapy, the model can
keep track of progress and suggest corrections in movements, and this is a significant
role in rehabilitation.
pg. 23
Image and video pose estimation is one of the immense contributions of the project. This
versatility in the system helps it be adaptable to various cases, from still image analysis
through the analysis of sports actions and artistic poses, to dynamic video tracking, of
use in surveillance, sports performance analysis, or virtual training scenarios. The
functionality of processing feeds in real time enhances its applied use in such live settings
as the sports event or fitness class or interactive games.
Its ability to follow the human pose in less controlled environments, such as different
lighting or complex backgrounds, helps explain why this model seems to be robust and
reliable. This aspect of the project broadens the application fields where it can work
efficiently. In addition, the flexibility of the app in accepting both image and video inputs
makes it an accessible tool for a wide range of users, from fitness enthusiasts and athletes
to healthcare providers and developers in need of pose data.
This is a pioneering tool for the analysis of human movement, and in showing the power
of AI and deep learning in understanding and interpreting human posture, it contributes
meaningfully in practical applications within healthcare, fitness, entertainment, and
myriad other industry-specific uses. By improving human-computer interaction, it has
the potential to transform various industries toward personalized health and sports and
even interactive technologies.
pg. 24
REFERENCES
Human Pose Estimation Using Deep Learning: A Systematic Literature Review [1]
Samkari, E., Arif, M., Alghamdi, M., & Al Ghamdi, M. A. (2023). Human pose
estimation using deep learning: a systematic literature review. Machine Learning and
Knowledge Extraction, 5(4), 1612-1659.
Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen, J., ... & Shah, M. (2023). Deep
learning-based human pose estimation: A survey. ACM Computing Surveys, 56(1), 1-
37.
Gupta, A., Gupta, K., Gupta, K., & Gupta, K. (2021). Human Activity Recognition Using
Pose Estimation and Machine Learning Algorithm. In ISIC (Vol. 21, pp. 25-27).
pg. 25