Human Activity
Human Activity
ABSTRACT
Various benchmarking datasets and their properties are being explored. The
Experimental Evaluation of various papers are observed efficiently with the various
performance metrics like Precision, Recall, and Accuracy. Human action recognition can
be made more reliable without manual annotation of relevant portion of action of
interest. This paper presents not only an update extending previous related surveys, but
also focuses on a joint learning framework that identify the temporal and spatial extent
of action in videos.
I
CONTENT
ABSTRACT I
ACKNOWLEDGEMENT II
LIST OF FIGURES V
LIST OF TABLES VI
1. INTRODUCTION
ABOUT PROJECT 1
DOMAIN INTRODUCTION 3
PROBLEM STATEMENT 3
OBJECTIVES 3
SCOPE OF THE PROJECT 4
2. LITERATURE SURVEY
TECHNOLOGY
OPENCV 8
2.2. CONVOLUTIONAL NEURAL NETWORK 10
3. REQUIREMENT ANALYSIS
3.1. FUNCTIONAL REQUIREMENTS 16
4. DESIGN
II
4.3. FlOWCHART 24
SEQUENCE DIAGRAM 27
STATE DIAGRAM 27
5. IMPLEMENTATION 28
DATA REPRESENTATION 30
PRE-PROCESSING 31
SEGMENTATION 31
FEATURE EXTRACTION 31
BUILDING MODEL 32
6. TESTING 37
TYPES OF TESTING 37
TEST CASES 38
7. SNAPSHOT 42
8. CONCLUSION 45
9. REFERENCES 46
III
LIST OF FIGURES
Fig. No Figure Description Page No
2.1 Training Error 5
2.2 Architecture Layer 8
3.1 OpenCV 18
3.2 Anaconda UI 19
4.3 Flowchart 24
7.5 Output 1 44
7.6 Output 2 44
IV
LIST OF TABLES
Table No Table Description Page No
4.1 Layers of ResNet 23
4.2 Versions of ResNet 23
V
CHAPTER 1
INTRODUCTION
ABOUT PROJECT
Human Activity Recognition uses deep learning methods such as convolutional neural networks
and recurrent neural networks that basically help achieve state-of-art results. Convolutional
neural networks mixed with long short-term memory networks are best suited to learning
features from raw data and help to predict associated movement.
HAR is the place where movements are often typical activities performed indoors, such as walking,
talking, standing and sitting. But it isn’t restricted to those activities but deals with much more.
Verifiably, sensor information for movement acknowledgment was testing and costly to gather,
requiring custom equipment. Presently advanced mobile phones and other individual GPS beacons
utilized for wellness and wellbeing checking are modest and pervasive. All things considered,
sensor information from these gadgets is less expensive to gather, more normal, and subsequently
is an all the more ordinarily contemplated form of the overall action acknowledgment issue. For
the most part, this issue is outlined as a univariate or multivariate time arrangement grouping
task.
It is a difficult issue as there are no self-evident or direct approaches to relate the recorded sensor
information to explicit human exercises and each subject may play out an action with critical range.
The goal is to record sensor information and relating exercises for explicit subjects, fit a model from
this information, and sum up the model to order the movement of new concealed subjects from their
1
sensor info.
2
Human Activity Recognition project is designed using openCV and ResNet architecture for
recognizing the activities performed in different videos. The project is done using Deep Learning
and an Artificial Intelligence function.
A model is built using the sequential method in Tensorflow library. The model is trained using the
training data and the model tested using testing data. In this algorithm the weights of the trained
model are stored in H5 file, for later use and the model need not be trained every time we use it.
Classification matrix is plotted for better understanding of the model. It is difficult to detect the
acitivties using opencv, so here in this ResNet architecture is used along with opencv for detecting
the movements and labelling them. This algorithm has many applications in real world and it is
user friendly and efficient.
Python is a general purpose programming language which became very popular in short time
mainly because of its simplicity and code readability. It allows the programmers to express their
ideas in very few lines of code without reducing any readability. Open CV is a library that contains
programming functions especially has the real-time functions for computer vision. Open CV is an
open source cross platform, free to be used under BSD [Link] CV supports many other
technologies like Caffe, Torch/PyTorch, the deep learning frameworks TensorFlow. The existing
state-of-the-art 2D architectures (such as ResNet) can be extended to video classification via 3D
kernels. The ImageNet dataset allowed models like these to be trained to high accuracy. Hence,
these architectures will be able to process and perform video classification by - (1) tweaking the
input volume shape that will help to include spatiotemporal knowledge.
DOMAIN INTRODUCTION
Deep Learning is an Artificial Intelligence function. This domain has been constantly evolving in today’s
era, and this has resulted in an explosion of data in all forms. This humungous collection of data is easily
accessible and it can be shared through the means of cloud computing.
3
With that being said, the data extracted is so huge that it could take years for humans to interpret and
understand the necessary information. With the domain, Deep Learning, a variety of computer vision
applications have been introduced that are becoming a part of our daily lives.
That includes Home Surveillance, Human behavioural analysis and so much more.
PROBLEM STATEMENT
This project focuses on building a Human Activity Recognition System with OpenCV, TensorFlow and
Deep Learning. The aim is to train a Deep Learning model that will detect human activities being
performed on the video stream provided as an input. This model will be able to identify over 400
activities performed by humans such as jogging, cycling, eating, etc. with an accuracy of 70% to 90%. The
purpose of this project is to identify the actions and objectives of one or more objects.
OBJECTIVES
The primary objective of this project is to solve human centred problems from healthcare to security by
inferring several simple human activities. By following the steps, we will be able to understand the
process involved in training a model to recognize human activities.
4
● Classify the activity and caption it back on the individual frames
The major applications of this system can be seen in robotics, fall detection for humans, AI, video
surveillance and many more. The field has earned a lot of fame through the years. Because of this, there
has become an immense need to develop an effective way to store the videos. The necessary
applications that are needed for designing these interfaces are carefully researched and given deep
thought. The recognition pf activities in the video surveillance is not limited to detecting unauthorized
people entry and abnormal crowd behaviour but much more.
5
CHAPTER 2
LITERATURE SURVEY
Literature survey is the most basic step for research and developments, by which we gain
knowledge. Before starting the project, it is necessary to know about the appropriate language,
software, and other development tools to be used for our project, so that we get efficient results.
The programmers before starting the actual coding need to know all these kinds of information
and need lot of external support. This support can be obtained from senior programmers, from
book or from websites or from some Journals. Literature survey is the best way of learning and
gaining knowledge about the concepts to be used. We have studied about the smart selfie in
detail.
TECHNOLOGY
Residual Networks (ResNet) – Deep Learning:
After the first CNN-based architecture (AlexNet) that win the ImageNet 2012 competition, every
subsequent winning architecture uses several deep neural networks to reduce the error rate. It can work
for a few layers, but when we increase the number of layers, there occurs an issue associated with
something called Vanishing/Exploding gradient. This results in the gradient becoming 0 or too large. And
so, we increase the layers, training and test errors will increase automatically.
6
Other Techniques used:
7
Neural networks
CNN architecture
Advantages:
8
Basic Architecture Model:
9
OpenCV
OpenCV is used in the field of computer vision which is a real-time optimized Computer vision library and
tools. It helps in processing images and videos to detect objects, frames, faces and activities. It plays a
major role in real-time operation which is essential in today’s systems.
Features of OpenCV Library
∙ Analyse the video, i.e., estimate the motion in it, subtract the background, and track
objects in it.
Core Functionality
The core functionality module includes some of the basic data structures like Range, Scalar, Points,
etc., which are used in building applications using OpenCv. It also includes the Mat- multidimensional
array, which is used for storing the images. [Link] is the name of the package of this module
in java library of OpenCV.
Image Processing
The image processing module contains various operations for processing the images such as filtering
images, transformation of images, conversion of colour space, histograms, etc.
10
[Link] is the name of the package of this module in java library of OpenCV.
Video
The video module covers the analysis of video concepts like estimation of motion, tracking of objects, etc.
[Link] is the name of the package of this module in java library of OpenCV.
Video I/O
The video codecs and video capturing is explained in this module using OpenCV. [Link] is
the name of the package of this module in java library of OpenCV. calib3d
The multiple- view geometry algorithms, object pose estimation, stereo and single camera
calibration, elements of 3D reconstruction, stereo correspondence are all included in this module.
[Link].calib3d is the name of the package of this module in java library of OpenCV.
features2d
It includes the feature description and detection concepts. [Link].features2d is the name of the
package of this module in java library of OpenCV.
Objdetect
The object detection and predefined instance classes like eye, faces, people, etc, are included in
this module. [Link] is the name of the package of this module in java library of
OpenCV.
Highgui
This module contains simple UI capabilities. [Link] and [Link] is the name
of the package of this module in java library of OpenCV.
EXISTING SYSTEM
These are the existing system related to this field. We studied these papers to get more
knowledge about video recognition and frame processing, and about what kind of methods to be
applied for our project.
PROPOSED SYSTEM
In this study, the proposed concept is advanced based totally on Python 3, Keras, OpenCV, ResNet, and
12
TensorFlow. The foremost reason for this device is to system the enter video flow for human detection
and similarly processing the character frames of the enter video to expect which hobby is being
distributed. After the prediction made, that's correct as much as 94%, the frames are captioned and also
the end product is given to the output.
One among the foremost important a part of this study is the classification of activity being performed by
the human and this feature depends on object detection framework.
13
METHODOLOGY
The main aim of this system is to detect human motion and tag them on the basis of activities performed
by them using Human Behaviour Analysis. This is achieved by leveraging a human activity recognition
model pre-trained on the Kinetics dataset, which includes 400-700 human activities (depending on which
version of the dataset you’re using) and over 300,000 video clips. DeepMind Kinetics human motion
video dataset is described. The dataset includes four hundred human motion instructions, with as a
minimum four hundred videos for every motion. Each clip lasts round 10s and is taken from an exclusive
YouTube video. The moves are human oriented and cowl a vast variety of instructions along with
The statistics of the dataset, how it was collected, and some baseline performance figures for neural
network architectures are trained and tested for human action classification on this dataset. Preliminary
analysis is carried out of whether imbalance in the dataset leads to bias in the classifiers.
14
CHAPTER 3
REQUIREMENT ANALYSIS
3. 1 FUNCTIONAL REQUIREMENTS
⮚ System should be able to take video stream as an input.
⮚ System should also be able to extract each frame from the video input
⮚ System should be able to pre-process these frames extracted from the input and resize them
⮚ System should be able to compare the frames with the trained weights.
⮚ After comparing, the system should be able to categorize the input sequence into various
NON-FUNCTIONAL REQUIREMENTS
Security:
No outside entity shall be allowed to modify content of code without proper authorization.
Usability:
Reliability:
15
⮚ The system should be able to recover in time. System should be able to handle any
exceptions properly
TensorFlow is the premier open-source deep learning framework developed and maintained by Google.
Though implementing TensorFlow directly could be challenging enough, the modern [Link] API beings
the simplicity and ease of use of Keras to the TensorFlow project.
Deep-getting to know networks are prominent from those everyday neural networks having greater
hidden layers, or so-known as greater depth. These nets are able to coming across hidden systems within
unlabelled and unstructured statistics ([Link], sound, and text), which includes the sizeable majority
of statistics withinside the world.
OpenCV:
OpenCV is a cross-platform library the use of which we will broaden real-time pc imaginative and
prescient applications. applications. It especially specializes in photograph processing, video seize and
evaluation such as functions like face detection and item detection. Some of the main library modules of
OpenCV are Core Functionality, Image Processing, Video I/O, features2D, etc.
16
Computer Vision:
Computer Vision is defined as an area that explains the way to reconstruct, interrupt, and recognize a
three-D scene from its 2D images, in phrases of the residences of the shape gift withinside the scene. It
offers with modelling and replicating human imaginative and prescient the use of pc software program
and hardware.
UI
Anaconda:
Anaconda may be a distribution of the Python and R programming languages for scientific computing,
that aims to simplify package management and deployment. Interestingly, the package versions in
Anaconda are handled by package management system conda. This package manager was spun out as a
separate open-source package because it ended up being useful on its own and for other things than
Python.
17
HARDWARE AND SOFTWARE REQUIREMENTS
⮚ HARDWARE REQUIREMENTS
o Windows 10 OS
⮚ SOFTWARE REQUIREMENTS
o OpenCV
o Anaconda Environment
o Pre-Trained weights
OpenCV (Open Source Computer Vision Library) is an open supply laptop imaginative and prescient and
system studying software program library. OpenCV became constructed to offer an infrastructure for
laptop imaginative and prescient programs and to boost up the usage of system notion withinside the
business products. Being a BSD-certified product, OpenCV makes it clean for commercial enterprise to
make use of and regulate the code.
18
Anaconda
The open supply Anaconda is the perfect manner to carry out Python/R facts technological knowhow and
gadget gaining knowledge of on Linux, Windows and Mac OS X. with over eleven million customers
worldwide, it's miles the enterprise popular for developing, trying out on a unmarried gadget, allowing
individual’s facts scientists to:
Pre-Trained Weights
Lower layers learn features that are not necessarily specific to the application/dataset: corners, edges,
simple shapes, etc. So, it does not matter if the data is strictly a subset of the categories that the original
network can predict. Depending on how much data are available for training, and how similar the data is
to the one used in the pretrained network, you will be able to decide to freeze the lower layers and work
with only the higher layers.
Its bendy structure permits for the smooth deployment of computation throughout quite a few platforms
(CPUs, GPUs, TPUs), and from computer systems to clusters of servers to cellular and side devices.
TensorFlow computations are viewed as stateful dataflow graphs.
Keras includes several implementations of generally used neural-community constructing blocks which
includes layers, objectives, activation functions, optimizers, and a bunch of equipment to make running
with photograph and textual content information easier. The code is hosted on GitHub, and network aid
boards consist of the GitHub troubles page, and a Slack channel. In addition to traditional neural
networks,
19
Keras has aid for convolutional and recurrent neural networks. It helps different application layers like
dropout, batch normalization, and pooling.
Python Language
Python is a high-level programming language. You can use Python because it is Readable and
Maintainable Code, supports Multiple Programming Paradigms and Many Open Source Frameworks and
Tools, it is Compatible with Major Platforms and Systems, it has Robust Standard Library, helps in Simplify
Complex Software Development.
20
CHAPTER 4
DESIGN
DESIGN GOALS
INPUT
Take the pre-recorded video as input from the user through command line backend
OUTPUT
The output should caption the frames of input video based on the prediction made by trained activity
recognition model.
EFFICIENCY
The system should be able to identify human from any other similar objects in the input video stream
and appropriately classify the activity with acceptable accuracy.
One of the issues ResNets clear up is the well-known recognized vanishing gradient. This is due to the
fact while the community is just too deep, the gradients from in which the loss feature is calculated
without difficulty. This end result at the weights in no way updating its values and therefore, no getting
to know is being performed. With ResNets, the gradients can go with the drift immediately thru the pass
connections backwards from later layers to preliminary filters.
21
Architecture
Since ResNets can have variable sizes, depending on how big each of the layers of the model are, and
how many layers it has, we are going to follow the one described by the authors in the paper — ResNet
34 — to brief the architectures of these networks. Understand that the reduction between layers is
achieved by an increase on the stride, from 1 to 2, at the first convolution of each of the layer.
Summary
The ResNets following the explained rules built by the authors yield to the following structures as shown below:
FLOWCHART
22
Figure 4.3. Flowchart
23
DATAFLOW DIAGRAM
24
USE-CASE DIAGRAM
25
SEQUENCE DIAGRAM
STATE DIAGRAM
26
CHAPTER 5
IMPLEMENTATION
Libraries required
2. Cv2 - It is a python library used for real time computer vision. If recognizes faces,
classify human actions, identify objects, track camera movements, track moving
objects, extract 3D model of object. In this application it recognizes the objects
appearing on the camera and tracks the objects and camera movements.
3. Imutils - It is a python library used to access webcam of the system. This library access
the webcam that allows for live stream.
27
In order to reduce the number of parameters on the basis of expanding the receptive field, we
introduced TridentNet with three branches in the backbone ResNet-50. TridentNet is introduced
into the 5th stage of ResNet-50 in this paper. Due to structural differences between modules in the
ResNet network itself, the improved trident module is also divided into Conv-trident block and IDtrident
block.
The original TridentNet was used as a part of the thing detection network as a threeway
structure. We made some modifications to form it one branch output, and added a shortcut to
form it more in line with the ResNet configuration. additionally to adding a multi-branch
structure, TridentNet also uses the concept of dilated convolution. By adding 0 within the
convolution kernel, an outsized receptive field are often obtained with fewer parameters. When
the expansion rate is, the horizontal length of the convolution kernel, the amount of all 0s and
28
therefore the length of the
29
first convolution kernel size k are as shown in equation. within the TridentNet structure, the
parameters of the three branches are 3, and are 1, 2, and three respectively. Therefore, the
dimension of the receptive field n becomes 3, 5, and 7. n=k + (k-1) * (d-1)
The existing state-of-the-art 2D architectures (such as ResNet) can be extended to video classification via
3D kernels.
The ImageNet dataset allowed models like these to be trained to high accuracy. Hence, these
architectures will be able to process and perform video classification by - (1) tweaking the input
volume shape that will help to include spatiotemporal knowledge.
DATA REPRESENTATION
Different records representations were utilized by the guides withinside the underneath table, relying at
the HAR programs and sensors. Inertial measurements recorded via way of means of sensors
incorporated in IMUs are deployed, generally, for fixing HAR. Usually, greater than 3 gadgets are set at
the organic structure, e.g., at the hands, legs, head, and torso. Differently, the authors recorded
acceleration measurements from best one device, that's located at the waist. The authors proposed the
usage of the value of the acceleration vector from the three additives x, y, and z. The authors used the
logarithm value of a two-dimensional Discrete Fourier Transform of IMU signals. They proposed using
this value as a picture enter for a CNN.
PRE-PROCESSING
Low and High-pass filtering have been used for separating the acceleration components due to body
movements and gravity. This is also used for eliminating noise. Now the body acceleration was calculated
30
by subtracting the gravity component from the acceleration measurements. Also, here the video inputs
are processed and extracted into individual frames so these frames could be passed or forwarded on to a
trained model and then recognized.
SEGMENTATION
FEATURE EXTRACTION
In the sample reputation methods, the extraction of characters is a crucial stage. It permits representing
statistics in a mild manner. They are divided into major groups, statistical functions and application-
based. Time-area functions specializes in the waveform characteristics, and frequency-area functions
recognition at the periodic shape of the signal. We performed a complete take a look at trendy strategies
of human hobby popularity. We mentioned unimodal procedures and supplied an inner categorization of
those strategies, which have been evolved for analysing gesture, atomic actions, and greater complicated
activities.
31
Figure 5.4. Feature Extraction process
BUILDING MODEL
Since this project is related to smile detection, we need to include the face haar cascade
and the smile haar cascade. Both the XML files are loaded in to the algorithm. We can
mention face, eye, mouth etc.
Ex: ap = [Link]()
32
default="", help="optional path
vars(ap.parse_args())
Step 3: load the contents of the class labels file, then define the sample duration (i.e., # of
frames for classification) and sample size (i.e., the spatial dimensions of the frame).
2. Defining the sample duration which is the duration of frames for classification.
Ex: SAMPLE_DURATION = 16
3. Defining the sample size which is the spatial dimensions of the frame. Ex: SAMPLE_SIZE
= 112
Step 4: We now initialize the frames queue used to store a rolling sample duration of frames.
Ex: frames = deque(maxlen=SAMPLE_DURATION)
33
Ex: while True:
Step 8: resize the frame (to ensure faster processing) and add the frame to our queue
[Link](frame)
Step 9: if our queue is not filled to the sample size, continue back to the top of the loop and
continue polling/processing frames
Continue
Step 10: Now that our frames array is filled, we can construct our blob
swapRB=True, crop=True)
Step 11: pass the blob through the network to obtain our human activity recognition
predictions.
label = CLASSES[[Link](outputs)]
34
Step 12: Draw the predicted activity on the frame.
35
2. OpenCV – It is a python library designed for solving Computer vision problems. 3.
Tensorflow – it is a Deep learning python library, which contains various sub libraries or
modules for solving image related problems.
36
CHAPTER 6
TESTING
Testing can be defined as the process of finding errors. The main purpose of testing
is to find errors, discover weakness or faults in a product. It is a way of checking the component
functionality, assemblies and the completely integrated product. It is the method used for
ensuring whether the software designed meets the user expectations and requirements and to
ensure that it does not fail in unacceptable manner. There are different types of testing:
TYPES OF TESTING
[Link] TESTING
In unit testing the test cases are designed for validating the internal program logic and to check if
the input produces valid output. All the code flow and the decision branches are tested using unit
testing. It is a structural testing and done at component level.
2 INTEGRATION TESTING
Integration testing is done after integrating each component to check if it can properly as a one
program. It is done to find whether the component integration is consistent and correct. The main
aim of integration testing is to find the issues that arise while combining the components.
3 SYSTEM TESTING
System testing is conducted on the complete product, the integrated system to check if it meets the user
requirements. It does not require any inner knowledge about the code or implementation.
System testing is done on the complete system in the context of a SRS (System Requirement
Specification) or FRS (Functional Requirement Specification).
TEST CASES
37
1.
38
Table 6.1: Test case for detecting push-up
2.
39
Remark Test successful
3.
40
Output The activity was detected by the CNN
algorithm and the type of activity is also
displayed on the screen.
4.
41
Description When the program is executed, on
analysing the video fed into system, the
activity performed in video must be
labelled as “throwing axe”
42
CHAPTER 7
SNAPSHOT
CODE SNIPPETS:
1.
2.
43
3.
OUTPUT SNAPSHOTS -
GUI:
44
OUTPUT 1:
OUTPUT 2:
45
46
CHAPTER 8
CONCLUSION
In this project we implemented Human Activity Recognition System. It is an application that
recognizes the movements or activities performed by humans and labels them. Initially we did
literature survey on how to implement this application. Then we analysed the functional and
nonfunctional requirements to implement the application. Then we designed few UML diagrams
for better understanding of implementation. After requirement and design analysis we gave input
data set to train the model with 80% data and test the model with remaining 20% data. Once the
model is trained it detects and labels the activity being performed in the respective video input.
The model is trained using convolution neural network (CNN) machine learning algorithm, namely
using the ResNet architecture.
All in all, in this project you learned how to perform human activity recognition using OpenCV and
Deep Learning. To be able to complete this task, we have used a human activity recognition
pretrained model on the Kinetics dataset, which includes 400-700 human activities (depending on
which version of the dataset you’re using) and over 300,000 video clips. The model makes use of
the ResNet architecture that uses 3D kernels when compared to the standard 2D filters, allowing it
to include a temporal component for recognition of activity.
47
CHAPTER 9
REFERENCES
I. T. Lan, Y. Wang, and G. Mori, “Discriminative figure-centric models for joint action localization and
recognition,” in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011, pp. 2003–210.
II. [Link], [Link] and [Link],” Discovering discriminative action parts from mid-level
video
representations” ,in Proc ,IEEE [Link],.Pattern Recog.,Jun 2012,pp.1242-1249.
III. [Link],[Link] Genmert,[Link] ,[Link] and [Link] ,”Action localization with tubelets from
motion”
,in Proc IEEE Conf, [Link] Recog. Jun 2014 pp 740-747
IV.
H. Zhang and O. Yoshie, “Improving human activity recognition using subspace clustering,” in
Machine
Learning and Cybernetics (ICMLC), 2012 International Conference on, vol. 3, July 2012, pp. 1058–1063.
V. N. Robertson and I. Reid, “A general method for human activity recognition in video,” Computer
Vision and Image Understanding, vol. 104, no. 2, pp. 232–248, 2006.