[go: up one dir, main page]

0% found this document useful (0 votes)
140 views53 pages

Human Activity

notes

Uploaded by

abiraman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views53 pages

Human Activity

notes

Uploaded by

abiraman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Recognizing human activity

ABSTRACT

Understanding the activities of human from videos is demanding task in


Computer Vision. Identifying the actions being accomplished by the human in the
video sequence automatically and tagging their actions is the prime functionality of
intelligent video systems. The goal of activity recognition is to identify the actions and
objectives of one or more objects from a series of examination on the action of object and
their environmental condition. The major applications of Human Activity Recognition
vary from Content-based Video Analytics, Robotics, Human- Computer Interaction,
Human fall detection, Ambient Intelligence, Visual Surveillance, Video Indexing etc.
This paper collectively summarizes and deciphers the various methodologies, challenges
and issues of Human Activity Recognition systems. Variants of Human Activity
Recognition systems such as Human Object Interactions and Human-Human
Interactions are also explored.

Various benchmarking datasets and their properties are being explored. The
Experimental Evaluation of various papers are observed efficiently with the various
performance metrics like Precision, Recall, and Accuracy. Human action recognition can
be made more reliable without manual annotation of relevant portion of action of
interest. This paper presents not only an update extending previous related surveys, but
also focuses on a joint learning framework that identify the temporal and spatial extent
of action in videos.

I
CONTENT
ABSTRACT I

ACKNOWLEDGEMENT II

LIST OF FIGURES V

LIST OF TABLES VI

1. INTRODUCTION
ABOUT PROJECT 1
DOMAIN INTRODUCTION 3
PROBLEM STATEMENT 3
OBJECTIVES 3
SCOPE OF THE PROJECT 4

2. LITERATURE SURVEY

TECHNOLOGY

Residual Networks (ResNet) – Deep Learning 5

OPENCV 8
2.2. CONVOLUTIONAL NEURAL NETWORK 10

2.3. EXISTING SYSTEM 12


2.4. PROPOSED SYSTEM 13
2.5. METHODOLOGY 14

3. REQUIREMENT ANALYSIS
3.1. FUNCTIONAL REQUIREMENTS 16

3.2. NON FUNCTIONAL REQUIREMENTS 16


3.3. DOMAIN AND UI REQUIREMENTS 17

3.4. HARDWARE AND SOFTWARE REQUIREMENTS 19

4. DESIGN

4.1. DESIGN GOALS 22

4.2. OVERALL SYSTEM ARCHITECTURE 22

II
4.3. FlOWCHART 24

4.4. DATA FLOW DIAGRAM 25


USE-CASE DIAGRAM 26

SEQUENCE DIAGRAM 27

STATE DIAGRAM 27

5. IMPLEMENTATION 28

ACTIVITY RECOGNITION USING RESNET 28

DATA REPRESENTATION 30

PRE-PROCESSING 31

SEGMENTATION 31

FEATURE EXTRACTION 31

BUILDING MODEL 32

CONVOLUTIONAL NEURAL NETWORK 35

6. TESTING 37
TYPES OF TESTING 37
TEST CASES 38
7. SNAPSHOT 42
8. CONCLUSION 45
9. REFERENCES 46

III
LIST OF FIGURES
Fig. No Figure Description Page No
2.1 Training Error 5
2.2 Architecture Layer 8

2.3 Layers of ConvNet 11

2.4 Activity frame 14

2.5 Flow of program 15

3.1 OpenCV 18

3.2 Anaconda UI 19

4.3 Flowchart 24

4.4 Dataflow Diagram 25

4.5 Use-case diagram 26

4.6 Sequence Diagram 27

4.7 State Diagram 27

5.1 Network structure of ResNet-50 29

5.2 Improved version of network structure 29

5.4 Feature Extraction process 32

7.1 Code Snippet 42

7.2 Code Snippet 2 42

7.3 Code Snippet 3 43

7.4 GUI of front-end 43

7.5 Output 1 44

7.6 Output 2 44

IV
LIST OF TABLES
Table No Table Description Page No
4.1 Layers of ResNet 23
4.2 Versions of ResNet 23

5.3 Detailed structure of Surveys 30

6.1 Test case for detecting push-up 38

6.2 Test case for Human Detection 39

6.3 Test case for Predicting activity 40

6.4 Test case for activity “throwing axe” 41

V
CHAPTER 1

INTRODUCTION
ABOUT PROJECT

Recognizing human activity is essential in human-to-human interaction and relations. It helps


provide information about the basic identity of a person and their psychological state. Human
activity recognition is the process of classifying sequences of videos to understand the kind of
activity being performed in it. It can be a challenging issue while working with a large number of
observations, and the shortage of ways to relate accelerometer data to known movements.

Human Activity Recognition uses deep learning methods such as convolutional neural networks
and recurrent neural networks that basically help achieve state-of-art results. Convolutional
neural networks mixed with long short-term memory networks are best suited to learning
features from raw data and help to predict associated movement.
HAR is the place where movements are often typical activities performed indoors, such as walking,
talking, standing and sitting. But it isn’t restricted to those activities but deals with much more.
Verifiably, sensor information for movement acknowledgment was testing and costly to gather,
requiring custom equipment. Presently advanced mobile phones and other individual GPS beacons
utilized for wellness and wellbeing checking are modest and pervasive. All things considered,
sensor information from these gadgets is less expensive to gather, more normal, and subsequently
is an all the more ordinarily contemplated form of the overall action acknowledgment issue. For
the most part, this issue is outlined as a univariate or multivariate time arrangement grouping
task.

It is a difficult issue as there are no self-evident or direct approaches to relate the recorded sensor
information to explicit human exercises and each subject may play out an action with critical range.

The goal is to record sensor information and relating exercises for explicit subjects, fit a model from
this information, and sum up the model to order the movement of new concealed subjects from their

1
sensor info.

2
Human Activity Recognition project is designed using openCV and ResNet architecture for
recognizing the activities performed in different videos. The project is done using Deep Learning
and an Artificial Intelligence function.

A model is built using the sequential method in Tensorflow library. The model is trained using the
training data and the model tested using testing data. In this algorithm the weights of the trained
model are stored in H5 file, for later use and the model need not be trained every time we use it.
Classification matrix is plotted for better understanding of the model. It is difficult to detect the
acitivties using opencv, so here in this ResNet architecture is used along with opencv for detecting
the movements and labelling them. This algorithm has many applications in real world and it is
user friendly and efficient.

Python is a general purpose programming language which became very popular in short time
mainly because of its simplicity and code readability. It allows the programmers to express their
ideas in very few lines of code without reducing any readability. Open CV is a library that contains
programming functions especially has the real-time functions for computer vision. Open CV is an
open source cross platform, free to be used under BSD [Link] CV supports many other
technologies like Caffe, Torch/PyTorch, the deep learning frameworks TensorFlow. The existing
state-of-the-art 2D architectures (such as ResNet) can be extended to video classification via 3D
kernels. The ImageNet dataset allowed models like these to be trained to high accuracy. Hence,
these architectures will be able to process and perform video classification by - (1) tweaking the
input volume shape that will help to include spatiotemporal knowledge.

(2) integrating the 3D kernels inside of the architecture.

DOMAIN INTRODUCTION

Deep Learning is an Artificial Intelligence function. This domain has been constantly evolving in today’s
era, and this has resulted in an explosion of data in all forms. This humungous collection of data is easily
accessible and it can be shared through the means of cloud computing.

3
With that being said, the data extracted is so huge that it could take years for humans to interpret and
understand the necessary information. With the domain, Deep Learning, a variety of computer vision
applications have been introduced that are becoming a part of our daily lives.
That includes Home Surveillance, Human behavioural analysis and so much more.

PROBLEM STATEMENT

This project focuses on building a Human Activity Recognition System with OpenCV, TensorFlow and
Deep Learning. The aim is to train a Deep Learning model that will detect human activities being
performed on the video stream provided as an input. This model will be able to identify over 400
activities performed by humans such as jogging, cycling, eating, etc. with an accuracy of 70% to 90%. The
purpose of this project is to identify the actions and objectives of one or more objects.

OBJECTIVES

The primary objective of this project is to solve human centred problems from healthcare to security by
inferring several simple human activities. By following the steps, we will be able to understand the
process involved in training a model to recognize human activities.

Steps involved are as follows:

● Take video stream as input

● Process the video that is taken and extract individual frames

● Pass the frame through trained human activity recognition model

● Compare the frames and predict the type of activity

4
● Classify the activity and caption it back on the individual frames

● Stream the video to the output

SCOPE OF THE PROJECT

The major applications of this system can be seen in robotics, fall detection for humans, AI, video
surveillance and many more. The field has earned a lot of fame through the years. Because of this, there
has become an immense need to develop an effective way to store the videos. The necessary
applications that are needed for designing these interfaces are carefully researched and given deep
thought. The recognition pf activities in the video surveillance is not limited to detecting unauthorized
people entry and abnormal crowd behaviour but much more.

5
CHAPTER 2

LITERATURE SURVEY
Literature survey is the most basic step for research and developments, by which we gain
knowledge. Before starting the project, it is necessary to know about the appropriate language,
software, and other development tools to be used for our project, so that we get efficient results.
The programmers before starting the actual coding need to know all these kinds of information
and need lot of external support. This support can be obtained from senior programmers, from
book or from websites or from some Journals. Literature survey is the best way of learning and
gaining knowledge about the concepts to be used. We have studied about the smart selfie in
detail.

TECHNOLOGY
Residual Networks (ResNet) – Deep Learning:
After the first CNN-based architecture (AlexNet) that win the ImageNet 2012 competition, every
subsequent winning architecture uses several deep neural networks to reduce the error rate. It can work
for a few layers, but when we increase the number of layers, there occurs an issue associated with
something called Vanishing/Exploding gradient. This results in the gradient becoming 0 or too large. And
so, we increase the layers, training and test errors will increase automatically.

Figure 2.1. Training Error

6
Other Techniques used:

7
Neural networks
CNN architecture
Advantages:

1. This architecture carries a powerful representational ability.


2. It is a powerful backbone that is used in several computer vision tasks.
3. It targets to working on adding the output from an earlier layer to the next layer making the flow
of process more efficient.

8
Basic Architecture Model:

Figure 2.2. Architecture Layer

9
OpenCV
OpenCV is used in the field of computer vision which is a real-time optimized Computer vision library and
tools. It helps in processing images and videos to detect objects, frames, faces and activities. It plays a
major role in real-time operation which is essential in today’s systems.
Features of OpenCV Library

Using OpenCV library, you can − ∙

Read and write images

∙ Capture and save those videos

∙ Process images by filtering them out.

∙ Start performing feature detection

∙ Detect specific objects and humans in that video

∙ Analyse the video, i.e., estimate the motion in it, subtract the background, and track

objects in it.

OpenCV Library Modules


Some of the main library modules in OpenCV library are as follows:

Core Functionality

The core functionality module includes some of the basic data structures like Range, Scalar, Points,
etc., which are used in building applications using OpenCv. It also includes the Mat- multidimensional
array, which is used for storing the images. [Link] is the name of the package of this module
in java library of OpenCV.

Image Processing

The image processing module contains various operations for processing the images such as filtering
images, transformation of images, conversion of colour space, histograms, etc.

10
[Link] is the name of the package of this module in java library of OpenCV.

Video
The video module covers the analysis of video concepts like estimation of motion, tracking of objects, etc.
[Link] is the name of the package of this module in java library of OpenCV.
Video I/O
The video codecs and video capturing is explained in this module using OpenCV. [Link] is
the name of the package of this module in java library of OpenCV. calib3d
The multiple- view geometry algorithms, object pose estimation, stereo and single camera
calibration, elements of 3D reconstruction, stereo correspondence are all included in this module.
[Link].calib3d is the name of the package of this module in java library of OpenCV.
features2d

It includes the feature description and detection concepts. [Link].features2d is the name of the
package of this module in java library of OpenCV.

Objdetect

The object detection and predefined instance classes like eye, faces, people, etc, are included in
this module. [Link] is the name of the package of this module in java library of
OpenCV.
Highgui

This module contains simple UI capabilities. [Link] and [Link] is the name
of the package of this module in java library of OpenCV.

2.2. CONVOLUTIONAL NEURAL NETWORK


CNN is a Deep Learning Algorithm, where it takes image as input and extracts features from
images and differentiates between the objects and gives output as per the user requirements.
Convolutional Neural Network does not need much of pre-processing compared to other
algorithms. CNN has many layers, which helps it in analyzing the images itself without input from
humans and extracting features. Neural Networks mimic the process of neuron system in humans.
Convolution
11
Neural Network is a feed forward neural network, it is mainly used for image related problems. A
CNN is also known as ConvNet. ConvNet have ability to learn about the filters and characteristics
by itself. ConvNet has become most popular with computer vision tasks.

Fig. 2.3. Layers of ConvNet

Some applications of ConvNet


∙ Image recognition and OCR

∙ Object detection for self-driving cars

∙ Face recognition on social media

∙ Image analysis in healthcare

EXISTING SYSTEM
These are the existing system related to this field. We studied these papers to get more
knowledge about video recognition and frame processing, and about what kind of methods to be
applied for our project.

PROPOSED SYSTEM
In this study, the proposed concept is advanced based totally on Python 3, Keras, OpenCV, ResNet, and

12
TensorFlow. The foremost reason for this device is to system the enter video flow for human detection
and similarly processing the character frames of the enter video to expect which hobby is being
distributed. After the prediction made, that's correct as much as 94%, the frames are captioned and also
the end product is given to the output.

One among the foremost important a part of this study is the classification of activity being performed by
the human and this feature depends on object detection framework.

Figure 2.4. Activity frame

13
METHODOLOGY

The main aim of this system is to detect human motion and tag them on the basis of activities performed

by them using Human Behaviour Analysis. This is achieved by leveraging a human activity recognition

model pre-trained on the Kinetics dataset, which includes 400-700 human activities (depending on which

version of the dataset you’re using) and over 300,000 video clips. DeepMind Kinetics human motion

video dataset is described. The dataset includes four hundred human motion instructions, with as a

minimum four hundred videos for every motion. Each clip lasts round 10s and is taken from an exclusive

YouTube video. The moves are human oriented and cowl a vast variety of instructions along with

humanitem interactions together with gambling instruments, in addition to human-human interactions

together with shaking hands.

The statistics of the dataset, how it was collected, and some baseline performance figures for neural

network architectures are trained and tested for human action classification on this dataset. Preliminary

analysis is carried out of whether imbalance in the dataset leads to bias in the classifiers.

Figure 2.5. Flow of program

14
CHAPTER 3

REQUIREMENT ANALYSIS
3. 1 FUNCTIONAL REQUIREMENTS
⮚ System should be able to take video stream as an input.

⮚ System should also be able to extract each frame from the video input

⮚ System should be able to pre-process these frames extracted from the input and resize them

it to the required threshold size.

⮚ System should be able to compare the frames with the trained weights.

⮚ After comparing, the system should be able to categorize the input sequence into various

classes with acceptable accuracy.

NON-FUNCTIONAL REQUIREMENTS

Security:

 No outside entity shall be allowed to modify content of code without proper authorization.

Usability:

⮚ Self-learning support must be available. System must be intelligent enough to suggest


through proper steps as you continue using the system. System should be able to recognize all
kinds of activities a human can perform.

Reliability:

15
⮚ The system should be able to recover in time. System should be able to handle any
exceptions properly

DOMAIN AND UI REQUIREMENTS DOMAIN:


Deep Learning with Tensor Flow:

TensorFlow is the premier open-source deep learning framework developed and maintained by Google.
Though implementing TensorFlow directly could be challenging enough, the modern [Link] API beings
the simplicity and ease of use of Keras to the TensorFlow project.

Deep-getting to know networks are prominent from those everyday neural networks having greater
hidden layers, or so-known as greater depth. These nets are able to coming across hidden systems within
unlabelled and unstructured statistics ([Link], sound, and text), which includes the sizeable majority
of statistics withinside the world.

OpenCV:

OpenCV is a cross-platform library the use of which we will broaden real-time pc imaginative and
prescient applications. applications. It especially specializes in photograph processing, video seize and
evaluation such as functions like face detection and item detection. Some of the main library modules of
OpenCV are Core Functionality, Image Processing, Video I/O, features2D, etc.

Figure 3.1. OpenCV

16
Computer Vision:
Computer Vision is defined as an area that explains the way to reconstruct, interrupt, and recognize a
three-D scene from its 2D images, in phrases of the residences of the shape gift withinside the scene. It
offers with modelling and replicating human imaginative and prescient the use of pc software program
and hardware.

Computer Vision overlaps significantly with the following fields –

● Pattern Recognition − It explains various techniques to classify patterns.

● Photogrammetry − It is concerned with obtaining accurate measurements from images.

UI

Anaconda:

Anaconda may be a distribution of the Python and R programming languages for scientific computing,
that aims to simplify package management and deployment. Interestingly, the package versions in
Anaconda are handled by package management system conda. This package manager was spun out as a
separate open-source package because it ended up being useful on its own and for other things than
Python.

Figure 3.2. Anaconda UI

17
HARDWARE AND SOFTWARE REQUIREMENTS

⮚ HARDWARE REQUIREMENTS

o Windows 10 OS

o Processor: Intel® core™ i5-4210U CPU @1.70Ghz 2.40Ghz

o RAM: 8.00 GB or above.

o System type:32/64bit Operating System.

⮚ SOFTWARE REQUIREMENTS

o OpenCV

o Anaconda Environment

o Pre-Trained weights

o Tensor-flow Backend (Lib.)

o Keras frontend (Lib.)

o Language used: Python

OpenCV (Python library)

OpenCV (Open Source Computer Vision Library) is an open supply laptop imaginative and prescient and
system studying software program library. OpenCV became constructed to offer an infrastructure for
laptop imaginative and prescient programs and to boost up the usage of system notion withinside the
business products. Being a BSD-certified product, OpenCV makes it clean for commercial enterprise to
make use of and regulate the code.

18
Anaconda

The open supply Anaconda is the perfect manner to carry out Python/R facts technological knowhow and
gadget gaining knowledge of on Linux, Windows and Mac OS X. with over eleven million customers
worldwide, it's miles the enterprise popular for developing, trying out on a unmarried gadget, allowing
individual’s facts scientists to:

⮚ Quickly download 1,500+ Python/R data science packages

⮚ Manage libraries, dependencies, and environment with Conda.

Pre-Trained Weights

Lower layers learn features that are not necessarily specific to the application/dataset: corners, edges,
simple shapes, etc. So, it does not matter if the data is strictly a subset of the categories that the original
network can predict. Depending on how much data are available for training, and how similar the data is
to the one used in the pretrained network, you will be able to decide to freeze the lower layers and work
with only the higher layers.

Tensor-flow Backend Library

Its bendy structure permits for the smooth deployment of computation throughout quite a few platforms
(CPUs, GPUs, TPUs), and from computer systems to clusters of servers to cellular and side devices.
TensorFlow computations are viewed as stateful dataflow graphs.

Keras Front-end Library

Keras includes several implementations of generally used neural-community constructing blocks which
includes layers, objectives, activation functions, optimizers, and a bunch of equipment to make running
with photograph and textual content information easier. The code is hosted on GitHub, and network aid
boards consist of the GitHub troubles page, and a Slack channel. In addition to traditional neural
networks,

19
Keras has aid for convolutional and recurrent neural networks. It helps different application layers like
dropout, batch normalization, and pooling.

Python Language

Python is a high-level programming language. You can use Python because it is Readable and
Maintainable Code, supports Multiple Programming Paradigms and Many Open Source Frameworks and
Tools, it is Compatible with Major Platforms and Systems, it has Robust Standard Library, helps in Simplify
Complex Software Development.

20
CHAPTER 4

DESIGN
DESIGN GOALS
INPUT
Take the pre-recorded video as input from the user through command line backend

OUTPUT

The output should caption the frames of input video based on the prediction made by trained activity
recognition model.

EFFICIENCY

The system should be able to identify human from any other similar objects in the input video stream
and appropriately classify the activity with acceptable accuracy.

OVERALL SYSTEM ARCHITECTURE


What problems ResNets solve?

One of the issues ResNets clear up is the well-known recognized vanishing gradient. This is due to the
fact while the community is just too deep, the gradients from in which the loss feature is calculated
without difficulty. This end result at the weights in no way updating its values and therefore, no getting
to know is being performed. With ResNets, the gradients can go with the drift immediately thru the pass
connections backwards from later layers to preliminary filters.

21
Architecture

Since ResNets can have variable sizes, depending on how big each of the layers of the model are, and
how many layers it has, we are going to follow the one described by the authors in the paper — ResNet
34 — to brief the architectures of these networks. Understand that the reduction between layers is
achieved by an increase on the stride, from 1 to 2, at the first convolution of each of the layer.

Table 4.1. Layers of ResNet

Summary

The ResNets following the explained rules built by the authors yield to the following structures as shown below:

Table 4.2. Versions of ResNet

FLOWCHART

22
Figure 4.3. Flowchart

23
DATAFLOW DIAGRAM

Figure 4.4. Dataflow Diagram

24
USE-CASE DIAGRAM

Figure 4.5. Use-case diagram

25
SEQUENCE DIAGRAM

Figure 4.6. Sequence Diagram

STATE DIAGRAM

Figure 4.7. State Diagram

26
CHAPTER 5

IMPLEMENTATION

ACTIVITY RECOGNITION USING RESNET


Many researchers have made the classification more accurate by increasing the depth or width of the
network. Here, we chose to use a three-branch network rather than ResNet’s original single-branch
structure to expand its receptive field. the initial ResNet-50 contains 50 convolutions, which are divided
into 5 structurally similar stages to extract image features. Steps to implement the application:

1. Take video stream as input.


2. Process video and extract individual frames.
3. Pass the frame through trained human activity recognition model.
4. Compare the frames and predict the type of activity.
5. Classify the activity and caption it back on the individual frames.
6. Stream the video to the output.

Libraries required

1. Numpy - It is a python library which supports large, multidimensional array and


matrices. It is also large collection of high-level mathematical functions to operate on
those arrays when required.

2. Cv2 - It is a python library used for real time computer vision. If recognizes faces,
classify human actions, identify objects, track camera movements, track moving
objects, extract 3D model of object. In this application it recognizes the objects
appearing on the camera and tracks the objects and camera movements.

3. Imutils - It is a python library used to access webcam of the system. This library access
the webcam that allows for live stream.

27
In order to reduce the number of parameters on the basis of expanding the receptive field, we
introduced TridentNet with three branches in the backbone ResNet-50. TridentNet is introduced
into the 5th stage of ResNet-50 in this paper. Due to structural differences between modules in the
ResNet network itself, the improved trident module is also divided into Conv-trident block and IDtrident
block.

Figure 5.2. Improved version of network structure of Resnet-50

The original TridentNet was used as a part of the thing detection network as a threeway
structure. We made some modifications to form it one branch output, and added a shortcut to
form it more in line with the ResNet configuration. additionally to adding a multi-branch
structure, TridentNet also uses the concept of dilated convolution. By adding 0 within the
convolution kernel, an outsized receptive field are often obtained with fewer parameters. When
the expansion rate is, the horizontal length of the convolution kernel, the amount of all 0s and

28
therefore the length of the

29
first convolution kernel size k are as shown in equation. within the TridentNet structure, the
parameters of the three branches are 3, and are 1, 2, and three respectively. Therefore, the
dimension of the receptive field n becomes 3, 5, and 7. n=k + (k-1) * (d-1)
The existing state-of-the-art 2D architectures (such as ResNet) can be extended to video classification via
3D kernels.
The ImageNet dataset allowed models like these to be trained to high accuracy. Hence, these
architectures will be able to process and perform video classification by - (1) tweaking the input
volume shape that will help to include spatiotemporal knowledge.

DATA REPRESENTATION
Different records representations were utilized by the guides withinside the underneath table, relying at
the HAR programs and sensors. Inertial measurements recorded via way of means of sensors
incorporated in IMUs are deployed, generally, for fixing HAR. Usually, greater than 3 gadgets are set at
the organic structure, e.g., at the hands, legs, head, and torso. Differently, the authors recorded
acceleration measurements from best one device, that's located at the waist. The authors proposed the
usage of the value of the acceleration vector from the three additives x, y, and z. The authors used the
logarithm value of a two-dimensional Discrete Fourier Transform of IMU signals. They proposed using
this value as a picture enter for a CNN.

Table 5.3. Detailed structure of Surveys

PRE-PROCESSING

Low and High-pass filtering have been used for separating the acceleration components due to body
movements and gravity. This is also used for eliminating noise. Now the body acceleration was calculated

30
by subtracting the gravity component from the acceleration measurements. Also, here the video inputs
are processed and extracted into individual frames so these frames could be passed or forwarded on to a
trained model and then recognized.

SEGMENTATION

Segmentation is basically the extracting of a sequence of continuous measurements or the preprocessed


data that are most likely to reflect a human activity. In our project, Human Activity Recognition System,
the sliding window approach is the most general method that is used for creating segments that will be
processed by a classifier. In this approach, a window is moved over the time-series data by a certain step
to extract a segment. This step size is selected according to segmentation precision, taking into account
that short activities can be skipped.

FEATURE EXTRACTION

In the sample reputation methods, the extraction of characters is a crucial stage. It permits representing
statistics in a mild manner. They are divided into major groups, statistical functions and application-
based. Time-area functions specializes in the waveform characteristics, and frequency-area functions
recognition at the periodic shape of the signal. We performed a complete take a look at trendy strategies
of human hobby popularity. We mentioned unimodal procedures and supplied an inner categorization of
those strategies, which have been evolved for analysing gesture, atomic actions, and greater complicated
activities.

31
Figure 5.4. Feature Extraction process

BUILDING MODEL

Explanation of function working which would be performing the activity recognition.


Step 1: First of all, we need to import the OpenCV library, numpy and imutils.

Ex: import cv2 import numpy as np import imutils

Step 2: Construct the argument parser and parse the arguments.

Since this project is related to smile detection, we need to include the face haar cascade
and the smile haar cascade. Both the XML files are loaded in to the algorithm. We can
mention face, eye, mouth etc.

Ex: ap = [Link]()

ap.add_argument("m", "--model", required=True, help="path to

trained human activity recognition model") ap.add_argument("-

c", "-classes", required=True, help="path to class labels file")

ap.add_argument("-i", "--input", type=str,

32
default="", help="optional path

to video file") args =

vars(ap.parse_args())

Step 3: load the contents of the class labels file, then define the sample duration (i.e., # of
frames for classification) and sample size (i.e., the spatial dimensions of the frame).

1. Defining the class labels file


Ex: CLASSES = open(args["classes"]).read().strip().split("\n")

2. Defining the sample duration which is the duration of frames for classification.

Ex: SAMPLE_DURATION = 16

3. Defining the sample size which is the spatial dimensions of the frame. Ex: SAMPLE_SIZE
= 112

Step 4: We now initialize the frames queue used to store a rolling sample duration of frames.
Ex: frames = deque(maxlen=SAMPLE_DURATION)

Step 5: load the human activity recognition model

Ex: print("[INFO] loading human activity recognition model...")


net = [Link](args["model"])

Step 6: grab a pointer to the input video stream

Ex: print("[INFO] accessing video stream...") vs =


[Link](args["input"] if args["input"] else 0)

Step 7: loop over frames from the video stream

33
Ex: while True:

(grabbed, frame) = [Link]()


if not grabbed: print("[INFO] no frame read from
stream - exiting") break

Step 8: resize the frame (to ensure faster processing) and add the frame to our queue

Ex: frame = [Link](frame, width=400)

[Link](frame)

Step 9: if our queue is not filled to the sample size, continue back to the top of the loop and
continue polling/processing frames

Ex: if len(frames) < SAMPLE_DURATION:

Continue

Step 10: Now that our frames array is filled, we can construct our blob

Ex: blob = [Link](frames, 1.0,

(SAMPLE_SIZE, SAMPLE_SIZE), (114.7748, 107.7354, 99.4750),

swapRB=True, crop=True)

blob = [Link](blob, (1, 0, 2, 3))


blob = np.expand_dims(blob, axis=0)

Step 11: pass the blob through the network to obtain our human activity recognition
predictions.

Ex: [Link](blob) outputs = [Link]()

label = CLASSES[[Link](outputs)]

34
Step 12: Draw the predicted activity on the frame.

Ex: [Link](frame, (0, 0), (300, 40), (0, 0, 0), -1)


[Link](frame, label, (10, 25), cv2.FONT_HERSHEY_SIMPLEX, 0.8,

(255, 255, 255), 2)

Step 13: Display the frame to our screen.

Ex: [Link]("Activity Recognition", frame)

key = [Link](1) & 0xFF

CONVOLUTIONAL NEURAL NETWORK


This algorithm is built using Convolutional neural network [ResNet] and opencv library, which
detects the facial expressions of the person through the webcam and captures the image if the
facial expression is found to be happy. The convolutional neural network is a deep learning, feed
forward neural network used for analysing the images. In this project, the ConvNet is used for
detecting the seven facial expressions. The ConvNet extracts the features from images in the
dataset and finds a certain pattern which is used for identifying the expression in real faces
through video camera. Opencv is used for capturing and storing the images to the folder.

LIBRARIES AND MODULES


The various libraries required for this algorithm for performing the various tasks are follows:

1. Numpy - used for performing the basic mathematical operations

35
2. OpenCV – It is a python library designed for solving Computer vision problems. 3.
Tensorflow – it is a Deep learning python library, which contains various sub libraries or
modules for solving image related problems.

4. Keras – It is a sub library of tensorflow.

5. Sequence – it is a simplest module with linear stack of layers.

36
CHAPTER 6

TESTING
Testing can be defined as the process of finding errors. The main purpose of testing
is to find errors, discover weakness or faults in a product. It is a way of checking the component
functionality, assemblies and the completely integrated product. It is the method used for
ensuring whether the software designed meets the user expectations and requirements and to
ensure that it does not fail in unacceptable manner. There are different types of testing:

TYPES OF TESTING
[Link] TESTING
In unit testing the test cases are designed for validating the internal program logic and to check if
the input produces valid output. All the code flow and the decision branches are tested using unit
testing. It is a structural testing and done at component level.

2 INTEGRATION TESTING

Integration testing is done after integrating each component to check if it can properly as a one
program. It is done to find whether the component integration is consistent and correct. The main
aim of integration testing is to find the issues that arise while combining the components.

3 SYSTEM TESTING

System testing is conducted on the complete product, the integrated system to check if it meets the user
requirements. It does not require any inner knowledge about the code or implementation.
System testing is done on the complete system in the context of a SRS (System Requirement
Specification) or FRS (Functional Requirement Specification).

TEST CASES

37
1.

Serial Number of Test Case Test Case 1

Test case Name Push-up detection

Description When the program is executed, on


analysing the video fed into system, the
activity performed in video must be
labelled as “push-up”

Output The activity in the video input is recognized


and labelled as “push-up”.

Remark Test successful

38
Table 6.1: Test case for detecting push-up

2.

Serial Number of Test Case Test Case 2

Test case Name Human recognition

Description When the program is executed, when the


video input is entered into the system,
the system should recognize the human
behaviour in the video.

Output The human in the video is recognized.

39
Remark Test successful

Table 6.2: Test case for Human Detection

3.

Serial Number of Test Case Test Case 3

Test case Name Predicting any activity performed

Description When the program is executed, on


feeding the system with the video input,
the system should determine the type of
activity being performed in the video.

40
Output The activity was detected by the CNN
algorithm and the type of activity is also
displayed on the screen.

Remark Test successful

Table 6.3: Test case for Predicting activity performed

4.

Serial Number of Test Case Test Case 4

Test case Name Detecting the activity of “throwing axe”

41
Description When the program is executed, on
analysing the video fed into system, the
activity performed in video must be
labelled as “throwing axe”

Output The activity in the video input is


recognized and labelled as “throwing
axe”.

Remark Test successful

Table 6.4: Test case for detection of activity “throwing axe”

42
CHAPTER 7

SNAPSHOT
CODE SNIPPETS:

1.

Figure 7.1. Code Snippet 1

2.

Figure 7.2. Code Snippet 2

43
3.

Figure 7.3. Code Snippet 3

OUTPUT SNAPSHOTS -

GUI:

Figure 7.4. GUI of front-end

44
OUTPUT 1:

Figure 7.5. Output 1

OUTPUT 2:

Figure 7.6. Output 2

45
46
CHAPTER 8

CONCLUSION
In this project we implemented Human Activity Recognition System. It is an application that
recognizes the movements or activities performed by humans and labels them. Initially we did
literature survey on how to implement this application. Then we analysed the functional and
nonfunctional requirements to implement the application. Then we designed few UML diagrams
for better understanding of implementation. After requirement and design analysis we gave input
data set to train the model with 80% data and test the model with remaining 20% data. Once the
model is trained it detects and labels the activity being performed in the respective video input.
The model is trained using convolution neural network (CNN) machine learning algorithm, namely
using the ResNet architecture.

All in all, in this project you learned how to perform human activity recognition using OpenCV and
Deep Learning. To be able to complete this task, we have used a human activity recognition
pretrained model on the Kinetics dataset, which includes 400-700 human activities (depending on
which version of the dataset you’re using) and over 300,000 video clips. The model makes use of
the ResNet architecture that uses 3D kernels when compared to the standard 2D filters, allowing it
to include a temporal component for recognition of activity.

47
CHAPTER 9

REFERENCES

I. T. Lan, Y. Wang, and G. Mori, “Discriminative figure-centric models for joint action localization and
recognition,” in Proc. IEEE Int. Conf. Comput. Vis., Nov. 2011, pp. 2003–210.

II. [Link], [Link] and [Link],” Discovering discriminative action parts from mid-level
video
representations” ,in Proc ,IEEE [Link],.Pattern Recog.,Jun 2012,pp.1242-1249.

III. [Link],[Link] Genmert,[Link] ,[Link] and [Link] ,”Action localization with tubelets from
motion”
,in Proc IEEE Conf, [Link] Recog. Jun 2014 pp 740-747

IV.
H. Zhang and O. Yoshie, “Improving human activity recognition using subspace clustering,” in
Machine
Learning and Cybernetics (ICMLC), 2012 International Conference on, vol. 3, July 2012, pp. 1058–1063.

V. N. Robertson and I. Reid, “A general method for human activity recognition in video,” Computer
Vision and Image Understanding, vol. 104, no. 2, pp. 232–248, 2006.

You might also like