[go: up one dir, main page]

0% found this document useful (0 votes)
144 views70 pages

Internshala Summer Training Report On Data Science

The document provides an overview of an online training program on data science conducted by Internshala. It discusses the modules covered in the training including introduction to data science, python for data science, statistics, predictive modeling and machine learning. It also includes the declaration, acknowledgement and introduction to the training organization.

Uploaded by

ishantsomani15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views70 pages

Internshala Summer Training Report On Data Science

The document provides an overview of an online training program on data science conducted by Internshala. It discusses the modules covered in the training including introduction to data science, python for data science, statistics, predictive modeling and machine learning. It also includes the declaration, acknowledgement and introduction to the training organization.

Uploaded by

ishantsomani15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Summer Training Report

On
“Data Science”

Submitted to Kurukshetra University in partial fulfilment


of the requirement for the award of the
Degree of
Bachelor of Technology
ELECTRONICS AND COMMUNICATION
ENGINEERING

Submitted by:
Buland
251701150, ECE-B, 7th Sem

Submitted to:
Mr. Puneet Bansal Asst.
Prof.

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING,


UNIVERSITY INSTITUTE OF ENGINEERING AND TECHNOLOGY,
KURUKSHETRA UNIVERSITY, KURUKSHETRA
DECLARATION

I hereby certify that the work which is being presented in the report entitled
“Data Science” in fulfilment of the requirement for completion of one-month
industrial training in Department of Electronics and Communication
Engineering of “University Institute of Engineering and Technology,
Kurukshetra University” is an authentic record of my own work carried out
during industrial training.

Buland
251701150
ECE-B 7th sem.

ACKNOWLEDGEMENT

The work in this report is an outcome of continuous work over a period and
drew intellectual support from Internshala and other sources. I would like to
articulate our profound gratitude and indebtedness to Internshala helped us
in completion of the training. I am thankful to Internshala Training Associates
for teaching and assisting me in making the training successful.

Buland
251701150
ECE-B 7th sem.
Introduction to Organization:
Internshala is an internship and online training platform, based in Gurgaon,
India. Founded by Sarvesh Agrawal, an IIT Madras alumnus, in 2010, the
website helps students find internships with organisations in India. The
platform started as a WordPress blog which aggregated internships across
India and articles on education, technology and skill gap in 2010. The website
was launched in 2013. Internshala launched its online trainings in 2014. The
platform is used by 2.0 Mn + students and 70000+ companies. At the core
of the idea is the belief that internships, if managed well, can make a positive
difference to the student, to the employer, and to the society at large. Hence,
the ad-hoc culture surrounding internships in India should and would
change. Internshala aims to be the driver of this change.
About Training:
The Data Science Training by Internshala is a 6-week online training program
in which Internshala aim to provide you with a comprehensive introduction
to data science. In this training program, you will learn the basics of python,
statistics, predictive modeling, and machine learning. This training program
has video tutorials and is packed with assignments, assessments tests,
quizzes, and practice exercises for you to get a hands-on learning experience.
At the end of this training program, you will have a solid understanding of
data science and will be able to build an end-to-end predictive model. For
doubt clearing, you can post your queries on the forum and get answers
within 24 hours.
Table of Content
Introduction to Organization

About Training

Module-1: Introduction to Data Science

1.1. Data Science Overview

Module-2: Python for Data Science

2.1. Introduction to Python


2.2. Understanding Operators
2.3. Variables and Data Types
2.4. Conditional Statements
2.5. Looping Constructs
2.6. Functions
2.7. Data Structure
2.8. Lists
2.9. Dictionaries
2.10. Understanding Standard Libraries in Python
2.11. Reading a CSV File in Python
2.12. Data Frames and basic operations with Data Frames
2.13. Indexing Data Frame
Module-3: Understanding the Statistics for Data Science

3.1. Introduction to Statistics


3.2. Measures of Central Tendency
3.3. Understanding the spread of data
3.4. Data Distribution
3.5. Introduction to Probability
3.6. Probabilities of Discreet and Continuous Variables
3.7. Central Limit Theorem and Normal Distribution
3.8. Introduction to Inferential Statistics
3.9. Understanding the Confidence Interval and margin of error
3.10. Hypothesis Testing
3.11. T tests
3.12. Chi Squared Tests
3.13. Understanding the concept of Correlation
Module-4: Predictive Modeling and Basics of Machine Learning

4.1. Introduction to Predictive Modeling


4.2. Understanding the types of Predictive Models
4.3. Stages of Predictive Models
4.4. Hypothesis Generation
4.5. Data Extraction
4.6. Data Exploration
4.7. Reading the data into Python
4.8. Variable Identification
4.9. Univariate Analysis for Continuous Variables
4.10. Univariate Analysis for Categorical Variables
4.11. Bivariate Analysis
4.12. Treating Missing Values
4.13. How to treat Outliers
4.14. Transforming the Variables
4.15. Basics of Model Building
4.16. Linear Regression
4.17. Logistic Regression
4.18. Decision Trees
4.19. K-means

Module-1: Introduction to Data Science


1.1. Data Science Overview
Data science is the study of data. Like biological sciences is a study of biology,
physical sciences, it’s the study of physical reactions. Data is real, data has real
properties, and we need to study them if we’re going to work on them. Data
Science involves data and some signs.
It is a process, not an event. It is the process of using data to understand too many
different things, to understand the world. Let Suppose when you have a model or
proposed explanation of a problem, and you try to validate that proposed
explanation or model with your data.
It is the skill of unfolding the insights and trends that are hiding (or abstract) behind
data. It’s when you translate data into a story. So, use storytelling to generate
insight. And with these insights, you can make strategic choices for a company or
an institution.
We can also define data science as a field which is about processes and systems to
extract data of various forms and from various resources whether the data is
unstructured or structured.

Predictive modeling:
Predictive modeling is a form of artificial intelligence that uses data mining and
probability to forecast or estimate more granular, specific outcomes.
For example, predictive modeling could help identify customers who are likely to
purchase our new One AI software over the next 90 days. Machine Learning:
Machine learning is a branch of artificial intelligence (ai) where computers learn to
act and adapt to new data without being programmed to do so. The computer is
able to act independently of human interaction. Forecasting:
Forecasting is a process of predicting or estimating future events based on past
and present data and most commonly by analysis of trends. "Guessing" doesn't cut
it. A forecast, unlike a prediction, must have logic to it. It must be defendable. This
logic is what differentiates it from the magic 8 ball's lucky guess. After all, even a
broken watch is right two times a day.
Applications of Data Science:
Data science and big data are making an undeniable impact on businesses,
changing day-to-day operations, financial analytics, and especially interactions
with customers. It's clear that businesses can gain enormous value from the insights
data science can provide. But sometimes it's hard to see exactly how. So let's look
at some examples. In this era of big data, almost everyone generates masses of
data every day, often without being aware of it. This digital trace reveals the
patterns of our online lives. If you have ever searched for or bought a product on
a site like Amazon, you'll notice that it starts making recommendations related to
your search. This type of system known as a recommendation engine is a common
application of data science. Companies like Amazon, Netflix, and Spotify use
algorithms to make specific recommendations derived from customer preferences
and historical behavior. Personal assistants like Siri on Apple devices use data
science to devise answers to the infinite number of questions end users may ask.
Google watches your every move in the world, you're online shopping habits, and
your social media. Then it analyzes that data to create recommendations for
restaurants, bars, shops, and other attractions based on the data collected from
your device and your current location. Wearable devices like Fitbits, Apple watches,
and Android watches add information about your activity levels, sleep patterns,
and heart rate to the data you generate. Now that we know how consumers
generate data, let's take a look at how data science is impacting business. In 2011,
McKinsey & Company said that data science was going to become the key basis of
competition. Supporting new waves of productivity, growth, and innovation. In
2013, UPS announced that it was using data from customers, drivers, and vehicles,
in a new route guidance system aimed to save time, money, and fuel. Initiatives like
this support the statement that data science will fundamentally change the way
businesses compete and operate. How does a firm gain a competitive advantage?
Let's take Netflix as an example. Netflix collects and analyzes massive amounts of
data from millions of users, including which shows people are watching at what
time a day when people pause, rewind, and fast-forward, and which shows directors
and actors they search for. Netflix can be confident that a show will be a hit before
filming even begins by analyzing users preference for certain directors and acting
talent, and discovering which combinations people enjoy. Add this to the success
of earlier versions of a show and you have a hit. For example, Netflix knew many of
its users had streamed to the work of David Fincher. They also knew that films
featuring Robin Wright had always done well, and that the British version of House
of Cards was very successful.
Netflix knew that significant numbers of people who liked Fincher also liked Wright.
All this information combined to suggest that buying the series would be a good
investment for the company.
Module-2: Python for Data Science

2.1. Introduction to Python


Python is a high-level, general-purpose and a very popular programming
language. Python programming language (latest Python 3) is being used in web
development, Machine Learning applications, along with all cutting edge
technology in Software Industry. Python Programming Language is very well
suited for Beginners, also for experienced programmers with other programming
languages like C++ and Java.

Below are some facts about Python Programming Language:

• Python is currently the most widely used multi-purpose, high-level


programming language.
• Python allows programming in Object-Oriented and Procedural paradigms.
• Python programs generally are smaller than other programming languages like
Java. Programmers have to type relatively less and indentation requirement of
the language, makes them readable all the time.
• Python language is being used by almost all tech-giant companies like –
Google, Amazon, Facebook, Instagram, Dropbox, Uber… etc.
• The biggest strength of Python is huge collection of standard library which can
be used for the following:
• Machine Learning
• GUI Applications (like Kivy, Tkinter, PyQt etc. )
• Web frameworks like Django (used by YouTube, Instagram, Dropbox)
• Image processing (like OpenCV, Pillow)
• Web scraping (like Scrapy, BeautifulSoup, Selenium)
• Test frameworks
• Multimedia
• Scientific computing
• Text processing and many more.
2.2. Understanding Operators
a. Arithmetic operators:
Arithmetic operators are used to perform mathematical operations like
addition, subtraction, multiplication and division.

OPERATOR DESCRIPTION SYNTAX

+ Addition: adds two operands x+y

- Subtraction: subtracts two operands x-y

* Multiplication: multiplies two operands x*y

/ Division (float): divides the first operand by the second x/y

// Division (floor): divides the first operand by the second x // y


Modulus: returns the remainder when first operand is

divided by the second

% x%y

** Power : Returns first raised to power second x ** y

b. Relational Operators:
Relational operators compares the values. It either returns True or False
according to the condition.

OPERATOR DESCRIPTION SYNTAX

> Greater than: True if left operand is greater than the right x>y

< Less than: True if left operand is less than the right x<y

x ==

== Equal to: True if both operands are equal


!= Not equal to - True if operands are not equal x != y

Greater than or equal to: True if left operand is greater than x >=

or equal to the right y

>=

Less than or equal to: True if left operand is less than or equal x <=

<= to the right y

c. Logical operators:
Logical operators perform Logical AND, Logical OR and Logical NOT
operations.
OPERATOR DESCRIPTION SYNTAX

and Logical AND: True if both the operands are true x and y

or Logical OR: True if either of the operands is true x or y

not Logical NOT: True if operand is false not x


d. Bitwise operators:
Bitwise operators acts on bits and performs bit by bit operation.
OPERATOR DESCRIPTION SYNTAX

& Bitwise AND x&y

| Bitwise OR x|y

~ Bitwise NOT ~x

^ Bitwise XOR x^y

>> Bitwise right shift x>>

<< Bitwise left shift x<<

e. Assignment operators:
Assignment operators are used to assign values to the variables.

OPERATOR DESCRIPTION SYNTAX

Assign value of right side of expression to left

side operand

= x=y+z
Add AND: Add right side operand with left

side operand and then assign to left operand

+= a+=b a=a+b

Subtract AND: Subtract right operand from

left operand and then assign to left operand

-= a-=b a=a-b

Multiply AND: Multiply right operand with left

operand and then assign to left operand

*= a*=b a=a*b

Divide AND: Divide left operand with right

operand and then assign to left operand

/= a/=b a=a/b

Modulus AND: Takes modulus using left and

right operands and assign result to left

%= operand a%=b a=a%b


Divide(floor) AND: Divide left operand with
right operand and then assign the value(floor)
to left operand

//= a//=b a=a//b

Exponent AND: Calculate exponent(raise


power) value using operands and assign value
to left operand

**= a**=b a=a**b

Performs Bitwise AND on operands and

assign value to left operand

&= a&=b a=a&b

Performs Bitwise OR on operands and assign

value to left operand

|= a|=b a=a|b

Performs Bitwise xOR on operands and assign

value to left operand

^= a^=b a=a^b
Performs Bitwise right shift on operands and

assign value to left operand

>>= a>>=b a=a>>b

a <<=

Performs Bitwise left shift on operands and b a= a <<

<<= assign value to left operand b

f. Special operators: There are some special type of operators like-


2.2.6.1 Identity operators: is and is not are the identity operators both are
used to check if two values are located on the same part of the
memory. Two variables that are
equal does not imply that they are identical. is True if the operands are
identical
is not True if the operands are not identical

2.2.6.2 Membership operators:


in and not in are the membership operators; used to test whether a value
or variable is in a sequence. in True if value is found in the
sequence not in True if value is not found in the sequence

g. Precedence and Associativity of Operators:


Operator precedence and associativity as these determine the priorities of
the operator.
2.2.7.1 Operator Precedence:
This is used in an expression with more than one operator with different
precedence to determine which operation to perform first.
2.2.7.2 Operator Associativity:
If an expression contains two or more operators with the same precedence
then Operator Associativity is used to determine. It can either be Left to
Right or from Right to Left.
OPERATOR DESCRIPTION ASSOCIATIVITY

() Parentheses left-to-right

** Exponent right-to-left

* / % Multiplication/division/modulus left-to-right

OPERATOR DESCRIPTION ASSOCIATIVITY

+ - Addition/subtraction left-to-right

<< >> Bitwise shift left, Bitwise shift right left-to-right

Relational less than/less than or equal to


Relational greater than/greater than or equal to

< <=

> >= left-to-right

== != Relational is equal to/is not equal to left-to-right

2.3. Variables and Data Types


Variables:
a. Python Variables Naming Rules:
There are certain rules to what you can name a variable(called an identifier).
• Python variables can only begin with a letter(A-Z/a-z) or an underscore(_).
• The rest of the identifier may contain letters(A-Z/a-z), underscores(_), and
numbers(0-9).
• Python is case-sensitive, and so are Python identifiers. Name and name are two
different identifiers.
b. Assigning and Reassigning Python Variables:
• To assign a value to Python variables, you don’t need to declare its type.
• You name it according to the rules stated in section 2a, and type the value after
the equal sign(=).
• You can’t put the identifier on the right-hand side of the equal sign.
• Neither can you assign Python variables to a keyword.

c. Multiple Assignment:
• You can assign values to multiple Python variables in one statement.
• You can assign the same value to multiple Python variables.
d. Deleting Variables:
• You can also delete Python variables using the keyword ‘del’.
Data Types:
A. Python Numbers:
There are four numeric Python data types. a. int int stands for integer. This Python
Data Type holds signed integers. We can use the type() function to find which class
it belongs to. b. float
This Python Data Type holds floating-point real values. An int can only store the
number 3, but float can store 3.25 if you want. c. long
This Python Data type holds a long integer of unlimited length. But this construct
does not exist in Python 3.x. d. complex
This Python Data type holds a complex number. A complex number looks like this:
a+bj Here, a and b are the real parts of the number, and j is imaginary. B. Strings:
A string is a sequence of characters. Python does not have a char data type, unlike
C++ or Java. You can delimit a string using single quotes or double-quotes. a.
Spanning a String Across Lines:
To span a string across multiple lines, you can use triple quotes. b.
Displaying Part of a String:
You can display a character from a string using its index in the string. Remember,
indexing starts with 0. c. String Formatters:
String formatters allow us to print characters and values at once. You can use the
% operator.
d. String Concatenation:
You can concatenate(join) strings using + operator. However, you cannot
concatenate values of different types. C. Python Lists:
A list is a collection of values. Remember, it may contain different types of values.

To define a list, you must put values separated with commas in square brackets.
You don’t need to declare a type for a list either. a. Slicing a List
You can slice a list the way you’d slice a string- with the slicing operator. Indexing
for a list begins with 0, like for a string. A Python doesn’t have arrays. b. Length of
a List
Python supports an inbuilt function to calculate the length of a list. c.
Reassigning Elements of a List
A list is mutable. This means that you can reassign elements later on.
d. Iterating on the List
To iterate over the list we can use the for loop. By iterating, we can access each
element one by one which is very helpful when we need to perform some
operations on each element of list. e. Multidimensional Lists
A list may have more than one dimension. Have a detailed look on this in DataFlair’s
tutorial on Python Lists. D. Python Tuples:
A tuple is like a list. You declare it using parentheses instead. a.
Accessing and Slicing a Tuple
You access a tuple the same way as you’d access a list. The same goes for slicing it.
b. A tuple is Immutable
Python tuple is immutable. Once declared, you can’t change its size or elements. E.
Dictionaries:
A dictionary holds key-value pairs. Declare it in curly braces, with pairs separated
by commas. Separate keys and values by a colon(:).The type() function works with
dictionaries too.
a. Accessing a Value
To access a value, you mention the key in square brackets.
b. Reassigning Elements You can reassign a value to a
key. c. List of Keys
Use the keys() function to get a list of keys in the dictionary. F.
Bool:
A Boolean value can be True or False.
G. Sets:
A set can have a list of values. Define it using curly braces. It returns only one
instance of any value present more than once. However, a set is unordered, so it
doesn’t support indexing. Also, it is mutable. You can change its elements or add
more. Use the add() and remove() methods to do so. H. Type Conversion:
Since Python is dynamically-typed, you may want to convert a value into another
type. Python supports a list of functions for the same. a. int()
b. float()
c. bool()
d. set()
e. list()
f. tuple()
g. str()

2.4. Conditional Statements


a. If statements
If statement is one of the most commonly used conditional statement in most of
the programming languages. It decides whether certain statements need to be
executed or not. If statement checks for a given condition, if the condition is true,
then the set of code present inside the if block will be executed.
The If condition evaluates a Boolean expression and executes the block of code
only when the Boolean expression becomes TRUE.
Syntax:
If (Boolean expression): Block of code #Set of statements to execute if the
condition is true

b. If-else statements
The statement itself tells that if a given condition is true then execute the
statements present inside if block and if the condition is false then execute the else
block.
Else block will execute only when the condition becomes false, this is the block
where you will perform some actions when the condition is not true.
If-else statement evaluates the Boolean expression and executes the block of code
present inside the if block if the condition becomes TRUE and executes a block of
code present in the else block if the condition becomes FALSE.

Syntax:
if(Boolean expression):
Block of code #Set of statements to execute if condition is true

else:
Block of code #Set of statements to execute if condition is false

c. elif statements
In python, we have one more conditional statement called elif statements. Elif
statement is used to check multiple conditions only if the given if condition false.
It’s similar to an if-else statement and the only difference is that in else we will not
check the condition but in elif we will do check the condition.
Elif statements are similar to if-else statements but elif statements evaluate multiple
conditions.
Syntax: if
(condition):
#Set of statement to execute if condition is true elif
(condition):
#Set of statements to be executed when if condition is false and elif
condition is true else:
#Set of statement to be executed when both if and elif conditions are false

d. Nested if-else statements


Nested if-else statements mean that an if statement or if-else statement is present
inside another if or if-else block. Python provides this feature as well, this in turn
will help us to check multiple conditions in a given program.
An if statement present inside another if statement which is present inside another
if statements and so on.
Nested if Syntax: if(condition):
#Statements to execute if condition is true
if(condition):
#Statements to execute if condition is true
#end of nested if
#end of if

Nested if-else Syntax:

if(condition):
#Statements to execute if condition is true
if(condition):
#Statements to execute if condition is true
else:
#Statements to execute if condition is false else:
#Statements to execute if condition is false

e. elif Ladder
We have seen about the elif statements but what is this elif ladder. As the name
itself suggests a program which contains ladder of elif statements or elif statements
which are structured in the form of a ladder.
This statement is used to test multiple expressions.
Syntax: if (condition):
#Set of statement to execute if condition is true elif
(condition):
#Set of statements to be executed when if condition is false and elif
condition is true elif (condition):
#Set of statements to be executed when both if and first elif condition is
false and second elif condition is true elif (condition):
#Set of statements to be executed when if, first elif and second elif
conditions are false and third elif statement is true else:
#Set of statement to be executed when all if and elif conditions are false

2.5. Looping Constructs


Loops:
a. while loop:
Repeats a statement or group of statements while a given condition is TRUE. It tests
the condition before executing the loop body.

Syntax:
while expression:
statement(s) b.
for loop:
Executes a sequence of statements multiple times and abbreviates the code that
manages the loop variable.
Syntax:
for iterating_var in sequence:
statements(s)
c. nested loops:
You can use one or more loop inside any another while, for or do..while loop.
Syntax of nested for loop: for
iterating_var in sequence: for
iterating_var in sequence:
statements(s) statements(s)
Syntax of nested while loop:
while expression: while
expression: statement(s)
statement(s)

Loop Control Statements:


a. break statement:
Terminates the loop statement and transfers execution to the statement
immediately following the loop. b. continue statement:
Causes the loop to skip the remainder of its body and immediately retest its
condition prior to reiterating. c. pass statement:
The pass statement in Python is used when a statement is required syntactically but
you do not want any command or code to execute.

2.6. Functions A. Built-in Functions or pre-


defined functions:
These are the functions which are already defined by Python. For example: id (),
type(), print (), etc.

B. User-Defined Functions:
These are functions that are defined by the users for simplicity and to avoid
repetition of code. It is done by using def function.

2.7. Data Structure


Python has implicit support for Data Structures which enable you to store and
access data. These structures are called List, Dictionary, Tuple and Set.
2.8. Lists
Lists in Python are the most versatile data structure. They are used to store
heterogeneous data items, from integers to strings or even another list! They are
also mutable, which means that their elements can be changed even after the list
is created. Creating Lists
Lists are created by enclosing elements within [square] brackets and each item is
separated by a comma. Creating lists in Python
Since each element in a list has its own distinct position, having duplicate values in
a list is not a problem.
Accessing List elements
To access elements of a list, we use Indexing. Each element in a list has an index
related to it depending on its position in the list. The first element of the list has
the index 0, the next element has index 1, and so on. The last element of the list
has an index of one less than the length of the list.
Indexing in Python lists
While positive indexes return elements from the start of the list, negative indexes
return values from the end of the list. This saves us from the trivial calculation which
we would have to otherwise perform if we wanted to return the nth element from
the end of the list. So instead of trying to return List_name[len(List_name)-1]
element, we can simply write List_name[-1].

Using negative indexes, we can return the nth element from the end of the list
easily. If we wanted to return the first element from the end, or the last index, the
associated index is -1. Similarly, the index for the second last element will be -2,
and so on. Remember, the 0th index will still refer to the very first element in the
list.
Appending values in Lists
We can add new elements to an existing list using the append() or insert()
methods: append() – Adds an element to the end of the list insert() – Adds an
element to a specific position in the list which needs to be specified along with
the value
Removing elements from Lists
Removing elements from a list is as easy as adding them and can be done using
the remove() or pop() methods: remove() – Removes the first occurrence from the
list that matches the given value pop() – This is used when we want to remove an
element at a specified index from the list. However, if we don’t provide an index
value, the last element will be removed from the list.
Sorting Lists
On comparing two strings, we just compare the integer values of each character
from the beginning. If we encounter the same characters in both the strings, we
just compare the next character until we find two differing characters.
Concatenating Lists
We can even concatenate two or more lists by simply using the + symbol. This will
return a new list containing elements from both the lists:
List comprehensions
A very interesting application of Lists is List comprehension which provides a neat
way of creating new lists. These new lists are created by applying an operation on
each element of an existing list. It will be easy to see their impact if we first check
out how it can be done using the good old for-loops.
Stacks & Queues using Lists
A list is an in-built data structure in Python. But we can use it to create user-defined
data structures. Two very popular user-defined data structures built using lists are
Stacks and Queues.
Stacks are a list of elements in which the addition or deletion of elements is done
from the end of the list. Think of it as a stack of books. Whenever you need to add
or remove a book from the stack, you do it from the top. It uses the simple concept
of Last-In-First-Out.
Queues, on the other hand, are a list of elements in which the addition of elements
takes place at the end of the list, but the deletion of elements takes place from the
front of the list. You can think of it as a queue in the real-world. The queue becomes
shorter when people from the front exit the queue. The queue becomes longer
when someone new adds to the queue from the end. It uses the concept of FirstIn-
First-Out.
2.9. Dictionaries
Dictionary is another Python data structure to store heterogeneous objects that are
immutable but unordered.
Generating Dictionary
Dictionaries are generated by writing keys and values within a { curly } bracket
separated by a semi-colon. And each key-value pair is separated by a comma:

Using the key of the item, we can easily extract the associated value of the item:
Dictionaries are very useful to access items quickly because, unlike lists and tuples,
a dictionary does not have to iterate over all the items finding a value. Dictionary
uses the item key to quickly find the item value. This concept is called hashing.

Accessing keys and values


You can access the keys from a dictionary using the keys() method and the values
using the values() method. These we can view using a for-loop or turn them into a
list using list():

We can even access these values simultaneously using the items() method which
returns the respective key and value pair for each element of the dictionary.

2.10. Understanding Standard Libraries in Python


Pandas
When it comes to data manipulation and analysis, nothing beats Pandas. It is the
most popular Python library, period. Pandas is written in the Python language
especially for manipulation and analysis tasks. Pandas provides features like:
• Dataset joining and merging
• Data Structure column deletion and insertion
• Data filtration
• Reshaping datasets
• DataFrame objects to manipulate data, and much more!
NumPy
NumPy, like Pandas, is an incredibly popular Python library. NumPy brings in
functions to support large multi-dimensional arrays and matrices. It also brings in
high-level mathematical functions to work with these arrays and matrices. NumPy
is an open-source library and has multiple contributors. Matplotlib
Matplotlib is the most popular data visualization library in Python. It allows us to
generate and build plots of all kinds. This is my go-to library for exploring data
visually along with Seaborn.

2.11. Reading a CSV File in Python


A CSV (Comma Separated Values) file is a form of plain text document which uses
a particular format to organize tabular information. CSV file format is a bounded
text document that uses a comma to distinguish the values. Every row in the
document is a data log. Each log is composed of one or more fields, divided by
commas. It is the most popular file format for importing and exporting
spreadsheets and databases.
• USing csv.reader(): At first, the CSV file is opened using the open() method
in ‘r’ mode(specifies read mode while opening a file) which returns the file
object then it is read by using the reader() method of CSV module that
returns the reader object that iterates throughout the lines in the specified
CSV document.
Note: The ‘with‘ keyword is used along with the open() method as it simplifies
exception handling and automatically closes the CSV file.

import csv

# opening the CSV file with


open('Giants.csv', mode ='r')as file:

# reading the CSV file csvFile


= csv.reader(file)

# displaying the contents of the CSV file


for lines in csvFile: print(lines)
• Using pandas.read_csv() method: It is very easy and simple to read a CSV file
using pandas library functions. Here read_csv() method of pandas library is
used to read data from CSV files.

import pandas

# reading the CSV file


csvFile = pandas.read_csv('Giants.csv')
# displaying the contents of the CSV file
print(csvFile)

2.12. Data Frames and basic operations with Data Frames


Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous
tabular data structure with labeled axes (rows and columns). A Data frame is a
twodimensional data structure, i.e., data is aligned in a tabular fashion in rows and
columns. Pandas DataFrame consists of three principal components, the data,
rows, and columns.

DataFrame Methods:

FUNCTION DESCRIPTION

index() Method returns index (row labels) of the DataFrame


insert() Method inserts a column into a DataFrame

add() Method returns addition of dataframe and other, element-wise (binary


operator add)

sub() Method returns subtraction of dataframe and other, element-wise

(binary operator sub)

mul() Method returns multiplication of dataframe and other, element-wise

(binary operator mul)

div() Method returns floating division of dataframe and other, element-wise

(binary operator truediv)

unique() Method extracts the unique values in the dataframe

nunique() Method returns count of the unique values in the dataframe

value_counts() Method counts the number of times each unique value occurs within
the Series
columns() Method returns the column labels of the DataFrame

axes() Method returns a list representing the axes of the DataFrame

isnull() Method creates a Boolean Series for extracting rows with null values

notnull() Method creates a Boolean Series for extracting rows with non-null
values

between() Method extracts rows where a column value falls in between a


predefined range

isin() Method extracts rows from a DataFrame where a column value exists
in a predefined collection

dtypes() Method returns a Series with the data type of each column. The
result’s index is the original DataFrame’s columns

astype() Method converts the data types in a Series


values() Method returns a Numpy representation of the DataFrame i.e. only the
values in the DataFrame will be returned, the axes labels will be
removed

sort_values()- Set1, Set2 Method sorts a data frame in Ascending or Descending order of
passed Column

sort_index() Method sorts the values in a DataFrame based on their index positions
or labels instead of their values but sometimes a data frame is made
out of two or more data frames and hence later index can be changed
using this method

loc[] Method retrieves rows based on index label

iloc[] Method retrieves rows based on index position

ix[] Method retrieves DataFrame rows based on either index label or index

position. This method combines the best features of the .loc[] and

.iloc[] methods

rename() Method is called on a DataFrame to change the names of the index labels
or column names
columns() Method is an alternative attribute to change the coloumn name

drop() Method is used to delete rows or columns from a DataFrame

pop() Method is used to delete rows or columns from a DataFrame

sample() Method pulls out a random sample of rows or columns from a

DataFrame

nsmallest() Method pulls out the rows with the smallest values in a column

nlargest() Method pulls out the rows with the largest values in a column

shape() Method returns a tuple representing the dimensionality of the

DataFrame

ndim() Method returns an ‘int’ representing the number of axes / array


dimensions.

Returns 1 if Series, otherwise

returns 2 if

DataFrame
dropna() Method allows the user to analyze and drop Rows/Columns with Null
values in different ways

fillna() Method manages and let the user replace NaN values with some value
of their own

rank() Values in a Series can be ranked in order with this method

query() Method is an alternate string-based syntax for extracting a subset


from a DataFrame

copy() Method creates an independent copy of a pandas object

duplicated() Method creates a Boolean Series and uses it to extract rows that have
duplicate values

drop_duplicates() Method is an alternative option to identifying duplicate rows and


removing them through filtering

2.13. Indexing Data Frame


This function allows us to retrieve rows and columns by position. In order to do
that, we’ll need to specify the positions of the rows that we want, and the positions
of the columns that we want as well. The df.iloc indexer is very similar to df.loc but
only uses integer locations to make its selections.
Module-3: Understanding the Statistics for Data
Science

3.1 Introduction to Statistics


Statistics simply means numerical data, and is field of math that generally deals
with collection of data, tabulation, and interpretation of numerical data. It is actually
a form of mathematical analysis that uses different quantitative models to produce
a set of experimental data or studies of real life. It is an area of applied mathematics
concern with data collection analysis, interpretation, and presentation. Statistics
deals with how data can be used to solve complex problems. Some people consider
statistics to be a distinct mathematical science rather than a branch of mathematics.
Statistics makes work easy and simple and provides a clear and clean picture of
work you do on a regular basis.
Basic terminology of Statistics:
• Population –
It is actually a collection of set of individuals or objects or events whose
properties are to be analyzed.
• Sample –
It is the subset of a population.
Types of Statistics :
3.2 Measures of Central Tendency
(i) Mean :
It is measure of average of all value in a sample set.
For example,

(ii) Median :
It is measure of central value of a sample set. In these, data set is ordered from
lowest to highest value and then finds exact middle.
For example,

(iii) Mode :
It is value most frequently arrived in sample set. The value repeated most of time
in central set is actually mode.
For example,

3.3 Understanding the spread of data


Measure of Variability is also known as measure of dispersion and used to
describe variability in a sample or population. In statistics, there are three
common measures of variability as shown below:
(i) Range :
It is given measure of how to spread apart values in sample set or data set.
Range = Maximum value - Minimum value
(ii) Variance :
It simply describes how much a random variable defers from expected value
and it is also computed as square of deviation.
S2= ∑ni=1 [(xi - x͞ )2 ÷ n]
In these formula, n represent total data points, ͞x represent mean of data points
and xi represent individual data points.
(iii) Dispersion :
It is measure of dispersion of set of data from its mean.
σ= √ (1÷n) ∑ni=1 (xi - μ)2

3.4 Data Distribution


Terms related to Exploration of Data Distribution
-> Boxplot
-> Frequency Table
-> Histogram
-> Density Plot

Boxplot : It is based on the percentiles of the data as shown in the figure below.
The top and bottom of the boxplot are 75th and 25th percentile of the data. The
extended lines are known as whiskers that includes the range of rest of the data.
# BoxPlot Population In Millions
fig, ax1 = plt.subplots()
fig.set_size_inches(9, 15)

ax1 = sns.boxplot(x = data.PopulationInMillions, orient ="v")


ax1.set_ylabel("Population by State in Millions", fontsize = 15)
ax1.set_title("Population - BoxPlot", fontsize = 20)

Frequency Table : It is a tool to distribute the data into equally spaced ranges,
segments and tells us how many values fall in each segment.
Histogram: It is a way of visualizing data distribution through frequency table
with bins on the x-axis and data count on the y-axis. Code – Histogram

# Histogram Population In Millions

fig, ax2 = plt.subplots() fig.set_size_inches(9,


15)

ax2 = sns.distplot(data.PopulationInMillions, kde = False)


ax2.set_ylabel("Frequency", fontsize = 15)
ax2.set_xlabel("Population by State in Millions", fontsize = 15)
ax2.set_title("Population - Histogram", fontsize = 20)
Output :
Density Plot: It is related to histogram as it shows data-values being
distributed as continuous line. It is a smoothed histogram version. The output
below is the density plor superposed over histogram.
Code – Density Plot for the data
# Density Plot - Population
fig, ax3 = plt.subplots() fig.set_size_inches(7,
9)

ax3 = sns.distplot(data.Population, kde = True) ax3.set_ylabel("Density",


fontsize = 15)
ax3.set_xlabel("Murder Rate per Million", fontsize = 15)
ax3.set_title("Desnsity Plot - Population", fontsize = 20)
Output :

3.5 Introduction to Probability


Probability refers to the extent of occurrence of events. When an event occurs
like throwing a ball, picking a card from deck, etc ., then the must be some
probability associated with that event.
In terms of mathematics, probability refers to the ratio of wanted outcomes to the
total number of possible outcomes. There are three approaches to the theory of
probability, namely:
1. Empirical Approach
2. Classical Approach
3. Axiomatic Approach
In this article, we are going to study about Axiomatic Approach.In this approach,
we represent the probability in terms of sample space(S) and other terms.
Basic Terminologies:
• Random Event :- If the repetition of an experiment occurs several
times under similar conditions, if it does not produce the same
outcome everytime but the outcome in a trial is one of the several
possible outcomes, then such an experiment is called random event
or a probabilistic event.
• Elementary Event – The elementary event refers to the outcome of
each random event performed. Whenever the random event is
performed, each associated outcome is known as elementary event.
• Sample Space – Sample Space refers to the set of all possible
outcomes of a random event.Example, when a coin is tossed, the
possible outcomes are head and tail.
• Event – An event refers to the subset of the sample space associated
with a random event.
• Occurrence of an Event – An event associated with a random event
is said to occur if any one of the elementary event belonging to it is
an outcome.
• Sure Event – An event associated with a random event is said to be
sure event if it always occurs whenever the random event is
performed.
• Impossible Event – An event associated with a random event is said
to be impossible event if it never occurs whenever the random event
is performed.
• Compound Event – An event associated with a random event is said
to be compound event if it is the disjoint union of two or more
elementary events.
• Mutually Exclusive Events – Two or more events associated with a
random event are said to be mutually exclusive events if any one of
the event occurrs, it prevents the occurrence of all other events.This
means that no two or more events can occur simultaneously at the
same time.
• Exhaustive Events – Two or more events associated with a random
event are said to be exhaustive events if their union is the sample
space.
Probability of an Event – If there are total p possible outcomes associated with a
random experiment and q of them are favourable outcomes to the event A, then
the probability of event A is denoted by P(A) and is given by P(A)
= q/p

3.6 Probabilities of Discreet and Continuous Variables


Random variable is basically a function which maps from the set of sample space
to set of real numbers. The purpose is to get an idea about result of a particular
situation where we are given probabilities of different outcomes.

Discrete Random Variable:


A random variable X is said to be discrete if it takes on finite number of values. The
probability function associated with it is said to be PMF = Probability mass function.
P(xi) = Probability that X = xi = PMF of X = pi.
1. 0 ≤ pi ≤ 1.
2. ∑pi = 1 where sum is taken over all possible values of x.
Continuous Random Variable:
A random variable X is said to be continuous if it takes on infinite number of
values. The probability function associated with it is said to be PDF = Probability
density function.
PDF: If X is continuous random variable.
P (x < X < x + dx) = f(x)*dx.
1. 0 ≤ f(x) ≤ 1; for all x
2. ∫ f(x) dx = 1 over all values of x
Then P (X) is said to be PDF of the distribution.
3.7 Central Limit Theorem and Normal Distribution
Whenever a random experiment is replicated, the Random Variable that equals
the average (or total) result over the replicates tends to have a normal
distribution as the number of replicates becomes large.
It is one of the cornerstones of probability theory and statistics, because of the
role it plays in the Central Limit Theorem, and because many real-world
phenomena involve random quantities that are approximately normal (e.g.,
errors in scientific measurement). It is also known by other names such as-
Gaussian Distribution, Bell shaped Distribution.

It can be observed from the above graph that the distribution is symmetric
about its center, which is also the mean (0 in this case). This makes the
probability of events at equal deviations from the mean, equally probable. The
density is highly centered around the mean, which translates to lower
probabilities for values away from the mean.
Probability Density Function –
The probability density function of the general normal distribution is given as-

In the above formula, all the symbols have their usual meanings, is the
Standard Deviation and is the Mean. It is easy to get overwhelmed by the above
formula while trying to understand everything in one glance, but we can try to
break it down into smaller pieces so as to get an intuition as to what is going
on.
The z-score is a measure of how many standard deviations away a data point
is

from the mean. Mathematically,

The exponent of in the above formula is the square of the z-score times . This
is actually in accordance to the observations that we made above. Values away
from the mean have a lower probability compared to the values near the mean.
Values away from the mean will have a higher z-score and consequently a lower
probability since the exponent is negative. The opposite is true for values closer
to the mean. This gives way for the 68-95-99.7 rule, which states that the
percentage of values that lie within a band around the mean in a normal
distribution with a width of two, four and six standard deviations, comprise 68%,
95% and 99.7% of all the values. The figure given below shows this rule-

The effects of and on the distribution are shown below. Here is used to
reposition the center of the distribution and consequently move the graph left
or right, and is used to flatten or inflate the curve-
3.8 Introduction to Inferential Statistics
Inferential Statistics makes inference and prediction about population based on a
sample of data taken from population. It generalizes a large dataset and applies
probabilities to draw a conclusion. It is simply used for explaining meaning of
descriptive stats. It is simply used to analyze, interpret result, and draw conclusion.
Inferential Statistics is mainly related to and associated with hypothesis testing
whose main target is to reject null hypothesis.
Hypothesis testing is a type of inferential procedure that takes help of sample data
to evaluate and assess credibility of a hypothesis about a population. Inferential
statistics are generally used to determine how strong relationship is within sample.
But it is very difficult to obtain a population list and draw a random sample.
Types of inferential statistics –
Various types of inferential statistics are used widely nowadays and are very easy
to interpret. These are given below:
• One sample test of difference/One sample hypothesis test
• Confidence Interval
• Contingency Tables and Chi-Square Statistic
• T-test or Anova
3.9 Understanding the Confidence Interval and margin of error
In simple terms, Confidence Interval is a range where we are certain that true
value exists. The selection of a confidence level for an interval determines the
probability that the confidence interval will contain the true parameter value. This
range of values is generally used to deal with population-based data, extracting
specific, valuable information with a certain amount of confidence, hence the
term ‘Confidence Interval’.
Fig. Shows how a confidence interval generally looks like.

3.10 Hypothesis Testing


Hypothesis are statement about the given problem. Hypothesis testing is a
statistical method that is used in making a statistical decision using experimental
data. Hypothesis testing is basically an assumption that we make about a
population parameter. It evaluates two mutually exclusive statements about a
population to determine which statement is best supported by the sample data.
Parameters of hypothesis testing
• Null hypothesis(H0): In statistics, the null hypothesis is a general given
statement or default position that there is no relationship between two
measured cases or no relationship among groups.
In other words, it is a basic assumption or made based on the problem
knowledge.
Example: A company production is = 50 unit/per day etc.
• Alternative hypothesis(H1): The alternative hypothesis is the hypothesis
used in hypothesis testing that is contrary to the null hypothesis. Example : A
company production is not equal to 50 unit/per day etc.
• Level of significance
It refers to the degree of significance in which we accept or reject the
nullhypothesis. 100% accuracy is not possible for accepting a hypothesis, so
we, therefore, select a level of significance that is usually 5%. This is normally
denoted with and generally, it is 0.05 or 5%, which means your output
should be 95% confident to give similar kind of result in each sample.
• P-value
The P value, or calculated probability, is the probability of finding the
observed/extreme results when the null hypothesis(H0) of a study given
problem is true. If your P-value is less than the chosen significance level
then you reject the null hypothesis i.e. accept that your sample claims
to support the alternative hypothesis. Error in Hypothesis Testing
• Type I error: When we reject the null hypothesis, although that hypothesis
was true. Type I error is denoted by alpha.
• Type II errors: When we accept the null hypothesis but it is false. Type II
errors are denoted by beta.

3.11 T tests
A t-test is a type of inferential statistic used to determine if there is a
significant difference between the means of two groups, which may be related
in certain features.

If t-value is large => the two groups belong to different groups.


If t-value is small => the two groups belong to same group.
There are three types of t-tests, and they are categorized as dependent and
independent t-tests.
1. Independent samples t-test: compares the means for two groups.
2. Paired sample t-test: compares means from the same group at different
times (say, one year apart).
3. One sample t-test test: the mean of a single group against a known mean.
3.12 Chi Squared Tests
Chi-square test is used for categorical features in a dataset. We calculate Chi-square
between each feature and the target and select the desired number of features with best
Chi-square scores. It determines if the association between two categorical variables of
the sample would reflect their real association in the population.
Chi- square score is given by :

3.13 Understanding the concept of Correlation


Correlation –
1. It show whether and how strongly pairs of variables are related to each
other.
2. Correlation takes values between -1 to +1, wherein values close to +1
represents strong positive correlation and values close to -1 represents
strong negative correlation.
3. In this variable are indirectly related to each other.
4. It gives the direction and strength of relationship between variables.
Formula –

Here,
x’ and y’ = mean of given sample
set n = total no of sample xi and yi
= individual sample of set
Example –
Module-4: Predictive Modeling and Basics of Machine
Learning

4.1. Introduction to Predictive Modeling


Predictive analytics involves certain manipulations on data from existing data sets
with the goal of identifying some new trends and patterns. These trends and
patterns are then used to predict future outcomes and trends. By performing
predictive analysis, we can predict future trends and performance. It is also defined
as the prognostic analysis, the word prognostic means prediction. Predictive
analytics uses the data, statistical algorithms and machine learning techniques to
identify the probability of future outcomes based on historical data.

4.2. Understanding the types of Predictive Models


Supervised learning
Supervised learning as the name indicates the presence of a supervisor as a teacher.
Basically supervised learning is a learning in which we teach or train the machine
using data which is well labeled that means some data is already tagged with the
correct answer. After that, the machine is provided with a new set of examples(data)
so that supervised learning algorithm analyses the training data(set of training
examples) and produces a correct outcome from labeled data.
Unsupervised learning
Unsupervised learning is the training of machine using information that is neither
classified nor labeled and allowing the algorithm to act on that information without
guidance. Here the task of machine is to group unsorted information according to
similarities, patterns and differences without any prior training of data.
4.3. Stages of Predictive Models
Steps To Perform Predictive Analysis:
Some basic steps should be performed in order to perform predictive analysis.
1. Define Problem Statement:
Define the project outcomes, the scope of the effort, objectives, identify the data
sets that are going to be used.
2. Data Collection:
Data collection involves gathering the necessary details required for the analysis.
It involves the historical or past data from an authorized source over which
predictive analysis is to be performed.
3. Data Cleaning:
Data Cleaning is the process in which we refine our data sets. In the process of
data cleaning, we remove un-necessary and erroneous data. It involves removing
the redundant data and duplicate data from our data sets.
4. Data Analysis:
It involves the exploration of data. We explore the data and analyze it thoroughly
in order to identify some patterns or new outcomes from the data set. In this
stage, we discover useful information and conclude by identifying some patterns
or trends.
5. Build Predictive Model:
In this stage of predictive analysis, we use various algorithms to build predictive
models based on the patterns observed. It requires knowledge of python, R,
Statistics and MATLAB and so on. We also test our hypothesis using standard
statistic models.
6. Validation:
It is a very important step in predictive analysis. In this step, we check the
efficiency of our model by performing various tests. Here we provide sample
input sets to check the validity of our model. The model needs to be evaluated
for its accuracy in this stage.
7. Deployment:
In deployment we make our model work in a real environment and it helps in
everyday discussion making and make it available to use.
8. Model Monitoring:
Regularly monitor your models to check performance and ensure that we have
proper results. It is seeing how model predictions are performing against actual
data sets.
4.4. Hypothesis Generation
A hypothesis is a function that best describes the target in supervised machine
learning. The hypothesis that an algorithm would come up depends upon the data
and also depends upon the restrictions and bias that we have imposed on the data.
To better understand the Hypothesis Space and Hypothesis consider the following
coordinate that shows the distribution of some data:

4.5. Data Extraction


In general terms, “Mining” is the process of extraction of some valuable material
from the earth e.g. coal mining, diamond mining etc. In the context of computer
science, “Data Mining” refers to the extraction of useful information from a bulk
of data or data warehouses. One can see that the term itself is a little bit
confusing. In case of coal or diamond mining, the result of extraction process is
coal or diamond. But in case of Data Mining, the result of extraction process is
not data!! Instead, the result of data mining is the patterns and knowledge that
we gain at the end of the extraction process. In that sense, Data Mining is also
known as Knowledge Discovery or Knowledge Extraction.
Data Mining as a whole process
The whole process of Data Mining comprises of three main phases:
1. Data Pre-processing – Data cleaning, integration, selection and transformation
takes place
2. Data Extraction – Occurrence of exact data mining
3. Data Evaluation and Presentation – Analyzing and presenting results
4.6. Data Exploration
Steps of Data Exploration and Preparation
Remember the quality of your inputs decide the quality of your output. So, once
you have got your business hypothesis ready, it makes sense to spend lot of time
and efforts here. With my personal estimate, data exploration, cleaning and
preparation can take up to 70% of your total project time.
Below are the steps involved to understand, clean and prepare your data for
building your predictive model:
• Variable Identification
• Univariate Analysis
• Bi-variate Analysis
• Missing values treatment
• Outlier treatment
• Variable transformation
• Variable creation
Finally, we will need to iterate over steps 4 – 7 multiple times before we come up
with our refined model.

4.7. Reading the data into Python


Python provides inbuilt functions for creating, writing and reading files. There are
two types of files that can be handled in python, normal text files and binary files
(written in binary language, 0s and 1s).
• Text files: In this type of file, Each line of text is terminated with a special character
called EOL (End of Line), which is the new line character (‘\n’) in python by default.
• Binary files: In this type of file, there is no terminator for a line and the data is
stored after converting it into machine-understandable binary language.

Access modes govern the type of operations possible in the opened file. It refers
to how the file will be used once it’s opened. These modes also define the location
of the File Handle in the file. File handle is like a cursor, which defines from where
the data has to be read or written in the file. Different access modes for reading a
file are –
1. Read Only (‘r’) : Open text file for reading. The handle is positioned at the
beginning of the file. If the file does not exists, raises I/O error. This is also the
default mode in which file is opened.
2. Read and Write (‘r+’) : Open the file for reading and writing. The handle is
positioned at the beginning of the file. Raises I/O error if the file does not exists.
3. Append and Read (‘a+’) : Open the file for reading and writing. The file is created
if it does not exist. The handle is positioned at the end of the file. The data being
written will be inserted at the end, after the existing data.

4.8. Variable Identification


First, identify Predictor (Input) and Target (output) variables. Next, identify the
data type and category of the variables.

Example:- Suppose, we want to predict, whether the students will play cricket or
not (refer below data set). Here you need to identify predictor variables, target
variable, data type of variables and category of variables.

Below, the variables have been defined in different category:


4.9. Univariate Analysis for Continuous Variables
Continuous Variables:- In case of continuous variables, we need to understand
the central tendency and spread of the variable. These are measured using various
statistical metrics visualization methods as shown below:

Note: Univariate analysis is also used to highlight missing and outlier values. In the

upcoming part of this series, we will look at methods to handle missing and outlier

values.

4.10. Univariate Analysis for Categorical Variables


For categorical variables, we’ll use frequency table to understand distribution of
each category. We can also read as percentage of values under each category. It
can be be measured using two metrics, Count and Count% against each category.
Bar chart can be used as visualization.
4.11. Bivariate Analysis
Bi-variate Analysis finds out the relationship between two variables. Here, we look
for association and disassociation between variables at a pre-defined significance
level. We can perform bi-variate analysis for any combination of categorical and
continuous variables. The combination can be: Categorical & Categorical,
Categorical & Continuous and Continuous & Continuous. Different methods are
used to tackle these combinations during analysis process.
Continuous & Continuous: While doing bi-variate analysis between two
continuous variables, we should look at scatter plot. It is a nifty way to find out the
relationship between two variables. The pattern of scatter plot indicates the
relationship between variables. The relationship can be linear or non-linear.

Scatter plot shows the relationship between two variable but does not indicates
the strength of relationship amongst them. To find the strength of the relationship,
we use Correlation. Correlation varies between -1 and +1.

• -1: perfect negative linear correlation


• +1:perfect positive linear correlation and

• 0: No correlation

Correlation can be derived using following formula:

Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))

Various tools have function or functionality to identify correlation between


variables. In Excel, function CORREL() is used to return the correlation between two
variables and SAS uses procedure PROC CORR to identify the correlation. These
function returns Pearson Correlation value to identify the relationship between two
variables:
In above example, we have good positive relationship(0.65) between two variables
X and Y.

Categorical & Categorical: To find the relationship between two categorical


variables, we can use following methods:

• Two-way table: We can start analyzing the relationship by creating a two-way


table of count and count%. The rows represents the category of one variable and
the columns represent the categories of the other variable. We show count or
count% of observations available in each combination of row and column
categories.
• Stacked Column Chart: This method is more of a visual form of Two-way table.

• Chi-Square Test: This test is used to derive the statistical significance of


relationship between the variables. Also, it tests whether the evidence in the sample
is strong enough to generalize that the relationship for a larger population as well.
Chi-square is based on the difference between the expected and observed
frequencies in one or more categories in the two-way table. It returns probability
for the computed chi-square distribution with the degree of freedom. Probability
of 0: It indicates that both categorical variable are dependent

Probability of 1: It shows that both variables are independent.

Probability less than 0.05: It indicates that the relationship between the variables is
significant at 95% confidence. The chi-square test statistic for a test of
independence of two categorical variables is found by:
where O represents the observed frequency. E is the expected
frequency under the null hypothesis and computed by:

From previous two-way table, the expected count for product category 1 to be of
small size is 0.22. It is derived by taking the row total for Size (9) times the column
total for Product category (2) then dividing by the sample size (81). This is
procedure is conducted for each cell. Statistical Measures used to analyze the
power of relationship are:

• Cramer’s V for Nominal Categorical Variable


• Mantel-Haenszed Chi-Square for ordinal categorical variable.

Different data science language and tools have specific methods to perform
chisquare test. In SAS, we can use Chisq as an option with Proc freq to perform
this test.

Categorical & Continuous: While exploring relation between categorical and


continuous variables, we can draw box plots for each level of categorical variables.
If levels are small in number, it will not show the statistical significance. To look at
the statistical significance we can perform Z-test, T-test or ANOVA.

• Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically

different from each other or not. If the probability of Z is


small then the difference of two averages is more significant. The T-test is very
similar to Z-test but it is used when number of observation for both categories is
less than 30.
• ANOVA:- It assesses whether the average of more than two groups is statistically
different.

4.12. Treating Missing Values


Why missing values treatment is required?
Missing data in the training data set can reduce the power / fit of a model or can
lead to a biased model because we have not analysed the behavior and relationship
with other variables correctly. It can lead to wrong prediction or classification.

Notice the missing values in the image shown above: In the left scenario, we have
not treated missing values. The inference from this data set is that the chances of
playing cricket by males is higher than females. On the other hand, if you look at
the second table, which shows data after treatment of missing values (based on
gender), we can see that females have higher chances of playing cricket compared
to males.

Why my data has missing values?

We looked at the importance of treatment of missing values in a dataset. Now, let’s


identify the reasons for occurrence of these missing values. They may occur at two
stages:

1. Data Extraction: It is possible that there are problems with extraction process. In
such cases, we should double-check for correct data with data guardians. Some
hashing procedures can also be used to make sure data extraction is correct. Errors
at data extraction stage are typically easy to find and can be corrected easily as
well.
2. Data collection: These errors occur at time of data collection and are harder to
correct. They can be categorized in four types:
o Missing completely at random: This is a case when the probability of missing
variable is same for all observations. For example: respondents of data collection
process decide that they will declare their earning after tossing a fair coin. If an
head occurs, respondent declares his / her earnings & vice versa. Here each
observation has equal chance of missing value. o Missing at random: This is a case
when variable is missing at random and missing ratio varies for different values /
level of other input variables. For example: We are collecting data for age and
female has higher missing value compare to male.
o Missing that depends on unobserved predictors: This is a case when the missing
values are not random and are related to the unobserved input variable. For
example: In a medical study, if a particular diagnostic causes discomfort, then there
is higher chance of drop out from the study. This missing value is not at random
unless we have included “discomfort” as an input variable for all patients.
o Missing that depends on the missing value itself: This is a case when the
probability of missing value is directly correlated with missing value itself. For
example: People with higher or lower income are likely to provide non-response to
their earning.

Which are the methods to treat missing values ? 1. Deletion: It is of two types:
List Wise Deletion and Pair Wise Deletion. o In list wise deletion, we delete observations
where any of the variable is missing. Simplicity is one of the major advantage of this
method, but this method reduces the power of model because it reduces the sample
size. o In pair wise deletion, we perform analysis with all cases in which the variables of
interest are present. Advantage of this method is, it keeps as many cases available for
analysis. One of the disadvantage of this method, it uses different sample size for
different variables.
o Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the model
output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing
values with estimated ones. The objective is to employ known relationships that
can be identified in the valid values of the data set to assist in estimating the
missing values. Mean / Mode / Median imputation is one of the most frequently
used methods. It consists of replacing the missing data for a given attribute by the
mean or median (quantitative attribute) or mode (qualitative attribute) of all known
values of that variable. It can be of two types:-
o Generalized Imputation: In this case, we calculate the mean or median for all non
missing values of that variable then replace missing value with mean or median.
Like in above table, variable “Manpower” is missing so we take average of all non
missing values of “Manpower” (28.33) and then replace missing value with it.
o Similar case Imputation: In this case, we calculate average for gender “Male”
(29.75) and “Female” (25) individually of non missing values then replace the
missing value based on gender. For “Male“, we will replace missing values of
manpower with 29.75 and for “Female” with 25.
3. Prediction Model: Prediction model is one of the sophisticated method for
handling missing data. Here, we create a predictive model to estimate values that
will substitute the missing data. In this case, we divide our data set into two sets:
One set with no missing values for the variable and another one with missing
values. First data set become training data set of the model while second data set
with missing values is test data set and variable with missing values is treated as
target variable. Next, we create a model to predict target variable based on other
attributes of the training data set and populate missing values of test data set.We
can use regression, ANOVA, Logistic regression and various modeling technique to
perform this. There are 2 drawbacks for this approach:
o The model estimated values are usually more well-behaved than the true
values
o If there are no relationships with attributes in the data set and the attribute
with missing values, then the model will not be precise for estimating
missing values.
4. KNN Imputation: In this method of imputation, the missing values of an attribute
are imputed using the given number of attributes that are most similar to the
attribute whose values are missing. The similarity of two attributes is determined
using a distance function. It is also known to have certain advantage &
disadvantages. o Advantages:
▪ k-nearest neighbour can predict both qualitative & quantitative attributes
▪ Creation of predictive model for each attribute with missing data is not required
▪ Attributes with multiple missing values can be easily treated ▪ Correlation structure
of the data is taken into consideration o Disadvantage:
▪ KNN algorithm is very time-consuming in analyzing large database. It searches
through all the dataset looking for the most similar instances.
▪ Choice of k-value is very critical. Higher value of k would include attributes which
are significantly different from what we need whereas lower value of k implies
missing out of significant attributes.

4.13. How to treat Outliers


Outlier is a commonly used terminology by analysts and data scientists as it needs
close attention else it can result in wildly wrong estimations. Simply speaking,
Outlier is an observation that appears far away and diverges from an overall pattern
in a sample.
Most commonly used method to detect outliers is visualization. We use various
visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have
used box plot and scatter plot for visualization). Some analysts also various thumb
rules to detect outliers. Some of them are:

• Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
• Use capping methods. Any value which out of range of 5th and 95th percentile can
be considered as outlier
• Data points, three or more standard deviation away from mean are considered
outlier
• Outlier detection is merely a special case of the examination of data for influential
data points and it also depends on the business understanding
• Bivariate and multivariate outliers are typically measured using either an index of
influence or leverage, or distance. Popular indices such as Mahalanobis’ distance
and Cook’s D are frequently used to detect outliers.
• In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and
influential observation, we also look at statistical measure like STUDENT, COOKD,
RSTUDENT and others.

How to remove Outliers?

Most of the ways to deal with outliers are similar to the methods of missing values
like deleting observations, transforming them, binning them, treat them as a
separate group, imputing values and other statistical methods. Here, we will discuss
the common techniques used to deal with outliers:
Deleting observations: We delete outlier values if it is due to data entry error, data
processing error or outlier observations are very small in numbers. We can also use
trimming at both ends to remove outliers.

Transforming and binning values: Transforming variables can also eliminate


outliers. Natural log of a value reduces the variation caused by extreme values.
Binning is also a form of variable transformation. Decision Tree algorithm allows to
deal with outliers well due to binning of variable. We can also use the process of
assigning weights to different observations.

Imputing: Like imputation of missing values, we can also impute outliers. We can
use mean, median, mode imputation methods. Before imputing values, we should
analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing
values. We can also use statistical model to predict values of outlier observation
and after that we can impute it with predicted values.
Treat separately: If there are significant number of outliers, we should treat them
separately in the statistical model. One of the approach is to treat both groups as
two different groups and build individual model for both groups and then combine
the output.

4.14. Transforming the Variables


• When we want to change the scale of a variable or standardize the values of a
variable for better understanding. While this transformation is a must if you have data
in different scales, this transformation does not change the shape of the variable
distribution

• When we can transform complex non-linear relationships into linear


relationships. Existence of a linear relationship between variables is easier to
comprehend compared to a non-linear or curved relation. Transformation helps us to
convert a non-linear relation into linear relation. Scatter plot can be used to find the
relationship between two continuous variables. These transformations also improve
the prediction. Log transformation is one of the commonly used transformation
technique used in these situations.

• Symmetric
distribution is preferred over skewed distribution as it is easier to interpret and
generate inferences. Some modeling techniques requires normal distribution of
variables. So, whenever we have a skewed distribution, we can use transformations
which reduce skewness. For right skewed distribution, we take square / cube root or
logarithm of variable and for left skewed, we take square / cube or exponential of
variables.
• Variable Transformation is also done from an implementation point of view (Human
involvement). Let’s understand it more clearly. In one of my project on employee
performance, I found that age has direct correlation with performance of the
employee i.e. higher the age, better the performance. From an implementation stand
point, launching age based progamme might present implementation challenge.
However, categorizing the sales agents in three age group buckets of <30 years, 30-
45 years and >45 and then formulating three different strategies for each group is a
judicious approach. This categorization technique is known as Binning of Variables.

4.15. Basics of Model Building


Lifecycle of Model Building –
• Select variables
• Balance data
• Build models
• Validate
• Deploy
• Maintain
• Define success
• Explore data
• Condition data
Data exploration is used to figure out gist of data and to develop first step
assessment of its quality, quantity, and characteristics. Visualization techniques can
be also applied. However, this can be difficult task in high dimensional spaces with
many input variables. In the conditioning of data, we group functional data which
is applied upon modeling techniques after then rescaling is done, in some cases
rescaling is an issue if variables are coupled. Variable section is very important to
develop quality model.
This process is implicity model-dependent since it is used to configure which
combination of variables should be used in ongoing model development. Data
balancing is to partition data into appropriate subsets for training, test, and
validation. Model building is to focus on desired algorithms. The most famous
technique is symbolic regression, other techniques can also be preferred.

4.16. Linear Regression


Linear Regression is a machine learning algorithm based on supervised
regression algorithm. Regression models a target prediction value based on
independent variables. It is mostly used for finding out the relationship between
variables and forecasting. Different regression models differ based on – the kind of
relationship between the dependent and independent variables, they are
considering and the number of independent variables being used.
4.17. Logistic Regression
Logistic regression is basically a supervised classification algorithm. In a
classification problem, the target variable(or output), y, can take only discrete
values for a given set of features(or inputs), X.
Any change in the coefficient leads to a change in both the direction and the
steepness of the logistic function. It means positive slopes result in an S-shaped
curve and negative slopes result in a Z-shaped curve.
4.18. Decision Trees
Decision Tree : Decision tree is the most powerful and popular tool for
classification and prediction. A Decision tree is a flowchart like tree structure, where
each internal node denotes a test on an attribute, each branch represents an
outcome of the test, and each leaf node (terminal node) holds a class label.
Decision Tree Representation :
Decision trees classify instances by sorting them down the tree from the root to
some leaf node, which provides the classification of the instance. An instance is
classified by starting at the root node of the tree,testing the attribute specified by
this node,then moving down the tree branch corresponding to the value of the
attribute as shown in the above figure.This process is then repeated for the
subtree rooted at the new node.
Strengths and Weakness of Decision Tree approach
The strengths of decision tree methods are:
• Decision trees are able to generate understandable rules.
• Decision trees perform classification without requiring much computation.
• Decision trees are able to handle both continuous and categorical variables.
• Decision trees provide a clear indication of which fields are most important for
prediction or classification.
The weaknesses of decision tree methods :
• Decision trees are less appropriate for estimation tasks where the goal is to predict
the value of a continuous attribute.
• Decision trees are prone to errors in classification problems with many class and
relatively small number of training examples.
• Decision tree can be computationally expensive to train. The process of growing a
decision tree is computationally expensive. At each node, each candidate splitting
field must be sorted before its best split can be found. In some algorithms,
combinations of fields are used and a search must be made for optimal combining
weights. Pruning algorithms can also be expensive since many candidate sub-trees
must be formed and compared.

4.19. K-means
k-means clustering tries to group similar kinds of items in form of clusters. It finds
the similarity between the items and groups them into the clusters. K-means
clustering algorithm works in three steps. Let’s see what are these three steps.

1. Select the k values.


2. Initialize the centroids.
3. Select the group and find the average.

Let us understand the above steps with the help of the figure because a good
picture is better than the thousands of words.
We will understand each figure one by one.

• Figure 1 shows the representation of data of two different items. the first item has
shown in blue color and the second item has shown in red color. Here I am
choosing the value of K randomly as 2. There are different methods by which we
can choose the right k values.
• In figure 2, Join the two selected points. Now to find out centroid, we will draw a
perpendicular line to that line. The points will move to their centroid. If you will
notice there, then you will see that some of the red points are now moved to the
blue points. Now, these points belong to the group of blue color items.
• The same process will continue in figure 3. we will join the two points and draw a
perpendicular line to that and find out the centroid. Now the two points will move
to its centroid and again some of the red points get converted to blue points.
• The same process is happening in figure 4. This process will be continued until and
unless we get two completely different clusters of these groups.

How to choose the value of K?

One of the most challenging tasks in this clustering algorithm is to choose the right
values of k. What should be the right k-value? How to choose the k-value? Let us
find the answer to these questions. If you are choosing the k values randomly, it
might be correct or may be wrong. If you will choose the wrong value then it will
directly affect your model performance. So there are two methods by which you
can select the right value of k.

1. Elbow Method.
2. Silhouette Method.

Now, Let’s understand both the concept one by one in detail.


Elbow Method
Elbow is one of the most famous methods by which you can select the right value
of k and boost your model performance. We also perform the hyperparameter
tuning to chose the best value of k. Let us see how this elbow method works.
It is an empirical method to find out the best value of k. it picks up the range of
values and takes the best among them. It calculates the sum of the square of the
points and calculates the average distance.

When the value of k is 1, the within-cluster sum of the square will be high. As the
value of k increases, the within-cluster sum of square value will decrease.

Finally, we will plot a graph between k-values and the within-cluster sum of the
square to get the k value. we will examine the graph carefully. At some point, our
graph will decrease abruptly. That point will be considered as a value of k.
Silhouette
Method

The silhouette method is somewhat different. The elbow method it also picks up
the range of the k values and draws the silhouette graph. It calculates the silhouette
coefficient of every point. It calculates the average distance of points within its
cluster a (i) and the average distance of the points to its next closest cluster called
b (i).

Note : The a (i) value must be less than the b (i) value, that is ai<<bi.
Now, we have the values of a (i) and b (i). we will calculate the silhouette coefficient
by using the below formula.
Now, we can calculate the silhouette coefficient of all the points in the clusters and
plot the silhouette graph. This plot will also helpful in detecting the outliers. The
plot of the silhouette is between -1 to 1.

Also, check for the plot which has fewer outliers which means a less negative value.
Then choose that value of k for your model to tune.

Advantages of K-means

1. It is very simple to implement.


2. It is scalable to a huge data set and also faster to large datasets.
3. it adapts the new examples very frequently.
4. Generalization of clusters for different shapes and sizes.

Disadvantages of K-means
1. It is sensitive to the outliers.
2. Choosing the k values manually is a tough job.
3. As the number of dimensions increases its scalability decreases.

You might also like