Internshala Summer Training Report On Data Science
Internshala Summer Training Report On Data Science
On
“Data Science”
Submitted by:
Buland
251701150, ECE-B, 7th Sem
Submitted to:
Mr. Puneet Bansal Asst.
Prof.
I hereby certify that the work which is being presented in the report entitled
“Data Science” in fulfilment of the requirement for completion of one-month
industrial training in Department of Electronics and Communication
Engineering of “University Institute of Engineering and Technology,
Kurukshetra University” is an authentic record of my own work carried out
during industrial training.
Buland
251701150
ECE-B 7th sem.
ACKNOWLEDGEMENT
The work in this report is an outcome of continuous work over a period and
drew intellectual support from Internshala and other sources. I would like to
articulate our profound gratitude and indebtedness to Internshala helped us
in completion of the training. I am thankful to Internshala Training Associates
for teaching and assisting me in making the training successful.
Buland
251701150
ECE-B 7th sem.
Introduction to Organization:
Internshala is an internship and online training platform, based in Gurgaon,
India. Founded by Sarvesh Agrawal, an IIT Madras alumnus, in 2010, the
website helps students find internships with organisations in India. The
platform started as a WordPress blog which aggregated internships across
India and articles on education, technology and skill gap in 2010. The website
was launched in 2013. Internshala launched its online trainings in 2014. The
platform is used by 2.0 Mn + students and 70000+ companies. At the core
of the idea is the belief that internships, if managed well, can make a positive
difference to the student, to the employer, and to the society at large. Hence,
the ad-hoc culture surrounding internships in India should and would
change. Internshala aims to be the driver of this change.
About Training:
The Data Science Training by Internshala is a 6-week online training program
in which Internshala aim to provide you with a comprehensive introduction
to data science. In this training program, you will learn the basics of python,
statistics, predictive modeling, and machine learning. This training program
has video tutorials and is packed with assignments, assessments tests,
quizzes, and practice exercises for you to get a hands-on learning experience.
At the end of this training program, you will have a solid understanding of
data science and will be able to build an end-to-end predictive model. For
doubt clearing, you can post your queries on the forum and get answers
within 24 hours.
Table of Content
Introduction to Organization
About Training
Predictive modeling:
Predictive modeling is a form of artificial intelligence that uses data mining and
probability to forecast or estimate more granular, specific outcomes.
For example, predictive modeling could help identify customers who are likely to
purchase our new One AI software over the next 90 days. Machine Learning:
Machine learning is a branch of artificial intelligence (ai) where computers learn to
act and adapt to new data without being programmed to do so. The computer is
able to act independently of human interaction. Forecasting:
Forecasting is a process of predicting or estimating future events based on past
and present data and most commonly by analysis of trends. "Guessing" doesn't cut
it. A forecast, unlike a prediction, must have logic to it. It must be defendable. This
logic is what differentiates it from the magic 8 ball's lucky guess. After all, even a
broken watch is right two times a day.
Applications of Data Science:
Data science and big data are making an undeniable impact on businesses,
changing day-to-day operations, financial analytics, and especially interactions
with customers. It's clear that businesses can gain enormous value from the insights
data science can provide. But sometimes it's hard to see exactly how. So let's look
at some examples. In this era of big data, almost everyone generates masses of
data every day, often without being aware of it. This digital trace reveals the
patterns of our online lives. If you have ever searched for or bought a product on
a site like Amazon, you'll notice that it starts making recommendations related to
your search. This type of system known as a recommendation engine is a common
application of data science. Companies like Amazon, Netflix, and Spotify use
algorithms to make specific recommendations derived from customer preferences
and historical behavior. Personal assistants like Siri on Apple devices use data
science to devise answers to the infinite number of questions end users may ask.
Google watches your every move in the world, you're online shopping habits, and
your social media. Then it analyzes that data to create recommendations for
restaurants, bars, shops, and other attractions based on the data collected from
your device and your current location. Wearable devices like Fitbits, Apple watches,
and Android watches add information about your activity levels, sleep patterns,
and heart rate to the data you generate. Now that we know how consumers
generate data, let's take a look at how data science is impacting business. In 2011,
McKinsey & Company said that data science was going to become the key basis of
competition. Supporting new waves of productivity, growth, and innovation. In
2013, UPS announced that it was using data from customers, drivers, and vehicles,
in a new route guidance system aimed to save time, money, and fuel. Initiatives like
this support the statement that data science will fundamentally change the way
businesses compete and operate. How does a firm gain a competitive advantage?
Let's take Netflix as an example. Netflix collects and analyzes massive amounts of
data from millions of users, including which shows people are watching at what
time a day when people pause, rewind, and fast-forward, and which shows directors
and actors they search for. Netflix can be confident that a show will be a hit before
filming even begins by analyzing users preference for certain directors and acting
talent, and discovering which combinations people enjoy. Add this to the success
of earlier versions of a show and you have a hit. For example, Netflix knew many of
its users had streamed to the work of David Fincher. They also knew that films
featuring Robin Wright had always done well, and that the British version of House
of Cards was very successful.
Netflix knew that significant numbers of people who liked Fincher also liked Wright.
All this information combined to suggest that buying the series would be a good
investment for the company.
Module-2: Python for Data Science
% x%y
b. Relational Operators:
Relational operators compares the values. It either returns True or False
according to the condition.
> Greater than: True if left operand is greater than the right x>y
< Less than: True if left operand is less than the right x<y
x ==
Greater than or equal to: True if left operand is greater than x >=
>=
Less than or equal to: True if left operand is less than or equal x <=
c. Logical operators:
Logical operators perform Logical AND, Logical OR and Logical NOT
operations.
OPERATOR DESCRIPTION SYNTAX
and Logical AND: True if both the operands are true x and y
| Bitwise OR x|y
~ Bitwise NOT ~x
e. Assignment operators:
Assignment operators are used to assign values to the variables.
side operand
= x=y+z
Add AND: Add right side operand with left
+= a+=b a=a+b
-= a-=b a=a-b
*= a*=b a=a*b
/= a/=b a=a/b
|= a|=b a=a|b
^= a^=b a=a^b
Performs Bitwise right shift on operands and
a <<=
() Parentheses left-to-right
** Exponent right-to-left
* / % Multiplication/division/modulus left-to-right
+ - Addition/subtraction left-to-right
< <=
c. Multiple Assignment:
• You can assign values to multiple Python variables in one statement.
• You can assign the same value to multiple Python variables.
d. Deleting Variables:
• You can also delete Python variables using the keyword ‘del’.
Data Types:
A. Python Numbers:
There are four numeric Python data types. a. int int stands for integer. This Python
Data Type holds signed integers. We can use the type() function to find which class
it belongs to. b. float
This Python Data Type holds floating-point real values. An int can only store the
number 3, but float can store 3.25 if you want. c. long
This Python Data type holds a long integer of unlimited length. But this construct
does not exist in Python 3.x. d. complex
This Python Data type holds a complex number. A complex number looks like this:
a+bj Here, a and b are the real parts of the number, and j is imaginary. B. Strings:
A string is a sequence of characters. Python does not have a char data type, unlike
C++ or Java. You can delimit a string using single quotes or double-quotes. a.
Spanning a String Across Lines:
To span a string across multiple lines, you can use triple quotes. b.
Displaying Part of a String:
You can display a character from a string using its index in the string. Remember,
indexing starts with 0. c. String Formatters:
String formatters allow us to print characters and values at once. You can use the
% operator.
d. String Concatenation:
You can concatenate(join) strings using + operator. However, you cannot
concatenate values of different types. C. Python Lists:
A list is a collection of values. Remember, it may contain different types of values.
To define a list, you must put values separated with commas in square brackets.
You don’t need to declare a type for a list either. a. Slicing a List
You can slice a list the way you’d slice a string- with the slicing operator. Indexing
for a list begins with 0, like for a string. A Python doesn’t have arrays. b. Length of
a List
Python supports an inbuilt function to calculate the length of a list. c.
Reassigning Elements of a List
A list is mutable. This means that you can reassign elements later on.
d. Iterating on the List
To iterate over the list we can use the for loop. By iterating, we can access each
element one by one which is very helpful when we need to perform some
operations on each element of list. e. Multidimensional Lists
A list may have more than one dimension. Have a detailed look on this in DataFlair’s
tutorial on Python Lists. D. Python Tuples:
A tuple is like a list. You declare it using parentheses instead. a.
Accessing and Slicing a Tuple
You access a tuple the same way as you’d access a list. The same goes for slicing it.
b. A tuple is Immutable
Python tuple is immutable. Once declared, you can’t change its size or elements. E.
Dictionaries:
A dictionary holds key-value pairs. Declare it in curly braces, with pairs separated
by commas. Separate keys and values by a colon(:).The type() function works with
dictionaries too.
a. Accessing a Value
To access a value, you mention the key in square brackets.
b. Reassigning Elements You can reassign a value to a
key. c. List of Keys
Use the keys() function to get a list of keys in the dictionary. F.
Bool:
A Boolean value can be True or False.
G. Sets:
A set can have a list of values. Define it using curly braces. It returns only one
instance of any value present more than once. However, a set is unordered, so it
doesn’t support indexing. Also, it is mutable. You can change its elements or add
more. Use the add() and remove() methods to do so. H. Type Conversion:
Since Python is dynamically-typed, you may want to convert a value into another
type. Python supports a list of functions for the same. a. int()
b. float()
c. bool()
d. set()
e. list()
f. tuple()
g. str()
b. If-else statements
The statement itself tells that if a given condition is true then execute the
statements present inside if block and if the condition is false then execute the else
block.
Else block will execute only when the condition becomes false, this is the block
where you will perform some actions when the condition is not true.
If-else statement evaluates the Boolean expression and executes the block of code
present inside the if block if the condition becomes TRUE and executes a block of
code present in the else block if the condition becomes FALSE.
Syntax:
if(Boolean expression):
Block of code #Set of statements to execute if condition is true
else:
Block of code #Set of statements to execute if condition is false
c. elif statements
In python, we have one more conditional statement called elif statements. Elif
statement is used to check multiple conditions only if the given if condition false.
It’s similar to an if-else statement and the only difference is that in else we will not
check the condition but in elif we will do check the condition.
Elif statements are similar to if-else statements but elif statements evaluate multiple
conditions.
Syntax: if
(condition):
#Set of statement to execute if condition is true elif
(condition):
#Set of statements to be executed when if condition is false and elif
condition is true else:
#Set of statement to be executed when both if and elif conditions are false
if(condition):
#Statements to execute if condition is true
if(condition):
#Statements to execute if condition is true
else:
#Statements to execute if condition is false else:
#Statements to execute if condition is false
e. elif Ladder
We have seen about the elif statements but what is this elif ladder. As the name
itself suggests a program which contains ladder of elif statements or elif statements
which are structured in the form of a ladder.
This statement is used to test multiple expressions.
Syntax: if (condition):
#Set of statement to execute if condition is true elif
(condition):
#Set of statements to be executed when if condition is false and elif
condition is true elif (condition):
#Set of statements to be executed when both if and first elif condition is
false and second elif condition is true elif (condition):
#Set of statements to be executed when if, first elif and second elif
conditions are false and third elif statement is true else:
#Set of statement to be executed when all if and elif conditions are false
Syntax:
while expression:
statement(s) b.
for loop:
Executes a sequence of statements multiple times and abbreviates the code that
manages the loop variable.
Syntax:
for iterating_var in sequence:
statements(s)
c. nested loops:
You can use one or more loop inside any another while, for or do..while loop.
Syntax of nested for loop: for
iterating_var in sequence: for
iterating_var in sequence:
statements(s) statements(s)
Syntax of nested while loop:
while expression: while
expression: statement(s)
statement(s)
B. User-Defined Functions:
These are functions that are defined by the users for simplicity and to avoid
repetition of code. It is done by using def function.
Using negative indexes, we can return the nth element from the end of the list
easily. If we wanted to return the first element from the end, or the last index, the
associated index is -1. Similarly, the index for the second last element will be -2,
and so on. Remember, the 0th index will still refer to the very first element in the
list.
Appending values in Lists
We can add new elements to an existing list using the append() or insert()
methods: append() – Adds an element to the end of the list insert() – Adds an
element to a specific position in the list which needs to be specified along with
the value
Removing elements from Lists
Removing elements from a list is as easy as adding them and can be done using
the remove() or pop() methods: remove() – Removes the first occurrence from the
list that matches the given value pop() – This is used when we want to remove an
element at a specified index from the list. However, if we don’t provide an index
value, the last element will be removed from the list.
Sorting Lists
On comparing two strings, we just compare the integer values of each character
from the beginning. If we encounter the same characters in both the strings, we
just compare the next character until we find two differing characters.
Concatenating Lists
We can even concatenate two or more lists by simply using the + symbol. This will
return a new list containing elements from both the lists:
List comprehensions
A very interesting application of Lists is List comprehension which provides a neat
way of creating new lists. These new lists are created by applying an operation on
each element of an existing list. It will be easy to see their impact if we first check
out how it can be done using the good old for-loops.
Stacks & Queues using Lists
A list is an in-built data structure in Python. But we can use it to create user-defined
data structures. Two very popular user-defined data structures built using lists are
Stacks and Queues.
Stacks are a list of elements in which the addition or deletion of elements is done
from the end of the list. Think of it as a stack of books. Whenever you need to add
or remove a book from the stack, you do it from the top. It uses the simple concept
of Last-In-First-Out.
Queues, on the other hand, are a list of elements in which the addition of elements
takes place at the end of the list, but the deletion of elements takes place from the
front of the list. You can think of it as a queue in the real-world. The queue becomes
shorter when people from the front exit the queue. The queue becomes longer
when someone new adds to the queue from the end. It uses the concept of FirstIn-
First-Out.
2.9. Dictionaries
Dictionary is another Python data structure to store heterogeneous objects that are
immutable but unordered.
Generating Dictionary
Dictionaries are generated by writing keys and values within a { curly } bracket
separated by a semi-colon. And each key-value pair is separated by a comma:
Using the key of the item, we can easily extract the associated value of the item:
Dictionaries are very useful to access items quickly because, unlike lists and tuples,
a dictionary does not have to iterate over all the items finding a value. Dictionary
uses the item key to quickly find the item value. This concept is called hashing.
We can even access these values simultaneously using the items() method which
returns the respective key and value pair for each element of the dictionary.
import csv
import pandas
DataFrame Methods:
FUNCTION DESCRIPTION
value_counts() Method counts the number of times each unique value occurs within
the Series
columns() Method returns the column labels of the DataFrame
isnull() Method creates a Boolean Series for extracting rows with null values
notnull() Method creates a Boolean Series for extracting rows with non-null
values
isin() Method extracts rows from a DataFrame where a column value exists
in a predefined collection
dtypes() Method returns a Series with the data type of each column. The
result’s index is the original DataFrame’s columns
sort_values()- Set1, Set2 Method sorts a data frame in Ascending or Descending order of
passed Column
sort_index() Method sorts the values in a DataFrame based on their index positions
or labels instead of their values but sometimes a data frame is made
out of two or more data frames and hence later index can be changed
using this method
ix[] Method retrieves DataFrame rows based on either index label or index
position. This method combines the best features of the .loc[] and
.iloc[] methods
rename() Method is called on a DataFrame to change the names of the index labels
or column names
columns() Method is an alternative attribute to change the coloumn name
DataFrame
nsmallest() Method pulls out the rows with the smallest values in a column
nlargest() Method pulls out the rows with the largest values in a column
DataFrame
returns 2 if
DataFrame
dropna() Method allows the user to analyze and drop Rows/Columns with Null
values in different ways
fillna() Method manages and let the user replace NaN values with some value
of their own
duplicated() Method creates a Boolean Series and uses it to extract rows that have
duplicate values
(ii) Median :
It is measure of central value of a sample set. In these, data set is ordered from
lowest to highest value and then finds exact middle.
For example,
(iii) Mode :
It is value most frequently arrived in sample set. The value repeated most of time
in central set is actually mode.
For example,
Boxplot : It is based on the percentiles of the data as shown in the figure below.
The top and bottom of the boxplot are 75th and 25th percentile of the data. The
extended lines are known as whiskers that includes the range of rest of the data.
# BoxPlot Population In Millions
fig, ax1 = plt.subplots()
fig.set_size_inches(9, 15)
Frequency Table : It is a tool to distribute the data into equally spaced ranges,
segments and tells us how many values fall in each segment.
Histogram: It is a way of visualizing data distribution through frequency table
with bins on the x-axis and data count on the y-axis. Code – Histogram
It can be observed from the above graph that the distribution is symmetric
about its center, which is also the mean (0 in this case). This makes the
probability of events at equal deviations from the mean, equally probable. The
density is highly centered around the mean, which translates to lower
probabilities for values away from the mean.
Probability Density Function –
The probability density function of the general normal distribution is given as-
In the above formula, all the symbols have their usual meanings, is the
Standard Deviation and is the Mean. It is easy to get overwhelmed by the above
formula while trying to understand everything in one glance, but we can try to
break it down into smaller pieces so as to get an intuition as to what is going
on.
The z-score is a measure of how many standard deviations away a data point
is
The exponent of in the above formula is the square of the z-score times . This
is actually in accordance to the observations that we made above. Values away
from the mean have a lower probability compared to the values near the mean.
Values away from the mean will have a higher z-score and consequently a lower
probability since the exponent is negative. The opposite is true for values closer
to the mean. This gives way for the 68-95-99.7 rule, which states that the
percentage of values that lie within a band around the mean in a normal
distribution with a width of two, four and six standard deviations, comprise 68%,
95% and 99.7% of all the values. The figure given below shows this rule-
The effects of and on the distribution are shown below. Here is used to
reposition the center of the distribution and consequently move the graph left
or right, and is used to flatten or inflate the curve-
3.8 Introduction to Inferential Statistics
Inferential Statistics makes inference and prediction about population based on a
sample of data taken from population. It generalizes a large dataset and applies
probabilities to draw a conclusion. It is simply used for explaining meaning of
descriptive stats. It is simply used to analyze, interpret result, and draw conclusion.
Inferential Statistics is mainly related to and associated with hypothesis testing
whose main target is to reject null hypothesis.
Hypothesis testing is a type of inferential procedure that takes help of sample data
to evaluate and assess credibility of a hypothesis about a population. Inferential
statistics are generally used to determine how strong relationship is within sample.
But it is very difficult to obtain a population list and draw a random sample.
Types of inferential statistics –
Various types of inferential statistics are used widely nowadays and are very easy
to interpret. These are given below:
• One sample test of difference/One sample hypothesis test
• Confidence Interval
• Contingency Tables and Chi-Square Statistic
• T-test or Anova
3.9 Understanding the Confidence Interval and margin of error
In simple terms, Confidence Interval is a range where we are certain that true
value exists. The selection of a confidence level for an interval determines the
probability that the confidence interval will contain the true parameter value. This
range of values is generally used to deal with population-based data, extracting
specific, valuable information with a certain amount of confidence, hence the
term ‘Confidence Interval’.
Fig. Shows how a confidence interval generally looks like.
3.11 T tests
A t-test is a type of inferential statistic used to determine if there is a
significant difference between the means of two groups, which may be related
in certain features.
Here,
x’ and y’ = mean of given sample
set n = total no of sample xi and yi
= individual sample of set
Example –
Module-4: Predictive Modeling and Basics of Machine
Learning
Access modes govern the type of operations possible in the opened file. It refers
to how the file will be used once it’s opened. These modes also define the location
of the File Handle in the file. File handle is like a cursor, which defines from where
the data has to be read or written in the file. Different access modes for reading a
file are –
1. Read Only (‘r’) : Open text file for reading. The handle is positioned at the
beginning of the file. If the file does not exists, raises I/O error. This is also the
default mode in which file is opened.
2. Read and Write (‘r+’) : Open the file for reading and writing. The handle is
positioned at the beginning of the file. Raises I/O error if the file does not exists.
3. Append and Read (‘a+’) : Open the file for reading and writing. The file is created
if it does not exist. The handle is positioned at the end of the file. The data being
written will be inserted at the end, after the existing data.
Example:- Suppose, we want to predict, whether the students will play cricket or
not (refer below data set). Here you need to identify predictor variables, target
variable, data type of variables and category of variables.
Note: Univariate analysis is also used to highlight missing and outlier values. In the
upcoming part of this series, we will look at methods to handle missing and outlier
values.
Scatter plot shows the relationship between two variable but does not indicates
the strength of relationship amongst them. To find the strength of the relationship,
we use Correlation. Correlation varies between -1 and +1.
• 0: No correlation
Probability less than 0.05: It indicates that the relationship between the variables is
significant at 95% confidence. The chi-square test statistic for a test of
independence of two categorical variables is found by:
where O represents the observed frequency. E is the expected
frequency under the null hypothesis and computed by:
From previous two-way table, the expected count for product category 1 to be of
small size is 0.22. It is derived by taking the row total for Size (9) times the column
total for Product category (2) then dividing by the sample size (81). This is
procedure is conducted for each cell. Statistical Measures used to analyze the
power of relationship are:
Different data science language and tools have specific methods to perform
chisquare test. In SAS, we can use Chisq as an option with Proc freq to perform
this test.
• Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically
Notice the missing values in the image shown above: In the left scenario, we have
not treated missing values. The inference from this data set is that the chances of
playing cricket by males is higher than females. On the other hand, if you look at
the second table, which shows data after treatment of missing values (based on
gender), we can see that females have higher chances of playing cricket compared
to males.
1. Data Extraction: It is possible that there are problems with extraction process. In
such cases, we should double-check for correct data with data guardians. Some
hashing procedures can also be used to make sure data extraction is correct. Errors
at data extraction stage are typically easy to find and can be corrected easily as
well.
2. Data collection: These errors occur at time of data collection and are harder to
correct. They can be categorized in four types:
o Missing completely at random: This is a case when the probability of missing
variable is same for all observations. For example: respondents of data collection
process decide that they will declare their earning after tossing a fair coin. If an
head occurs, respondent declares his / her earnings & vice versa. Here each
observation has equal chance of missing value. o Missing at random: This is a case
when variable is missing at random and missing ratio varies for different values /
level of other input variables. For example: We are collecting data for age and
female has higher missing value compare to male.
o Missing that depends on unobserved predictors: This is a case when the missing
values are not random and are related to the unobserved input variable. For
example: In a medical study, if a particular diagnostic causes discomfort, then there
is higher chance of drop out from the study. This missing value is not at random
unless we have included “discomfort” as an input variable for all patients.
o Missing that depends on the missing value itself: This is a case when the
probability of missing value is directly correlated with missing value itself. For
example: People with higher or lower income are likely to provide non-response to
their earning.
Which are the methods to treat missing values ? 1. Deletion: It is of two types:
List Wise Deletion and Pair Wise Deletion. o In list wise deletion, we delete observations
where any of the variable is missing. Simplicity is one of the major advantage of this
method, but this method reduces the power of model because it reduces the sample
size. o In pair wise deletion, we perform analysis with all cases in which the variables of
interest are present. Advantage of this method is, it keeps as many cases available for
analysis. One of the disadvantage of this method, it uses different sample size for
different variables.
o Deletion methods are used when the nature of missing data is “Missing
completely at random” else non random missing values can bias the model
output.
2. Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing
values with estimated ones. The objective is to employ known relationships that
can be identified in the valid values of the data set to assist in estimating the
missing values. Mean / Mode / Median imputation is one of the most frequently
used methods. It consists of replacing the missing data for a given attribute by the
mean or median (quantitative attribute) or mode (qualitative attribute) of all known
values of that variable. It can be of two types:-
o Generalized Imputation: In this case, we calculate the mean or median for all non
missing values of that variable then replace missing value with mean or median.
Like in above table, variable “Manpower” is missing so we take average of all non
missing values of “Manpower” (28.33) and then replace missing value with it.
o Similar case Imputation: In this case, we calculate average for gender “Male”
(29.75) and “Female” (25) individually of non missing values then replace the
missing value based on gender. For “Male“, we will replace missing values of
manpower with 29.75 and for “Female” with 25.
3. Prediction Model: Prediction model is one of the sophisticated method for
handling missing data. Here, we create a predictive model to estimate values that
will substitute the missing data. In this case, we divide our data set into two sets:
One set with no missing values for the variable and another one with missing
values. First data set become training data set of the model while second data set
with missing values is test data set and variable with missing values is treated as
target variable. Next, we create a model to predict target variable based on other
attributes of the training data set and populate missing values of test data set.We
can use regression, ANOVA, Logistic regression and various modeling technique to
perform this. There are 2 drawbacks for this approach:
o The model estimated values are usually more well-behaved than the true
values
o If there are no relationships with attributes in the data set and the attribute
with missing values, then the model will not be precise for estimating
missing values.
4. KNN Imputation: In this method of imputation, the missing values of an attribute
are imputed using the given number of attributes that are most similar to the
attribute whose values are missing. The similarity of two attributes is determined
using a distance function. It is also known to have certain advantage &
disadvantages. o Advantages:
▪ k-nearest neighbour can predict both qualitative & quantitative attributes
▪ Creation of predictive model for each attribute with missing data is not required
▪ Attributes with multiple missing values can be easily treated ▪ Correlation structure
of the data is taken into consideration o Disadvantage:
▪ KNN algorithm is very time-consuming in analyzing large database. It searches
through all the dataset looking for the most similar instances.
▪ Choice of k-value is very critical. Higher value of k would include attributes which
are significantly different from what we need whereas lower value of k implies
missing out of significant attributes.
• Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR
• Use capping methods. Any value which out of range of 5th and 95th percentile can
be considered as outlier
• Data points, three or more standard deviation away from mean are considered
outlier
• Outlier detection is merely a special case of the examination of data for influential
data points and it also depends on the business understanding
• Bivariate and multivariate outliers are typically measured using either an index of
influence or leverage, or distance. Popular indices such as Mahalanobis’ distance
and Cook’s D are frequently used to detect outliers.
• In SAS, we can use PROC Univariate, PROC SGPLOT. To identify outliers and
influential observation, we also look at statistical measure like STUDENT, COOKD,
RSTUDENT and others.
Most of the ways to deal with outliers are similar to the methods of missing values
like deleting observations, transforming them, binning them, treat them as a
separate group, imputing values and other statistical methods. Here, we will discuss
the common techniques used to deal with outliers:
Deleting observations: We delete outlier values if it is due to data entry error, data
processing error or outlier observations are very small in numbers. We can also use
trimming at both ends to remove outliers.
Imputing: Like imputation of missing values, we can also impute outliers. We can
use mean, median, mode imputation methods. Before imputing values, we should
analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing
values. We can also use statistical model to predict values of outlier observation
and after that we can impute it with predicted values.
Treat separately: If there are significant number of outliers, we should treat them
separately in the statistical model. One of the approach is to treat both groups as
two different groups and build individual model for both groups and then combine
the output.
• Symmetric
distribution is preferred over skewed distribution as it is easier to interpret and
generate inferences. Some modeling techniques requires normal distribution of
variables. So, whenever we have a skewed distribution, we can use transformations
which reduce skewness. For right skewed distribution, we take square / cube root or
logarithm of variable and for left skewed, we take square / cube or exponential of
variables.
• Variable Transformation is also done from an implementation point of view (Human
involvement). Let’s understand it more clearly. In one of my project on employee
performance, I found that age has direct correlation with performance of the
employee i.e. higher the age, better the performance. From an implementation stand
point, launching age based progamme might present implementation challenge.
However, categorizing the sales agents in three age group buckets of <30 years, 30-
45 years and >45 and then formulating three different strategies for each group is a
judicious approach. This categorization technique is known as Binning of Variables.
4.19. K-means
k-means clustering tries to group similar kinds of items in form of clusters. It finds
the similarity between the items and groups them into the clusters. K-means
clustering algorithm works in three steps. Let’s see what are these three steps.
Let us understand the above steps with the help of the figure because a good
picture is better than the thousands of words.
We will understand each figure one by one.
• Figure 1 shows the representation of data of two different items. the first item has
shown in blue color and the second item has shown in red color. Here I am
choosing the value of K randomly as 2. There are different methods by which we
can choose the right k values.
• In figure 2, Join the two selected points. Now to find out centroid, we will draw a
perpendicular line to that line. The points will move to their centroid. If you will
notice there, then you will see that some of the red points are now moved to the
blue points. Now, these points belong to the group of blue color items.
• The same process will continue in figure 3. we will join the two points and draw a
perpendicular line to that and find out the centroid. Now the two points will move
to its centroid and again some of the red points get converted to blue points.
• The same process is happening in figure 4. This process will be continued until and
unless we get two completely different clusters of these groups.
One of the most challenging tasks in this clustering algorithm is to choose the right
values of k. What should be the right k-value? How to choose the k-value? Let us
find the answer to these questions. If you are choosing the k values randomly, it
might be correct or may be wrong. If you will choose the wrong value then it will
directly affect your model performance. So there are two methods by which you
can select the right value of k.
1. Elbow Method.
2. Silhouette Method.
When the value of k is 1, the within-cluster sum of the square will be high. As the
value of k increases, the within-cluster sum of square value will decrease.
Finally, we will plot a graph between k-values and the within-cluster sum of the
square to get the k value. we will examine the graph carefully. At some point, our
graph will decrease abruptly. That point will be considered as a value of k.
Silhouette
Method
The silhouette method is somewhat different. The elbow method it also picks up
the range of the k values and draws the silhouette graph. It calculates the silhouette
coefficient of every point. It calculates the average distance of points within its
cluster a (i) and the average distance of the points to its next closest cluster called
b (i).
Note : The a (i) value must be less than the b (i) value, that is ai<<bi.
Now, we have the values of a (i) and b (i). we will calculate the silhouette coefficient
by using the below formula.
Now, we can calculate the silhouette coefficient of all the points in the clusters and
plot the silhouette graph. This plot will also helpful in detecting the outliers. The
plot of the silhouette is between -1 to 1.
Also, check for the plot which has fewer outliers which means a less negative value.
Then choose that value of k for your model to tune.
Advantages of K-means
Disadvantages of K-means
1. It is sensitive to the outliers.
2. Choosing the k values manually is a tough job.
3. As the number of dimensions increases its scalability decreases.