0% found this document useful (0 votes)

168 views167 pages

DSBDA Lab Manual

Uploaded by

B34-Samruddhi Latore

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

168 views167 pages

DSBDA Lab Manual

Uploaded by

B34-Samruddhi Latore

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 167

Savitribai Phule Pune University

Third Year of Computer Engineering (2019 Course) Home

310256:Data Science and Big Data Analytics Laboratory

Teaching Scheme Credit:02 Examination Scheme and
Practical: 04 Marks Term work: 50 Marks
Hours/Week Practical: 25 Marks
Companion Course: Data Science and Big Data Analytics (310251)
Course Objectives:
● To understand principles of Data Science for the analysis of real time problems
● To develop in depth understanding and implementation of the key technologies in Data
Scienceand Big Data Analytics
● To analyze and demonstrate knowledge of statistical data analysis techniques for
decision-making
● To gain practical, hands-on experience with statistics programming languages and Big
Data tools
Course Outcomes:
On completion of the course, learners will be able to
CO1: Apply principles of Data Science for the analysis of real time problems
CO2: Implement data representation using statistical
methods
CO3: Implement and evaluate data analytics algorithms
CO4: Perform text preprocessing
CO5: Implement data visualization techniques
CO6: Use cutting edge tools and technologies to analyze Big Data
Guidelines for Instructor's Manual
The instructor‘s manual is to be developed as a reference and hands-on resource. It should
includeprologue (about University/program/ institute/ department/foreword/ preface), curriculum of
the course, conduction and Assessment guidelines, topics under consideration, concept, objectives,
outcomes, set oftypical applications/assignments/ guidelines, and references.
Guidelines for Student's Laboratory Journal
The laboratory assignments are to be submitted by student in the form of journal. Journal consists of
Certificate, table of contents, and handwritten write-up of each assignment (Title, Date of
Completion, Objectives, Problem Statement, Software and Hardware requirements, Assessment
grade/marks and assessor's sign, Theory- Concept in brief, algorithm, flowchart, test cases, Test Data
Set(if applicable), mathematical model (if applicable), conclusion/analysis. Program codes with
sample output of all performed assignments are to be submitted as softcopy. As a conscious effort and
little contribution towards Green IT and environment awareness, attaching printed papers as part of
write-ups and programlisting to journal must be avoided. Use of DVD containing students programs
maintained by Laboratory In-charge is highly encouraged. For reference one or two journals may be
maintained with program prints in the Laboratory.
Guidelines for Laboratory /Term Work Assessment
Continuous assessment of laboratory work should be based on overall performance of Laboratory
assignments by a student. Each Laboratory assignment assessment will assign grade/marks based on
parameters, such as timely completion, performance, innovation, efficient codes, punctuality and
Guidelines for Practical Examination
Problem statements must be decided jointly by the internal examiner and external examiner. During
practical assessment, maximum weightage should be given to satisfactory implementation of the
problem statement. Relevant questions may be asked at the time of evaluation to test the student‘s
understanding of the fundamentals, effective and efficient implementation. This will encourage,
transparent evaluation and fair approach, and hence will not create any uncertainty or doubt in the
minds of the students. So, adhering to these principles will consummate our team efforts to the
promising start of student's academics.
Guidelines for Laboratory Conduction
The instructor is expected to frame the assignments by understanding the prerequisites, technological
aspects, utility and recent trends related to the topic. The assignment framing policy need to address
the average students and inclusive of an element to attract and promote the intelligent students. Use
of open source software is encouraged. Based on the concepts learned. Instructor may also set one
assignment ormini-project that is suitable to respective branch beyond the scope of syllabus.
Set of suggested assignment list is provided in groups- A and B. Each student must perform
13assignments (10 from group A, 3 from group B), 2 mini project from Group C
Operating System recommended :- 64-bit Open source Linux or its
derivativeProgramming tools recommended: - JAVA/Python/R/Scala
Virtual Laboratory:
● "Welcome to Virtual Labs - A MHRD Govt of india Initiative"
● http://cse20-iiith.vlabs.ac.in/List%20of%20Experiments.html?domain=Computer%20Science

Suggested List of Laboratory

Experiments/Assignments
Assignments from all Groups (A,B,C) are compulsory.
Sr.
Group A : Data Science
No.
1. Data Wrangling, I
Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g. https://www.kaggle.com). Provide a
cleardescription of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas data frame.
4. Data Preprocessing: check for missing values in the data using pandas insult(),
describe() function to get some initial statistics. Provide variable descriptions. Types
of variables etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by
checking the data types (i.e., character, numeric, integer, factor, and logical) of the
variables in the data set. If variables are not in the correct data type, apply proper
type conversions.
6. Turn categorical variables into quantitative variables in Python.
In addition to the codes and outputs, explain every operation that you do in the above steps
and explain everything that you do to import/read/scrape the data set.
2. Data Wrangling II
Create an “Academic performance” dataset of students and perform the following operations
using Python.
1. Scan all variables for missing values and inconsistencies. If there are missing values
and/or inconsistencies, use any of the suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable
techniques to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for better
understanding of the variable, to convert a non-linear relation into a linear one, or to
decrease the skewness and convert the distribution into a normal distribution.
Reason and document your approach properly.
3. Descriptive Statistics - Measures of Central Tendency and variability
Perform the following operations on any open source dataset (e.g., data.csv)
1. Provide summary statistics (mean, median, minimum, maximum, standard
deviation) for a dataset (age, income etc.) with numeric variables grouped by one of
the qualitative (categorical) variable. For example, if your categorical variable is age
groups and quantitative variable is income, then provide summary statistics of
income grouped bythe age groups. Create a list that contains a numeric value for
each response to the categorical variable.
2. Write a Python program to display some basic statistical details like percentile,
mean, standard deviation etc. of the species of ‘Iris-setosa’, ‘Iris-versicolor’ and
‘Iris- versicolor’ of iris.csv dataset.
Provide the codes with outputs and explain everything that you do in this step.
4. Data Analytics I
Create a Linear Regression Model using Python/R to predict home prices using Boston
Housing Dataset (https://www.kaggle.com/c/boston-housing). The Boston Housing dataset
contains information about various houses in Boston through different parameters. There
are 506 samplesand 14 feature variables in this dataset.
The objective is to predict the value of prices of the house using the given features.
5. Data Analytics II
1. Implement logistic regression using Python/R to perform classification
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate,
Precision,Recall on the given dataset.
6. Data Analytics III
1. Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv
dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate,
Precision,Recall on the given dataset.
7. Text Analytics
1. Extract Sample document and apply following document preprocessing
methods:Tokenization, POS Tagging, stop words removal, Stemming and
Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse
Document Frequency.
8. Data Visualization I
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains
information about the passengers who boarded the unfortunate Titanic ship. Use the
Seaborn libraryto see if we can find any patterns in the data.
2. Write a code to check how the price of the ticket (column name: 'fare') for each
passenger is distributed by plotting a histogram.
9. Data Visualization II
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for
distribution of age with respect to each gender along with the information about
whether they survived or not. (Column names : 'sex' and 'age')
2. Write observations on the inference from the above statistics.
10. Data Visualization III
Download the Iris flower dataset or any other dataset into a DataFrame.
(e.g., https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset and give the inference as:
1. List down the features and their types (e.g., numeric, nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to illustrate the feature distributions.
3. Create a box plot for each feature in the dataset.
4. Compare distributions and identify outliers.
Group B- Big Data Analytics – JAVA/SCALA (Any three)
1. Write a code in JAVA for a simple Word Count application that counts the number of
occurrences of each word in a given input set using the Hadoop Map-Reduce framework
on local-standalone set-up.
2. Design a distributed application using Map-Reduce which processes a log file of a system.
3. Locate dataset (e.g., sample_weather.txt) for working on weather data which reads the text
input files and finds average for temperature, dew point and wind speed.
4. Write a simple program in SCALA using Apache Spark framework
Group C- Mini Projects/ Case Study – PYTHON/R (Any TWO Mini Project)
1. Write a case study on Global Innovation Network and Analysis (GINA). Components of
analytic plan are 1. Discovery business problem framed, 2. Data, 3. Model planning analytic
technique and 4. Results and Key findings.
2. Use the following dataset and classify tweets into positive and negative
tweets. https://www.kaggle.com/ruchi798/data-science-tweets
3. Develop a movie recommendation model using the scikit-learn library in python.
Refer dataset
https://github.com/rashida048/Some-NLP-Projects/blob/master/movie_dataset.csv
4. Use the following covid_vaccine_statewise.csv dataset and perform following analytics on the
given dataset
https://www.kaggle.com/sudalairajkumar/covid19-in-india?select=covid_vaccine_statewise.csv
a. Describe the dataset
b. Number of persons state wise vaccinated for first dose in India
c. Number of persons state wise vaccinated for second dose in India
d. Number of Males vaccinated
d. Number of females vaccinated
5. Write a case study to process data driven for Digital Marketing OR Health care systems with
Hadoop Ecosystem components as shown. (Mandatory)
● HDFS: Hadoop Distributed File System
● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
● HBase: NoSQL Database (Provides real-time reads and writes)
● Mahout, Spark MLLib: (Provides analytical tools) Machine Learning algorithm
libraries
● Solar, Lucene: Searching and Indexing
Learning Resources
Reference Books :
1. Chirag Shah, “A Hands-On Introduction To Data Science”, Cambridge
University Press,(2020),ISBN : ISBN 978-1-108-47244-9.
2. Wes McKinney, “Python for Data Analysis”, O' Reilly media, ISBN : 978-1-449-31979-3.
3. “Scikit-learn Cookbook”, Trent hauk,Packt Publishing, ISBN: 9781787286382
4. R Kent Dybvig, “The Scheme Programming Language”, MIT Press, ISBN 978-0-262-51298-5.
5. Jenny Kim, Benjamin Bengfort, “Data Analytics with Hadoop”, OReilly Media, Inc.
6. Jake VanderPlas,“Python Data Science Handbook”
https://tanthiamhuat.files.wordpress.com/2018/04/pythondatasciencehandbook.
pd f
7. Gareth James, “An Introduction to Statistical Learning”
https://www.ime.unicamp.br/~dias/Intoduction%20to%20Statistical%20Learning.p
df
8. Cay S Horstmann, “Scala for the Impatient”, Pearson, ISBN: 978-81-317-9605-4,
9. Alvin Alexander, “Scala Cookbook”, O‘Reilly, SPD,ISBN: 978-93-5110-263-2
Web Links:
● https://www.simplilearn.com/data-science-vs-big-data-vs-data-analytics-article
● https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduc
e- client- core/MapReduceTutorial.html
● https://www.edureka.co/blog/hadoop-ecosystem
● https://www.edureka.co/blog/mapreduce-tutorial/#mapreduce_word_count_example
● https://github.com/vasanth-mahendran/weather-data-hadoop
● https://spark.apache.org/docs/latest/quick-start.html#more-on-dataset-operations
● https://www.scala-lang.org/
MOOCs Courses link:
● https://nptel.ac.in/courses/106/106/106106212/
● https://onlinecourses.nptel.ac.in/noc21_cs33/preview
● https://nptel.ac.in/courses/106/104/106104189/
● https://onlinecourses.nptel.ac.in/noc20_cs92/preview
@The CO-PO Mapping Matrix
PO/CO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 2 2 3 1 1 - - - 1 - 1 3
CO2 2 2 3 1 2 - - - 1 - - 3
CO3 2 2 3 2 2 2 - - 2 - 1 3
CO4 2 2 2 2 2 - - - - - - 3
CO5 2 2 3 3 3 1 - - 2 - 2 3
CO6 2 2 1 1 3 2 1 - 2 - 2 1
INDEX
Group-1
Sr.No. Title of Assignments Page
No.
Group A : Data Science
Data Wrangling, I 1
Perform the following operations using Python on any open source dataset
(e.g., data.csv)
1. Import all the required Python Libraries.
2.Locate an open source data from the web (e.g.
https://www.kaggle.com). Provide a cleardescription of the data and its
source (i.e., URL of the web site).
3. Load the Dataset into pandas data frame.
4. Data Preprocessing: check for missing values in the data using
pandas insult(), describe() function to get some initial statistics.
1
Provide variable descriptions. Types of variables etc. Check the
dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of
variables by checking the data types (i.e., character, numeric, integer,
factor, and logical) of the variables in the data set. If variables are not
in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
In addition to the codes and outputs, explain every operation that you do in
the above steps and explain everything that you do to import/read/scrape
the data set.
Data Wrangling II 21
Create an “Academic performance” dataset of students and perform the
following operations using Python.
1. Scan all variables for missing values and inconsistencies. If there are
missing values and/or inconsistencies, use any of the suitable
techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any
2
of the suitable techniques to deal with them.
3. Apply data transformations on at least one of the variables. The
purpose of this transformation should be one of the following
reasons: to change the scale for better understanding of the variable,
to convert a non-linear relation into a linear one, or to decrease the
skewness and convert the distribution into a normal distribution.
Reason and document your approach properly.
Descriptive Statistics - Measures of Central Tendency and variability 50
Perform the following operations on any open source dataset (e.g., data.csv)
1. Provide summary statistics (mean, median, minimum, maximum,
3
standard deviation) for a dataset (age, income etc.) with numeric
variables grouped by one of the qualitative (categorical) variable.
2. For example, if your categorical variable is age groups and
quantitative variable is income, then provide summary statistics of
income grouped by the age groups. Create a list that contains a
numeric value for each response to the categorical variable.
3. Write a Python program to display some basic statistical details like
percentile, mean, standard deviation etc. of the species of ‘Iris-
setosa’, ‘Iris-versicolor’ and ‘Iris- versicolor’ of iris.csv dataset.
Provide the codes with outputs and explain everything that you do in this step.
Data Analytics I 66
Create a Linear Regression Model using Python/R to predict home prices
using Boston Housing Dataset (https://www.kaggle.com/c/boston-housing).
4 The Boston Housing dataset contains information about various houses in
Boston through different parameters. There are 506 samples and 14 feature
variables in this dataset.
The objective is to predict the value of prices of the house using the given
features.
Data Analytics II 81
1. Implement logistic regression using Python/R to perform
5 classificationonSocial_Network_Ads.csv dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy,
Error rate, Precision,Recall on the given dataset.
Data Analytics III 89
1. Implement Simple Naïve Bayes classification algorithm using
Python/R on iris.csvdataset.
6
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy,
Error rate, Precision,
Recall on the given dataset.
Text Analytics 95
1. Extract Sample document and apply following
7 document preprocessing methods:Tokenization, POS
Tagging, stop words removal, Stemming and
Lemmatization.
2. Create representation of document by calculating Term Frequency and
Inverse Document Frequency.
Data Visualization I 104
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and
contains information about the passengers who boarded the
8 unfortunate Titanic ship. Use the Seaborn library to see if we can find
any patterns in the data.
2. Write a code to check how the price of the ticket (column name:
'fare') for each passenger is distributed by plotting a
histogram.
Data Visualization II 121
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a
9 box plot for distribution of age with respect to each gender along
with the information about whether they survived or not. (Column
names : 'sex' and 'age')
Write observations on the inference from the above statistics.
10 Data Visualization III 133
Download the Iris flower dataset or any other dataset into
a DataFrame. (e.g.,https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the
dataset and give the inference as:
1. List down the features and their types (e.g., numeric,
nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to illustrate
the feature distributions.
3. Create a box plot for each feature in the dataset.
Compare distributions and identify outliers.
Group B- Big Data Analytics – JAVA/SCALA (Any three)
Write a code in JAVA for a simple Word Count application that counts the 137
1 number of occurrences of each word in a given input set using the Hadoop
Map-Reduce framework on local-standalone set-up.
Design a distributed application using Map-Reduce which processes a log file 143
2
of a system.
Locate dataset (e.g., sample_weather.txt) for working on weather data which
3 reads the text input files and finds average for temperature, dew point and
wind speed.
4 Write a simple program in SCALA using Apache Spark framework 147
Group C- Mini Projects/ Case Study – PYTHON/R (Any TWO Mini Project)
1. Write a case study on Global Innovation Network and Analysis (GINA).
Components of analytic plan are 1. Discovery business problem framed, 2.
Data, 3. Model planning analytic technique and 4. Results and Key findings.
2. Use the following dataset and classify tweets into positive and negative tweets.
https://www.kaggle.com/ruchi798/data-science-tweets
3. Develop a movie recommendation model using the scikit-learn library in
python.
Refer dataset
https://github.com/rashida048/Some-NLP-
Projects/blob/master/movie_dataset.csv
4. Use the following covid_vaccine_statewise.csv dataset and perform
following analytics on the given dataset
https://www.kaggle.com/sudalairajkumar/covid19-in-
india?select=covid_vaccine_statewise.csv
a. Describe the dataset
b. Number of persons state wise vaccinated for first dose in India
c. Number of persons state wise vaccinated for second dose in India
d. Number of Males vaccinated
d. Number of females vaccinated
5. Write a case study to process data driven for Digital Marketing OR Health
care systems with Hadoop Ecosystem components as shown. (Mandatory)
● HDFS: Hadoop Distributed File System
● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
● HBase: NoSQL Database (Provides real-time reads and writes)
● Mahout, Spark MLLib: (Provides analytical tools) Machine
Learning algorithm libraries
● Solar, Lucene: Searching and Indexing
Assignment 1

Title Data Wrangling I

Problem Perform the following operations using Python on any open source dataset (e.g., data.csv)
statement Import all the required Python Libraries.
1. Locate open source data from the web (e.g. https://www.kaggle.com).
2. Provide a clear description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into the pandas data frame.
4. Data Preprocessing: check for missing values in the data using pandas insult(),
describe() function to get some initial statistics. Provide variable descriptions.
Types of variables etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables
by checking the data types (i.e., character, numeric, integer, factor, and
logical) of the variables in the data set. If variables are not in the correct data
type, apply proper type conversions.
Turn categorical variables into quantitative variables in Python.
Prerequisite 1. Basic of Python Programming
2. Concept of Data Preprocessing, Data Formatting, Data Normalization and
Data Cleaning.
CO mapped 1,2
Programmi Python
ng Tools

Theory & Procedure:

Title: Perform the following operations using Python on any open source dataset (e.g.,
data.csv) Import all the required Python Libraries.
1. Locate open source data from the web (e.g. https://www.kaggle.com).
2. Provide a clear description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into the pandas data frame.
4. Data Preprocessing: check for missing values in the data using pandas insult(), describe() function
to get some initial statistics. Provide variable descriptions. Types of variables etc. Check the
dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking the data
types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. If variables
are not in the correct data type, apply proper type conversions.
Turn categorical variables into quantitative variables in Python.

Objectives: Students should be able to perform the data wranglingoperation using Python on
any open source dataset.
Outcome: Students are able to perform the Data Formatting and Data Normalization.

Prerequisites: 1. Basic of Python Programming

2. Concept of Data Preprocessing, Data Formatting, Data Normalization and Data
Cleaning.
Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB SATA HD.

Software Specification: Ubuntu 14.04, Python, Jupyter.

1. Introduction to Dataset
A dataset is a collection of records, similar to a relational database table. Records are
similar to table rows, but the columns can contain not only strings or numbers, but
also nested data structures such as lists, maps, and other records.

Instance: A single row of data is called an instance. It is an observation from the

domain. Feature: A single column of data is called a feature. It is a component of an
observation and is also called an attribute of a data instance. Some features may be
inputs to a model (the predictors) and others may be outputs or the features to be
predicted.

Data Type: Features have a data type. They may be real or integer-valued or may
have a categorical or ordinal value. You can have strings, dates, times, and more
complex types, but typically they are reduced to real or categorical values when
working with traditional machine learning methods.
Datasets: A collection of instances is a dataset and when working with machine
learning methods we typically need a few datasets for different purposes.
Training Dataset: A dataset that we feed into our machine learning algorithm to train
our model.
Testing Dataset: A dataset that we use to validate the accuracy of our model but is
not used to train the model. It may be called the validation dataset.
Data Represented in a Table:
Data should be arranged in a two-dimensional space made up of rows and columns.
This type of data structure makes it easy to understand the data and pinpoint any
problems. An example of some raw data stored as a CSV (comma separated values).

The representation of the same data in a table is as follows:

and as Data Types

A data type is essentially an internal construct that a programming language uses to
understand how to store and manipulate data.
A possible confusing point about pandas data types is that there is some overlap
between pandas, python and numpy. This table summarizes the key points:

Pandas Pytho
Nu Usag
dtype ntype
m e
Py
typ
e
object str or string_, unicode_, mixed types Text or mixed numeric
mixe and non-numeric values
d
int64 int int_, int8, int16, int32, int64, Integer numbers
uint8, uint16,uint32, uint64
float64 float float_, float16, float32, float64 Floating point numbers
bool bool bool_ True/False values
datetime64 NA datetime64[ns] Date and time values
timedelta[ns] NA NA Differences between two
datetimes
category NA NA Finite list of text values
2. Python Libraries for Data Science
a. Pandas
Pandas is an open-source Python package that provides high-performance,
easy-to-use data structures and data analysis tools for the labeled data in Python
programming language.
What can you do with Pandas?

1. Indexing, manipulating, renaming, sorting, merging data frame

2. Update, Add, Delete columns from a data frame
3. Impute missing files, handle missing data or NANs
4. Plot data with histogram or box plot
b. NumPy
One of the most fundamental packages in Python, NumPy is a general-purpose array-
processing package. It provides high-performance multidimensional array objects and
tools to work with the arrays. NumPy is an efficient container of generic multi-
dimensional data.
NumPy’s main object is the homogeneous multidimensional array. It is a table of
elements or numbers of the same datatype, indexed by a tuple of positive integers. In
NumPy, dimensions are called axes and the number of axes is called rank. NumPy’s
array class is called ndarray aka array.

What can you do with NumPy?

1. Basic array operations: add, multiply, slice, flatten, reshape, index arrays
2. Advanced array operations: stack arrays, split into sections, broadcast arrays
3. Work with DateTime or Linear Algebra
4. Basic Slicing and Advanced Indexing in NumPy Python
c. Matplotlib
This is undoubtedly my favorite and a quintessential Python library. You can create
stories with the data visualized with Matplotlib. Another library from the SciPy Stack,
Matplotlib plots 2D figures.
What can you do with Matplotlib?
Histogram, bar plots, scatter plots, area plot to pie plot, Matplotlib can depict a wide
range of visualizations. With a bit of effort and tint of visualization capabilities, with
Matplotlib, you can create just any visualizations:Line plots
● Scatter plots
● Area plots
● Bar charts and Histograms
● Pie charts
● Stem plots
● Contour plots
● Quiver plots
● Spectrograms
Matplotlib also facilitates labels, grids, legends, and some more formatting entities
with Matplotlib.
d. Seaborn
So when you read the official documentation on Seaborn, it is defined as the data
visualization library based on Matplotlib that provides a high-level interface for
drawing attractive and informative statistical graphics. Putting it simply, seaborn is
an extension of Matplotlib with advanced features.
What can you do with Seaborn?

1. Determine relationships between multiple variables (correlation)

2. Observe categorical variables for aggregate statistics
3. Analyze univariate or bi-variate distributions and compare them between
different data subsets
4. Plot linear regression models for dependent variables
5. Provide high-level abstractions, multi-plot grids
6. Seaborn is a great second-hand for R visualization libraries like corrplot and ggplot.
e. 5. Scikit Learn
Introduced to the world as a Google Summer of Code project, Scikit Learn is a robust
machine learning library for Python. It features ML algorithms like SVMs, random
forests, k-means clustering, spectral clustering, mean shift, cross-validation and
more... Even NumPy, SciPy and related scientific operations are supported by Scikit
Learn with Scikit Learn being a part of the SciPy Stack.
What can you do with Scikit Learn?
1. Classification: Spam detection, image recognition
2. Clustering: Drug response, Stock price
3. Regression: Customer segmentation, Grouping experiment outcomes
4. Dimensionality reduction: Visualization, Increased efficiency
5. Model selection: Improved accuracy via parameter tuning
6. Pre-processing: Preparing input data as a text for processing with
machine learning algorithms.
3. Description of Dataset:
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple
Measurements in Taxonomic Problems, and can also be found on the UCI Machine
Learning Repository.
It includes three iris species with 50 samples each as well as some properties about each
flower. One flower species is linearly separable from the other two, but the other two are
not linearly separable from each other.
Total Sample- 150
The columns in this dataset are:
1. Id
2. SepalLengthCm
3. SepalWidthCm
4. PetalLengthCm
5. PetalWidthCm
6. Species
3 Different Types of Species each contain 50 Sample-
Description of Dataset-

4. Panda Dataframe functions for Load Dataset

# The columns of the resulting DataFrame have different

dtypes. iris.dtypes
1. The dataset is downloads from UCI repository.
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
2. Now Read CSV File as a Dataframe in Python from from path where you saved the
same The Iris data set is stored in .csv format. ‘.csv’ stands for comma separated
values. It is easier to load .csv files in Pandas data frame and perform various
analytical operations on it.
Load Iris.csv into a Pandas data frame —
Syntax-
iris = pd.read_csv(csv_url, header = None)

3. The csv file at the UCI repository does not contain the variable/column names. They
are located in a separate file.
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']

4. read in the dataset from the UCI Machine Learning Repository link and specify
column names to use
iris = pd.read_csv(csv_url, names = col_names)
5. Panda Dataframe functions for Data
Preprocessing : Dataframe Operations:

Sr. Data Frame Function Description

1 dataset.head(n=5) Return the first n rows.

2 dataset.tail(n=5)
Return the last n rows.

3 dataset.index The index (row labels) of the Dataset.

4 dataset.columns The column labels of the Dataset.

5 dataset.shape Return a tuple representing the dimensionality of

the Dataset.

6 dataset.dtypes Return the dtypes in the Dataset.

This returns a Series with the data type of each column.
The result’s index is the original Dataset’s columns.
Columns with mixed types are stored with the object
dtype.

7 dataset.columns.values Return the columns values in the Dataset in arrayformat

8 dataset.describe(include='all') Generate descriptive statistics.to view some basic

statistical details like percentile, mean, std etc. of a data
frame or a series of numericvalues.Analyzes both numeric
and object series, as well as Dataset column sets of mixed
data types.
9 dataset['Column name] Read the Data Column wise.

10 dataset.sort_index(axis=1, Sort object by labels (along an axis).

ascending=False)
dataset.sort_values(by="Col Sort values by column name.
11
u mn name")

12 dataset.iloc[5] Purely integer-location based indexing for selection by

position.

13 dataset[0:3] Selecting via [], which slices the rows.

14 dataset.loc[:, ["Col_name1", Selection by label
"col_name2"]]

15 dataset.iloc[:n, :] a subset of the first n rows of the original data

16 dataset.iloc[:, :n] a subset of the first n columns of the original data
17 dataset.iloc[:m, :n] a subset of the first m rows and the first n columns

Few Examples of iLoc to slice data for iris Dataset

Sr. Data Frame Description Output

No Function

1 dataset.iloc[3:5, 0:2] Slice the data

2 dataset.iloc[[1, 2, By lists of integer

4], [0, 2]] position locations,
similar to the
NumPy/Python
style:
3 dataset.iloc[1:3, :] For slicing
rows explicitly:

4 dataset.iloc[:, 1:3] For slicing Column

explicitly:

4 dataset.iloc[1, 1] For getting a value

explicitly:

5 dataset['SepalLeng Accessing Column and

thCm'].iloc[5] Rows by position
6 cols_2_4=dataset.c Get Column Name then
olumns[2:4] get data from column
dataset[cols_2_4]

7 dataset[dataset.col in one Expression answer

umns[2:4]].iloc[5: for the above two
1 0] commands

Checking of Missing Values in Dataset:

● isnull() is the function that is used to check missing values or null values in pandas python.
● isna() function is also used to get the count of missing values of column and row
wise count of missing values
● The dataset considered for explanation is:

a. is there any missing values in dataframe as a

whole Function: DataFrame.isnull()
Output:

b. is there any missing values across each

column Function:

DataFrame.isnull().any()
Output:

c. count of missing values across each column using isna() and isnull()
In order to get the count of missing values of the entire dataframe isnull() function
is used. sum() which does the column wise sum first and doing another sum() will
get the count of missing values of the entire dataframe.
Function: dataframe.isnull().sum().sum()
Output : 8
d. count row wise missing value using
isnull() Function:
dataframe.isnull().sum(axis = 1)

Output:

e. count Column wise missing value using

isnull() Method 1:
Function: dataframe.isnull().sum()

Output:

Method 2:
unction: dataframe.isna().sum()
f. count of missing values of a specific
column.
Function:dataframe.col_name.isnull().sum
()

df1.Gender.isnull().sum()
Output: 2
g. groupby count of missing values of a column.
In order to get the count of missing values of the particular column by group
in pandas we will be using isnull() and sum() function with apply() and
groupby() which performs the group wise count of missing values as shown
below.
Function:
df1.groupby(['Gender'])['Score'].apply(lambda
x: x.isnull().sum())
Output:

6. Panda functions for Data Formatting and Normalization

The Transforming data stage is about converting the data set into a format that can be

analyzed or modelled effectively, and there are several techniques for this process.

a. Data Formatting: Ensuring all data formats are correct (e.g. object, text,
floating number, integer, etc.) is another part of this initial ‘cleaning’ process. If
you are working with dates in Pandas, they also need to be stored in the exact
format to use special date-time functions.

Functions used for data formatting

Sr. Data Frame Description Output

No Function

1. df.dtypes To check the

data type
2. df['petal length To change the data
(cm)']= df['petal type (data type of
‘petal length
length (cm)'changed to
(cm)'].astype("int int)
")

b. Data normalization: Mapping all the nominal data values onto a uniform
scale (e.g. from 0 to 1) is involved in data normalization. Making the ranges
consistent

across variables helps with statistical analysis and ensures better comparisons
later on.It is also known as Min-Max scaling.

Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Print iris dataset.
df.head()
Step 5: Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()

Step 6: Separate the feature from the class label

x=df.iloc[:,:4]
Step 6: Create an object to transform the data to fit minmax processor
x_scaled = min_max_scaler.fit_transform(x)
Step 7:Run the normalizer on the dataframe
df_normalized = pd.DataFrame(x_scaled)
Step 8: View the dataframe
df_normalized
Output: After Step 3:
Output after step 8:

7. Panda Functions for handling categorical variables

● Categorical variables have values that describe a ‘quality’ or ‘characteristic’
of a data unit, like ‘what type’ or ‘which category’.
● Categorical variables fall into mutually exclusive (in one category or in
another) and exhaustive (include all possible options) categories. Therefore,
categorical variables are qualitative variables and tend to be represented by a
non-numeric value.
● Categorical features refer to string type data and can be easily understood by
human beings. But in case of a machine, it cannot interpret the categorical
data directly. Therefore, the categorical data must be translated into
numerical data that can be understood by machine.
There are many ways to convert categorical data into numerical data. Here the three most
used methods are discussed.

a. Label Encoding: Label Encoding refers to converting the labels into a numeric form
so as to convert them into the machine-readable form. It is an important
preprocessing step for the structured dataset in supervised learning.
Example : Suppose we have a column Height in some dataset. After applying label
encoding,the Height column is converted into:
where 0 is the label for tall, 1 is the label for medium, and 2 is a label for short height.
Label Encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
Sklearn Functions for Label Encoding:
● preprocessing.LabelEncoder : It Encode labels with value between
0 and n_classes-1.
● fit_transform(y):
Parameters: yarray-like of shape
(n_samples,) Target values.
Returns: yarray-like of shape (n_samples,)
Encoded labels.
This transformer should be used to encode target values, and not the input.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa',
'Iris-versicolor', 'Iris-virginica'], dtype=object)

Step 4: define label_encoder object knows how to understand word labels.

label_encoder = preprocessing.LabelEncoder()
Step 5: Encode labels in column 'species'.
df['Species']= label_encoder.fit_transform(df['Species'])
Step 6: Observe the unique values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)

● Use LabelEncoder when there are only two possible values of a categorical feature.
For example, features having value such as yes or no. Or, maybe, gender features
when there are only two possible values including male or female.

Limitation: Label encoding converts the data in machine-readable form, but it

assigns a unique number(starting from 0) to each class of data. This may lead to
the generation of priority issues in the data sets. A label with a high value may be
considered to have high priority than a label having a lower value.
b. One-Hot Encoding:

In one-hot encoding, we create a new set of dummy (binary) variables that is equal to
the number of categories (k) in the variable. For example, let’s say we have a
categorical variable Color with three categories called “Red”, “Green” and “Blue”, we
need to use three dummy variables to encode this variable using one-hot encoding. A
dummy (binary) variable just takes the value 0 or 1 to indicate the exclusion or
inclusion of a category.

In one-hot encoding,
“Red” color is encoded as [1 0 0] vector of size 3.
“Green” color is encoded as [0 1 0] vector of size 3.
“Blue” color is encoded as [0 0 1] vector of size 3.
One-hot encoding on iris dataset: For iris dataset the target column which is Species.
It contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
Sklearn Functions for One-hot Encoding:
● sklearn.preprocessing.OneHotEncoder(): Encode
categorical integer features using a one-hot aka one-of-K scheme
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa',
'Iris-versicolor', 'Iris-virginica'], dtype=object)

Step 6: Apply one_hot encoder for Species column.

enc = preprocessing.OneHotEncoder()
enc_df=pd.DataFrame(enc.fit_transform(df[['Species']])).toarray()

Step 7: Join the encoded values with Features variable

df_encode = features_df.join(enc_df)

Step 8: Observe the merge dataframe

df_encode
Step 9: Rename the newly encoded columns.
df_encode.rename(columns = {0:'Iris-Setosa',
1:'Iris-Versicolor',2:'Iris-virginica'}, inplace = True)
Step 10: Observe the merge dataframe
df_encode

Output after Step 8:

Output after Step 10:

c. Dummy Variable Encoding

Dummy encoding also uses dummy (binary) variables. Instead of creating a number
of dummy variables that is equal to the number of categories (k) in the variable,
dummy encoding uses k-1 dummy variables. To encode the same Color variable with
three categories using the dummy encoding, we need to use only two dummy
variables.

In dummy encoding,
“Red” color is encoded as [1 0] vector of size
2. “Green” color is encoded as [0 1] vector of
size 2. “Blue” color is encoded as [0 0] vector
of size 2.
Dummy encoding removes a duplicate category present in the one-hot encoding.
Pandas Functions for One-hot Encoding with dummy variables:
● pandas.get_dummies(data, prefix=None,
prefix_sep='_', dummy_na=False, columns=None,
sparse=False, drop_first=False, dtype=None): Convert
categorical variable into dummy/indicator variables.
● Parameters:
data:array-like, Series, or DataFrame
Data of which to get dummy
indicators.
prefixstr: list of str, or dict of str, default
None String to append DataFrame column
names.
prefix_sep: str, default ‘_’
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as
with prefix.
dummy_nabool: default False
Add a column to indicate NaNs, if False NaNs are ignored.

columns: list:like, default None

Column names in the DataFrame to be encoded. If columns is None then all
the columns with object or category dtype will be converted.

sparse: bool: default False

Whether the dummy-encoded columns should be backed by a SparseArray (
True) or a regular NumPy array (False).

drop_first:bool, default False

Whether to get k-1 dummies out of k categorical levels by removing the first level.

dtype: dtype, default np.uint8

Data type for new columns. Only a single dtype is allowed.
● Return : DataFrame with Dummy-coded data.

Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa',
'Iris-versicolor', 'Iris-virginica'], dtype=object)

Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 6: Apply one_hot encoder with dummy variables for Species column.
one_hot_df = pd.get_dummies(df,
prefix="Species", columns=['Species'],
drop_first=True)
Step 7: Observe the merge dataframe
one_hot_df

Conclusion- In this way we have explored the functions of the python library for Data
Preprocessing, Data Wrangling Techniques and How to Handle missing values on Iris
Dataset. Assignment Question
1. Explain Data Frame with Suitable example.
2. What is the limitation of the label encoding method?
3. What is the need of data normalization?
4. What are the different Techniques for Handling the Missing Data?
Assignment No: 2
Title Data Wrangling, II
Problem
Create an “Academic performance” dataset of students and perform the
Statement
following operations using Python.
1. Scan all variables for missing values and inconsistencies. If
there are missing values and/or inconsistencies, use any of the
suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use
any of the suitabletechniques to deal with them.
3. Apply data transformations on at least one of the variables. The
purpose of this transformation should be one of the following
reasons: to change the scale forbetter understanding of the
variable, to convert a non-linear relation into a linearone, or to
decrease the skewness and convert the distribution into a
normaldistribution.
Reason and document your approach properly.
Prerequisite
1. Creation of Dataset using Microsoft Excel.
2. Identification and Handling of Null Values
3. Identification and Handling of Outliers
4. Data Transformation for the purpose of :
a. To change the scale for better understanding
b. To decrease the skewness and convert distribution into normal distribution
CO mapped 2
Programming Python, Jupyter
Tools
Theory & Procedure:
Title: Create an “Academic performance” dataset of students and perform the following
operations using Python.
1. Scan all variables for missing values and inconsistencies. If there are missing
values and/or inconsistencies, use any of the suitable techniques to deal with
them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable

21
techniques to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for
better understanding of the variable, to convert a non-linear relation into a
linear one, or to decrease the skewness and convert the distribution into a
normal distribution.
Reason and document your approach properly.
Objectives: Students should be able to perform the data wrangling operation using
Python on any open source dataset
Outcomes: Students are able to perform data transformations.
Prerequisites:
1. Creation of Dataset using Microsoft Excel.
2. Identification and Handling of Null Values
3. Identification and Handling of Outliers
4. Data Transformation for the purpose of :
a. To change the scale for better understanding
b. To decrease the skewness and convert distribution into normal

distribution Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB

SATA HD. Software Specification: Ubuntu 14.04, Python, Jupyter.

1. Creation of Dataset using Microsoft Excel.

The dataset is created in “CSV” format.
● The name of dataset is StudentsPerformance
● The features of the dataset are: Math_Score, Reading_Score,
Writing_Score, Placement_Score, Club_Join_Date .
● Number of Instances: 30
● The response variable is: Placement_Offer_Count .
● Range of Values:
Math_Score [60-80], Reading_Score[75-,95], ,Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].
● The response variable is the number of placement offers facilitated to
particular students, which is largely depend on Placement_Score
To fill the values in the dataset the RANDBETWEEN is used. Returns a random
integer number between the numbers you specify
Syntax : RANDBETWEEN(bottom, top) Bottom The smallest integer and
Top The largest integer RANDBETWEEN will return.
For better understanding and visualization, 20% impurities are added into each
variable to the dataset.
The step to create the dataset are as follows:
Step 1: Open Microsoft Excel and click on Save As. Select Other .Formats

Step 2: Enter the name of the dataset and Save the dataset astye CSV(MS-DOS).
Step 3: Enter the name of features as column header.

Step 3: Fill the dara by using RANDOMBETWEEN function. For every feature ,
fill the data by considering above spectified range.
one example is given:

Scroll down the cursor for 30 rows to create 30 instances.

Repeat this for the features, Reading_Score, Writing_Score,
Placement_Score, Club_Join_Date.
The placement count largely depends on the placement score. It is considered that if
placement score <75, 1 offer is facilitated; for placement score >75 , 2 offer is
facilitated and for else (>85) 3 offer is facilitated. Nested If formula is used for ease
of data filling.
Step 4: In 20% data, fill the impurities. The range of math score is [60,80], updating a
few instances values below 60 or above 80. Repeat this for Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].

Step 5: To violate the ruleof response variable, update few valus . If placement
scoreis greater then 85, facilated only 1 offer.

The dataset is created with the given description.

2. Identification and Handling of Null Values

Missing Data can occur when no information is provided for one or more items or for
a whole unit. Missing Data is a very big problem in real-life scenarios. Missing Data
can also refer to as NA(Not Available) values in pandas. In DataFrame sometimes
many datasets simply arrive with missing data, either because it exists and was not
collected or it never existed. For Example, Suppose different users being surveyed
may choose not to share their income, some users may choose not to share the address
in this way many datasets went missing.
In Pandas missing data is represented by two value:

1. None: None is a Python singleton object that is often used for missing data in
Python code.
2. NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point
representation.
Pandas treat None and NaN as essentially interchangeable for indicating
missing or null values. To facilitate this convention, there are several useful
functions for detecting, removing, and replacing null values in Pandas DataFrame :

● isnull()

● notnull()
● dropna()
● fillna()
● replace()

1. Checking for missing values using isnull() and notnull()

● Checking for missing values using isnull()

In order to check null values in Pandas DataFrame, isnull() function is used.
This function return dataframe of Boolean values which are True for NaN values.

Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4: Use isnull() function to check null values in the dataset.
df.isnull()

Step 5: To create a series true for NaN values for specific columns. for example
math score in dataset and display data with only math score as NaN
series = pd.isnull(df["math
score"]) df[series]

● Checking for missing values using notnull()

In order to check null values in Pandas Dataframe, notnull() function is used. This
function return dataframe of Boolean values which are False for NaN values.

Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4: Use notnull() function to check null values in the dataset.
df.notnull()

Step 5: To create a series true for NaN values for specific columns. for example
math score in dataset and display data with only math score as NaN
series1 = pd.notnull(df["math
score"]) df[series1]
See that there are also categorical values in the dataset, for this, you need to use
Label Encoding or One Hot Encoding.
from sklearn.preprocessing import
LabelEncoder le = LabelEncoder()
df['gender'] =
le.fit_transform(df['gender']) newdf=df
df

2. Filling missing values using dropna(), fillna(), replace()

In order to fill null values in a datasets, fillna(), replace() functions are used.
These functions replace NaN values with some value of their own. All these
functions help in filling null values in datasets of a DataFrame.

● For replacing null values with NaN

missing_values = ["Na", "na"]
df = pd.read_csv("StudentsPerformanceTest1.csv", na_values
= missing_values)
df

● Filling null values with a single value

Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4: filling missing value using fillna()
ndf=df
ndf.fillna

(0)
Step 5: filling missing values using mean, median and standard deviation of that
column.

data['math score'] = data['math score'].fillna(data['math score'].mean())

data[''math score''] = data[''math

score''].fillna(data[''math score''].median())

data['math score''] = data[''math score''].fillna(data[''math score''].std())

replacing missing values in forenoon column with minimum/maximum

number of that column

data[''math score''] = data[''math score''].fillna(data[''math score''].min())

data[''math score''] = data[''math score''].fillna(data[''math score''].max())

● Filling null values in dataset

To fill null values in dataset use inplace=true
m_v=df['math score'].mean()
df['math score'].fillna(value=m_v,
inplace=True) df

● Filling a null values using replace() method

Following line will replace Nan value in dataframe with value -99

ndf.replace(to_replace = np.nan, value = -99)

● Deleting null values using dropna() method
In order to drop null values from a dataframe, dropna() function is used. This
function drops Rows/Columns of datasets with Null values in different ways.
1. Dropping rows with at least 1 null value

2. Dropping rows if all values in that row are missing

3. Dropping columns with at least 1 null value.
4. Dropping Rows with at least 1 null value in CSV file

Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as
pd import numpy
as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4:To drop rows with at least 1 null value
ndf.dropna()
Step 5: To Drop rows if all values in that row are missing
ndf.dropna(how = 'all')

Step 6: To Drop columns with at least 1 null value.

ndf.dropna(axis = 1)

Step 7 : To drop rows with at least 1 null value in CSV

file. making new data frame with dropped NA values
new_data = ndf.dropna(axis = 0, how
='any')
new_data

3. Identification and Handling of Outliers

3.1 Identification of Outliers
One of the most important steps as part of data preprocessing is detecting and treating
the outliers as they can negatively affect the statistical analysis and the training
process of a machine learning algorithm resulting in lower accuracy.
1. What are Outliers?
We all have heard of the idiom ‘odd one out' which means something unusual
in comparison to the others in a group.

Similarly, an Outlier is an observation in a given dataset that lies far from the
rest of the observations. That means an outlier is vastly larger or smaller than the
remaining values in the set.

2. Why do they occur?

An outlier may occur due to the variability in the data, or due to experimental
error/human error.

They may indicate an experimental error or heavy skewness in the data(heavy-

tailed distribution).

3. What do they affect?

In statistics, we have three measures of central tendency namely Mean,
Median, and Mode. They help us describe the data.

Mean is the accurate measure to describe the data when we do not have any
outliers present. Median is used if there is an outlier in the dataset. Mode is used if
there is an outlier AND about ½ or more of the data is the same.

‘Mean’ is the only measure of central tendency that is affected by the outliers
which in turn impacts Standard deviation.
Example:
Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9].
By looking at it, one can quickly say ‘101’ is an outlier that is much larger than the
other values.

fig. Computation with and without outlier

From the above calculations, we can clearly say the Mean is more affected than the
Median.
4. Detecting Outliers
If our dataset is small, we can detect the outlier by just looking at the dataset.
But what if we have a huge dataset, how do we identify the outliers then? We need to
use visualization and mathematical techniques.

Below are some of the techniques of detecting outliers

● Boxplots
● Scatterplots
● Z-score
● Inter Quantile Range(IQR)

4.1 Detecting outliers using Boxplot:

It captures the summary of the data effectively and efficiently with only a
simple box and whiskers. Boxplot summarizes sample data using 25th, 50th, and 75th
percentiles. One can just get insights(quartiles, median, and outliers) into the dataset
by just looking at its boxplot.
Algorithm:
Step 1 : Import pandas and numpy libraries
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df

Step 4:Select the columns for boxplot and draw the boxplot.

col = ['math score', 'reading score' ,

'writing score','placement score']
df.boxplot(col)

Step 5: We can now print the outliers for each column with reference to the box plot.
print(np.where(df['math score']>90))
print(np.where(df['reading
score']<25))
print(np.where(df['writing
score']<30))

4.2 Detecting outliers using Scatterplot:

It is used when you have paired numerical data, or when your dependent
variable has multiple values for each reading independent variable, or when trying to
determine the relationship between the two variables. In the process of utilizing the
scatter plot, one can also use it for outlier detection.
To plot the scatter plot one requires two variables that are somehow related to
each other. So here Placement score and Placement count features are used.
Algorithm:
Step 1 : Import pandas , numpy and matplotlib libraries
import pandas as
pd import numpy
as np
import matplotlib.pyplot as plt
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df
Step 4: Draw the scatter plot with placement score and placement offer count
fig, ax = plt.subplots(figsize = (18,10))
ax.scatter(df['placement score'], df['placement offer
count'])
plt.show()
Labels to the axis can be assigned (Optional)
ax.set_xlabel('(Proportion non-retail
business acres)/(town)')
ax.set_ylabel('(Full-value property-tax rate)/(
$10,000)')
Step 5: We can now print the outliers with reference to scatter plot.
print(np.where((df['placement score']<50) &
(df['placement offer count']>1)))
print(np.where((df['placement score']>85) &
(df['placement offer count']<3)))

4.3 Detecting outliers using Z-Score:

Z-Score is also called a standard score. This value/score helps to
understand how far is the data point from the mean. And after setting up a
threshold value one can utilize z score values of data points to define the
outliers.
Zscore = (data_point -mean) / std. deviation

Algorithm:
Step 1 : Import numpy and stats from scipy libraries
import numpy as np

from scipy import

stats

Step 2: Calculate Z-Score for mathscore column

z = np.abs(stats.zscore(df['math score']))
Step 3: Print Z-Score Value. It prints the z-score values of each data
item of the column
print(z)
Step 4: Now to define an outlier threshold value is chosen.
threshold = 0.18
Step 5: Display the sample outliers
sample_outliers = np.where(z
<threshold) sample_outliers

4.4 Detecting outliers using Inter Quantile Range(IQR):

IQR (Inter Quartile Range) Inter Quartile Range approach to finding
the outliers is the most commonly used and most trusted approach used in the
research field.
IQR = Quartile3 – Quartile1
To define the outlier base value is defined above and below datasets
normal range namely Upper and Lower bounds, define the upper and the
lower bound (1.5*IQR value is considered) :

upper = Q3
+1.5*IQR lower =
Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of IQR
(new_IQR = IQR + 0.5*IQR) is taken.

Algorithm:
Step 1 : Import numpy library
import numpy as np

Step 2: Sort Reading Score feature and store it into sorted_rscore.

sorted_rscore= sorted(df['reading score'])
Step 3: Print sorted_rscore
sorted_rscore
Step 4: Calculate and print Quartile 1 and Quartile 3
q1 = np.percentile(sorted_rscore,
25) q3 =
np.percentile(sorted_rscore, 75)
print(q1,q3)

Step 5: Calculate value of IQR (Inter Quartile Range)

IQR = q3-q1
Step 6: Calculate and print Upper and Lower Bound to define
the outlier base value.
lwr_bound = q1-(1.5*IQR)
upr_bound = q3+(1.5*IQR)
print(lwr_bound,
upr_bound)

Step 7: Print Outliers

r_outliers = []
for i in sorted_rscore:
if (i<lwr_bound or
i>upr_bound):
r_outliers.append(i)
print(r_outliers)

3.2 Handling of Outliers:

For removing the outlier, one must follow the same process of removing an
entry from the dataset using its exact position in the dataset because in all the above
methods of detecting the outliers end result is the list of all those data items that
satisfy the outlier definition according to the method used.

Below are some of the methods of treating the outliers

● Trimming/removing the outlier

● Quantile based flooring and capping
● Mean/Median imputation
● Trimming/removing the outlier:
In this technique, we remove the outliers from the dataset. Although it is not a
good practice to follow.
new_df=df
for i in sample_outliers:
new_df.drop(i,inplace=True)
new_df

Here Sample_outliers are So instances with

index 0, 12 ,16 and 17 are deleted.

● Quantile based flooring and capping:

In this technique, the outlier is capped at a certain value above the 90th percentile
value or floored at a factor below the 10th percentile value
df=pd.read_csv("/demo.csv")
df_stud=df
ninetieth_percentile = np.percentile(df_stud['math score'], 90)
b = np.where(df_stud['math
score']>ninetieth_percentile, ninetieth_percentile,
df_stud['math score'])
print("New array:",b)

df_stud.insert(1,"m
score",b,True) df_stud

● Mean/Median imputation:
As the mean value is highly influenced by the outliers, it is advised to replace
the outliers with the median value.
1. Plot the box plot for reading
score col = ['reading
score'] df.boxplot(col)

2. Outliers are seen in box plot.

3. Calculate the median of reading score by using sorted_rscore
median=np.median(sorted_rscore
) median
4. Replace the upper bound outliers using median value
refined_df=df
refined_df['reading score'] =
np.where(refined_df['reading score'] >upr_bound,
median,refined_df['reading score'])
5. Display redefined_df

6. Replace the lower bound outliers using median value

refined_df['reading score'] =
np.where(refined_df['reading score'] <lwr_bound,
median,refined_df['reading score'])
7. Display redefined_df
8. Draw the box plot for
redefined_df col =
['reading score']
refined_df.boxplot(col)

4. Data Transformation for the purpose of :

Data transformation is the process of converting raw data into a format or structure
that would be more suitable for model building and also data discovery in general.The
process of data transformation can also be referred to as extract/transform/load (ETL).
The extraction phase involves identifying and pulling data from the various source
systems that create data and then moving the data to a single repository. Next, the raw
data is cleansed, if needed. It's then transformed into a target format that can be fed
into operational systems or into a data warehouse, a date lake or another repository for
use in business intelligence and analytics applications. The transformation The data
are transformed in ways that are ideal for mining the data. The data transformation
involves steps that are.

● Smoothing: It is a process that is used to remove noise from the dataset using
some algorithms It allows for highlighting important features present in the
dataset. It helps in predicting the patterns
● Aggregation: Data collection or aggregation is the method of storing and
presenting data in a summary format. The data may be obtained from multiple
data sources to integrate these data sources into a data analysis description. This
is a crucial step
since the accuracy of data analysis insights is highly dependent on the quantity and
quality of the data used.
● Generalization:It converts low-level data attributes to high-level data attributes
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
converted into categorical value (young, old).
● Normalization: Data normalization involves converting all data variables into a
given range. Some of the techniques that are used for accomplishing
normalization are:
○ Min–max normalization: This transforms the original data linearly.
○ Z-score normalization: In z-score normalization (or zero-mean normalization)
the values of an attribute (A), are normalized based on the mean of A and its
standard deviation.
○ Normalization by decimal scaling: It normalizes the values of an attribute by

changing the position of their decimal points

● Attribute or feature construction.
○ New attributes constructed from the given ones: Where new attributes are
created & applied to assist the mining process from the given set of attributes.
This simplifies the original data & makes the mining more efficient.
In this assignment , The purpose of this transformation should be one of the
following reasons:

a. To change the scale for better understanding (Attribute or feature

construction)
Here the Club_Join_Date is transferred to Duration.
Algorithm:
Step 1 : Import pandas and numpy libraries
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df

Step 3: Change the scale of Joining year to duration.

b. To decrease the skewness and convert distribution into normal

distribution (Normalization by decimal scaling)
Data Skewness: It is asymmetry in a statistical distribution, in which the
curve appears distorted or skewed either to the left or to the right. Skewness
can be quantified to define the extent to which a distribution differs from a
normal distribution.
Normal Distribution: In a normal distribution, the graph appears as a
classical, symmetrical “bell-shaped curve.” The mean, or average, and the
mode, or maximum point on the curve, are equal.
Positively Skewed Distribution
A positively skewed distribution means that the extreme data results are
larger. This skews the data in that it brings the mean (average) up. The mean
will be larger than the median in a Positively skewed distribution.
A negatively skewed distribution means the opposite: that the extreme data
results are smaller. This means that the mean is brought down, and the median
is larger than the mean in a negatively skewed distribution.

Reducing skewness A data transformation may be used to reduce skewness.

A distribution that is symmetric or nearly so is often easier to handle and
interpret than a skewed distribution. The logarithm, x to log base 10 of x, or x
to log base e of x (ln x), or x to log base 2 of x, is a strong transformation with
a major effect on distribution shape. It is commonly used for reducing right
skewness and is often appropriate for measured variables. It can not be applied
to zero or negative values.
Algorithm:
Step 1 : Detecting outliers using Z-Score for the Math_score variable and
remove the outliers.
Step 2: Observe the histogram for math_score variable.
import matplotlib.pyplot as plt
new_df['math score'].plot(kind =
'hist')
Step 3: Convert the variables to logarithm at the scale 10.
df['log_math'] = np.log10(df['math score'])

Step 4: Observe the histogram for math_score variable.

df['log_math'].plot(kind = 'hist')

It is observed that skewness is reduced at some level.

Conclusion: In this way we have explored the functions of the python library for Data
Identifying and handling the outliers. Data Transformations Techniques are explored with the
purpose of creating the new variable and reducing the skewness from datasets.
Assignment Question:
1. Explain the methods to detect the outlier.
2. Explain data transformation methods
3. Write the algorithm to display the statistics of Null values present in the dataset.
4. Write an algorithm to replace the outlier value with the mean of the variable.
Assignment No: 3

Title Descriptive Statistics - Measures of Central Tendency and variability

Problem statement Perform the following operations on any open source dataset (e.g., data.csv)
1. Provide summary statistics (mean, median, minimum,
maximum, standard deviation) for a dataset (age, income etc.) with
numeric variables grouped by one of the qualitative (categorical)
variables. For example, if your categorical variable is age groups and
quantitative variable is income, then provide summary statistics of
income grouped by the age groups. Create a list that contains a
numeric value for each response to the categorical variable.
2. Write a Python program to display some basic statistical
details like percentile, mean, standard deviation etc. of the species of
‘Iris- setosa’, ‘Iris-versicolor’ and ‘Iris- versicolor’ of iris.csv
dataset.
Provide the codes with outputs and explain everything that you do in
this step.
Prerequisite 1. Basic of Python Programming
2. Concept of statistics such as mean, median, minimum, maximum,
standard deviation etc.
CO mapped 2
Programming Tools Python

Theory & Procedure:

Title of the Assignment: Descriptive Statistics - Measures of Central Tendency and
variability
Perform the following operations on any open source dataset (e.g., data.csv)
3. Provide summary statistics (mean, median, minimum, maximum, standard
deviation) for a dataset (age, income etc.) with numeric variables grouped by one of the
qualitative (categorical) variables. For example, if your categorical variable is age groups
and quantitative variable is income, then provide summary statistics of income grouped by
the age groups. Create a list that contains a numeric value for each response to the
categorical variable.
4. Write a Python program to display some basic statistical details like percentile,
mean, standard deviation etc. of the species of ‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris-
versicolor’ of
iris.csv dataset.
Provide the codes with outputs and explain everything that you do in this step.
Objective: Students should be able to perform the Statistical operations using Python on
any open source dataset.

Outcome: Students are able to perform Descriptive Statistics.

Prerequisite:

1. Basic of Python Programming

2. Concept of statistics such as mean, median, minimum, maximum, standard deviationetc.

Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB SATA HD.

Software Specification: Ubuntu 14.04, Python,Jupyter

Theory:
Summary statistics: What is Statistics?

Statistics is the science of collecting data and analysing them to infer proportions
(sample) that are representative of the population. In other words, statistics is
interpreting data in order to make predictions for the population.
Branches of Statistics:
There are two branches of Statistics.
DESCRIPTIVE STATISTICS : Descriptive Statistics is a statistics or a measure that
describes the data.
INFERENTIAL STATISTICS : Using a random sample of data taken from a
population to describe and make inferences about the population is called Inferential
Statistics.
Descriptive Statistics
Descriptive Statistics is summarising the data at hand through certain numbers like
mean, median etc. so as to make the understanding of the data easier. It does not
involve any generalisation or inference beyond what is available. This means that the
descriptive
statistics are just the representation of the data (sample) available and not based on any
theory of probability.
Commonly Used Measures
1. Measures of Central Tendency
2. Measures of Dispersion (or Variability)

● Measures of Central Tendency

A Measure of Central Tendency is a one number summary of the data that
typically describes the centre of the data. This one number summary is of three
types.
a. Mean : Mean is defined as the ratio of the sum of all the observations in the
data to the total number of observations. This is also known as Average. Thus
mean is a number around which the entire data set is spread.
Consider the following data points.
17, 16, 21, 18, 15, 17, 21, 19, 11, 23

b. Median : Median is the point which divides the entire data into two equal
halves. One-half of the data is less than the median, and the other half is
greater than the same. Median is calculated by first arranging the data in either
ascending or descending order.
○ If the number of observations is odd, the median is given by the middle
observation in the sorted form.
○ If the number of observations are even, median is given by the mean
of the two middle observations in the sorted form.
An important point to note is that the order of the data (ascending or
descending) does not affect the median.

To calculate Median, let's arrange the data in ascending order.

11, 15, 16, 17, 17, 18, 19, 21, 21, 23

Since the number of observations is even (10), median is given by

the average of the two middle observations (5th and 6th here).
c. Mode : Mode is the number which has the maximum frequency in the entire
data set, or in other words,mode is the number that appears the maximum
number of times. A data can have one or more than one mode.
● If there is only one number that appears the maximum number of
times, the data has one mode, and is called Uni-modal.
● If there are two numbers that appear the maximum number of times,
the data has two modes, and is called Bi-modal.
● If there are more than two numbers that appear the maximum number
of times, the data has more than two modes, and is called Multi-modal.
Consider the following data points.
17, 16, 21, 18, 15, 17, 21, 19, 11, 23
Mode is given by the number that occurs the maximum number of
times. Here, 17 and 21 both occur twice. Hence, this is a Bimodal data and the
modes are 17 and 21.
● Measures of Dispersion (or Variability)

Measures of Dispersion describes the spread of the data around the central value (or
the Measures of Central Tendency)

1. Absolute Deviation from Mean — The Absolute Deviation from Mean, also
called Mean Absolute Deviation (MAD), describes the variation in the data
set, in the sense that it tells the average absolute distance of each data point in
the set. It is calculated as

2. Variance — Variance measures how far are data points spread out from the
mean. A high variance indicates that data points are spread widely and a small
variance indicates that the data points are closer to the mean of the data set. It
is calculated as
3. Standard Deviation — The square root of Variance is called the Standard
Deviation. It is calculated as

4. Range — Range is the difference between the Maximum value and the
Minimum value in the data set. It is given as

5. Quartiles — Quartiles are the points in the data set that divides the data set
into four equal parts. Q1, Q2 and Q3 are the first, second and third quartile of
the data set.

● 25% of the data points lie below Q1 and 75% lie above it.

● 50% of the data points lie below Q2 and 50% lie above it. Q2 is nothing
but Median.

● 75% of the data points lie below Q3 and 25% lie above it.

6. Skewness — The measure of asymmetry in a probability distribution is

defined by Skewness. It can either be positive, negative or undefined.
Positive Skew — This is the case when the tail on the right side of the curve is
bigger than that on the left side. For these distributions, mean is greater than
the mode.
Negative Skew — This is the case when the tail on the left side of the curve is
bigger than that on the right side. For these distributions, mean is smaller than
the mode.
The most commonly used method of calculating Skewness is
If the skewness is zero, the distribution is symmetrical. If it is negative, the
distribution is Negatively Skewed and if it is positive, it is Positively Skewed.

Python Code:
1. Mean
To find mean of all columns
Syntax:
df.mean()
Output:

To find mean of specific column

Syntax:
df.loc[:,'Age'].mean()
Output:
38.85
To find mean row wise
Syntax:
df.mean(axis=1)[0:4]
Output:

2. Median
To find median of all columns
Syntax:
df.median()
Output:

To find median of specific column

Syntax:
df.loc[:,'Age'].median()
Output:
36.0
To find median row wise
Syntax:
df.median(axis=1)[0:4]
Output:

3. Mode
To find mode of all columns
df.mode()
Syntax:
Output:

In the Genre Column mode is Female, for column Age mode is 32 etc.
If a particular column does not have mode all the values will be
displayed in the column.
To find the mode of a specific
column. Syntax:
df.loc[:,'Age'].mode()
Output:
32

1. Minimum
To find median of all columns
Syntax:
df.min()

Output:

To find median of Specific column

Syntax:
df.loc[:,'Age'].min(skipna = False)

Output:
18

2. Maximum
To find median of all columns
Syntax:
df.max()

Output:

To find median of Specific column

Syntax:
df.loc[:,'Age'].min(skipna = False)

Output:
18

3. Standard Deviation

To find Standard Deviation of all columns

Syntax:
df.std()

Output:

To find Standard Deviation of specific column

Syntax:
df.loc[:,'Age'].std()

Output:
13.969007331558883

To find Standard Deviation row wise

Syntax:
df.std(axis=1)[0:4]

Output:

2. Types of Variables:

A variable is a characteristic that can be measured and that can assume different values.
Height, age, income, province or country of birth, grades obtained at school and type of
housing are all examples of variables.

Variables may be classified into two main categories:

● Categorical and
● Numeric.

Each category is then classified in two subcategories: nominal or ordinal for categorical
variables, discrete or continuous for numeric variables.

● Categorical variables

A categorical variable (also called qualitative variable) refers to a

characteristic that can’t be quantifiable.
Categorical variables can be either nominal or ordinal.
○ Nominal Variable
A nominal variable is one that describes a name, label or category without
natural order. In the given table, the variable “mode of transportation for travel
to work” is also nominal.
○ Ordinal Variable
An ordinal variable is a variable whose values are defined by an order relation
between the different categories. In following table, the variable “behaviour”
is ordinal because the category “Excellent” is better than the category “Very
good,” which is better than the category “Good,” etc. There is some natural
ordering, but it is limited since we do not know by how much “Excellent”
behaviour is better than “Very good” behaviour.

● Numerical Variables
A numeric variable (also called quantitative variable) is a quantifiable characteristic
whose values are numbers (except numbers which are codes standing up for

categories). Numeric variables may be either continuous or discrete.

○ Continuous variables
A variable is said to be continuous if it can assume an infinite number of real
values within a given interval.
For instance, consider the height of a student. The height can’t take any
values. It can’t be negative and it can’t be higher than three metres. But
between 0 and 3, the number of possible values is theoretically infinite. A
student may be 1.6321748755 … metres tall.

○ Discrete variables
As opposed to a continuous variable, a discrete variable can assume only a
finite number of real values within a given interval.
An example of a discrete variable would be the score given by a judge
to a gymnast in competition: the range is 0 to 10 and the score is always given
to one decimal (e.g. a score of 8.5)

3. Summary statistics of income grouped by the age groups

Problem Statement: For example, if your categorical variable is age groups and
quantitative variable is income, then provide summary statistics of income grouped by the
age groups. Create a list that contains a numeric value for each response to the categorical
variable.

Categorical Variable:
Genre Quantitative
Variable : Age Syntax:
df.groupby(['Genre'])['Age'].mean()
Output:

Categorical Variable:
Genre Quantitative
Variable : Income
Syntax:
df_u=df.rename(columns= {'Annual
Income k$)':'Income'},inplace=False)

(df_u.groupby(['Genre']).Income.mean
()) Output:

To create a list that contains a numeric value for each response to the categorical variable.
from sklearn import
preprocessing enc =
preprocessing.OneHotEncoder()
enc_df =
pd.DataFrame(enc.fit_transform(df[['Genre']]).toarray())
enc_df

To concat numerical list to

dataframe df_encode
=df_u.join(enc_df)
df_encode
4. Display basic statistical details on the iris
dataset. Algorithm:
1. Import Pandas Library
2. The dataset is downloaded from UCI repository.
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
3. Assign Column names
col_names =
['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']

4. Load Iris.csv into a Pandas data frame

iris = pd.read_csv(csv_url, names = col_names)
5. Load all rows with Iris-setosa species in variable irisSet
irisSet = (iris['Species']== 'Iris-setosa')

6. To display basic statistical details like percentile, mean,standard deviation etc.

forIris- setosa use describe

print('Iris-setosa')

print(iris[irisSet].describe())

7. Load all rows with Iris-versicolor species in variable irisVer

irisVer = (iris['Species']== 'Iris-versicolor')

8. To display basic statistical details like percentile, mean,standard deviation etc.

forIris- versicolor use describe

print('Iris-versicolor')

print(iris[irisVer].describe())

9. Load all rows with Iris-virginica species in variable irisVir

irisVir = (iris['Species']== 'Iris-virginica')

10. To display basic statistical details like percentile, mean,standard deviation etc.
forIris- virginica use describe

print('Iris-virginica')

print(iris[irisVir].describe())
Conclusion:

Descriptive statistics summarises or describes the characteristics of a data set.

Descriptive statistics consists of two basic categories of measures:

● measures of central tendency and

● measures of variability (or spread).

Measures of central tendency describe the centre of a data set. It includes the
mean, median, and mode.

Measures of variability or spread describe the dispersion of data within the set
and it includes standard deviation, variance, minimum and maximum variables.

Assignment Questions:
1. Explain Measures of Central Tendency with examples.
2. What are the different types of variables. Explain with examples.
3. Which method is used to statistic the dataframe? write the code.

.
Assignment No: 4
Title Data Analytics I

Problem Create a Linear Regression Model using Python/R to predict home prices using
statement Boston Housing Dataset (https://www.kaggle.com/c/boston-housing).
The Boston Housing dataset contains information about various houses in
Boston through different parameters. There are 506 samples and 14 feature
variables in this dataset.
The objective is to predict the value of prices of the house using the given
features.

Prerequisite 1. Basic of Python Programming

2. Concept of Data Regression
CO mapped 4
Programmi Python,, Jupyter
ng Tools

Theory & Procedure:

Title : Create a Linear Regression Model using Python/R to predict home prices
using Boston Housing Dataset (https://www.kaggle.com/c/boston-housing). The Boston
Housing dataset contains information about various houses in Boston through different
parameters. There are 506 samples and 14 feature variables in this dataset.
The objective is to predict the value of prices of the house using the given features.

Objective : Students should be able to data analysis using liner regressionusing Python
for any open source dataset

Outcome: Students are able to perform data analysis.

Prerequisite: 1. Basic of Python Programming

2.Concept of Regresion.

Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB SATA HD.

Software Specification: Ubuntu 14.04, Python, Jupyter.

1. Linear Regression: It is a machine learning algorithm based on supervised learning.
It targets prediction values on the basis of independent variables.
● It is preferred to find out the relationship between forecasting and variables.
● A linear relationship between a dependent variable (X) is continuous; while
independent variable(Y) relationship may be continuous or discrete. A linear
relationship should be available in between predictor and target variable so
known as Linear Regression.
● Linear regression is popular because the cost function is Mean Squared Error
(MSE) which is equal to the average squared difference between an
observation’s actual and predicted values.
● It is shown as an equation of
line like : Y = m*X + b + e
Where : b is intercepted, m is slope of the line and e is error term.
This equation can be used to predict the value of target variable Y based on given
predictor variable(s) X, as shown in Fig. 1.

Fig. 1: geometry of linear regression

● Fig. 2 shown below is about the relation between weight (in Kg) and height
(in cm), a linear relation. It is an approach of studying in a statistical manner
to summarise and learn the relationships among continuous (quantitative)
variables.
● Here a variable, denoted by ‘x’ is considered as the predictor, explanatory, or
independent variable.

● Another variable, denoted ‘y’, is considered as the response, outcome, or

dependent variable. While "predictor" and "response" used to refer to these
variables.
● Simple linear regression technique concerned with the study of only one
predictor variable.

Fig.2 : Relation between weight (in Kg) and height (in cm)

MultiVariate Regression :It concerns the study of two or more predictor variables.
Usually a transformation of the original features into polynomial features from a
given degree is preferred and further Linear Regression is applied on it.
● A simple linear model Y = a + bX is in original feature will be transformed
into polynomial feature is transformed and further a linear regression applied
to it and it will be something like
Y=a + bX + cX2
● If a high degree value is used in transformation the curve becomes over-fitted
as it captures the noise from data as well.

2. Least Square Method for Linear Regression

● Linear Regression involves establishing linear relationships between dependent and
independent variables. Such a relationship is portrayed in the form of an equation also
known as the linear model.

● A simple linear model is the one which involves only one dependent and one
independent variable. Regression Models are usually denoted in Matrix Notations.
● However, for a simple univariate linear model, it can be denoted by the regression
equation
𝑦=β +β 𝑥 (1)
0 1

where 𝑦 is the dependent or the response variable

𝑥 is the independent or the input variable
β is the value of y when x=0 or the y intercept
0

β is the value of slope of the line ε is the error or the noise

● This linear equation represents a line also known as the ‘regression line’. The least
square estimation technique is one of the basic techniques used to guess the values of
the parameters and based on a sample set.
● This technique estimates parameters β and by trying to minimise the square
1

and β
0

of errors at all the points in the sample set. The error is the deviation of the actual sample
● data point from the regression line. The technique can be represented by the equation.
𝑛
2
𝑚𝑖𝑛 ∑ (𝑦 − 𝑦) (2)
𝑖=0
and β such
1

Using differential calculus on equation 1 we can find the values of β

that the sum of squares (that is equation 2) is minimum.

2
𝑛 𝑛
β = ∑ (𝑥 − 𝑥 ) (𝑦 − 𝑦 )/ ∑ 𝑥( ) (3)
𝑖 𝑖−
𝑖=1

β =𝑦−β 𝑥 (
0 1
4
)
Once the Linear Model is estimated using equations (3) and (4), we can estimate the
value of the dependent variable in the given range only. Going outside the range is
called extrapolation which is inaccurate if simple regression techniques are used.
3. Measuring Performance of Linear
Regression Mean Square Error:
The Mean squared error (MSE) represents the error of the estimator or predictive
model created based on the given set of observations in the sample. Two or more
regression models created using a given sample data can be compared based on their
MSE. The lesser the MSE, the better the regression model is. When the linear
regression model is trained using a given set of observations, the model with the least
mean sum of squares error (MSE) is selected as the best model. The Python or R
packages select the best-fit model as the model with the lowest MSE or lowest RMSE
when training the linear regression models.
Mathematically, the MSE can be calculated as the average sum of the squared
difference between the actual value and the predicted or estimated value represented
by the regression model (line or plane).

An MSE of zero (0) represents the fact that the predictor is a perfect
predictor. RMSE:
Root Mean Squared Error method that basically calculates the least-squares error and takes
a root of the summed values.
Mathematically speaking, Root Mean Squared Error is the square root of the sum of all
errors divided by the total number of values. This is the formula to calculate RMSE
RMSE - Least Squares Regression Method -
Edureka R-Squared :

R-Squared is the ratio of the sum of squares regression (SSR) and the sum of squares total
(SST).
SST : total sum of squares (SST), regression sum of squares (SSR), Sum of square of errors
(SSE) are all showing the variation with different measures.
A value of R-squared closer to 1 would mean that the regression model covers most
part of the variance of the values of the response variable and can be termed as a
good model.
One can alternatively use MSE or R-Squared based on what is appropriate and the need of
the hour. However, the disadvantage of using MSE rather than R-squared is that it will be
difficult to gauge the performance of the model using MSE as the value of MSE can vary
from 0 to any larger number. However, in the case of R-squared, the value is bounded
between 0 and .
4. Example of Linear Regression

Consider following data for 5 students.

Each Xi (i = 1 to 5) represents the score of ith student in standard X and
corresponding Yi (i = 1 to 5) represents the score of ith student in standard XII.
(i) Linear regression equation best predicts standard XIIth score
(ii) Interpretation for the equation of Linear Regression
(iii) If a student's score is 80 in std X, then what is his expected score in XII standard?

Student Score in X standard (Xi) Score in XII standard (Yi)

1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
x y 𝑥 −𝑥 𝑦 −𝑦 (𝑥 −𝑥 )2 (𝑥 −𝑥 )(𝑦 − 𝑦 )
95 85 17 8 289 136
85 95 7 18 49 126
80 70 2 -7 4 -14
70 65 -8 -12 64 96
60 70 -18 -7 324 126
𝑥 = 78 𝑦= 77 ε (𝑥 −𝑥 ) = 2
ε (𝑥 −𝑥 )(𝑦 − 𝑦 ) = 470
730

(i) linear regression equation that best predicts standard XIIth score

𝑦=β +β 𝑥
0 1

2
𝑛 𝑛

𝑖 −𝑥 ) − 𝑦 )/ ∑ 𝑥( 𝑥)
𝑖 𝑖−
𝑖=1
β = 470/730 = 0. 644
1

β =𝑦−β 𝑥
0 1

β = 77 − (0. 644 * 78) = 26. 768

𝑦 = 26. 76 + 0. 644 𝑥

(ii) Interpretation of the regression line.

Interpretation 1

For an increase in value of x by 0.644 units there is an increase in value of y in one unit.

Interpretation 2

Even if x = 0 value of independent variable, it is expected that value of y is 26.768

Score in XII standard (Yi) is 0.644 units depending on Score in X standard (Xi) but other
factors will also contribute to the result of XII standard by 26.768 .

(iii) If a student's score is 65 in std X, then his expected score in XII standard is 78.288

For x = 80 the y value will be

𝑦 = 26. 76 + 0. 644 * 65 = 68. 38

5. Training data set and Testing data set

● Machine Learning algorithm has two phases

1. Training and 2. Testing.
● The input of the training phase is training data, which is passed to any machine
learning algorithm and machine learning model is generated as output of the training
phase.
● The input of the testing phase is test data, which is passed to the machine learning
model and prediction is done to observe the correctness of mode.
Fig. 1.3.1 : Training and Testing Phase in Machine Learning
(a) Training Phase
● Training dataset is provided as input to this phase.
● Training dataset is a dataset having attributes and class labels and used for training
Machine Learning algorithms to prepare models.
● Machines can learn when they observe enough relevant data. Using this one can
model algorithms to find relationships, detect patterns, understand complex problems
and make decisions.
● Training error is the error that occurs by applying the model to the same data from
which the model is trained.
● In a simple way the actual output of training data and predicted output of the model
does not match the training error Ein is said to have occurred.
● Training error is much easier to compute.
(b) Testing Phase
● Testing dataset is provided as input to this phase.
● Test dataset is a dataset for which class label is unknown. It is tested using model
● A test dataset used for assessment of the finally chosen model.
● Training and Testing dataset are completely different.
● Testing error is the error that occurs by assessing the model by providing the
unknown data to the model.
● In a simple way the actual output of testing data and predicted output of the model
does not match the testing error Eout is said to have occurred.
● E out is generally observed larger than Ein.
●
(c) Generalization
● Generalization is the prediction of the future based on the past system.
● It needs to generalize beyond the training data to some future data that it might not
have seen yet.
● The ultimate aim of the machine learning model is to minimize the generalization error.
● The generalization error is essentially the average error for data the model has never
seen.
● In general, the dataset is divided into two partition training and test sets.
● The fit method is called on the training set to build the model.
● This fit method is applied to the model on the test set to estimate the target value and
evaluate the model's performance.
● The reason the data is divided into training and test sets is to use the test set to
estimate how well the model trained on the training data and how well it would
perform on the unseen data.
Algorithm (Synthesis Dataset):
Step 1: Import libraries and create alias for Pandas, Numpy and Matplotlib
import pandas as
pd import numpy
as np
import matplotlib.pyplot as plt
Step 2: Create a Dataframe with Dependent Variable(x) and independent variable y.
x=np.array([95,85,80,70,60])
y=np.array([85,95,70,65,70])
Step 3 : Create Linear Regression Model using Polyfit Function:
model= np.polyfit(x, y, 1)
Step 4: Observe the coefficients of the model.
model
Output:
array([ 0.64383562, 26.78082192])
Step 5: Predict the Y value for X and observe the output.
predict =
np.poly1d(model)
predict(65)
Output:
68.63
Step 6: Predict the y_pred for all values of x.
y_pred=
predict(x)
y_pred
Output:
array([81.50684932, 87.94520548, 71.84931507, 68.63013699, 71.84931507])

Step 7: Evaluate the performance of Model (R-Suare)

R squared calculation is not implemented in numpy… so that one should be borrowed
from sklearn.
from sklearn.metrics import
r2_score r2_score(y, y_pred)
Output:
0.4803218090889323
Step 8: Plotting the linear regression
model y_line = model[1] +
model[0]* x plt.plot(x,
y_line, c = 'r')
plt.scatter(x, y_pred)
plt.scatter(x,y,c='r')

Output:

Algorithm (Boston Dataset):

Step 1: Import libraries and create alias for Pandas, Numpy and Matplotlib
import numpy as
np import pandas
as pd
import matplotlib.pyplot as plt
Step 2: Import the Boston Housing dataset
from sklearn.datasets import
load_boston boston = load_boston()
Step 3: Initialize the data frame
data = pd.DataFrame(boston.data)
Step 4: Add the feature names to the
dataframe data.columns =
boston.feature_names data.head()
Step 5: Adding target variable to dataframe
data['PRICE'] = boston.target
Step 6: Perform Data Preprocessing( Check for missing values)
data.isnull().sum()
Step 7: Split dependent variable and independent variables
x = data.drop(['PRICE'], axis =
1) y = data['PRICE']
Step 8: splitting data to training and testing dataset.
from sklearn.model_selection import
train_test_split xtrain, xtest, ytrain, ytest =
train_test_split(x, y, test_size =0.2,random_state =
0) Step 9: Use linear regression( Train the Machine ) to Create
Model import sklearn
from sklearn.linear_model import
LinearRegression lm = LinearRegression()
model=lm.fit(xtrain, ytrain)
Step 10: Predict the y_pred for all values of train_x and test_x
ytrain_pred =
lm.predict(xtrain) ytest_pred
= lm.predict(xtest)
Step 11:Evaluate the performance of Model for train_y and test_y
df=pd.DataFrame(ytrain_pred,ytrain)
df=pd.DataFrame(ytest_pred,ytest)
Step 12: Calculate Mean Square Paper for train_y and test_y
from sklearn.metrics import mean_squared_error,
r2_score mse = mean_squared_error(ytest, ytest_pred)
print(mse)
mse =
mean_squared_error(ytrain_pred,ytrain)
print(mse)
Output:
33.44897999767638
mse = mean_squared_error(ytest,
ytest_pred) print(mse)
Output:
19.32647020358573
Step 13: Plotting the linear regression model
lt.scatter(ytrain ,ytrain_pred,c='blue',marker='o',label='Training
data') plt.scatter(ytest,ytest_pred
,c='lightgreen',marker='s',label='Test data') plt.xlabel('True values')
plt.ylabel('Predicted')
plt.title("True value vs Predicted value")
plt.legend(loc= 'upper left')
#plt.hlines(y=0,xmin=0,xmax=50)
plt.plot()
plt.show()

Conclusion:
In this way we have done data analysis using linear regression for Boston Dataset
and predict the price of houses using the features of the Boston Dataset.
Assignment Question:
1) Compute SST, SSE, SSR, MSE, RMSE, R Square for the below example .

Student Score in X standard (Xi) Score in XII standard (Yi)

1 95 85
2 85 95
3 80 70
4 70 65
5 60 70

2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code to calculate the RSquare for Boston
Dataset. (Consider the linear regression model created in
practical session)
.
Assignment No: 5
Title DataAnalytic , II

Problem
statement 1. Implement logistic regression using Python/R to perform
classification onSocial_Network_Ads.csv dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN,
Accuracy, Error rate, Precision,Recall on the given dataset..

Prerequisite 1. Basic of Python Programming

2. Concept of Regression
CO mapped 4,5
Programmi Python, Jupyter
ng Tools

Theory & Procedure:

Title : 1.Implement logistic regression using Python/R to perform classification

on Social_Network_Ads.csv dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate,
Precision, Recall on the given dataset..

Objective: Students should be able to data analysis using logisticregression using Python for
any open source dataset

Outcomes: Students are able to perform data analysis.

Prerequisite:
1. Basic of Python Programming
2.Concept of Regression.
Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB SATA HD.

Software Specification: Ubuntu 14.04, Python, Jupyter.

1. Logistic Regression: Classification techniques are an essential part of machine learning
and data mining applications. Approximately 70% of problems in Data Science are
classification problems. There are lots of classification problems that are available, but
logistic regression is common and is a useful regression method for solving the binary
classification problem. Another category of classification is Multinomial classification,
which handles the issues where multiple classes are present in the target variable. For
example, the IRIS dataset is a very famous example of multi-class classification. Other
examples are classifying article/blog/document categories.

Logistic Regression can be used for various classification problems such as spam
detection. Diabetes prediction, if a given customer will purchase a particular product or
will they churn another competitor, whether the user will click on a given advertisement
link or not, and many more examples are in the bucket.

Logistic Regression is one of the most simple and commonly used Machine Learning
algorithms for two-class classification. It is easy to implement and can be used as the
baseline for any binary classification problem. Its basic fundamental concepts are also
constructive in deep learning. Logistic regression describes and estimates the relationship
between one dependent binary variable and independent variables.

Logistic regression is a statistical method for predicting binary classes. The outcome or

target variable is dichotomous in nature. Dichotomous means there are only two possible

classes. For example, it can be used for cancer detection problems. It computes the

probability of an event occurring.

It is a special case of linear regression where the target variable is categorical in nature. It

uses a log of odds as the dependent variable. Logistic Regression predicts the probability

of occurrence of a binary event utilising a logit function.

Linear Regression Equation:

Where, y is a dependent variable and x1, x2 ... and Xn are explanatory variables.
Sigmoid Function:

Apply Sigmoid function on linear regression:

2. Differentiate between Linear and Logistic Regression

Linear regression gives you a continuous output, but logistic regression provides a constant
output. An example of the continuous output is house price and stock price. Example's of the
discrete output is predicting whether a patient has cancer or not, predicting whether the customer
will churn. Linear regression is estimated using Ordinary Least Squares (OLS) while logistic
regression is estimated using Maximum Likelihood Estimation (MLE) approach.
3. Sigmoid Function
The sigmoid function, also called logistic function, gives an ‘S’ shaped curve that can take any
real-valued number and map it into a value between 0 and 1. If the curve goes to positive infinity,
y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0. If
the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES, and
if it is less than 0.5, we can classify it as 0 or NO. The outputcannotFor example: If the output is
0.75, we can say in terms of probability as: There is a 75 percent chance that a patient will suffer
from cancer.

4. Types of LogisticRegression
Binary Logistic Regression: The target variable has only two possible outcomes such as
Spam or Not Spam, Cancer or No Cancer.
Multinomial Logistic Regression: The target variable has three or more nominal categories
such as predicting the type of Wine.
Ordinal Logistic Regression: the target variable has three or more ordinal categories such
as restaurant or product rating from 1 to 5.
5. Confusion Matrix Evaluation Metrics
Contingency table or Confusion matrix is often used to measure the performance of classifiers. A
confusion matrix contains information about actual and predicted classifications done by a
classification system. Performance of such systems is commonly evaluated using the data in the
matrix.

The following table shows the confusion matrix for a two class classifier.

Here each row indicates the actual classes recorded in the test data set and the each column indicates the
classes as predicted by the classifier.
Numbers on the descending diagonal indicate correct predictions, while the ascending diagonal
concerns prediction errors.

Some Important measures derived from confusion matrix are:

● Number of positive (Pos) : Total number instances which are labelled as positive in a given
dataset.
● Number of negative (Neg) : Total number instances which are labelled as negative in a
given dataset.
● Number of True Positive (TP) : Number of instances which are actually labelled as
positive and the predicted class by classifier is also positive.
● Number of True Negative (TN) : Number of instances which are actually labelled as
negative and the predicted class by classifier is also negative.
● Number of False Positive (FP) : Number of instances which are actually labelled as
negative and the predicted class by classifier is positive.
● Number of False Negative (FN): Number of instances which are actually labelled as
positive and the class predicted by the classifier is negative.
● Accuracy: Accuracy is calculated as the number of correctly classified instances divided by
total number of instances.
The ideal value of accuracy is 1, and the worst is 0. It is also calculated as the sum of true
positive and true negative (TP + TN) divided by the total number of instances.

𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁
𝑎𝑐𝑐 = = 𝑃𝑜𝑠+𝑁𝑒𝑔
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁

● Error Rate: Error Rate is calculated as the number of incorrectly classified instances
divided by total number of instances.
The ideal value of accuracy is 0, and the worst is 1. It is also calculated as the sum of false
positive and false negative (FP + FN) divided by the total number of instances.
𝐹𝑃+𝐹𝑁 𝐹𝑃+𝐹𝑁
𝑒𝑟𝑟 = = Or
𝑃𝑜𝑠+𝑁𝑒𝑔
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁

𝑒𝑟𝑟 = 1 − 𝑎𝑐𝑐
● Precision: It is calculated as the number of correctly classified positive instances divided by
the total number of instances which are predicted positive. It is also called confidence value.
The ideal value is 1, whereas the worst is 0.

𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑃+𝐹

● Recall: .It is calculated as the number of correctly classified positive instances divided by the
total number of positive instances. It is also called recall or sensitivity. The ideal value of
sensitivity is 1, whereas the worst is 0.

It is calculated as the number of correctly classified positive instances divided by the total
number of positive instances.

𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑃+𝐹
Algorithm (Boston Dataset):
Step 1: Import libraries and create alias for Pandas, Numpy and Matplotlib
Step 2: Import the Social_Media_Adv Dataset
Step 3: Initialize the data frame
Step 4: Perform Data
Preprocessing
● Convert Categorical to Numerical Values if applicable
● Check for Null Value
● Covariance Matrix to select the most promising features
● Divide the dataset into Independent(X)
and Dependent(Y)variables.
● Split the dataset into training and testing datasets
● Scale the Features if necessary.

Step 5: Use Logistic regression( Train the Machine ) to Create Model

# import the class
from sklearn.linear_model import
LogisticRegression # instantiate the model (using
the default parameters)
logreg =
LogisticRegression() # fit
the model with data
logreg.fit(xtrain,ytrain)
#
y_pred=logreg.predict(xtest
)
Step 6: Predict the y_pred for all values of train_x and test_x

Step 7:Evaluate the performance of Model for train_y and test_y

Step 8: Calculate the required evaluation parameters

from sklearn.metrics import

precision_score,confusion_matrix,accuracy_score,recall_score
cm= confusion_matrix(ytest, y_pred)

Conclusion:
In this way we have done data analysis using logistic regression for Social Media Adv. and
evaluate the performance of model.
Value Addition: Visualizing Confusion Matrix using Heatmap
Assignment Question:
1) Consider the binary classification task with two classes positive and
negative. Find out TP,TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall

2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code for the preprocessing mentioned in step 4. and Explain everystep
in detail.

.
Assignment No: 6
Title Data Analytics III

Problem
statement 1. Implement Simple Naïve Bayes classification algorithm
using Python/R on iris.csvdataset.
2. Compute Confusion matrix to find TP, FP, TN, FN,
Accuracy, Error rate, Precision,Recall on the given dataset.

Prerequisite 1. Basic of Python Programming

2. Concept of Data Regression
CO mapped 4
Programmi Python, Jupyter
ng Tools

Theory & procedure:

Title : 1.Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv
dataset.
2.Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate,
Precision, Recall on the given dataset.

Objective : Students should be able to data analysis using Naïve Bayesclassification

algorithm using Python for any open source dataset

Outcomes: Students are able to perform data analysis.

Prerequisite:
1. Basic of Python Programming
2.Concept of Join and Marginal Probability.

Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB SATA HD.

Software Specification: Ubuntu 14.04, Python, Jupyter.

1. Concepts used in Naïve Bayes classifier

● Naïve Bayes Classifier can be used for Classification of categorical data.

○ Let there be a ‘j’ number of classes. C={1,2,….j}

○ Let, input observation is specified by ‘P’ features. Therefore input observation
x is given , x = {F1,F2,…..Fp}
○ The Naïve Bayes classifier depends on Bayes' rule from probability theory.
● Prior probabilities: Probabilities which are calculated for some event based on no
other information are called Prior probabilities.

For example, P(A), P(B), P(C) are prior probabilities because while calculating P(A),
occurrences of event B or C are not concerned i.e. no information about occurrence of
any other event is used.

Conditional Probabilities:

From equation (1) and (2) ,

Is called the Bayes Rule.

2. Example of Naive Bayes

We have a dataset with some features Outlook, Temp, Humidity, and Windy, and the
target here is to predict whether a person or team will play tennis or not.
Conditional Probability
Here, we are predicting the probability of class1 and class2 based on the given condition. If I
try to write the same formula in terms of classes and features, we will get the following

equation

Now we have two classes and four features, so if we write this formula for class C1, it will be

something like

this.

Here, we replaced Ck with C1 and X with the intersection of X1, X2, X3, X4. You might

have a question, It’s because we are taking the situation when all these features are present at

the same time.

The Naive Bayes algorithm assumes that all the features are independent of each other or in

other words all the features are unrelated. With that assumption, we can further simplify the

above
formula and write it in this form

This is the final equation of the Naive Bayes and we have to calculate the probability of both

C1 and C2.For this particular example.

P (N0 | Today) > P (Yes | Today) So, the prediction that golf would be played is ‘No’.

Algorithm (Iris Dataset):

Step 1: Import libraries and create alias for Pandas, Numpy and
Matplotlib Step 2: Import the Iris dataset by calling URL.
Step 3: Initialize the data frame
Step 4: Perform Data
Preprocessing
● Convert Categorical to Numerical Values if applicable
● Check for Null Value
● Divide the dataset into Independent(X)
and Dependent(Y)variables.
● Split the dataset into training and testing datasets
● Scale the Features if necessary.

Step 5: Use Naive Bayes algorithm( Train the Machine ) to Create Model
# import the class
from sklearn.naive_bayes import
GaussianNB gaussian = GaussianNB()
gaussian.fit(X_train, y_train)

Step 6: Predict the y_pred for all values of train_x and test_x
Y_pred = gaussian.predict(X_test)

Step 7:Evaluate the performance of Model for train_y and test_y

accuracy = accuracy_score(y_test,Y_pred)
precision =precision_score(y_test,
Y_pred,average='micro') recall = recall_score(y_test,
Y_pred,average='micro')

Step 8: Calculate the required evaluation parameters

from sklearn.metrics import
precision_score,confusion_matrix,accuracy_score,recall_sco
re cm = confusion_matrix(y_test, Y_pred)

Conclusion:
In this way we have done data analysis using Naive Bayes Algorithm for Iris dataset and
evaluated the performance of the model.
Value Addition: Visualising Confusion Matrix using Heatmap

Assignment Question:
1) Consider the observation for the car theft scenario having 3 attributes colour, Type
and origin.
Find the probability of car theft having scenarios Red SUV and Domestic.
2) Write python code for the preprocessing mentioned in step 4. and Explain every step
in detail.

.
Assignment No: 7

Title Text Analytics

Problem statement
1. Extract Sample document and apply following document
preprocessing methods: Tokenization, POS Tagging, stop words
removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term
Frequency and Inverse DocumentFrequency.
Prerequisite 1. Basic of Python Programming
2. Basic of English language.
CO mapped 3,4
Programming Tools Python

Title of the Assignment: Text Analytics

1. Extract Sample document and apply following document preprocessing
methods: Tokenization, POS Tagging, stop words removal, Stemming and
Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse
Document Frequency.

Objective of the Assignment: Students should be able to perform Text Analysis using
TF IDF Algorithm

Outcome: Students are able to perform preprocessing the documents

Prerequisite:

1. Basic of Python Programming

2. Basic of English language.

Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB SATA HD.

Software Specification: Ubuntu 14.04, Python, Jupyter

Theory:
1. Basic concepts of Text Analytics
One of the most frequent types of day-to-day conversion is text communication. In
our everyday routine, we chat, message, tweet, share status, email, create blogs, and
offer
opinions and criticism. All of these actions lead to a substantial amount of
unstructured text being produced. It is critical to examine huge amounts of data in this
sector of the online world and social media to determine people's opinions.

Text mining is also referred to as text analytics. Text mining is a process of exploring
sizable textual data and finding patterns. Text Mining processes the text itself, while
NLP processes with the underlying metadata. Finding frequency counts of words,
length of the sentence, presence/absence of specific words is known as text mining.
Natural language processing is one of the components of text mining. NLP helps
identify sentiment, finding entities in the sentence, and category of blog/article. Text
mining is preprocessed data for text analytics. In Text Analytics, statistical and
machine learning algorithms are used to classify information.
2. Text Analysis Operations using natural language toolkit

NLTK(natural language toolkit) is a leading platform for building Python programs to

work with human language data. It provides easy-to-use interfaces and lexical
resources such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing, and semantic reasoning and
many more.
Analysing movie reviews is one of the classic examples to demonstrate a simple NLP
Bag-of-words model, on movie reviews.
2.1. Tokenization:
Tokenization is the first step in text analytics. The process of breaking down a
text paragraph into smaller chunks such as words or sentences is called
Tokenization. Token is a single entity that is the building blocks for a sentence
or paragraph.
● Sentence tokenization : split a paragraph into list of sentences using

sent_tokenize() method
● Word tokenization : split a sentence into list of words using word_tokenize()

method
2.2. Stop words removal
Stopwords considered as noise in the text. Text may contain stop words such
as is, am, are, this, a, an, the, etc. In NLTK for removing stopwords, you need
to create a list of stopwords and filter out your list of tokens from these words.
2.3. Stemming and Lemmatization
Stemming is a normalization technique where lists of tokenized words are
converted into shortened root words to remove redundancy. Stemming is the
process of reducing inflected (or sometimes derived) words to their word
stem, base or root form.
A computer program that stems word may be called a
stemmer. E.g.
A stemmer reduces the words like fishing, fished, and fisher to the stem fish.
The stem need not be a word, for example the Porter algorithm reduces, argue,
argued, argues, arguing, and argus to the stem argu .
Lemmatization in NLTK is the algorithmic process of finding the lemma of a
word depending on its meaning and context. Lemmatization usually refers to
the morphological analysis of words, which aims to remove inflectional
endings. It helps in returning the base or dictionary form of a word known as
the lemma.
Eg. Lemma for studies is study

Lemmatization Vs Stemming

Stemming algorithm works by cutting the suffix from the word. In a broader
sense cuts either the beginning or end of the word.
On the contrary, Lemmatization is a more powerful operation, and it takes into
consideration morphological analysis of the words. It returns the lemma which
is the base form of all its inflectional forms. In-depth linguistic knowledge is

required to create dictionaries and look for the proper form of the word.
Stemming is a general operation while lemmatization is an intelligent
operation where the proper form will be looked in the dictionary. Hence,
lemmatization helps in forming better machine learning features.
2.4. POS Tagging
POS (Parts of Speech) tell us about grammatical information of words of the
sentence by assigning specific token (Determiner, noun, adjective , adverb ,
verb,Personal Pronoun etc.) as tag (DT,NN ,JJ,RB,VB,PRP etc) to each words.
Word can have more than one POS depending upon the context where it is
used. We can use POS tags as statistical NLP tasks. It distinguishes a sense of
word which is very helpful in text realization and infer semantic information
from text for sentiment analysis.
3. Text Analysis Model using TF-IDF.
Term frequency–inverse document frequency(TFIDF) , is a numerical statistic
that is intended to reflect how important a word is to a document in a
collection or corpus.
● Term Frequency (TF)
It is a measure of the frequency of a word (w) in a document (d). TF is defined
as the ratio of a word’s occurrence in a document to the total number of words
in a document. The denominator term in the formula is to normalize since all
the corpus documents are of different lengths.

Example:

The initial step is to make a vocabulary of unique words and calculate TF for
each document. TF will be more for words that frequently appear in a
document and less for rare words in a document.

● Inverse Document Frequency (IDF)

It is the measure of the importance of a word. Term frequency (TF) does not
consider the importance of words. Some words such as’ of’, ‘and’, etc. can be
most frequently present but are of little significance. IDF provides weightage
to each word based on its frequency in the corpus D.
In our example, since we have two documents in the corpus, N=2.

● Term Frequency — Inverse Document Frequency (TFIDF)

It is the product of TF and IDF.
TFIDF gives more weightage to the word that is rare in the corpus (all the
documents). TFIDF provides more importance to the word that is more frequent in
the document.
After applying TFIDF, text in A and B documents can be represented as a TFIDF
vector of dimension equal to the vocabulary words. The value corresponding to each
word represents the importance of that word in a particular document.
TFIDF is the product of TF with IDF. Since TF values lie between 0 and 1, not using ln
can result in high IDF for some words, thereby dominating the TFIDF. We don’t want that,
and therefore, we use ln so that the IDF should not completely dominate the TFIDF.
● Disadvantage of TFIDF
It is unable to capture the semantics. For example, funny and humorous are synonyms, but
TFIDF does not capture that. Moreover, TFIDF can be computationally expensive if the
vocabulary is vast.
4. Bag of Words (BoW)
Machine learning algorithms cannot work with raw text directly. Rather, the text
must be converted into vectors of numbers. In natural language processing, a
common technique for extracting features from text is to place all of the words that
occur in the text in a

bucket. This approach is called a bag of words model or BoW for short. It’s referred
to as a “bag” of words because any information about the structure of the sentence is
lost.

Algorithm for Tokenization, POS Tagging, stop words removal, Stemming

and Lemmatization:
Step 1: Download the required packages
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagge
r')
Step 2: Initialize the text
text= "Tokenization is the first step in text analytics. The
process of breaking down a text paragraph into smaller chunks
such as words or sentences is called Tokenization."

Step 3: Perform Tokenization

#Sentence Tokenization
from nltk.tokenize import sent_tokenize
tokenized_text=
sent_tokenize(text)
print(tokenized_text)
#Word Tokenization
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)

Step 4: Removing Punctuations and Stop Word

# print stop words of English
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english
")) print(stop_words)

text= "How to remove stop words with NLTK library in Python?"

text= re.sub('[^a-zA-Z]', ' ',text)
tokens =
word_tokenize(text.lower())
filtered_text=[]
for w in tokens:
if w not in stop_words:
filtered_text.append(w)
print("Tokenized Sentence:",tokens)
print("Filterd Sentence:",filtered_text)
Step 5 : Perform Stemming
from nltk.stem import PorterStemmer
e_words= ["wait", "waiting", "waited",
"waits"] ps =PorterStemmer()
for w in e_words:
rootWord=ps.stem(w)
print(rootWord)

Step 6: Perform Lemmatization

from nltk.stem import
WordNetLemmatizer wordnet_lemmatizer =
WordNetLemmatizer() text = "studies
studying cries cry" tokenization =
nltk.word_tokenize(text) for w in
tokenization:
print("Lemma for {} is {}".format(w,
wordnet_lemmatizer.lemmatize(w)))

Step 7: Apply POS Tagging to text

import nltk
from nltk.tokenize import word_tokenize
data="The pink sweater fit her perfectly"
words=word_tokenize(data)
for word in words:
print(nltk.pos_tag([word]))
Algorithm for Create representation of document by calculating
TFIDF Step 1: Import the necessary libraries.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
Step 2: Initialize the Documents.
documentA = 'Jupiter is the largest Planet'
documentB = 'Mars is the fourth planet from the
Sun'
Step 3: Create BagofWords (BoW) for Document A and B.
bagOfWordsA = documentA.split('
') bagOfWordsB =
documentB.split(' ')
Step 4: Create Collection of Unique words from Document A and B.
uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))
Step 5: Create a dictionary of words and their occurrence for each document in the
corpus
numOfWordsA = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsA:
numOfWordsA[word]
+= 1
numOfWordsB = dict.fromkeys(uniqueWords,
0) for word in bagOfWordsB:
numOfWordsB[word] += 1
Step 6: Compute the term frequency for each of our documents.
def computeTF(wordDict,
bagOfWords): tfDict = {}
bagOfWordsCount = len(bagOfWords)
for word, count in
wordDict.items():
tfDict[word] = count / float(bagOfWordsCount)
return tfDict
tfA = computeTF(numOfWordsA,
bagOfWordsA) tfB =
computeTF(numOfWordsB, bagOfWordsB)

Step 7: Compute the term Inverse Document Frequency.

def
computeIDF(documents):
import math
N = len(documents)
idfDict = dict.fromkeys(documents[0].keys(),
0) for document in documents:
for word, val in
document.items(): if val >
0:
idfDict[word] += 1

for word, val in idfDict.items():

idfDict[word] = math.log(N /
float(val))
return idfDict
idfs = computeIDF([numOfWordsA,
numOfWordsB]) idfs

Step 8: Compute the term TF/IDF for all words.

def computeTFIDF(tfBagOfWords, idfs):
tfidf = {}
for word, val in
tfBagOfWords.items():
tfidf[word] = val * idfs[word]
return tfidf
tfidfA = computeTFIDF(tfA,
idfs) tfidfB =
computeTFIDF(tfB, idfs)
df = pd.DataFrame([tfidfA,
tfidfB]) df

Conclusion:
In this way we have done text data analysis using TF IDF algorithm.

Assignment Question:
1) Perform Stemming for text = "studies studying cries cry". Compare
the results generated with Lemmatization. Comment on your answer how Stemming
and Lemmatization differ from each other.

2) Write Python code for removing stop words from the below documents, conver the
documents into lowercase and calculate the TF, IDF and TFIDF score for each
document.
documentA = 'Jupiter is the largest Planet'
documentB = 'Mars is the fourth planet from the
Sun'
Assignment No: 8

Title Data Visualization I

Problem 1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains
statement
information about the passengers who boarded the unfortunate Titanic ship.
Use the Seaborn library to see if we can find any patterns in the data.
2. Write a code to check how the price of the ticket (column name: 'fare') for
each passenger is distributed by plotting a histogram.
Prerequisite
1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.
CO mapped 5
Programmi Python, Jupyter
ng Tools

Theory & Procedure:

Title of the Assignment: Data Visualization I
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains information
about the passengers who boarded the unfortunate Titanic ship. Use the Seaborn library to see
if we can find any patterns in the data.
2. Write a code to check how the price of the ticket (column name: 'fare') for each
passenger is distributed by plotting a histogram.

Objective of the Assignment: Students should be able to perform the data visualization using
Python on any open source dataset
Outcome: Students are able to perform the Data Visualization.

Prerequisite: 1. Basic of Python Programming

2.Seaborn Library, Concept of Data Visualization.

Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB SATA HD.

Software Specification: Ubuntu 14.04, Python, Jupyter.

Theory:
Seaborn which is another extremely useful library for data visualization in Python. The Seaborn
library is built on top of Matplotlib and offers many advanced data visualization capabilities.

Though, the Seaborn library can be used to draw a variety of charts such as matrix plots, grid plots,
regression plots etc., Seaborn library can be used to draw distributional and categorial plots. To
draw regression plots, matrix plots, and grid plots, Seaborn library need to download.

Downloading the Seaborn Library

Seaborn library can be installed by using pip installer by executing following

command: pip install seaborn

Alternatively, Anaconda distribution of Python can be used, following command is executed to

download the seaborn library:

conda install seaborn

The Dataset
Titanic dataset is used, which is downloaded by default with the Seaborn library. The
load_dataset function is used to load the dataset and pass the name of the dataset.

Execute the following script:

import pandas

as pd import

numpy as np

import matplotlib.pyplot

as plt import seaborn as

sns

dataset =

sns.load_dataset('titanic')

dataset.head()

The script above loads the Titanic dataset and displays the first five rows of the dataset using the
head function. The output looks like this:

The dataset contains 891 rows and 15 columns and contains information about the passengers
who boarded the unfortunate Titanic ship. The original task is to predict whether or not the
passenger survived depending upon different features such as their age, ticket, cabin they
boarded, the class of the ticket, etc. Seaborn library is used to find any patterns in the data.
Distributional Plots

Distributional plots, as the name suggests are type of plots that show the statistical distribution
of data.

The Dist Plot

The distplot() shows the histogram distribution of data for a single column. The column name is
passed as a parameter to the distplot() function. To check how the price of the ticket for each
passenger is distributed, execute the following script:

sns.distplot(dataset['fare'])

Output:

There is no line for the kernel density estimation on the plot. To pass the value for the bins
parameter in order to find more or less details in the graph, execute the following script:

sns.distplot(dataset['fare'], kde=False, bins=10)

By setting the number of bins to 10, data distributed in 10 bins as shown in the following output:

Output:
In the output, there are more than 700 passengers, the ticket price is between 0 and 50.

The Joint Plot

The jointplot()is used to display the mutual distribution of each column. There is need to pass
three parameters to jointplot. The first parameter is the column name which display the
distribution of data on x-axis. The second parameter is the column name which display the
distribution of data on y-axis. Finally, the third parameter is the name of the data frame. Plot a
joint plot of age and fare columns to see if there is any relationship between the two.

sns.jointplot(x='age', y='fare', data=dataset)

Output:

From the output, a joint plot has three parts. A distribution plot at the top for the column on the
x- axis, a distribution plot on the right for the column on the y-axis and a scatter plot in between
that shows the mutual distribution of data for both the columns. There is no correlation observed
between prices and the fares.

To change the type of the joint plot by passing a value for the kind parameter. The distribution of
data can be displayed in the form of a hexagonal plot, by passing the value hex for the kind
parameter.

sns.jointplot(x='age', y='fare', data=dataset, kind='hex')

Output:

In the hexagonal plot, the hexagon with most number of points gets darker color. From the
hexagonal plot, most of the passengers are between age 20 and 30 and most of them paid
between 10-50 for the tickets.

The Pair Plot

The pairplot() is a type of distribution plot that basically plots a joint plot for all the possible
combination of numeric and Boolean columns in dataset. The name of your dataset need to pass
as the parameter to the pairplot() function as shown below:

sns.pairplot(dataset)

A snapshot of the portion of the output is shown below:

Note: Before executing the script above, remove all null values from the dataset using the
following command:

dataset = dataset.dropna()

From the output of the pair plot , It is clear that joint plots for all the numeric and Boolean
columns in the Titanic dataset.

To add information from the categorical column to the pair plot, The name of the categorical
column have to pass to the hue parameter. For instance to plot the gender information on the pair
plot, execute the following script:

sns.pairplot(dataset, hue='sex')

Output:
In the output, the information about the males in orange and the information about the female in
blue (as shown in the legend). From the joint plot on the top left, it is clear that among the
surviving passengers, the majority were female.

The Rug Plot

The rugplot() is used to draw small bars along x-axis for each point in the dataset. To plot a rug
plot, pass the name of the column. Plot a rug plot for fare.

sns.rugplot(dataset['fare'])

Output:
From the output, it is clear that as was the case with the distplot(), most of the instances for the
fares have values between 0 and 100.

These are some of the most commonly used distribution plots offered by the Python's Seaborn
Library. Some of categorical plots in the Seaborn library as follows

Categorical Plots

Categorical plots, as the name suggests are normally used to plot categorical data. The
categorical plots plot the values in the categorical column against another categorical column or
a numeric column. Most commonly used categorical data as follows:

The Bar Plot

The barplot() is used to display the mean value for each value in a categorical column, against a
numeric column. The first parameter is the categorical column, the second parameter is the
numeric column while the third parameter is the dataset. To find the mean value of the age of the
male and female passengers, use the bar plot as follows.

sns.barplot(x='sex', y='age', data=dataset)

Output:

From the output, the average age of male passengers is just less than 40 while the average age of
female passengers is around 33.

To find the average, the bar plot can also be used to calculate other aggregate values for each
category. Pass the aggregate function to the estimator. To calculate the standard deviation for the
age of each gender as follows:

import numpy as np

import matplotlib.pyplot

as plt import seaborn as

sns

sns.barplot(x='sex', y='age', data=dataset, estimator=np.std)

Notice, in the above script, the std aggregate function used from the numpy library to calculate
the standard deviation for the ages of male and female passengers. The output:

The Count Plot

The count plot is similar to the bar plot, It displays the count of the categories in a specific
column. To count the number of males and women passenger,use count plot as follows:

sns.countplot(x='sex',

data=dataset) The output shows the

count as follows:

Output:

The Box Plot

The box plot is used to display the distribution of the categorical data in the form of quartiles.
The center of the box shows the median value. The value from the lower whisker to the bottom
of the box shows the first quartile. From the bottom of the box to the middle of the box lies the
second quartile. From the middle of the box to the top of the box lies the third quartile and
finally from the top of the box to the top whisker lies the last quartile.

To plot a box plot that displays the distribution for the age with respect to each gender,to pass the
categorical column as the first parameter (which is sex) and the numeric column (age) as the
second parameter. The dataset is passed as the third parameter.

sns.boxplot(x='sex', y='age', data=dataset)

Output:

From the above plot,the first quartile starts at around 5 and ends at 22 which means that 25% of
the passengers are aged between 5 and 25. The second quartile starts at around 23 and ends at
around 32 which means that 25% of the passengers are aged between 23 and 32. Similarly, the
third quartile starts and ends between 34 and 42, hence 25% passengers are aged within this
range and finally the fourth or last quartile starts at 43 and ends around 65.

If there are any outliers or the passengers that do not belong to any of the quartiles, they are
called outliers and are represented by dots on the box plot.

To see the box plots of forage of passengers of both genders, along with the information about
whether or not they survived, pass the survived as value to the hue parameter as shown below:

sns.boxplot(x='sex', y='age', data=dataset, hue="survived")

Output:
In addition to the information about the age of each gender, distribution of the passengers who
survived is also displayed. For instance, it is seen that among the male passengers, on average
more younger people survived as compared to the older ones. Similarly, it is observed that the
variation among the age of female passengers who did not survive is much greater than the age
of the surviving female passengers.

The Violin Plot

The violin plot is similar to the box plot, however, the violin plot allows us to display all the
components that actually correspond to the data point. The violinplot() function is used to plot
the violin plot. Like the box plot, the first parameter is the categorical column, the second
parameter is the numeric column while the third parameter is the dataset.

Let's plot a violin plot that displays the distribution for the age with respect to each

gender. sns.violinplot(x='sex', y='age', data=dataset)

Output:

To see from the figure above that violin plots provide much more information about the data as
compared to the box plot. Instead of plotting the quartile, the violin plot allows us to see all the
components that actually correspond to the data. The area where the violin plot is thicker has a
higher number of instances for the age. For instance, from the violin plot for males, it is clearly
evident that the number of passengers with age between 20 and 40 is higher than all the rest of
the age brackets.
Like box plots, you can also add another categorical variable to the violin plot using the hue
parameter as shown below:

sns.violinplot(x='sex', y='age', data=dataset, hue='survived')

Now to find a lot of information on the violin plot. For instance, if you look at the bottom of the
violin plot for the males who survived (left-orange), you can see that it is thicker than the bottom
of the violin plot for the males who didn't survive (left-blue). This means that the number of
young male passengers who survived is greater than the number of young male passengers who
did not survive. The violin plots convey a lot of information, however, on the downside; it takes
a bit of time and effort to understand the violin plots.

Instead of plotting two different graphs for the passengers who survived and those who did not,
you can have one violin plot divided into two halves, where one half represents surviving while
the other half represents the non-surviving passengers. To do so, you need to pass True as value
for the split parameter of the violinplot() function. Let's see how we can do this:

sns.violinplot(x='sex', y='age', data=dataset, hue='survived',

split=True) The output looks like this:

Now it can clearly observed the comparison between the age of the passengers who survived and
who did not for both males and females.

Both violin and box plots can be extremely useful. However, as a rule of thumb if you are
presenting your data to a non-technical audience, box plots should be preferred since they are
easy
to comprehend. On the other hand, if you are presenting your results to the research community
it is more convenient to use violin plot to save space and to convey more information in less
time.

The Strip Plot

The strip plot draws a scatter plot where one of the variables is categorical. We have seen scatter
plots in the joint plot and the pair plot sections where we had two numeric variables. The strip
plot is different in a way that one of the variables is categorical in this case, and for each
category in the categorical variable, you will see scatter plot with respect to the numeric column.

The stripplot() function is used to plot the violin plot. Like the box plot, the first parameter is the
categorical column, the second parameter is the numeric column while the third parameter is the
dataset. Look at the following script:

sns.stripplot(x='sex', y='age', data=dataset)

Output:

You can see the scattered plots of age for both males and females. The data points look like
strips. It is difficult to comprehend the distribution of data in this form. To better comprehend the
data, pass True for the jitter parameter which adds some random noise to the data. Look at the
following script:

sns.stripplot(x='sex', y='age', data=dataset, jitter=True)

Output:
Now better view for the distribution of age across the genders can be observed.

Like violin and box plots, you can add an additional categorical column to strip plot using hue
parameter as shown below:

sns.stripplot(x='sex', y='age', data=dataset, jitter=True, hue='survived')

Again you can see there are more points for the males who survived near the bottom of the plot
compared to those who did not survive.

Like violin plots, we can also split the strip plots. Execute the following script:

sns.stripplot(x='sex', y='age', data=dataset, jitter=True, hue='survived', split=True)

Output:

Now you can clearly see the difference in the distribution for the age of both male and female passengers
who survived and those who did not survive.

The Swarm Plot

The swarm plot is a combination of the strip and the violin plots. In the swarm plots, the points
are adjusted in such a way that they don't overlap. Let's plot a swarm plot for the distribution of
age
against gender. The swarmplot() function is used to plot the violin plot. Like the box plot, the
first parameter is the categorical column, the second parameter is the numeric column while the
third parameter is the dataset. Look at the following script:

sns.swarmplot(x='sex', y='age', data=dataset)

You can clearly see that the above plot contains scattered data points like the strip plot and the
data points are not overlapping. Rather they are arranged to give a view similar to that of a violin
plot.

Let's add another categorical column to the swarm plot using the hue

parameter. sns.swarmplot(x='sex', y='age', data=dataset, hue='survived')

Output:

From the output, it is evident that the ratio of surviving males is less than the ratio of surviving
females. Since for the male plot, there are more blue points and less orange points. On the other
hand, for females, there are more orange points (surviving) than the blue points (not surviving).
Another observation is that amongst males of age less than 10, more passengers survived as
compared to those who didn't.

We can also split swarm plots as we did in the case of strip and box plots. Execute the following
script to do so:

sns.swarmplot(x='sex', y='age', data=dataset, hue='survived', split=True)

Output:

Now you can clearly see that more women survived, as compared to men.

Combining Swarm and Violin Plots

Swarm plots are not recommended if you have a huge dataset since they do not scale well
because they have to plot each data point. If you really like swarm plots, a better way is to
combine two plots. For instance, to combine a violin plot with swarm plot, you need to execute
the following script:

sns.violinplot(x='sex', y='age', data=dataset)

sns.swarmplot(x='sex', y='age', data=dataset,

color='black') Output:

Conclusion:

Seaborn is an advanced data visualization library built on top of Matplotlib library. In this
assignment, we have explored distributional and categorical plots using Seaborn library.
Assignment Question:

1. List out different types of plot to find patterns of data

2. Explain when you will use distribution plots and when you will use categorical plots.

3. Write the conclusion from the following swarm plot (consider titanic dataset).

4. Which parameter is used to add another categorical variable to the violin plot, Explain
with syntax and example.
Assignment No: 9

Title Data Visualization II

Problem 1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box
statement plot for distribution of age with respect to each gender along with the information
about whether they survived or not. (Column names : 'sex' and 'age')
2. Write observations on the inference from the above statistics.
Prerequisite 1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.
CO mapped 4,5
Programmi Python, Jupyter
ng Tools

Theory & Procedure:

Title of the Assignment: Data Visualization II
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for
distribution of age with respect to each gender along with the information about
whether they survived or not. (Column names : 'sex' and 'age')
2. Write observations on the inference from the above statistics.

Objective: Students should be able to perform the data visualization using Python
on any open source dataset.

Outcome: Students are able to perform the Data Formatting and Data Normalization.
Prerequisites: 1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.

Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB SATA HD.

Software Specification: Ubuntu 14.04, Python, Jupyter.

Exploratory Data Analysis

There are various techniques to understand the data, And the basic need is the knowledge
of Numpy for mathematical operations and Pandas for data manipulation. Titanic dataset is
used. For demonstrating some of the techniques, use an inbuilt dataset of seaborn as tips
data which explains the tips each waiter gets from different customers.

Import libraries and loading

Data import numpy as np

import pandas pd

import matplotlib.pyplot

as plt import seaborn as

sns

from seaborn import

load_dataset #titanic dataset

data =

pd.read_csv("titanic_train.csv")

#tips dataset

tips = load_dataset("tips")

Univariate Analysis

Univariate analysis is the simplest form of analysis where we explore a single variable.
Univariate analysis is performed to describe the data in a better way. we perform Univariate
analysis of Numerical and categorical variables differently because plotting uses different
plots.

Categorical Data:

A variable that has text-based information is referred to as categorical variables. Now

following are various plots which we can use for visualizing Categorical data.

1) CountPlot:

Countplot is basically a count of frequency plot in form of a bar graph. It plots the count of
each category in a separate bar. When we use the pandas’ value counts function on any
column. It is the same visual form of the value counts function. In our data-target variable
is survived and it is categorical so plot a countplot of this.

sns.countplot(data['Survived'])

plt.show()

OUTPUT:
2) Pie Chart:

The pie chart is also the same as the countplot, only gives us additional information about
the percentage presence of each category in data means which category is getting how
much weightage in data. Now we check about the Sex column, what is a percentage of
Male and Female members traveling.

data['Sex'].value_counts().plot(kind="pie", autopct="%.2f")

plt.show()

OUTPUT:

Numerical Data:

Analyzing Numerical data is important because understanding the distribution of variables

helps to further process the data. Most of the time, we will find much inconsistency with
numerical data so we have to explore numerical variables.

1) Histogram:

A histogram is a value distribution plot of numerical columns. It basically creates bins in

various ranges in values and plots it where we can visualize how values are distributed. We
can have a look where more values lie like in positive, negative, or at the center(mean).
Let’s have
a look at the Age

column.

plt.hist(data['Age'],

bins=5) plt.show()

OUTPUT:

2) Distplot:

Distplot is also known as the second Histogram because it is a slight improvement version
of the Histogram. Distplot gives us a KDE(Kernel Density Estimation) over histogram
which explains PDF(Probability Density Function) which means what is the probability of
each value occurring in this column.

sns.distplot(data['Age'])

plt.show()

OUTPUT:
3) Boxplot:

Boxplot is a very interesting plot that basically plots a 5 number summary. to get 5 number
summary some terms we need to describe.

• Median – Middle value in series after sorting

• Percentile – Gives any number which is number of values present before this
percentile like for example 50 under 25th percentile so it explains total of 50 values are
there below 25th percentile

• Minimum and Maximum – These are not minimum and maximum values, rather
they describe the lower and upper boundary of standard deviation which is calculated
using Interquartile range(IQR).

IQR = Q3 - Q1

Lower_boundary = Q1 - 1.5 *

IQR Upper_bounday = Q3 +

1.5 * IQR

Here Q1 and Q3 is 1st quantile (25th percentile) and 3rd Quantile(75th percentile).

Bivariate/ Multivariate Analysis:

We have study about various plots to explore single categorical and numerical data.
Bivariate Analysis is used when we have to explore the relationship between 2 different
variables and we have to do this because, in the end, our main task is to explore the
relationship between variables to build a powerful model. And when we analyze more than
2 variables together then it is known as Multivariate Analysis. we will work on different
plots for Bivariate as well on
Multivariate Analysis.

Explore the plots when both the variable is numerical.

1) Scatter Plot:

To plot the relationship between two numerical variables scatter plot is a simple plot to do.
Let us see the relationship between the total bill and tip provided using a scatter plot.

sns.scatterplot(tips["total_bill"], tips["tip"])

Multivariate analysis with scatter plot:

We can also plot 3 variable or 4 variable relationships with scatter plot. suppose we
want to find the separate ratio of male and female with total bill and tip provided.

sns.scatterplot(tips["total_bill"], tips["tip"],

hue=tips["sex"]) plt.show()

OUTPUT:
We can also see 4 variable multivariate analyses with scatter plots using style argument.
Suppose along with gender we also want to know whether the customer was a smoker or not
so we can do this.

sns.scatterplot(tips["total_bill"], tips["tip"], hue=tips["sex"],

style=tips['smoker']) plt.show()

OUTPUT:

Numerical and Categorical:

If one variable is numerical and one is categorical then there are various plots that we can
use for Bivariate and Multivariate analysis.

1) Bar Plot:

Bar plot is a simple plot which we can use to plot categorical variable on the x-axis and
numerical variable on y-axis and explore the relationship between both variables. The
blacktip on top of each bar shows the confidence Interval. let us explore P-Class with age.

sns.barplot(data['Pclass'],

data['Age']) plt.show()

OUTPUT:
Multivariate analysis using Bar plot:

Hue’s argument is very useful which helps to analyze more than 2 variables. Now along
with the above relationship we want to see with gender.

sns.barplot(data['Pclass'], data['Fare'], hue =

data["Sex"]) plt.show()

OUTPUT:

2) Boxplot:

We have already study about boxplots in the Univariate analysis above. we can draw a
separate boxplot for both the variable. let us explore gender with age using a boxplot.

sns.boxplot(data['Sex'],

data["Age"]) OUTPUT:
Multivariate analysis with boxplot:

Along with age and gender let’s see who has survived and who has not.

sns.boxplot(data['Sex'], data["Age"], data["Survived"])

plt.sho

w()

OUTP

UT:

3) Distplot:

Distplot explains the PDF function using kernel density estimation. Distplot does not have
a hue parameter but we can create it. Suppose we want to see the probability of people
with an age range that of survival probability and find out whose survival probability is
high to the age range of death ratio.

sns.distplot(data[data['Survived'] == 0]['Age'], hist=False, color="blue")

sns.distplot(data[data['Survived'] == 1]['Age'], hist=False,

color="orange") plt.show()

OUTPUT:
In above graph, the blue one shows the probability of dying and the orange plot shows the
survival probability. If we observe it we can see that children’s survival probability is
higher than death and which is the opposite in the case of aged peoples. This small analysis
tells sometimes some big things about data and it helps while preparing data stories.

Categorical and Categorical:

Now, we will work on categorical and categorical columns.

1) Heatmap:

If you have ever used a crosstab function of pandas then Heatmap is a similar visual
representation of that only. It basically shows that how much presence of one category
concerning another category is present in the dataset. let me show first with crosstab and
then with heatmap.

pd.crosstab(data['Pclass'], data['Survived'])

Now with heatmap, we have to find how many people survived and died.

sns.heatmap(pd.crosstab(data['Pclass'], data['Survived']))
2) Cluster map:

We can also use a cluster map to understand the relationship between two categorical
variables. A cluster map basically plots a dendrogram that shows the categories of similar
behavior together.

sns.clustermap(pd.crosstab(data['Parch'], data['Survived']))

plt.show()

OUTPUT:
Conclusion-

In this way we have explored the functions of the python library for Data
Preprocessing, Data Wrangling Techniques and How to Handle missing values
on Iris Dataset.
Assignment No: 10

Title Data Visualization III

Problem Download the Iris flower dataset or any other dataset into
statement a DataFrame.(e.g.,https://archive.ics.uci.edu/ml/datasets/Iris ).
Scan the dataset and give the inference as:
1. List down the features and their types (e.g., numeric, nominal) available
in the dataset.
2. Create a histogram for each feature in the dataset to illustrate the
feature distributions.
3. Create a box plot for each feature in the dataset.
Compare distributions and identify outliers.
Prerequisite Fundamentals of R -Programming Languages
CO mapped 4,5
Programmi R programming Language, Jupyter
ng Tools

Theory & Procedure:

Title: Download the Iris flower dataset or any other dataset into a DataFrame.
(eg https://archive.ics.uci.edu/ml/datasets/Iris ) Use Python/R and Perform
following –
● How many features are there and what are their types (e.g., numeric, nominal)?
● Compute and display summary statistics for each feature available in the
dataset. (eg. minimum value, maximum value, mean, range, standard deviation,
variance and percentiles
● Data Visualization-Create a histogram for each feature in the dataset to
illustrate the feature distributions. Plot each histogram.
● Create a boxplot for each feature in the dataset. All of the boxplots should be
combined into a single plot. Compare distributions and identify outliers.
Objectives: To learn the concept of how to display summary statistics for each feature
availablein the dataset.
Outcome: Students are able to display statistics of features.
Prerequisites: Fundamentals of R -Programming Languages.
Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB SATA HD.
Software Specification: Ubuntu 14.04, R, Jupyter.

Theory:

How to Find the Mean, Median, Mode, Range, and Standard Deviation
Simplify comparisons of sets of number, especially large sets of number, by calculating the
center values using mean, mode and median. Use the ranges and standard deviations of the
sets to examine the variability of data.

Calculating Mean

The mean identifies the average value of the set of numbers. For example, consider the
data set containing the values 20, 24, 25, 36, 25, 22, 23.

Formula

To find the mean, use the formula: Mean equals the sum of the numbers in the data
set divided by the number of values in the data set. In mathematical terms:
Mean=(sum of all terms)÷(how many terms or values in the set).

Adding Data Set

Add the numbers in the example data set: 20+24+25+36+25+22+23=175.

Finding Divisor

Divide by the number of data points in the set. This set has seven values so divide by 7.

Finding Mean

Insert the values into the formula to calculate the mean. The mean equals the sum of the values
(175) divided by the number of data points (7). Since 175÷7=25, the mean of this data set equals
25. Not all mean values will equal a whole number.

Calculating Range

Range shows the mathematical distance between the lowest and highest values in the data
set. Range measures the variability of the data set. A wide range indicates greater
variability in the data, or perhaps a single outlier far from the rest of the data. Outliers may
skew, or shift, the mean value enough to impact data analysis.

Identifying Low and High Values

In the sample group, the lowest value is 20 and the highest value is 36.

Calculating Range

To calculate range, subtract the lowest value from the highest value. Since 36-20=16,
the range equals 16.

Calculating Standard Deviation

Standard deviation measures the variability of the data set. Like

range, a smaller standard deviation indicates less variability.
Formula

Finding standard deviation requires summing the squared difference between each data point
and the mean [∑(x-µ)2], adding all the squares, dividing that sum by one less than the number
of values (N-1), and finally calculating the square root of the dividend.

Mathematically, start with calculating the mean.

Calculating the Mean

Calculate the mean by adding all the data point values, then dividing by the number of
data points. In the sample data set, 20+24+25+36+25+22+23=175. Divide the sum,
175, by the number of data points, 7, or 175÷7=25. The mean equals 25.

Squaring the Difference

Next, subtract the mean from each data point, then square each difference. The formula
looks like this: ∑(x-µ)2, where ∑ means sum, x represents each data set value and µ
represents the mean value. Continuing with the example set, the values become: 20-25=-5
and -52=25; 24- 25=-1 and -12=1; 25-25=0 and 02=0; 36-25=11 and 112=121; 25-25=0 and
02=0; 22-25=-3 and -
32=9; and 23-25=-2 and -22=4.

Adding the Squared Differences

Adding the squared differences yields: 25+1+0+121+0+9+4=160. The example data set
has 7 values, so N-1 equals 7-1=6. The sum of the squared differences, 160, divided by 6
equals approximately 26.6667.

Standard Deviation

Calculate the standard deviation by finding the square root of the division by N-1. In
the example, the square root of 26.6667 equals approximately 5.164. Therefore, the
standard deviation equals approximately 5.164.

Evaluating Standard Deviation

Standard deviation helps evaluate data. Numbers in the data set that fall within one standard
deviation of the mean are part of the data set. Numbers that fall outside of two standard
deviations are extreme values or outliers. In the example set, the value 36 lies more than
two standard deviations from the mean, so 36 is an outlier. Outliers may represent
erroneous data or may suggest unforeseen circumstances and should be carefully
considered when interpreting data.

Facilities: Windows/Linux Operating Systems, RStudio, jdk.

Application:

1. The histogram is suitable for visualizing distribution of numerical data over a

continuous interval, or a certain time period. The histogram organizes large amounts of
data, and produces visualization quickly, using a single dimension.
2. The box plot allows quick graphical examination of one or more data sets. Box plots
may seem more primitive than a histogram but they do have some advantages. They take up
less space and are therefore particularly useful for comparing distributions between several
groups or sets of data. Choice of number and width of bins techniques can heavily influence
the appearance of a histogram, and choice of bandwidth can heavily influence the appearance
of a kernel density estimate.

3. Data Visualization Application lets you quickly create insightful data

visualizations, in minutes.

Data visualization tools allow anyone to organize and present information intuitively. They
enables users to share data visualizations with others.

Input:

Structured Dataset:

Iris Dataset File:

iris.csv

Output:
1. Display Dataset Details.
2. Calculate Min, Max, Mean, Variance value and Percentiles of probabilities also
Display Specific use quantile.
3. Display the Histogram using Hist Function.
4. Display the Boxplot using Boxplot Function.

Conclusion:

Hence, we have studied using dataset into a dataframe and compare distribution and
identify outliers.

Questions:

1. What is Data visualization?

2. How to calculate min, max, range and standard deviation?
3. How to create boxplot for each feature in the daset?
4. How to create histogram?
5. What is dataset?
Assignment 11

Title Word Count application

Problem Write a code in JAVA for a simple Word Count application that counts the number of
statement occurrences of each word in a given input set using the Hadoop Map-Reduce
framework onlocal-standalone set-up.
Prerequisite 1. Linux commands
2. Java Programming
CO mapped 6
Programmi Eclipse
ng Tools

Theory & Procedure:

Title: Write a code in JAVA for a simple Word Count application that counts the number of
occurrences of each word in a given input set using the Hadoop Map-Reduce framework onlocal-
standalone set-up.

Objectives: Students should be able to set up Hadoop single node cluster and implement application on
cluster.

Outcome: Students are able to set Hadoop single node cluster and run the application successfully.
Prerequisites: 1. Linux commands
2. Java Programming

Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB SATA HD.

Software Specification: Ubuntu 14.04, Eclipse

Setting up a Single Node Hadoop Cluster

A single node cluster means only one DataNode running and setting up all the NameNode, DataNode,
ResourceManager, and NodeManager on a single machine. This is used for studying and testing
purposes.
While in a Multi-node cluster, there are more than one DataNode running and each DataNode is
running on different machines. The multi-node cluster is practically used in organizations for analyzing
Big Data. Considering the above example, in real-time when we deal with petabytes of data, it needs to
be distributed across hundreds of machines to be processed. Thus, here we use a multi-node cluster.
Hadoop Installation:

Step 1) Install java

(sudo apt-get install openjdk-8-jdk)

Step 2) Download the Hadoop 2.7.3 Package.

wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz

Step 3) Extract the Hadoop tar

File. tar -xvf hadoop-2.7.3.tar.gz

Step 4) Add the Hadoop and Java paths in the bash file (.bashrc).

gedit .bashrc

export HADOOP_HOME=$HOME/hadoop-2.7.3
export HADOOP_CONF_DIR=$HOME/hadoop-2.7.3/etc/hadoop
export HADOOP_MAPRED_HOME=$HOME/hadoop-2.7.3
export HADOOP_COMMON_HOME=$HOME/hadoop-2.7.3
export HADOOP_HDFS_HOME=$HOME/hadoop-2.7.3
export YARN_HOME=$HOME/hadoop-2.7.3
export
PATH=$PATH:$HOME/hadoop-2.7.3/bin

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin:$PATH
Step 5) For applying all these changes to the current Terminal, execute the source command.

source .bashrc

Step 6)hadoop version

Step 7) Edit the Hadoop Configuration

files. cd hadoop-2.7.3/etc/hadoop/

Step 8) Open core-site.xml and edit the property mentioned below inside configuration tag:

gedit core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

Step 9) Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:

gedit hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>
Step 10) Edit the mapred-site.xml file and edit the property mentioned below inside configuration tag:
(In some cases, mapred-site.xml file is not available. So, we have to create the mapred-site.xml
file using mapred-site.xml template.
cp mapred-site.xml.template mapred-site.xml)

gedit mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Step 11) Edit yarn-site.xml and edit the property mentioned below inside configuration tag:

gedit yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

Step 12) Edit hadoop-env.sh and add the Java Path as mentioned below:

gedit hadoop–env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Step 13) Go to Hadoop home directory and format the

NameNode. cd hadoop-2.7.3

bin/hadoopnamenode -format

Step 14) Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start all the
daemons/nodes.

cd hadoop-2.7.3/sbin

1) Start NameNode:

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files stored
in the HDFS and tracks all the file stored across the cluster.

./hadoop-daemon.sh start namenode

2) Start DataNode:

On startup, a DataNode connects to the Namenode and it responds to the requests from the Namenode
for different operations.

./hadoop-daemon.sh start datanode

3) Start ResourceManager:

ResourceManager is the master that arbitrates all the available cluster resources and thus helps in
managing the distributed applications running on the YARN system. Its work is to manage each
NodeManagers and the each application’s ApplicationMaster.
./yarn-daemon.sh start resourcemanager

4) Start NodeManager:

The NodeManager in each machine framework is the agent which is responsible for managing
containers, monitoring their resource usage and reporting the same to the ResourceManager.

./yarn-daemon.sh start nodemanager

5) Start JobHistoryServer:

JobHistoryServer is responsible for servicing all job history related requests from client.

./mr-jobhistory-daemon.sh start historyserver

Step 15) To check that all the Hadoop services are up and running, run the below command.

jps

Step 16) Now open the Mozilla browser and go to localhost:50070/dfshealth.html to check the
NameNode interface.

Steps to run the application on Hadoop cluster:

Step 1) mkdir words

Step 2) Download hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce
program. Visit the following
link
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/1.2.1

Step 3) Put that downloaded jar file into words folder.

Step 4) Implement WordCount.java program.

Step 5) Create input1.txt on home directory with some random text

Step 6) go on words path then compile

javac -classpath /home/vijay/words/hadoop-core-1.2.1.jar /home/vijay/words/WordCount.java

Step 7) jar -cvf words.jar -c words/ .

Step 8) cd .. then use following

commands hadoop fs -mkdir /input

hadoop fs -put input1.txt

/input hadoop fs -ls /input

hadoop jar /home/vijay/words/words12.jar WordCount /input/input1.txt

/out321 hadoop fs -ls /out321

hadoop fs -cat /out321/part-r-00000

(Otherwise check in Browsing HDFS -> Utilities -> Browse the file System -> /)

Conclusion- In this way, we have successfully installed the hadoop cluster on single node
and also implemented the java application on it.
Assignment Question

1. What is the map reduce explain with a small example?

2. Write down steps to install hadoop.

Assignment 12

Title Design a distributed application

Problem Design a distributed application using Map-Reduce which processes a log file of a
statement system.
Prerequisite 1. Linux commands
2. Java Programming
CO mapped 6
Programmi Eclipse
ng Tools

Theory & Procedure:

Title: Design a distributed application using Map-Reduce which processes a log file of a system.

Objectives: Students should be able to implement a distributed application using Map-Reduce.

Outcome: Students are able to run the distributed application using Map-Reduce successfully.
Prerequisites: 1. Linux commands
2. Java Programming

Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB SATA HD.

Software Specification: Ubuntu 14.04, Eclipse

Steps to run the distributed application using Map-Reduce:

Note: Initially create one folder logfiles1 on desktop. In that folder store input file
(access_log_short.csv), SalesMapper.java, SalesCountryReducer.java, SalesCountryDriver.java
files)

Step 1) Go to Hadoop home directory and format the

NameNode. cd hadoop-2.7.3

bin/hadoop namenode -format

Step 2) Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start all the
daemons/nodes.

cd hadoop-2.7.3/sbin
1) Start NameNode:

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files
stored in the HDFS and tracks all the file stored across the cluster.

./hadoop-daemon.sh start namenode

2) Start DataNode:

On startup, a DataNode connects to the Namenode and it responds to the requests from the
Namenode for different operations.

./hadoop-daemon.sh start datanode

3) Start ResourceManager:

./yarn-daemon.sh start resourcemanager

4) Start NodeManager:

The NodeManager in each machine framework is the agent which is responsible for managing
containers, monitoring their resource usage and reporting the same to the ResourceManager.

./yarn-daemon.sh start nodemanager

5) Start JobHistoryServer:

JobHistoryServer is responsible for servicing all job history related requests from client.

./mr-jobhistory-daemon.sh start historyserver

Step 3) To check that all the Hadoop services are up and running, run the below

command. jps

Step 4) cd

Step 5) sudo mkdir mapreduce_vijay

Step 6) sudo chmod 777 -R mapreduce_vijay/

Step 7) sudo chown -R vijay mapreduce_vijay/

Step 8) sudo cp /home/vijay/Desktop/logfiles1/* ~/mapreduce_vijay/

Step 9) cd mapreduce_vijay/

Step 10) ls

Step 11) sudo chmod +r .

Step 12) export

CLASSPATH="/home/vijay/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-
mapreduce-client-core-2.7.3.jar:/home/vijay/hadoop-2.7.3/share/hadoop/mapreduce/hadoo
p-
mapreduce-client-common-2.7.3.jar:/home/vijay/hadoop-2.7.3/share/hadoop/common/had
oop- common-2.7.3.jar:~/mapreduce_vijay/SalesCountry/*:$HADOOP_HOME/lib/*"

Step 13) javac -d . SalesMapper.java SalesCountryReducer.java SalesCountryDriver.java

Step 14) ls

Step 15) cd SalesCountry/

Step 16) ls (check is class files are created)

Step 17) cd ..

Step 18) gedit

Manifest.txt (add
following lines to it:
Main-Class: SalesCountry.SalesCountryDriver)

Step 19) jar -cfm mapreduce_vijay.jar Manifest.txt SalesCountry/*.class

Step 20) ls

Step 21) cd

Step 22) cd mapreduce_vijay/

Step 23) sudo mkdir /input200

Step 24) sudo cp access_log_short.csv /input200

Step 25) $HADOOP_HOME/bin/hdfs dfs -put /input200 /

Step 26) $HADOOP_HOME/bin/hadoop jar mapreduce_vijay.jar /input200 /output200

Step 27) hadoop fs -ls /output200

Step 28) hadoop fs -cat /out321/part-00000

Step 29) Now open the Mozilla browser and go to localhost:50070/dfshealth.html to check
the NameNode interface.

Conclusion- In this way, we have successfully implemented distributed application using

Map-Reduce which processes a log file of a system.

Assignment Question:
1. What is the map reduce explain with a small example?

2. Write down steps to install hadoop.

Assignment 13

Title Simple program in SCALA

Problem Write a simple program in SCALA using Apache Spark framework

statement
Prerequisite 1. Linux commands
2. Java Programming
CO mapped 6
Programmi Eclipse
ng Tools

Theory & Procedure:

Title: Write a simple program in SCALA using Apache Spark framework

Objectives: Students should be able to implement a program in SCALA using Apache Spark framework.

Outcome: Students are able to run the program in SCALA using Apache Spark framework successfully.
Prerequisites: 1. Linux commands
2. Java Programming

Hardware Specification: x86_64 bit, 2 – 2/4 GB DDR RAM, 80 - 500 GB SATA HD.

Software Specification: Ubuntu 14.04, Eclipse

Install Scala

Step 1) java -version

Step 2) Install Scala from the apt repository by running the following commands to search for scala and
install it.

sudo apt search scala ⇒ Search for the

package sudo apt install scala⇒ Install the
package

Step 3) To verify the installation of Scala, run the following command.

scala -version
Apache Spark Framework Installation

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes
in-memory caching, and optimized query execution for fast analytic queries against data of any size.

Step 1) Now go to the official Apache Spark download page and grab the latest version (i.e. 3.2.1) at the
time of writing this article. Alternatively, you can use the wget command to download the file directly in
the terminal.

wget https://apachemirror.wuchna.com/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz

Step 2) Extract the Apache Spark tar

file. tar -xvzf
spark-3.1.1-bin-hadoop2.7.tgz

Step 3) Move the extracted Spark directory to /opt

directory. sudo mv spark-3.1.1-bin-hadoop2.7 /opt/spark

Configure Environmental Variables for Spark

Step 4) Now you have to set a few environmental variables in .profile file before starting up the spark.

echo "export SPARK_HOME=/opt/spark" >> ~/.profile

echo "export PATH=$PATH:/opt/spark/bin:/opt/spark/sbin" >> ~/.profile
echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile

Step 5) To make sure that these new environment variables are reachable within the shell and available to
Apache Spark, it is also mandatory to run the following command to take recent changes into effect.

source ~/.profile

Step 6) ls -l /opt/spark

Start Apache Spark in Ubuntu

Step 7) Run the following command to start the Spark master service and slave service.

start-master.sh
start-workers.sh spark://localhost:7077
(if workers not starting then remove and install openssh:
sudo apt-get remove openssh-client openssh-server
sudo apt-get install openssh-client openssh-server)

Step 8) Once the service is started go to the browser and type the following URL access spark page. From
the page, you can see my master and slave service is started.

http://localhost:8080/

Step 9) You can also check if spark-shell works fine by launching the spark-shell command.
Spark-shell
sudo apt install snapd

snap find “intellij”

sudo snap install intellij-idea-community - - classic

Start Intellij IDE community Edition

Conclusion- In this way, we have successfully run a program in SCALA using Apache Spark framework

Assignment Question:
1. Write down steps to install scala.

Should College Students Be Tested For AIDS
No ratings yet
Should College Students Be Tested For AIDS
3 pages
DSBDAlab Manual
No ratings yet
DSBDAlab Manual
116 pages
Data Science With Python Class Room Notes Qulaity Thought
100% (2)
Data Science With Python Class Room Notes Qulaity Thought
489 pages
AI Project Logbook Done(2023-24)
No ratings yet
AI Project Logbook Done(2023-24)
34 pages
Assignment 2
0% (6)
Assignment 2
7 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
85 pages
Scikit-Learn-Exercises - Jupyter Notebook
100% (2)
Scikit-Learn-Exercises - Jupyter Notebook
28 pages
Compiler Design Lab Manual
No ratings yet
Compiler Design Lab Manual
48 pages
Juspay Registered Eligible Students List 2025 Batch
100% (1)
Juspay Registered Eligible Students List 2025 Batch
111 pages
WDM Important Questions 1
100% (1)
WDM Important Questions 1
3 pages
MS 54 Assignment Jan2021 - June 2021
No ratings yet
MS 54 Assignment Jan2021 - June 2021
2 pages
Python Assignment
No ratings yet
Python Assignment
1 page
Dsbda Lab Manual Merged
No ratings yet
Dsbda Lab Manual Merged
117 pages
Report Final
100% (1)
Report Final
25 pages
t Respaifb i 4 l3 en File 19.En
No ratings yet
t Respaifb i 4 l3 en File 19.En
145 pages
Bahria University Midterm ITB Paper Fall 2021 07052021 091547pm
100% (1)
Bahria University Midterm ITB Paper Fall 2021 07052021 091547pm
1 page
Study Notion
No ratings yet
Study Notion
51 pages
(Icdsaiml 2024) Mon
No ratings yet
(Icdsaiml 2024) Mon
2 pages
Project Book Class XII AI
No ratings yet
Project Book Class XII AI
7 pages
Individual Long Term Internship
No ratings yet
Individual Long Term Internship
57 pages
AI Unit-4
No ratings yet
AI Unit-4
58 pages
Student management system
No ratings yet
Student management system
41 pages
Report MP
No ratings yet
Report MP
26 pages
Minimum Maximum Year Over
No ratings yet
Minimum Maximum Year Over
29 pages
Intro Intern Final Merged
No ratings yet
Intro Intern Final Merged
22 pages
T Report
No ratings yet
T Report
18 pages
Classifying Emergency Patients into Fast-Track and Complex Cases using Machine Learning
No ratings yet
Classifying Emergency Patients into Fast-Track and Complex Cases using Machine Learning
13 pages
Chandigarh
No ratings yet
Chandigarh
6 pages
FAI Lecture - 4-10-2023 PDF
No ratings yet
FAI Lecture - 4-10-2023 PDF
27 pages
B.tech (Cse) Ai + ML + DL
No ratings yet
B.tech (Cse) Ai + ML + DL
184 pages
BCA Cyber 2022 25 Booklet
No ratings yet
BCA Cyber 2022 25 Booklet
136 pages
Benchmarking Llms Via Uncertainty Quantification
No ratings yet
Benchmarking Llms Via Uncertainty Quantification
24 pages
An Artificial Neural Network-Based Earthquake Casualty Estimation Model For Istanbul City
No ratings yet
An Artificial Neural Network-Based Earthquake Casualty Estimation Model For Istanbul City
16 pages
LD7010 - Assignment
No ratings yet
LD7010 - Assignment
9 pages
Computer Practical File
No ratings yet
Computer Practical File
18 pages
Ghanashyam Rohina (Sponsorship Management System)
No ratings yet
Ghanashyam Rohina (Sponsorship Management System)
58 pages
Phase 2 Final Report Depression Detection
No ratings yet
Phase 2 Final Report Depression Detection
48 pages
Synopsis
No ratings yet
Synopsis
22 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
36 pages
Final LP-VI Lab Manual 23-24
No ratings yet
Final LP-VI Lab Manual 23-24
71 pages
Assessment 2 UEL CN 7000
No ratings yet
Assessment 2 UEL CN 7000
10 pages
#5 MEGNet_A_MEG-Based_Deep_Learning_Model_for_Cognitive_and_Motor_Imagery_Classification
No ratings yet
#5 MEGNet_A_MEG-Based_Deep_Learning_Model_for_Cognitive_and_Motor_Imagery_Classification
8 pages
AI Capstone Project
No ratings yet
AI Capstone Project
24 pages
NWS Assignment Coppa - Latest 2022
No ratings yet
NWS Assignment Coppa - Latest 2022
8 pages
pham2021
No ratings yet
pham2021
14 pages
Classification and Prediction
No ratings yet
Classification and Prediction
41 pages
Presentrtion Final 1
No ratings yet
Presentrtion Final 1
10 pages
GrandChallengeof106-PointFacialLandmarkLocalization-1
No ratings yet
GrandChallengeof106-PointFacialLandmarkLocalization-1
5 pages
Detection and Verification System of Handwriting and Signature Using Raspberry Picopy 220426094934
No ratings yet
Detection and Verification System of Handwriting and Signature Using Raspberry Picopy 220426094934
48 pages
Over 843 - Ai - MS 1 2024-25
No ratings yet
Over 843 - Ai - MS 1 2024-25
5 pages
Acharya, Toya Khatri, Ishan Annamalai, Annamalai Chouikha, Mohamed F (2021)
No ratings yet
Acharya, Toya Khatri, Ishan Annamalai, Annamalai Chouikha, Mohamed F (2021)
7 pages
Introducing Machine Learning To Parameter Estimation
No ratings yet
Introducing Machine Learning To Parameter Estimation
6 pages
Machine Learning With Ridge and Lasso Regression
No ratings yet
Machine Learning With Ridge and Lasso Regression
19 pages
Machine Learning Technical Report
No ratings yet
Machine Learning Technical Report
12 pages
A Hybrid Machine-Learning Ensemble For Anomaly Detection in Real-Time Industry 4.0 Systems
No ratings yet
A Hybrid Machine-Learning Ensemble For Anomaly Detection in Real-Time Industry 4.0 Systems
13 pages
Sih Idea
No ratings yet
Sih Idea
4 pages
Aditya_Rawat_Resume
No ratings yet
Aditya_Rawat_Resume
1 page
Mlip 002 PDF
50% (2)
Mlip 002 PDF
19 pages
JNTUA R20 B.Tech - Artificial Intelligence Data Science III IV Year Draft Course Structure Syllabus
No ratings yet
JNTUA R20 B.Tech - Artificial Intelligence Data Science III IV Year Draft Course Structure Syllabus
113 pages
Data Science
No ratings yet
Data Science
18 pages
MCA Seminar Presentation
No ratings yet
MCA Seminar Presentation
8 pages
Synopsis On
No ratings yet
Synopsis On
8 pages
Nptel
No ratings yet
Nptel
1 page
Open AI Chat GPT and Bias by OBrien and Alsmadi
No ratings yet
Open AI Chat GPT and Bias by OBrien and Alsmadi
15 pages
Ethical Hacking Assignment
No ratings yet
Ethical Hacking Assignment
5 pages
Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis
No ratings yet
Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis
14 pages
Communication Assessment For Cognizant Recruitment Drive - 2024 Graduating Batch
No ratings yet
Communication Assessment For Cognizant Recruitment Drive - 2024 Graduating Batch
18 pages
Summer Training Report - Ishan Patwal
No ratings yet
Summer Training Report - Ishan Patwal
52 pages
Vtu Guidelines For Project Report
No ratings yet
Vtu Guidelines For Project Report
5 pages
Project Report Format-1
No ratings yet
Project Report Format-1
11 pages
Im (FMS) Bhu Sp2020
No ratings yet
Im (FMS) Bhu Sp2020
32 pages
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
100% (1)
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
8 pages
Javascript Handwritten Notes
No ratings yet
Javascript Handwritten Notes
28 pages
Mca Project Guidelines Complete
No ratings yet
Mca Project Guidelines Complete
13 pages
Aissce Practical Exam CS Question Paper - 2023-24
No ratings yet
Aissce Practical Exam CS Question Paper - 2023-24
5 pages
Microsoft Dynamics 365 Syllabus
No ratings yet
Microsoft Dynamics 365 Syllabus
2 pages
Mini Project Format With Guidelines
No ratings yet
Mini Project Format With Guidelines
7 pages
Ex No: Web Page Creation Using Image Map Date
No ratings yet
Ex No: Web Page Creation Using Image Map Date
11 pages
B.E. Result Sheet 2532011
No ratings yet
B.E. Result Sheet 2532011
12 pages
"College Portal ": A Project Report On
50% (2)
"College Portal ": A Project Report On
13 pages
7th Sem 1
No ratings yet
7th Sem 1
32 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
26 pages
Format - Summer Internship Report
No ratings yet
Format - Summer Internship Report
6 pages
Online Institute Management System: A Project Report ON
No ratings yet
Online Institute Management System: A Project Report ON
25 pages
Education Work Experience: Garima - Cs18@nitp - Ac.in - +91-8218152661 - Garimasingh - Me
No ratings yet
Education Work Experience: Garima - Cs18@nitp - Ac.in - +91-8218152661 - Garimasingh - Me
2 pages
Project Report Format Lingayas Vidyapeeth
No ratings yet
Project Report Format Lingayas Vidyapeeth
13 pages
Result Management System
No ratings yet
Result Management System
11 pages
Assignment On Boolean Algebra
No ratings yet
Assignment On Boolean Algebra
3 pages
Minor Project Synopsis Format CSE PPIMT fINAL - G2 Group
No ratings yet
Minor Project Synopsis Format CSE PPIMT fINAL - G2 Group
12 pages
Multilevel Caches and Replacement Policies
No ratings yet
Multilevel Caches and Replacement Policies
16 pages