0% found this document useful (0 votes)

9 views145 pages

DSBDA

The document outlines an assignment for a computer engineering course focusing on data wrangling using Python. It includes tasks such as importing datasets, data preprocessing, formatting, normalization, and utilizing various Python libraries like Pandas, NumPy, and Matplotlib. Additionally, it provides theoretical content on datasets, data types, and functions for handling data in Pandas.

Uploaded by

sneha bose

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views145 pages

DSBDA

Uploaded by

sneha bose

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 145

Department of Computer Engineering Subject : DSBDAL

Correctness Documentation Timely Dated Sign of

Write-up Viva Total
of Program of Program Completion Subject Teacher

4 4 4 4 4 20

Expected Date of Completion:..................................... Actual Date of Completion:.......................

----------------------------------------------------------------------------------------------------------------

Group A
Assignment No: 1
----------------------------------------------------------------------------------------------------------------
Title of the Assignment: Data Wrangling, I
Perform the following operations using Python on any open source dataset (e.g., data.csv)
Import all the required Python Libraries.
1. Locate open source data from the web (e.g. https://www.kaggle.com).
2. Provide a clear description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into the pandas data frame.
4. Data Preprocessing: check for missing values in the data using pandas insult(), describe()
function to get some initial statistics. Provide variable descriptions. Types of variables
etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking
the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the
data set. If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to perform the data wrangling
operation using Python on any open source dataset
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

2. Concept of Data Preprocessing, Data Formatting , Data Normalization and Data

Cleaning.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Introduction to Dataset
2. Python Libraries for Data Science
3. Description of Dataset
4. Panda Dataframe functions for load the dataset
5. Panda functions for Data Preprocessing
6. Panda functions for Data Formatting and Normalization
7. Panda Functions for handling categorical variables
---------------------------------------------------------------------------------------------------------------

1. Introduction to Dataset
A dataset is a collection of records, similar to a relational database table. Records are
similar to table rows, but the columns can contain not only strings or numbers, but also
nested data structures such as lists, maps, and other records.

Instance: A single row of data is called an instance. It is an observation from the domain.
Feature: A single column of data is called a feature. It is a component of an observation
and is also called an attribute of a data instance. Some features may be inputs to a model
(the predictors) and others may be outputs or the features to be predicted.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Data Type: Features have a data type. They may be real or integer-valued or may have a
categorical or ordinal value. You can have strings, dates, times, and more complex types,
but typically they are reduced to real or categorical values when working with traditional
machine learning methods.
Datasets: A collection of instances is a dataset and when working with machine learning
methods we typically need a few datasets for different purposes.
Training Dataset: A dataset that we feed into our machine learning algorithm to train
our model.
Testing Dataset: A dataset that we use to validate the accuracy of our model but is not
used to train the model. It may be called the validation dataset.
Data Represented in a Table:
Data should be arranged in a two-dimensional space made up of rows and columns. This
type of data structure makes it easy to understand the data and pinpoint any problems. An
example of some raw data stored as a CSV (comma separated values).

The representation of the same data in a table is as follows:

Pandas Data Types

A data type is essentially an internal construct that a programming language uses to
understand how to store and manipulate data.
A possible confusing point about pandas data types is that there is some overlap between
pandas, python and numpy. This table summarizes the key points:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Pandas Python
NumPy type Usage
dtype type

object str or string_, unicode_, mixed types Text or mixed numeric

mixed and non-numeric values

int64 int int_, int8, int16, int32, int64, uint8, uint16, Integer numbers
uint32, uint64

float64 float float_, float16, float32, float64 Floating point numbers

bool bool bool_ True/False values

datetime64 NA datetime64[ns] Date and time values

timedelta[ns] NA NA Differences between two

datetimes

category NA NA Finite list of text values

2. Python Libraries for Data Science

a. Pandas
Pandas is an open-source Python package that provides high-performance, easy-to-use
data structures and data analysis tools for the labeled data in Python programming
language.
What can you do with Pandas?

1. Indexing, manipulating, renaming, sorting, merging data frame

2. Update, Add, Delete columns from a data frame
3. Impute missing files, handle missing data or NANs
4. Plot data with histogram or box plot
b. NumPy

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

One of the most fundamental packages in Python, NumPy is a general-purpose

array-processing package. It provides high-performance multidimensional array objects
and tools to work with the arrays. NumPy is an efficient container of generic
multi-dimensional data.

NumPy’s main object is the homogeneous multidimensional array. It is a table of

elements or numbers of the same datatype, indexed by a tuple of positive integers. In
NumPy, dimensions are called axes and the number of axes is called rank. NumPy’s
array class is called ndarray aka array.

What can you do with NumPy?

1. Basic array operations: add, multiply, slice, flatten, reshape, index arrays
2. Advanced array operations: stack arrays, split into sections, broadcast arrays
3. Work with DateTime or Linear Algebra
4. Basic Slicing and Advanced Indexing in NumPy Python
c. Matplotlib
This is undoubtedly my favorite and a quintessential Python library. You can create
stories with the data visualized with Matplotlib. Another library from the SciPy Stack,
Matplotlib plots 2D figures.

What can you do with Matplotlib?

Histogram, bar plots, scatter plots, area plot to pie plot, Matplotlib can depict a wide
range of visualizations. With a bit of effort and tint of visualization capabilities, with
Matplotlib, you can create just any visualizations:Line plots

● Scatter plots
● Area plots
● Bar charts and Histograms
● Pie charts
● Stem plots
● Contour plots

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● Quiver plots
● Spectrograms

Matplotlib also facilitates labels, grids, legends, and some more formatting entities with
Matplotlib.
d. Seaborn

So when you read the official documentation on Seaborn, it is defined as the data
visualization library based on Matplotlib that provides a high-level interface for drawing
attractive and informative statistical graphics. Putting it simply, seaborn is an extension
of Matplotlib with advanced features.

What can you do with Seaborn?

1. Determine relationships between multiple variables (correlation)

2. Observe categorical variables for aggregate statistics
3. Analyze univariate or bi-variate distributions and compare them between different
data subsets
4. Plot linear regression models for dependent variables
5. Provide high-level abstractions, multi-plot grids
6. Seaborn is a great second-hand for R visualization libraries like corrplot and ggplot.
e. 5. Scikit Learn

Introduced to the world as a Google Summer of Code project, Scikit Learn is a robust
machine learning library for Python. It features ML algorithms like SVMs, random
forests, k-means clustering, spectral clustering, mean shift, cross-validation and more...
Even NumPy, SciPy and related scientific operations are supported by Scikit Learn with
Scikit Learn being a part of the SciPy Stack.

What can you do with Scikit Learn?

1. Classification: Spam detection, image recognition

2. Clustering: Drug response, Stock price

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

3. Regression: Customer segmentation, Grouping experiment outcomes

4. Dimensionality reduction: Visualization, Increased efficiency
5. Model selection: Improved accuracy via parameter tuning
6. Pre-processing: Preparing input data as a text for processing with machine
learning algorithms.

3. Description of Dataset:

The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple
Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning
Repository.
It includes three iris species with 50 samples each as well as some properties about each
flower. One flower species is linearly separable from the other two, but the other two are not
linearly separable from each other.
Total Sample- 150
The columns in this dataset are:
1. Id
2. SepalLengthCm
3. SepalWidthCm
4. PetalLengthCm
5. PetalWidthCm
6. Species
3 Different Types of Species each contain 50 Sample-

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Description of Dataset-

4. Panda Dataframe functions for Load Dataset

# The columns of the resulting DataFrame have different dtypes.

iris.dtypes
1. The dataset is downloads from UCI repository.
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
2. Now Read CSV File as a Dataframe in Python from from path where you saved the same
The Iris data set is stored in .csv format. ‘.csv’ stands for comma separated values. It is
easier to load .csv files in Pandas data frame and perform various analytical operations on
it.
Load Iris.csv into a Pandas data frame —
Syntax-
iris = pd.read_csv(csv_url, header = None)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

3. The csv file at the UCI repository does not contain the variable/column names. They are
located in a separate file.
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']

4. read in the dataset from the UCI Machine Learning Repository link and specify column
names to use
iris = pd.read_csv(csv_url, names = col_names)

5. Panda Dataframe functions for Data Preprocessing :

Dataframe Operations:

Sr. Data Frame Function Description

1 dataset.head(n=5) Return the first n rows.

2 dataset.tail(n=5)
Return the last n rows.

3 dataset.index The index (row labels) of the Dataset.

4 dataset.columns The column labels of the Dataset.

5 dataset.shape Return a tuple representing the dimensionality of the

Dataset.

6 dataset.dtypes Return the dtypes in the Dataset.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

This returns a Series with the data type of each column.

The result’s index is the original Dataset’s columns.
Columns with mixed types are stored with the object
dtype.

7 dataset.columns.values Return the columns values in the Dataset in array

format

8 dataset.describe(include='all') Generate descriptive statistics.

to view some basic statistical details like percentile,
mean, std etc. of a data frame or a series of numeric
values.

Analyzes both numeric and object series, as well as

Dataset column sets of mixed data types.

9 dataset['Column name] Read the Data Column wise.

10 dataset.sort_index(axis=1, Sort object by labels (along an axis).

ascending=False)

11 dataset.sort_values(by="Colu Sort values by column name.

mn name")

12 dataset.iloc[5] Purely integer-location based indexing for selection by

position.

13 dataset[0:3] Selecting via [], which slices the rows.

14 dataset.loc[:, ["Col_name1", Selection by label

"col_name2"]]

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

15 dataset.iloc[:n, :] a subset of the first n rows of the original data

16 dataset.iloc[:, :n] a subset of the first n columns of the original data

17 dataset.iloc[:m, :n] a subset of the first m rows and the first n columns

Few Examples of iLoc to slice data for iris Dataset

Sr. Data Frame Description Output

No Function

1 dataset.iloc[3:5, 0:2] Slice the data

2 dataset.iloc[[1, 2, By lists of integer

4], [0, 2]] position locations, similar
to the NumPy/Python
style:

3 dataset.iloc[1:3, :] For slicing rows

explicitly:

4 dataset.iloc[:, 1:3] For slicing Column

explicitly:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

4 dataset.iloc[1, 1] For getting a value

explicitly:

5 dataset['SepalLeng Accessing Column and

thCm'].iloc[5] Rows by position

6 cols_2_4=dataset.c Get Column Name then

olumns[2:4] get data from column

dataset[cols_2_4]

7 dataset[dataset.col in one Expression answer

umns[2:4]].iloc[5:1 for the above two
0] commands

Checking of Missing Values in Dataset:

● isnull() is the function that is used to check missing values or null values in pandas python.
● isna() function is also used to get the count of missing values of column and row wise count
of missing values
● The dataset considered for explanation is:

a. is there any missing values in dataframe as a whole

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Function: DataFrame.isnull()
Output:

b. is there any missing values across each column

Function: DataFrame.isnull().any()
Output:

c. count of missing values across each column using isna() and isnull()
In order to get the count of missing values of the entire dataframe isnull() function is
used. sum() which does the column wise sum first and doing another sum() will get
the count of missing values of the entire dataframe.
Function: dataframe.isnull().sum().sum()
Output : 8
d. count row wise missing value using isnull()
Function: dataframe.isnull().sum(axis = 1)
Output:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

e. count Column wise missing value using isnull()

Method 1:
Function: dataframe.isnull().sum()
Output:

Method 2:
unction: dataframe.isna().sum()

f. count of missing values of a specific column.

Function:dataframe.col_name.isnull().sum()
df1.Gender.isnull().sum()
Output: 2
g. groupby count of missing values of a column.
In order to get the count of missing values of the particular column by group in
pandas we will be using isnull() and sum() function with apply() and groupby()
which performs the group wise count of missing values as shown below.
Function:
df1.groupby(['Gender'])['Score'].apply(lambda x:
x.isnull().sum())
Output:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

6. Panda functions for Data Formatting and Normalization

The Transforming data stage is about converting the data set into a format that can be

analyzed or modelled effectively, and there are several techniques for this process.

a. Data Formatting: Ensuring all data formats are correct (e.g. object, text, floating
number, integer, etc.) is another part of this initial ‘cleaning’ process. If you are
working with dates in Pandas, they also need to be stored in the exact format to use
special date-time functions.

Functions used for data formatting

Sr. Data Frame Description Output

No Function

1. df.dtypes To check the data

type

2. df['petal length To change the data

(cm)']= df['petal type (data type of
‘petal length
length (cm)'changed to int)
(cm)'].astype("int
")

b. Data normalization: Mapping all the nominal data values onto a uniform scale
(e.g. from 0 to 1) is involved in data normalization. Making the ranges consistent

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

across variables helps with statistical analysis and ensures better comparisons
later on.It is also known as Min-Max scaling.

Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Print iris dataset.
df.head()
Step 5: Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()
Step 6: Separate the feature from the class label
x=df.iloc[:,:4]
Step 6: Create an object to transform the data to fit minmax processor
x_scaled = min_max_scaler.fit_transform(x)
Step 7:Run the normalizer on the dataframe
df_normalized = pd.DataFrame(x_scaled)
Step 8: View the dataframe
df_normalized
Output: After Step 3:

Output after step 8:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

7. Panda Functions for handling categorical variables

● Categorical variables have values that describe a ‘quality’ or ‘characteristic’
of a data unit, like ‘what type’ or ‘which category’.
● Categorical variables fall into mutually exclusive (in one category or in
another) and exhaustive (include all possible options) categories. Therefore,
categorical variables are qualitative variables and tend to be represented by a
non-numeric value.
● Categorical features refer to string type data and can be easily understood by
human beings. But in case of a machine, it cannot interpret the categorical
data directly. Therefore, the categorical data must be translated into numerical
data that can be understood by machine.
There are many ways to convert categorical data into numerical data. Here the three most used
methods are discussed.
a. Label Encoding: Label Encoding refers to converting the labels into a numeric form
so as to convert them into the machine-readable form. It is an important preprocessing
step for the structured dataset in supervised learning.

Example : Suppose we have a column Height in some dataset. After applying label
encoding, the Height column is converted into:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

where 0 is the label for tall, 1 is the label for medium, and 2 is a label for short height.
Label Encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
Sklearn Functions for Label Encoding:
● preprocessing.LabelEncoder : It Encode labels with value between 0
and n_classes-1.
● fit_transform(y):
Parameters: yarray-like of shape (n_samples,)
Target values.
Returns: yarray-like of shape (n_samples,)
Encoded labels.
This transformer should be used to encode target values, and not the input.
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: define label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
Step 5: Encode labels in column 'species'.
df['Species']= label_encoder.fit_transform(df['Species'])
Step 6: Observe the unique values for the Species column.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)

● Use LabelEncoder when there are only two possible values of a categorical feature.
For example, features having value such as yes or no. Or, maybe, gender features
when there are only two possible values including male or female.

Limitation: Label encoding converts the data in machine-readable form, but it assigns a
unique number(starting from 0) to each class of data. This may lead to the generation
of priority issues in the data sets. A label with a high value may be considered to have
high priority than a label having a lower value.
b. One-Hot Encoding:

In one-hot encoding, we create a new set of dummy (binary) variables that is equal to the
number of categories (k) in the variable. For example, let’s say we have a categorical
variable Color with three categories called “Red”, “Green” and “Blue”, we need to use
three dummy variables to encode this variable using one-hot encoding. A dummy
(binary) variable just takes the value 0 or 1 to indicate the exclusion or inclusion of a
category.

In one-hot encoding,
“Red” color is encoded as [1 0 0] vector of size 3.
“Green” color is encoded as [0 1 0] vector of size 3.
“Blue” color is encoded as [0 0 1] vector of size 3.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

One-hot encoding on iris dataset: For iris dataset the target column which is Species. It
contains three species Iris-setosa, Iris-versicolor, Iris-virginica.
Sklearn Functions for One-hot Encoding:
● sklearn.preprocessing.OneHotEncoder(): Encode categorical
integer features using a one-hot aka one-of-K scheme
Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 5: Remove the target variable from dataset
features_df=df.drop(columns=['Species'])

Step 6: Apply one_hot encoder for Species column.

enc = preprocessing.OneHotEncoder()
enc_df=pd.DataFrame(enc.fit_transform(df[['Species']])).toarray()
Step 7: Join the encoded values with Features variable
df_encode = features_df.join(enc_df)

Step 8: Observe the merge dataframe

df_encode
Step 9: Rename the newly encoded columns.
df_encode.rename(columns = {0:'Iris-Setosa',
1:'Iris-Versicolor',2:'Iris-virginica'}, inplace = True)
Step 10: Observe the merge dataframe
df_encode

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Output after Step 8:

Output after Step 10:

c. Dummy Variable Encoding

Dummy encoding also uses dummy (binary) variables. Instead of creating a number of
dummy variables that is equal to the number of categories (k) in the variable, dummy
encoding uses k-1 dummy variables. To encode the same Color variable with three
categories using the dummy encoding, we need to use only two dummy variables.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

In dummy encoding,
“Red” color is encoded as [1 0] vector of size 2.
“Green” color is encoded as [0 1] vector of size 2.
“Blue” color is encoded as [0 0] vector of size 2.
Dummy encoding removes a duplicate category present in the one-hot encoding.
Pandas Functions for One-hot Encoding with dummy variables:
● pandas.get_dummies(data, prefix=None, prefix_sep='_',
dummy_na=False, columns=None, sparse=False,
drop_first=False, dtype=None): Convert categorical variable into
dummy/indicator variables.
● Parameters:
data:array-like, Series, or DataFrame
Data of which to get dummy indicators.
prefixstr: list of str, or dict of str, default None
String to append DataFrame column names.
prefix_sep: str, default ‘_’
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as
with prefix.
dummy_nabool: default False
Add a column to indicate NaNs, if False NaNs are ignored.

columns: list:like, default None

Column names in the DataFrame to be encoded. If columns is None then all the
columns with object or category dtype will be converted.

sparse: bool: default False

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Whether the dummy-encoded columns should be backed by a SparseArray (

True) or a regular NumPy array (False).

drop_first:bool, default False

Whether to get k-1 dummies out of k categorical levels by removing the first level.

dtype: dtype, default np.uint8

Data type for new columns. Only a single dtype is allowed.
● Return : DataFrame with Dummy-coded data.

Algorithm:
Step 1 : Import pandas and sklearn library for preprocessing
from sklearn import preprocessing
Step 2: Load the iris dataset in dataframe object df
Step 3: Observe the unique values for the Species column.
df['Species'].unique()
output: array(['Iris-setosa', 'Iris-versicolor',
'Iris-virginica'], dtype=object)
Step 4: Apply label_encoder object for label encoding the Observe the unique
values for the Species column.
df['Species'].unique()
Output: array([0, 1, 2], dtype=int64)
Step 6: Apply one_hot encoder with dummy variables for Species column.
one_hot_df = pd.get_dummies(df, prefix="Species",
columns=['Species'], drop_first=True)
Step 7: Observe the merge dataframe
one_hot_df

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Conclusion- In this way we have explored the functions of the python library for Data
Preprocessing, Data Wrangling Techniques and How to Handle missing values on Iris Dataset.
Assignment Question
1. Explain Data Frame with Suitable example.
2. What is the limitation of the label encoding method?
3. What is the need of data normalization?
4. What are the different Techniques for Handling the Missing Data?

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Correctness Documentation Timely Dated Sign of

Write-up Viva Total
of Program of Program Completion Subject Teacher

4 4 4 4 4 20

Expected Date of Completion:..................................... Actual Date of Completion:.......................

----------------------------------------------------------------------------------------------------------------

Group A
Assignment No: 2
----------------------------------------------------------------------------------------------------------------
Title of the Assignment: Data Wrangling, II
Create an “Academic performance” dataset of students and perform the following
operations using Python.
1. Scan all variables for missing values and inconsistencies. If there are missing
values and/or inconsistencies, use any of the suitable techniques to deal with
them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable
techniques to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for
better understanding of the variable, to convert a non-linear relation into a linear
one, or to decrease the skewness and convert the distribution into a normal
distribution.
Reason and document your approach properly.
----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to perform thedata wrangling
operation using Python on any open source dataset
---------------------------------------------------------------------------------------------------------------

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Prerequisite:
1. Basic of Python Programming
2. Concept of Data Preprocessing, Data Formatting , Data Normalization and Data
Cleaning.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Creation of Dataset using Microsoft Excel.
2. Identification and Handling of Null Values
3. Identification and Handling of Outliers
4. Data Transformation for the purpose of :
a. To change the scale for better understanding
b. To decrease the skewness and convert distribution into normal distribution
---------------------------------------------------------------------------------------------------------------
Theory:
1. Creation of Dataset using Microsoft Excel.
The dataset is created in “CSV” format.
● The name of dataset is StudentsPerformance
● The features of the dataset are: Math_Score, Reading_Score, Writing_Score,
Placement_Score, Club_Join_Date .
● Number of Instances: 30
● The response variable is: Placement_Offer_Count .
● Range of Values:
Math_Score [60-80], Reading_Score[75-,95], ,Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].
● The response variable is the number of placement offers facilitated to particular
students, which is largely depend on Placement_Score
To fill the values in the dataset the RANDBETWEEN is used. Returns a random
integer number between the numbers you specify
Syntax : RANDBETWEEN(bottom, top) Bottom The smallest integer and
Top The largest integer RANDBETWEEN will return.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

For better understanding and visualization, 20% impurities are added into each variable
to the dataset.
The step to create the dataset are as follows:
Step 1: Open Microsoft Excel and click on Save As. Select Other .Formats

Step 2: Enter the name of the dataset and Save the dataset astye CSV(MS-DOS).

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Step 3: Enter the name of features as column header.

Step 3: Fill the dara by using RANDOMBETWEEN function. For every feature , fill
the data by considering above spectified range.
one example is given:

Scroll down the cursor for 30 rows to create 30 instances.

Repeat this for the features, Reading_Score, Writing_Score, Placement_Score,
Club_Join_Date.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

The placement count largely depends on the placement score. It is considered that if
placement score <75, 1 offer is facilitated; for placement score >75 , 2 offer is facilitated
and for else (>85) 3 offer is facilitated. Nested If formula is used for ease of data filling.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Step 4: In 20% data, fill the impurities. The range of math score is [60,80], updating a
few instances values below 60 or above 80. Repeat this for Writing_Score [60,80],
Placement_Score[75-100], Club_Join_Date [2018-2021].

Step 5: To violate the ruleof response variable, update few valus . If placement scoreis
greater then 85, facilated only 1 offer.

The dataset is created with the given description.

2. Identification and Handling of Null Values

Missing Data can occur when no information is provided for one or more items or for a
whole unit. Missing Data is a very big problem in real-life scenarios. Missing Data can
also refer to as NA(Not Available) values in pandas. In DataFrame sometimes many
datasets simply arrive with missing data, either because it exists and was not collected or
it never existed. For Example, Suppose different users being surveyed may choose not to
share their income, some users may choose not to share the address in this way many
datasets went missing.
In Pandas missing data is represented by two value:

1. None: None is a Python singleton object that is often used for missing data in
Python code.
2. NaN : NaN (an acronym for Not a Number), is a special floating-point value
recognized by all systems that use the standard IEEE floating-point
representation.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Pandas treat None and NaN as essentially interchangeable for indicating missing
or null values. To facilitate this convention, there are several useful functions for
detecting, removing, and replacing null values in Pandas DataFrame :

● isnull()

● notnull()

● dropna()

● fillna()

● replace()
1. Checking for missing values using isnull() and notnull()

● Checking for missing values using isnull()

In order to check null values in Pandas DataFrame, isnull() function is used. This
function return dataframe of Boolean values which are True for NaN values.

Algorithm:
Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Step 4: Use isnull() function to check null values in the dataset.

df.isnull()

Step 5: To create a series true for NaN values for specific columns. for example
math score in dataset and display data with only math score as NaN
series = pd.isnull(df["math score"])
df[series]

● Checking for missing values using notnull()

In order to check null values in Pandas Dataframe, notnull() function is used. This
function return dataframe of Boolean values which are False for NaN values.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Step 4: Use notnull() function to check null values in the dataset.

df.notnull()

Step 5: To create a series true for NaN values for specific columns. for example
math score in dataset and display data with only math score as NaN
series1 = pd.notnull(df["math score"])
df[series1]

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

See that there are also categorical values in the dataset, for this, you need to use
Label Encoding or One Hot Encoding.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
newdf=df
df

2. Filling missing values using dropna(), fillna(), replace()

In order to fill null values in a datasets, fillna(), replace() functions are used.
These functions replace NaN values with some value of their own. All these
functions help in filling null values in datasets of a DataFrame.

● For replacing null values with NaN

missing_values = ["Na", "na"]

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

df = pd.read_csv("StudentsPerformanceTest1.csv", na_values =
missing_values)
df

● Filling null values with a single value

Step 1 : Import pandas and numpy in order to check missing values in Pandas
DataFrame
import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/StudentsPerformanceTest1.csv")
Step 3: Display the data frame
df
Step 4: filling missing value using fillna()
ndf=df
ndf.fillna(0)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Step 5: filling missing values using mean, median and standard deviation of that
column.

data['math score'] = data['math score'].fillna(data['math score'].mean())

data[''math score''] = data[''math score''].fillna(data[''math

score''].median())

data['math score''] = data[''math score''].fillna(data[''math score''].std())

replacing missing values in forenoon column with minimum/maximum number

of that column

data[''math score''] = data[''math score''].fillna(data[''math score''].min())

data[''math score''] = data[''math score''].fillna(data[''math score''].max())

● Filling null values in dataset

To fill null values in dataset use inplace=true
m_v=df['math score'].mean()
df['math score'].fillna(value=m_v, inplace=True)
df

● Filling a null values using replace() method

Following line will replace Nan value in dataframe with value -99

ndf.replace(to_replace = np.nan, value = -99)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● Deleting null values using dropna() method

In order to drop null values from a dataframe, dropna() function is used. This
function drops Rows/Columns of datasets with Null values in different ways.
1. Dropping rows with at least 1 null value
2. Dropping rows if all values in that row are missing
3. Dropping columns with at least 1 null value.
4. Dropping Rows with at least 1 null value in CSV file

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Step 5: To Drop rows if all values in that row are missing

ndf.dropna(how = 'all')

Step 6: To Drop columns with at least 1 null value.

ndf.dropna(axis = 1)

Step 7 : To drop rows with at least 1 null value in CSV file.

making new data frame with dropped NA values
new_data = ndf.dropna(axis = 0, how ='any')

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

new_data

3. Identification and Handling of Outliers

3.1 Identification of Outliers
One of the most important steps as part of data preprocessing is detecting and treating the
outliers as they can negatively affect the statistical analysis and the training process of a
machine learning algorithm resulting in lower accuracy.
1. What are Outliers?
We all have heard of the idiom ‘odd one out' which means something unusual in
comparison to the others in a group.

Similarly, an Outlier is an observation in a given dataset that lies far from the rest
of the observations. That means an outlier is vastly larger or smaller than the remaining
values in the set.

2. Why do they occur?

An outlier may occur due to the variability in the data, or due to experimental
error/human error.

They may indicate an experimental error or heavy skewness in the

data(heavy-tailed distribution).

3. What do they affect?

In statistics, we have three measures of central tendency namely Mean, Median,
and Mode. They help us describe the data.

Mean is the accurate measure to describe the data when we do not have any
outliers present. Median is used if there is an outlier in the dataset. Mode is used if there
is an outlier AND about ½ or more of the data is the same.

‘Mean’ is the only measure of central tendency that is affected by the outliers
which in turn impacts Standard deviation.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Example:
Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9]. By
looking at it, one can quickly say ‘101’ is an outlier that is much larger than the other
values.

fig. Computation with and without outlier

From the above calculations, we can clearly say the Mean is more affected than the
Median.
4. Detecting Outliers
If our dataset is small, we can detect the outlier by just looking at the dataset. But
what if we have a huge dataset, how do we identify the outliers then? We need to use
visualization and mathematical techniques.

Below are some of the techniques of detecting outliers

● Boxplots
● Scatterplots
● Z-score
● Inter Quantile Range(IQR)

4.1 Detecting outliers using Boxplot:

It captures the summary of the data effectively and efficiently with only a simple
box and whiskers. Boxplot summarizes sample data using 25th, 50th, and 75th
percentiles. One can just get insights(quartiles, median, and outliers) into the dataset by
just looking at its boxplot.
Algorithm:
Step 1 : Import pandas and numpy libraries
import pandas as pd

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df

Step 4:Select the columns for boxplot and draw the boxplot.

col = ['math score', 'reading score' , 'writing

score','placement score']
df.boxplot(col)

Step 5: We can now print the outliers for each column with reference to the box plot.
print(np.where(df['math score']>90))

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

print(np.where(df['reading score']<25))
print(np.where(df['writing score']<30))

4.2 Detecting outliers using Scatterplot:

It is used when you have paired numerical data, or when your dependent variable
has multiple values for each reading independent variable, or when trying to determine
the relationship between the two variables. In the process of utilizing the scatter plot, one
can also use it for outlier detection.
To plot the scatter plot one requires two variables that are somehow related to
each other. So here Placement score and Placement count features are used.
Algorithm:
Step 1 : Import pandas , numpy and matplotlib libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Step 2: Load the dataset in dataframe object df

df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df
Step 4: Draw the scatter plot with placement score and placement offer count
fig, ax = plt.subplots(figsize = (18,10))
ax.scatter(df['placement score'], df['placement offer
count'])
plt.show()
Labels to the axis can be assigned (Optional)
ax.set_xlabel('(Proportion non-retail business
acres)/(town)')
ax.set_ylabel('(Full-value property-tax rate)/(
$10,000)')

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Step 5: We can now print the outliers with reference to scatter plot.
print(np.where((df['placement score']<50) & (df['placement
offer count']>1)))
print(np.where((df['placement score']>85) & (df['placement
offer count']<3)))

4.3 Detecting outliers using Z-Score:

Z-Score is also called a standard score. This value/score helps to
understand how far is the data point from the mean. And after setting up a
threshold value one can utilize z score values of data points to define the outliers.
Zscore = (data_point -mean) / std. deviation

Algorithm:
Step 1 : Import numpy and stats from scipy libraries
import numpy as np

from scipy import stats

Step 2: Calculate Z-Score for mathscore column

z = np.abs(stats.zscore(df['math score']))
Step 3: Print Z-Score Value. It prints the z-score values of each data item
of the column
print(z)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Step 4: Now to define an outlier threshold value is chosen.

threshold = 0.18
Step 5: Display the sample outliers
sample_outliers = np.where(z <threshold)
sample_outliers

4.4 Detecting outliers using Inter Quantile Range(IQR):

IQR (Inter Quartile Range) Inter Quartile Range approach to finding the
outliers is the most commonly used and most trusted approach used in the
research field.
IQR = Quartile3 – Quartile1
To define the outlier base value is defined above and below datasets
normal range namely Upper and Lower bounds, define the upper and the lower
bound (1.5*IQR value is considered) :

upper = Q3 +1.5*IQR
lower = Q1 – 1.5*IQR
In the above formula as according to statistics, the 0.5 scale-up of IQR
(new_IQR = IQR + 0.5*IQR) is taken.

Algorithm:
Step 1 : Import numpy library
import numpy as np

Step 2: Sort Reading Score feature and store it into sorted_rscore.

sorted_rscore= sorted(df['reading score'])
Step 3: Print sorted_rscore
sorted_rscore
Step 4: Calculate and print Quartile 1 and Quartile 3

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

q1 = np.percentile(sorted_rscore, 25)
q3 = np.percentile(sorted_rscore, 75)
print(q1,q3)

Step 5: Calculate value of IQR (Inter Quartile Range)

IQR = q3-q1
Step 6: Calculate and print Upper and Lower Bound to define the
outlier base value.
lwr_bound = q1-(1.5*IQR)
upr_bound = q3+(1.5*IQR)
print(lwr_bound, upr_bound)

Step 7: Print Outliers

r_outliers = []
for i in sorted_rscore:
if (i<lwr_bound or i>upr_bound):
r_outliers.append(i)
print(r_outliers)

3.2 Handling of Outliers:

For removing the outlier, one must follow the same process of removing an entry
from the dataset using its exact position in the dataset because in all the above methods of
detecting the outliers end result is the list of all those data items that satisfy the outlier
definition according to the method used.

Below are some of the methods of treating the outliers

● Trimming/removing the outlier

● Quantile based flooring and capping
● Mean/Median imputation

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● Trimming/removing the outlier:

In this technique, we remove the outliers from the dataset. Although it is not a
good practice to follow.
new_df=df
for i in sample_outliers:
new_df.drop(i,inplace=True)
new_df

Here Sample_outliers are So instances with index 0,

12 ,16 and 17 are deleted.

● Quantile based flooring and capping:

In this technique, the outlier is capped at a certain value above the 90th percentile value
or floored at a factor below the 10th percentile value
df=pd.read_csv("/demo.csv")
df_stud=df
ninetieth_percentile = np.percentile(df_stud['math score'], 90)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

b = np.where(df_stud['math score']>ninetieth_percentile,
ninetieth_percentile, df_stud['math score'])
print("New array:",b)

df_stud.insert(1,"m score",b,True)
df_stud

● Mean/Median imputation:
As the mean value is highly influenced by the outliers, it is advised to replace the
outliers with the median value.
1. Plot the box plot for reading score
col = ['reading score']
df.boxplot(col)

2. Outliers are seen in box plot.

3. Calculate the median of reading score by using sorted_rscore

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

median=np.median(sorted_rscore)
median
4. Replace the upper bound outliers using median value
refined_df=df
refined_df['reading score'] = np.where(refined_df['reading
score'] >upr_bound, median,refined_df['reading score'])
5. Display redefined_df

6. Replace the lower bound outliers using median value

refined_df['reading score'] = np.where(refined_df['reading
score'] <lwr_bound, median,refined_df['reading score'])
7. Display redefined_df

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

8. Draw the box plot for redefined_df

col = ['reading score']
refined_df.boxplot(col)

4. Data Transformation for the purpose of :

Data transformation is the process of converting raw data into a format or structure that
would be more suitable for model building and also data discovery in general.The process
of data transformation can also be referred to as extract/transform/load (ETL). The
extraction phase involves identifying and pulling data from the various source systems
that create data and then moving the data to a single repository. Next, the raw data is
cleansed, if needed. It's then transformed into a target format that can be fed into
operational systems or into a data warehouse, a date lake or another repository for use in
business intelligence and analytics applications. The transformation The data are
transformed in ways that are ideal for mining the data. The data transformation involves
steps that are.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It
helps in predicting the patterns
● Aggregation: Data collection or aggregation is the method of storing and presenting
data in a summary format. The data may be obtained from multiple data sources to
integrate these data sources into a data analysis description. This is a crucial step
since the accuracy of data analysis insights is highly dependent on the quantity and
quality of the data used.
● Generalization:It converts low-level data attributes to high-level data attributes
using concept hierarchy. For Example Age initially in Numerical form (22, 25) is
converted into categorical value (young, old).
● Normalization: Data normalization involves converting all data variables into a
given range. Some of the techniques that are used for accomplishing normalization
are:
○ Min–max normalization: This transforms the original data linearly.
○ Z-score normalization: In z-score normalization (or zero-mean normalization)
the values of an attribute (A), are normalized based on the mean of A and its
standard deviation.
○ Normalization by decimal scaling: It normalizes the values of an attribute by
changing the position of their decimal points
● Attribute or feature construction.
○ New attributes constructed from the given ones: Where new attributes are
created & applied to assist the mining process from the given set of attributes.
This simplifies the original data & makes the mining more efficient.
In this assignment , The purpose of this transformation should be one of the
following reasons:

a. To change the scale for better understanding (Attribute or feature

construction)
Here the Club_Join_Date is transferred to Duration.
Algorithm:
Step 1 : Import pandas and numpy libraries
SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS
Department of Computer Engineering Subject : DSBDAL

import pandas as pd
import numpy as np
Step 2: Load the dataset in dataframe object df
df=pd.read_csv("/content/demo.csv")
Step 3: Display the data frame
df

Step 3: Change the scale of Joining year to duration.

b. To decrease the skewness and convert distribution into normal distribution

(Normalization by decimal scaling)
Data Skewness: It is asymmetry in a statistical distribution, in which the curve
appears distorted or skewed either to the left or to the right. Skewness can be
quantified to define the extent to which a distribution differs from a normal
distribution.
Normal Distribution: In a normal distribution, the graph appears as a classical,
symmetrical “bell-shaped curve.” The mean, or average, and the mode, or
maximum point on the curve, are equal.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Positively Skewed Distribution

A positively skewed distribution means that the extreme data results are larger.
This skews the data in that it brings the mean (average) up. The mean will be
larger than the median in a Positively skewed distribution.
A negatively skewed distribution means the opposite: that the extreme data
results are smaller. This means that the mean is brought down, and the median is
larger than the mean in a negatively skewed distribution.

Reducing skewness A data transformation may be used to reduce skewness. A

distribution that is symmetric or nearly so is often easier to handle and interpret
than a skewed distribution. The logarithm, x to log base 10 of x, or x to log base e
of x (ln x), or x to log base 2 of x, is a strong transformation with a major effect
on distribution shape. It is commonly used for reducing right skewness and is
often appropriate for measured variables. It can not be applied to zero or negative
values.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Algorithm:
Step 1 : Detecting outliers using Z-Score for the Math_score variable and
remove the outliers.
Step 2: Observe the histogram for math_score variable.
import matplotlib.pyplot as plt
new_df['math score'].plot(kind = 'hist')
Step 3: Convert the variables to logarithm at the scale 10.
df['log_math'] = np.log10(df['math score'])

Step 4: Observe the histogram for math_score variable.

df['log_math'].plot(kind = 'hist')

It is observed that skewness is reduced at some level.

Conclusion: In this way we have explored the functions of the python library for Data
Identifying and handling the outliers. Data Transformations Techniques are explored with the
purpose of creating the new variable and reducing the skewness from datasets.
Assignment Question:
1. Explain the methods to detect the outlier.
2. Explain data transformation methods
3. Write the algorithm to display the statistics of Null values present in the dataset.
4. Write an algorithm to replace the outlier value with the mean of the variable.
.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

----------------------------------------------------------------------------------------------------------------

Group A
Assignment No: 3
----------------------------------------------------------------------------------------------------------------
Contents for Theory:

1. Summary statistics

2. Types of Variables

3. Summary statistics of income grouped by the age groups

4. Display basic statistical details on the iris dataset.

---------------------------------------------------------------------------------------------------------------

1. Summary statistics:

● What is Statistics?

Statistics is the science of collecting data and analysing them to infer proportions (sample)
that are representative of the population. In other words, statistics is interpreting data in
order to make predictions for the population.
Branches of Statistics:
There are two branches of Statistics.
DESCRIPTIVE STATISTICS : Descriptive Statistics is a statistics or a measure that
describes the data.
INFERENTIAL STATISTICS : Using a random sample of data taken from a population to
describe and make inferences about the population is called Inferential Statistics.

Descriptive Statistics
Descriptive Statistics is summarising the data at hand through certain numbers like mean,
median etc. so as to make the understanding of the data easier. It does not involve any
generalisation or inference beyond what is available. This means that the descriptive
statistics are just the representation of the data (sample) available and not based on any
theory of probability.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Commonly Used Measures

1. Measures of Central Tendency
2. Measures of Dispersion (or Variability)

● Measures of Central Tendency

A Measure of Central Tendency is a one number summary of the data that
typically describes the centre of the data. This one number summary is of three
types.
a. Mean : Mean is defined as the ratio of the sum of all the observations in the data
to the total number of observations. This is also known as Average. Thus mean is
a number around which the entire data set is spread.
Consider the following data points.
17, 16, 21, 18, 15, 17, 21, 19, 11, 23

b. Median : Median is the point which divides the entire data into two equal
halves. One-half of the data is less than the median, and the other half is greater
than the same. Median is calculated by first arranging the data in either ascending
or descending order.
○ If the number of observations is odd, the median is given by the middle
observation in the sorted form.
○ If the number of observations are even, median is given by the mean of the
two middle observations in the sorted form.
An important point to note is that the order of the data (ascending or
descending) does not affect the median.

To calculate Median, let's arrange the data in ascending order.

11, 15, 16, 17, 17, 18, 19, 21, 21, 23

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Since the number of observations is even (10), median is given by the

average of the two middle observations (5th and 6th here).

c. Mode : Mode is the number which has the maximum frequency in the entire data
set, or in other words,mode is the number that appears the maximum number of
times. A data can have one or more than one mode.

● If there is only one number that appears the maximum number of times,
the data has one mode, and is called Uni-modal.
● If there are two numbers that appear the maximum number of times, the
data has two modes, and is called Bi-modal.
● If there are more than two numbers that appear the maximum number of
times, the data has more than two modes, and is called Multi-modal.

Consider the following data points.

17, 16, 21, 18, 15, 17, 21, 19, 11, 23

Mode is given by the number that occurs the maximum number of times.
Here, 17 and 21 both occur twice. Hence, this is a Bimodal data and the modes
are 17 and 21.

● Measures of Dispersion (or Variability)

Measures of Dispersion describes the spread of the data around the central value (or the
Measures of Central Tendency)

1. Absolute Deviation from Mean — The Absolute Deviation from Mean, also
called Mean Absolute Deviation (MAD), describes the variation in the data set, in
the sense that it tells the average absolute distance of each data point in the set. It
is calculated as

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

2. Variance — Variance measures how far are data points spread out from the mean.
A high variance indicates that data points are spread widely and a small variance
indicates that the data points are closer to the mean of the data set. It is calculated
as

3. Standard Deviation — The square root of Variance is called the Standard

Deviation. It is calculated as

4. Range — Range is the difference between the Maximum value and the Minimum
value in the data set. It is given as

5. Quartiles — Quartiles are the points in the data set that divides the data set into
four equal parts. Q1, Q2 and Q3 are the first, second and third quartile of the data
set.

● 25% of the data points lie below Q1 and 75% lie above it.

● 50% of the data points lie below Q2 and 50% lie above it. Q2 is nothing but
Median.

● 75% of the data points lie below Q3 and 25% lie above it.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

6. Skewness — The measure of asymmetry in a probability distribution is defined

by Skewness. It can either be positive, negative or undefined.

Positive Skew — This is the case when the tail on the right side of the curve is
bigger than that on the left side. For these distributions, mean is greater than the
mode.

Negative Skew — This is the case when the tail on the left side of the curve is
bigger than that on the right side. For these distributions, mean is smaller than the
mode.

The most commonly used method of calculating Skewness is

If the skewness is zero, the distribution is symmetrical. If it is negative, the

distribution is Negatively Skewed and if it is positive, it is Positively Skewed.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Python Code:
1. Mean
To find mean of all columns
Syntax:
df.mean()
Output:

To find mean of specific column

Syntax:
df.loc[:,'Age'].mean()
Output:
38.85

To find mean row wise

Syntax:
df.mean(axis=1)[0:4]
Output:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

2. Median
To find median of all columns
Syntax:
df.median()
Output:

To find median of specific column

Syntax:
df.loc[:,'Age'].median()
Output:
36.0
To find median row wise
Syntax:
df.median(axis=1)[0:4]
Output:

3. Mode
To find mode of all columns
Syntax:
df.mode()
Output:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

In the Genre Column mode is Female, for column Age mode is 32 etc. If a
particular column does not have mode all the values will be displayed in
the column.
To find the mode of a specific column.
Syntax:
df.loc[:,'Age'].mode()
Output:
32

4. Minimum
To find minimum of all columns
Syntax:
df.min()
Output:

To find minimum of Specific column

Syntax:
df.loc[:,'Age'].min(skipna = False)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Output:
18

5. Maximum
To find Maximum of all columns
Syntax:
df.max()
Output:

To find Maximum of Specific column

Syntax:
df.loc[:,'Age'].max(skipna = False)
Output:
70

6. Standard Deviation
To find Standard Deviation of all columns
Syntax:
df.std()
Output:

To find Standard Deviation of specific column

Syntax:
df.loc[:,'Age'].std()
Output:
13.969007331558883

To find Standard Deviation row wise

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Syntax:
df.std(axis=1)[0:4]

Output:

2. Types of Variables:

A variable is a characteristic that can be measured and that can assume different values.
Height, age, income, province or country of birth, grades obtained at school and type of
housing are all examples of variables.

Variables may be classified into two main categories:

● Categorical and
● Numeric.

Each category is then classified in two subcategories: nominal or ordinal for categorical
variables, discrete or continuous for numeric variables.

● Categorical variables

A categorical variable (also called qualitative variable) refers to a characteristic

that can’t be quantifiable.
Categorical variables can be either nominal or ordinal.
○ Nominal Variable
A nominal variable is one that describes a name, label or category without natural
order. In the given table, the variable “mode of transportation for travel to work”
is also nominal.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

○ Ordinal Variable
An ordinal variable is a variable whose values are defined by an order relation
between the different categories. In the following table, the variable “behaviour”
is ordinal because the category “Excellent” is better than the category “Very
good,” which is better than the category “Good,” etc. There is some natural
ordering, but it is limited since we do not know by how much “Excellent”
behaviour is better than “Very good” behaviour.

● Numerical Variables
A numeric variable (also called quantitative variable) is a quantifiable characteristic
whose values are numbers (except numbers which are codes standing up for categories).
Numeric variables may be either continuous or discrete.

○ Continuous variables

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

A variable is said to be continuous if it can assume an infinite number of real

values within a given interval.
For instance, consider the height of a student. The height can’t take any
values. It can’t be negative and it can’t be higher than three metres. But between 0
and 3, the number of possible values is theoretically infinite. A student may be
1.6321748755 … metres tall.

○ Discrete variables
As opposed to a continuous variable, a discrete variable can assume only a finite
number of real values within a given interval.
An example of a discrete variable would be the score given by a judge to a
gymnast in competition: the range is 0 to 10 and the score is always given to one
decimal (e.g. a score of 8.5)

3. Summary statistics of income grouped by the age groups

Problem Statement: For example, if your categorical variable is age groups and
quantitative variable is income, then provide summary statistics of income grouped by the age
groups. Create a list that contains a numeric value for each response to the categorical
variable.

Categorical Variable: Genre

Quantitative Variable : Age
Syntax:
df.groupby(['Genre'])['Age'].mean()
Output:

Categorical Variable: Genre

Quantitative Variable : Income
Syntax:
df_u=df.rename(columns= {'Annual Income
k$)':'Income'},inplace=False)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

(df_u.groupby(['Genre']).Income.mean())
Output:

To create a list that contains a numeric value for each response to the categorical variable.
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc_df = pd.DataFrame(enc.fit_transform(df[['Genre']]).toarray())
enc_df

To concat numerical list to dataframe

df_encode =df_u.join(enc_df)
df_encode

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

4. Display basic statistical details on the iris dataset.

Algorithm:
1. Import Pandas Library
2. The dataset is downloaded from UCI repository.
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
3. Assign Column names
col_names =
['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Species']
4. Load Iris.csv into a Pandas data frame
iris = pd.read_csv(csv_url, names = col_names)
5. Load all rows with Iris-setosa species in variable irisSet
irisSet = (iris['Species']== 'Iris-setosa')

6. To display basic statistical details like percentile, mean,standard deviation etc. for

Iris-setosa use describe

print('Iris-setosa')

print(iris[irisSet].describe())

7. Load all rows with Iris-versicolor species in variable irisVer

irisVer = (iris['Species']== 'Iris-versicolor')

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

8. To display basic statistical details like percentile, mean,standard deviation etc. for
Iris-versicolor use describe

print('Iris-versicolor')

print(iris[irisVer].describe())

9. Load all rows with Iris-virginica species in variable irisVir

irisVir = (iris['Species']== 'Iris-virginica')

10. To display basic statistical details like percentile, mean,standard deviation etc. for
Iris-virginica use describe

print('Iris-virginica')

print(iris[irisVir].describe())

Conclusion:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Descriptive statistics summarises or describes the characteristics of a data set. Descriptive

statistics consists of two basic categories of measures:

● measures of central tendency and

● measures of variability (or spread).

Measures of central tendency describe the centre of a data set. It includes the
mean, median, and mode.

Measures of variability or spread describe the dispersion of data within the set and
it includes standard deviation, variance, minimum and maximum variables.

Assignment Questions:
1. Explain Measures of Central Tendency with examples.
2. What are the different types of variables? Explain with examples.
3. Which method is used to statistic the dataframe? write the code.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Correctness Documentation Timely Dated Sign of

Write-up Viva Total
of Program of Program Completion Subject Teacher

4 4 4 4 4 20

Expected Date of Completion:..................................... Actual Date of Completion:.......................

----------------------------------------------------------------------------------------------------------------

Group A
Assignment No: 4
----------------------------------------------------------------------------------------------------------------
Title of the Assignment: Create a Linear Regression Model using Python/R to predict
home prices using Boston Housing Dataset (https://www.kaggle.com/c/boston-housing).
The Boston Housing dataset contains information about various houses in Boston through
different parameters. There are 506 samples and 14 feature variables in this dataset.
The objective is to predict the value of prices of the house using the given features.
----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to data analysis using liner regression
using Python for any open source dataset
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming
2.Concept of Regresion.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Linear Regression : Univariate and Multivariate
2. Least Square Method for Linear Regression
3. Measuring Performance of Linear Regression
4. Example of Linear Regression
5. Training data set and Testing data set

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

---------------------------------------------------------------------------------------------------------------
1. Linear Regression: It is a machine learning algorithm based on supervised learning. It
targets prediction values on the basis of independent variables.
● It is preferred to find out the relationship between forecasting and variables.
● A linear relationship between a dependent variable (X) is continuous; while
independent variable(Y) relationship may be continuous or discrete. A linear
relationship should be available in between predictor and target variable so known
as Linear Regression.
● Linear regression is popular because the cost function is Mean Squared Error
(MSE) which is equal to the average squared difference between an observation’s
actual and predicted values.
● It is shown as an equation of line like :
Y = m*X + b + e
Where : b is intercepted, m is slope of the line and e is error term.
This equation can be used to predict the value of target variable Y based on given
predictor variable(s) X, as shown in Fig. 1.

Fig. 1: geometry of linear regression

● Fig. 2 shown below is about the relation between weight (in Kg) and height (in
cm), a linear relation. It is an approach of studying in a statistical manner to
summarise and learn the relationships among continuous (quantitative) variables.
● Here a variable, denoted by ‘x’ is considered as the predictor, explanatory, or
independent variable.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● Another variable, denoted ‘y’, is considered as the response, outcome, or

dependent variable. While "predictor" and "response" used to refer to these
variables.
● Simple linear regression technique concerned with the study of only one predictor
variable.

Fig.2 : Relation between weight (in Kg) and height (in cm)

MultiVariate Regression :It concerns the study of two or more predictor variables.
Usually a transformation of the original features into polynomial features from a given
degree is preferred and further Linear Regression is applied on it.
● A simple linear model Y = a + bX is in original feature will be transformed into
polynomial feature is transformed and further a linear regression applied to it and
it will be something like
Y=a + bX + cX2
● If a high degree value is used in transformation the curve becomes over-fitted as it
captures the noise from data as well.

2. Least Square Method for Linear Regression

● Linear Regression involves establishing linear relationships between dependent and
independent variables. Such a relationship is portrayed in the form of an equation also
known as the linear model.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● A simple linear model is the one which involves only one dependent and one independent
variable. Regression Models are usually denoted in Matrix Notations.
● However, for a simple univariate linear model, it can be denoted by the regression
equation

𝑦 =β 0
+ β 1
𝑥 (1)

where 𝑦 is the dependent or the response variable

𝑥 is the independent or the input variable
β 0
is the value of y when x=0 or the y intercept

β 1
is the value of slope of the line ε is the error or the noise

● This linear equation represents a line also known as the ‘regression line’. The least square
estimation technique is one of the basic techniques used to guess the values of the
parameters and based on a sample set.
● This technique estimates parameters β
0
and β 1and by trying to minimise the square

of errors at all the points in the sample set. The error is the deviation of the actual sample
● data point from the regression line. The technique can be represented by the equation.
𝑛
2
𝑚𝑖𝑛 ∑ (𝑦 − 𝑦) (2)
𝑖=0

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Using differential calculus on equation 1 we can find the values of β

0
and β 1
such

that the sum of squares (that is equation 2) is minimum.

𝑛 𝑛 2

β 1
= ∑ (𝑥 𝑖
− 𝑥 ) (𝑦 𝑖
− 𝑦 )/ ∑ (𝑥 𝑖−
𝑥) (3)
𝑖=1 𝑖=1

β 0
= 𝑦 −β 1
𝑥 (4)

Once the Linear Model is estimated using equations (3) and (4), we can estimate the
value of the dependent variable in the given range only. Going outside the range is called
extrapolation which is inaccurate if simple regression techniques are used.
3. Measuring Performance of Linear Regression
Mean Square Error:
The Mean squared error (MSE) represents the error of the estimator or predictive model
created based on the given set of observations in the sample. Two or more regression
models created using a given sample data can be compared based on their MSE. The
lesser the MSE, the better the regression model is. When the linear regression model is
trained using a given set of observations, the model with the least mean sum of squares
error (MSE) is selected as the best model. The Python or R packages select the best-fit
model as the model with the lowest MSE or lowest RMSE when training the linear
regression models.
Mathematically, the MSE can be calculated as the average sum of the squared difference
between the actual value and the predicted or estimated value represented by the
regression model (line or plane).

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

An MSE of zero (0) represents the fact that the predictor is a perfect predictor.
RMSE:
Root Mean Squared Error method that basically calculates the least-squares error and takes a
root of the summed values.
Mathematically speaking, Root Mean Squared Error is the square root of the sum of all errors
divided by the total number of values. This is the formula to calculate RMSE

RMSE - Least Squares Regression Method - Edureka

R-Squared :

R-Squared is the ratio of the sum of squares regression (SSR) and the sum of squares total
(SST).
SST : total sum of squares (SST), regression sum of squares (SSR), Sum of square of errors
(SSE) are all showing the variation with different measures.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

A value of R-squared closer to 1 would mean that the regression model covers most part
of the variance of the values of the response variable and can be termed as a good
model.
One can alternatively use MSE or R-Squared based on what is appropriate and the need of the
hour. However, the disadvantage of using MSE rather than R-squared is that it will be difficult
to gauge the performance of the model using MSE as the value of MSE can vary from 0 to
any larger number. However, in the case of R-squared, the value is bounded between 0 and .
4. Example of Linear Regression
Consider following data for 5 students.
Each Xi (i = 1 to 5) represents the score of ith student in standard X and corresponding
Yi (i = 1 to 5) represents the score of ith student in standard XII.
(i) Linear regression equation best predicts standard XIIth score
(ii) Interpretation for the equation of Linear Regression
(iii) If a student's score is 80 in std X, then what is his expected score in XII standard?

Student Score in X standard (Xi) Score in XII standard (Yi)

1 95 85
2 85 95
3 80 70
4 70 65
5 60 70

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

x y 𝑥 −𝑥 𝑦 −𝑦 (𝑥 −𝑥 )2 (𝑥 −𝑥 )(𝑦 − 𝑦 )
95 85 17 8 289 136
85 95 7 18 49 126
80 70 2 -7 4 -14
70 65 -8 -12 64 96
60 70 -18 -7 324 126
𝑥 = 78 𝑦= 77 ε (𝑥 −𝑥 ) = 730
2
ε (𝑥 −𝑥 )(𝑦 − 𝑦 ) = 470

(i) linear regression equation that best predicts standard XIIth score

𝑦 =β 0
+ β 1
𝑥

𝑛 𝑛 2

β 1
= ∑ (𝑥 𝑖
− 𝑥 ) (𝑦 𝑖
− 𝑦 )/ ∑ (𝑥 𝑖−
𝑥)
𝑖=1 𝑖=1

β 1
= 470/730 = 0. 644

β 0
= 𝑦 −β 1
𝑥

β 0
= 77 − (0. 644 * 78) = 26. 768

𝑦 = 26. 76 + 0. 644 𝑥

(ii) Interpretation of the regression line.

Interpretation 1

For an increase in value of x by 0.644 units there is an increase in value of y in one unit.

Interpretation 2

Even if x = 0 value of independent variable, it is expected that value of y is 26.768

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Score in XII standard (Yi) is 0.644 units depending on Score in X standard (Xi) but other
factors will also contribute to the result of XII standard by 26.768 .

(iii) If a student's score is 65 in std X, then his expected score in XII standard is 78.288

For x = 80 the y value will be

𝑦 = 26. 76 + 0. 644 * 65 = 68. 38

5. Training data set and Testing data set

● Machine Learning algorithm has two phases

1. Training and 2. Testing.
● The input of the training phase is training data, which is passed to any machine learning
algorithm and machine learning model is generated as output of the training phase.
● The input of the testing phase is test data, which is passed to the machine learning model
and prediction is done to observe the correctness of mode.

Fig. 1.3.1 : Training and Testing Phase in Machine Learning

(a) Training Phase
● Training dataset is provided as input to this phase.
● Training dataset is a dataset having attributes and class labels and used for training
Machine Learning algorithms to prepare models.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● Machines can learn when they observe enough relevant data. Using this one can model
algorithms to find relationships, detect patterns, understand complex problems and make
decisions.
● Training error is the error that occurs by applying the model to the same data from which
the model is trained.
● In a simple way the actual output of training data and predicted output of the model does
not match the training error Ein is said to have occurred.
● Training error is much easier to compute.
(b) Testing Phase
● Testing dataset is provided as input to this phase.
● Test dataset is a dataset for which class label is unknown. It is tested using model
● A test dataset used for assessment of the finally chosen model.
● Training and Testing dataset are completely different.
● Testing error is the error that occurs by assessing the model by providing the unknown
data to the model.
● In a simple way the actual output of testing data and predicted output of the model does
not match the testing error Eout is said to have occurred.
● E out is generally observed larger than Ein.
(c) Generalization
● Generalization is the prediction of the future based on the past system.
● It needs to generalize beyond the training data to some future data that it might not have
seen yet.
● The ultimate aim of the machine learning model is to minimize the generalization error.
● The generalization error is essentially the average error for data the model has never
seen.
● In general, the dataset is divided into two partition training and test sets.
● The fit method is called on the training set to build the model.
● This fit method is applied to the model on the test set to estimate the target value and
evaluate the model's performance.
● The reason the data is divided into training and test sets is to use the test set to estimate
how well the model trained on the training data and how well it would perform on the
unseen data.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Algorithm (Synthesis Dataset):

Step 1: Import libraries and create alias for Pandas, Numpy and Matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Step 2: Create a Dataframe with Dependent Variable(x) and independent variable y.
x=np.array([95,85,80,70,60])
y=np.array([85,95,70,65,70])
Step 3 : Create Linear Regression Model using Polyfit Function:
model= np.polyfit(x, y, 1)
Step 4: Observe the coefficients of the model.
model
Output:
array([ 0.64383562, 26.78082192])
Step 5: Predict the Y value for X and observe the output.
predict = np.poly1d(model)
predict(65)
Output:
68.63
Step 6: Predict the y_pred for all values of x.
y_pred= predict(x)
y_pred
Output:
array([81.50684932, 87.94520548, 71.84931507, 68.63013699, 71.84931507])

Step 7: Evaluate the performance of Model (R-Suare)

R squared calculation is not implemented in numpy… so that one should be borrowed
from sklearn.
from sklearn.metrics import r2_score
r2_score(y, y_pred)
Output:
0.4803218090889323
Step 8: Plotting the linear regression model
y_line = model[1] + model[0]* x
plt.plot(x, y_line, c = 'r')
plt.scatter(x, y_pred)
plt.scatter(x,y,c='r')

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Output:

Algorithm (Boston Dataset):

Step 1: Import libraries and create alias for Pandas, Numpy and Matplotlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Step 2: Import the Boston Housing dataset
from sklearn.datasets import load_boston
boston = load_boston()
Step 3: Initialize the data frame
data = pd.DataFrame(boston.data)
Step 4: Add the feature names to the dataframe
data.columns = boston.feature_names
data.head()
Step 5: Adding target variable to dataframe
data['PRICE'] = boston.target
Step 6: Perform Data Preprocessing( Check for missing values)
data.isnull().sum()
Step 7: Split dependent variable and independent variables
x = data.drop(['PRICE'], axis = 1)
y = data['PRICE']
Step 8: splitting data to training and testing dataset.
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest =
train_test_split(x, y, test_size =0.2,random_state = 0)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Step 9: Use linear regression( Train the Machine ) to Create Model

import sklearn
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
model=lm.fit(xtrain, ytrain)
Step 10: Predict the y_pred for all values of train_x and test_x
ytrain_pred = lm.predict(xtrain)
ytest_pred = lm.predict(xtest)
Step 11:Evaluate the performance of Model for train_y and test_y
df=pd.DataFrame(ytrain_pred,ytrain)
df=pd.DataFrame(ytest_pred,ytest)
Step 12: Calculate Mean Square Paper for train_y and test_y
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(ytest, ytest_pred)
print(mse)
mse = mean_squared_error(ytrain_pred,ytrain)
print(mse)
Output:
33.44897999767638
mse = mean_squared_error(ytest, ytest_pred)
print(mse)
Output:
19.32647020358573
Step 13: Plotting the linear regression model
lt.scatter(ytrain ,ytrain_pred,c='blue',marker='o',label='Training data')
plt.scatter(ytest,ytest_pred ,c='lightgreen',marker='s',label='Test data')
plt.xlabel('True values')
plt.ylabel('Predicted')
plt.title("True value vs Predicted value")
plt.legend(loc= 'upper left')
#plt.hlines(y=0,xmin=0,xmax=50)
plt.plot()
plt.show()

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Conclusion:
In this way we have done data analysis using linear regression for Boston Dataset and
predict the price of houses using the features of the Boston Dataset.
Assignment Question:
1) Compute SST, SSE, SSR, MSE, RMSE, R Square for the below example .

Student Score in X standard (Xi) Score in XII standard (Yi)

1 95 85
2 85 95
3 80 70
4 70 65
5 60 70

2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code to calculate the RSquare for Boston Dataset.
(Consider the linear regression model created in practical session)
.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Correctness Documentation Timely Dated Sign of

Write-up Viva Total
of Program of Program Completion Subject Teacher

4 4 4 4 4 20

Expected Date of Completion:..................................... Actual Date of Completion:.......................

----------------------------------------------------------------------------------------------------------------

Group A
Assignment No: 5
----------------------------------------------------------------------------------------------------------------
Title of the Assignment:
1. Implement logistic regression using Python/R to perform classification on
Social_Network_Ads.csv dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision,
Recall on the given dataset..
----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to data analysis using logistic
regression using Python for any open source dataset
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming
2.Concept of Regression.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Logistic Regression
2. Differentiate between Linear and Logistic Regression
3. Sigmoid Function
4. Types of LogisticRegression
5. Confusion Matrix Evaluation Metrics

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

---------------------------------------------------------------------------------------------------------------

1. Logistic Regression: Classification techniques are an essential part of machine learning

and data mining applications. Approximately 70% of problems in Data Science are
classification problems. There are lots of classification problems that are available, but
logistic regression is common and is a useful regression method for solving the binary
classification problem. Another category of classification is Multinomial classification,
which handles the issues where multiple classes are present in the target variable. For
example, the IRIS dataset is a very famous example of multi-class classification. Other
examples are classifying article/blog/document categories.

Logistic Regression can be used for various classification problems such as spam
detection. Diabetes prediction, if a given customer will purchase a particular product or
will they churn another competitor, whether the user will click on a given advertisement
link or not, and many more examples are in the bucket.

Logistic Regression is one of the most simple and commonly used Machine Learning
algorithms for two-class classification. It is easy to implement and can be used as the
baseline for any binary classification problem. Its basic fundamental concepts are also
constructive in deep learning. Logistic regression describes and estimates the relationship
between one dependent binary variable and independent variables.

Logistic regression is a statistical method for predicting binary classes. The outcome or

target variable is dichotomous in nature. Dichotomous means there are only two possible

classes. For example, it can be used for cancer detection problems. It computes the

probability of an event occurring.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

It is a special case of linear regression where the target variable is categorical in nature. It

uses a log of odds as the dependent variable. Logistic Regression predicts the probability

of occurrence of a binary event utilising a logit function.

Linear Regression Equation:

Where, y is a dependent variable and x1, x2 ... and Xn are explanatory variables.
Sigmoid Function:

Apply Sigmoid function on linear regression:

2. Differentiate between Linear and Logistic Regression

Linear regression gives you a continuous output, but logistic regression provides a constant
output. An example of the continuous output is house price and stock price. Example's of the
discrete output is predicting whether a patient has cancer or not, predicting whether the customer
will churn. Linear regression is estimated using Ordinary Least Squares (OLS) while logistic
regression is estimated using Maximum Likelihood Estimation (MLE) approach.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

3. Sigmoid Function
The sigmoid function, also called logistic function, gives an ‘S’ shaped curve that can take any
real-valued number and map it into a value between 0 and 1. If the curve goes to positive infinity,
y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0.
If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES,
and if it is less than 0.5, we can classify it as 0 or NO. The outputcannotFor example: If the
output is 0.75, we can say in terms of probability as: There is a 75 percent chance that a patient
will suffer from cancer.

4. Types of LogisticRegression
Binary Logistic Regression: The target variable has only two possible outcomes such as
Spam or Not Spam, Cancer or No Cancer.
Multinomial Logistic Regression: The target variable has three or more nominal
categories such as predicting the type of Wine.
Ordinal Logistic Regression: the target variable has three or more ordinal categories
such as restaurant or product rating from 1 to 5.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

5. Confusion Matrix Evaluation Metrics

Contingency table or Confusion matrix is often used to measure the performance of classifiers. A
confusion matrix contains information about actual and predicted classifications done by a
classification system. Performance of such systems is commonly evaluated using the data in the
matrix.

The following table shows the confusion matrix for a two class classifier.

Here each row indicates the actual classes recorded in the test data set and the each column indicates the
classes as predicted by the classifier.
Numbers on the descending diagonal indicate correct predictions, while the ascending diagonal concerns
prediction errors.

Some Important measures derived from confusion matrix are:

● Number of positive (Pos) : Total number instances which are labelled as positive in a given
dataset.
● Number of negative (Neg) : Total number instances which are labelled as negative in a given
dataset.
● Number of True Positive (TP) : Number of instances which are actually labelled as positive
and the predicted class by classifier is also positive.
● Number of True Negative (TN) : Number of instances which are actually labelled as negative
and the predicted class by classifier is also negative.
● Number of False Positive (FP) : Number of instances which are actually labelled as negative
and the predicted class by classifier is positive.
● Number of False Negative (FN): Number of instances which are actually labelled as positive
and the class predicted by the classifier is negative.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● Accuracy: Accuracy is calculated as the number of correctly classified instances divided by total
number of instances.
The ideal value of accuracy is 1, and the worst is 0. It is also calculated as the sum of true positive
and true negative (TP + TN) divided by the total number of instances.

𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁
𝑎𝑐𝑐 = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
= 𝑃𝑜𝑠+𝑁𝑒𝑔

● Error Rate: Error Rate is calculated as the number of incorrectly classified instances divided
by total number of instances.
The ideal value of accuracy is 0, and the worst is 1. It is also calculated as the sum of false
positive and false negative (FP + FN) divided by the total number of instances.
𝐹𝑃+𝐹𝑁 𝐹𝑃+𝐹𝑁
𝑒𝑟𝑟 = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
= 𝑃𝑜𝑠+𝑁𝑒𝑔
Or

𝑒𝑟𝑟 = 1 − 𝑎𝑐𝑐
● Precision: It is calculated as the number of correctly classified positive instances divided by the
total number of instances which are predicted positive. It is also called confidence value. The
ideal value is 1, whereas the worst is 0.

𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃

● Recall: .It is calculated as the number of correctly classified positive instances divided by the
total number of positive instances. It is also called recall or sensitivity. The ideal value of
sensitivity is 1, whereas the worst is 0.

It is calculated as the number of correctly classified positive instances divided by the total number
of positive instances.

𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Algorithm (Boston Dataset):

Step 1: Import libraries and create alias for Pandas, Numpy and Matplotlib
Step 2: Import the Social_Media_Adv Dataset
Step 3: Initialize the data frame
Step 4: Perform Data Preprocessing
● Convert Categorical to Numerical Values if applicable
● Check for Null Value
● Covariance Matrix to select the most promising features
● Divide the dataset into Independent(X) and
Dependent(Y)variables.
● Split the dataset into training and testing datasets
● Scale the Features if necessary.

Step 5: Use Logistic regression( Train the Machine ) to Create Model

# import the class
from sklearn.linear_model import LogisticRegression
# instantiate the model (using the default parameters)
logreg = LogisticRegression()
# fit the model with data
logreg.fit(xtrain,ytrain)
# y_pred=logreg.predict(xtest)

Step 6: Predict the y_pred for all values of train_x and test_x

Step 7:Evaluate the performance of Model for train_y and test_y

Step 8: Calculate the required evaluation parameters

from sklearn.metrics import
precision_score,confusion_matrix,accuracy_score,recall_score
cm= confusion_matrix(ytest, y_pred)

Conclusion:
In this way we have done data analysis using logistic regression for Social Media Adv. and
evaluate the performance of model.
Value Addition: Visualising Confusion Matrix using Heatmap

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Assignment Question:
1) Consider the binary classification task with two classes positive and negative.
Find out TP,TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall

2) Comment on whether the model is best fit or not based on the calculated values.
3) Write python code for the preprocessing mentioned in step 4. and Explain every
step in detail.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Correctness Documentation Timely Dated Sign of

Write-up Viva Total
of Program of Program Completion Subject Teacher

4 4 4 4 4 20

Expected Date of Completion:..................................... Actual Date of Completion:.......................

----------------------------------------------------------------------------------------------------------------

Group A
Assignment No: 6
----------------------------------------------------------------------------------------------------------------
Title of the Assignment:
1. Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv
dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision,
Recall on the given dataset.
----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to data analysis using Naïve Bayes
classification algorithm using Python for any open source dataset
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming
2.Concept of Join and Marginal Probability.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Concepts used in Naïve Bayes classifier
2. Naive Bayes Example
3. Confusion Matrix Evaluation Metrics
---------------------------------------------------------------------------------------------------------------
1. Concepts used in Naïve Bayes classifier

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● Naïve Bayes Classifier can be used for Classification of categorical data.

○ Let there be a ‘j’ number of classes. C={1,2,….j}
○ Let, input observation is specified by ‘P’ features. Therefore input observation x
is given , x = {F1,F2,…..Fp}
○ The Naïve Bayes classifier depends on Bayes' rule from probability theory.
● Prior probabilities: Probabilities which are calculated for some event based on no other
information are called Prior probabilities.

For example, P(A), P(B), P(C) are prior probabilities because while calculating P(A),
occurrences of event B or C are not concerned i.e. no information about occurrence of
any other event is used.

Conditional Probabilities:

From equation (1) and (2) ,

Is called the Bayes Rule.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

2. Example of Naive Bayes

We have a dataset with some features Outlook, Temp, Humidity, and Windy, and the
target here is to predict whether a person or team will play tennis or not.

Conditional Probability
Here, we are predicting the probability of class1 and class2 based on the given condition. If I try
to write the same formula in terms of classes and features, we will get the following equation

Now we have two classes and four features, so if we write this formula for class C1, it will be

something like this.

Here, we replaced Ck with C1 and X with the intersection of X1, X2, X3, X4. You might have a

question, It’s because we are taking the situation when all these features are present at the same

time.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

The Naive Bayes algorithm assumes that all the features are independent of each other or in other

words all the features are unrelated. With that assumption, we can further simplify the above

formula and write it in this form

This is the final equation of the Naive Bayes and we have to calculate the probability of both C1

and C2.For this particular example.

P (N0 | Today) > P (Yes | Today) So, the prediction that golf would be played is ‘No’.

Algorithm (Iris Dataset):

Step 1: Import libraries and create alias for Pandas, Numpy and Matplotlib
Step 2: Import the Iris dataset by calling URL.
Step 3: Initialize the data frame
Step 4: Perform Data Preprocessing
● Convert Categorical to Numerical Values if applicable
● Check for Null Value

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● Divide the dataset into Independent(X) and

Dependent(Y)variables.
● Split the dataset into training and testing datasets
● Scale the Features if necessary.

Step 5: Use Naive Bayes algorithm( Train the Machine ) to Create Model
# import the class
from sklearn.naive_bayes import GaussianNB
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)

Step 6: Predict the y_pred for all values of train_x and test_x
Y_pred = gaussian.predict(X_test)

Step 7:Evaluate the performance of Model for train_y and test_y

accuracy = accuracy_score(y_test,Y_pred)
precision =precision_score(y_test, Y_pred,average='micro')
recall = recall_score(y_test, Y_pred,average='micro')

Step 8: Calculate the required evaluation parameters

from sklearn.metrics import
precision_score,confusion_matrix,accuracy_score,recall_score
cm = confusion_matrix(y_test, Y_pred)

Conclusion:
In this way we have done data analysis using Naive Bayes Algorithm for Iris dataset and
evaluated the performance of the model.

Value Addition: Visualising Confusion Matrix using Heatmap

Assignment Question:
1) Consider the observation for the car theft scenario having 3 attributes colour, Type and
origin.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Find the probability of car theft having scenarios Red SUV and Domestic.

2) Write python code for the preprocessing mentioned in step 4. and Explain every step in
detail.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Correctness Documentation Timely Dated Sign of

Write-up Viva Total
of Program of Program Completion Subject Teacher

4 4 4 4 4 20

Expected Date of Completion:..................................... Actual Date of Completion:.......................

----------------------------------------------------------------------------------------------------------------

Group A
Assignment No: 7
----------------------------------------------------------------------------------------------------------------
Title of the Assignment:
1. Extract Sample document and apply following document preprocessing methods:
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse Document
Frequency.
----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to perform Text Analysis using TF
IDF Algorithm
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming
2. Basic of English language.
---------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Basic concepts of Text Analytics
2. Text Analysis Operations using natural language toolkit
3. Text Analysis Model using TF-IDF.
4. Bag of Words (BoW)
---------------------------------------------------------------------------------------------------------------

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

1. Basic concepts of Text Analytics

One of the most frequent types of day-to-day conversion is text communication. In our
everyday routine, we chat, message, tweet, share status, email, create blogs, and offer
opinions and criticism. All of these actions lead to a substantial amount of unstructured
text being produced. It is critical to examine huge amounts of data in this sector of the
online world and social media to determine people's opinions.

Text mining is also referred to as text analytics. Text mining is a process of exploring
sizable textual data and finding patterns. Text Mining processes the text itself, while NLP
processes with the underlying metadata. Finding frequency counts of words, length of the
sentence, presence/absence of specific words is known as text mining. Natural language
processing is one of the components of text mining. NLP helps identify sentiment,
finding entities in the sentence, and category of blog/article. Text mining is preprocessed
data for text analytics. In Text Analytics, statistical and machine learning algorithms are
used to classify information.

2. Text Analysis Operations using natural language toolkit

NLTK(natural language toolkit) is a leading platform for building Python programs to

work with human language data. It provides easy-to-use interfaces and lexical resources
such as WordNet, along with a suite of text processing libraries for classification,
tokenization, stemming, tagging, parsing, and semantic reasoning and many more.
Analysing movie reviews is one of the classic examples to demonstrate a simple NLP
Bag-of-words model, on movie reviews.
2.1. Tokenization:
Tokenization is the first step in text analytics. The process of breaking down a text
paragraph into smaller chunks such as words or sentences is called Tokenization.
Token is a single entity that is the building blocks for a sentence or paragraph.

● Sentence tokenization : split a paragraph into list of sentences using

sent_tokenize() method

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● Word tokenization : split a sentence into list of words using word_tokenize()

method

2.2. Stop words removal

Stopwords considered as noise in the text. Text may contain stop words such as is,
am, are, this, a, an, the, etc. In NLTK for removing stopwords, you need to create
a list of stopwords and filter out your list of tokens from these words.

2.3. Stemming and Lemmatization

Stemming is a normalization technique where lists of tokenized words are
converted into shortened root words to remove redundancy. Stemming is the
process of reducing inflected (or sometimes derived) words to their word stem,
base or root form.
A computer program that stems word may be called a stemmer.
E.g.
A stemmer reduces the words like fishing, fished, and fisher to the stem fish.
The stem need not be a word, for example the Porter algorithm reduces, argue,
argued, argues, arguing, and argus to the stem argu .

Lemmatization in NLTK is the algorithmic process of finding the lemma of a

word depending on its meaning and context. Lemmatization usually refers to the
morphological analysis of words, which aims to remove inflectional endings. It
helps in returning the base or dictionary form of a word known as the lemma.
Eg. Lemma for studies is study

Lemmatization Vs Stemming

Stemming algorithm works by cutting the suffix from the word. In a broader sense
cuts either the beginning or end of the word.

On the contrary, Lemmatization is a more powerful operation, and it takes into

consideration morphological analysis of the words. It returns the lemma which is
the base form of all its inflectional forms. In-depth linguistic knowledge is

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

required to create dictionaries and look for the proper form of the word.
Stemming is a general operation while lemmatization is an intelligent operation
where the proper form will be looked in the dictionary. Hence, lemmatization
helps in forming better machine learning features.

2.4. POS Tagging

POS (Parts of Speech) tell us about grammatical information of words of the
sentence by assigning specific token (Determiner, noun, adjective , adverb ,
verb,Personal Pronoun etc.) as tag (DT,NN ,JJ,RB,VB,PRP etc) to each words.
Word can have more than one POS depending upon the context where it is used.
We can use POS tags as statistical NLP tasks. It distinguishes a sense of word
which is very helpful in text realization and infer semantic information from text
for sentiment analysis.

3. Text Analysis Model using TF-IDF.

Term frequency–inverse document frequency(TFIDF) , is a numerical statistic
that is intended to reflect how important a word is to a document in a collection or
corpus.
● Term Frequency (TF)
It is a measure of the frequency of a word (w) in a document (d). TF is defined as
the ratio of a word’s occurrence in a document to the total number of words in a
document. The denominator term in the formula is to normalize since all the
corpus documents are of different lengths.

Example:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

The initial step is to make a vocabulary of unique words and calculate TF for each
document. TF will be more for words that frequently appear in a document and
less for rare words in a document.
● Inverse Document Frequency (IDF)
It is the measure of the importance of a word. Term frequency (TF) does not
consider the importance of words. Some words such as’ of’, ‘and’, etc. can be
most frequently present but are of little significance. IDF provides weightage to
each word based on its frequency in the corpus D.

In our example, since we have two documents in the corpus, N=2.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● Term Frequency — Inverse Document Frequency (TFIDF)

It is the product of TF and IDF.
TFIDF gives more weightage to the word that is rare in the corpus (all the documents).
TFIDF provides more importance to the word that is more frequent in the document.

After applying TFIDF, text in A and B documents can be represented as a TFIDF vector of
dimension equal to the vocabulary words. The value corresponding to each word represents
the importance of that word in a particular document.
TFIDF is the product of TF with IDF. Since TF values lie between 0 and 1, not using ln can
result in high IDF for some words, thereby dominating the TFIDF. We don’t want that, and
therefore, we use ln so that the IDF should not completely dominate the TFIDF.
● Disadvantage of TFIDF
It is unable to capture the semantics. For example, funny and humorous are synonyms, but
TFIDF does not capture that. Moreover, TFIDF can be computationally expensive if the
vocabulary is vast.
4. Bag of Words (BoW)
Machine learning algorithms cannot work with raw text directly. Rather, the text must be
converted into vectors of numbers. In natural language processing, a common technique
for extracting features from text is to place all of the words that occur in the text in a

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

bucket. This approach is called a bag of words model or BoW for short. It’s referred to
as a “bag” of words because any information about the structure of the sentence is lost.

Algorithm for Tokenization, POS Tagging, stop words removal, Stemming and
Lemmatization:
Step 1: Download the required packages
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
Step 2: Initialize the text
text= "Tokenization is the first step in text analytics. The
process of breaking down a text paragraph into smaller chunks
such as words or sentences is called Tokenization."
Step 3: Perform Tokenization
#Sentence Tokenization
from nltk.tokenize import sent_tokenize
tokenized_text= sent_tokenize(text)
print(tokenized_text)

#Word Tokenization
from nltk.tokenize import word_tokenize
tokenized_word=word_tokenize(text)
print(tokenized_word)

Step 4: Removing Punctuations and Stop Word

# print stop words of English
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

text= "How to remove stop words with NLTK library in Python?"

text= re.sub('[^a-zA-Z]', ' ',text)
tokens = word_tokenize(text.lower())
filtered_text=[]
for w in tokens:
if w not in stop_words:
filtered_text.append(w)
print("Tokenized Sentence:",tokens)
print("Filterd Sentence:",filtered_text)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Step 5 : Perform Stemming

from nltk.stem import PorterStemmer
e_words= ["wait", "waiting", "waited", "waits"]
ps =PorterStemmer()
for w in e_words:
rootWord=ps.stem(w)
print(rootWord)
Step 6: Perform Lemmatization
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print("Lemma for {} is {}".format(w,
wordnet_lemmatizer.lemmatize(w)))

Step 7: Apply POS Tagging to text

import nltk
from nltk.tokenize import word_tokenize
data="The pink sweater fit her perfectly"
words=word_tokenize(data)
for word in words:
print(nltk.pos_tag([word]))

Algorithm for Create representation of document by calculating TFIDF

Step 1: Import the necessary libraries.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
Step 2: Initialize the Documents.
documentA = 'Jupiter is the largest Planet'
documentB = 'Mars is the fourth planet from the Sun'
Step 3: Create BagofWords (BoW) for Document A and B.
bagOfWordsA = documentA.split(' ')
bagOfWordsB = documentB.split(' ')
Step 4: Create Collection of Unique words from Document A and B.
uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Step 5: Create a dictionary of words and their occurrence for each document in the
corpus
numOfWordsA = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsA:
numOfWordsA[word] += 1
numOfWordsB = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsB:
numOfWordsB[word] += 1
Step 6: Compute the term frequency for each of our documents.
def computeTF(wordDict, bagOfWords):
tfDict = {}
bagOfWordsCount = len(bagOfWords)
for word, count in wordDict.items():
tfDict[word] = count / float(bagOfWordsCount)
return tfDict
tfA = computeTF(numOfWordsA, bagOfWordsA)
tfB = computeTF(numOfWordsB, bagOfWordsB)
Step 7: Compute the term Inverse Document Frequency.
def computeIDF(documents):
import math
N = len(documents)

idfDict = dict.fromkeys(documents[0].keys(), 0)
for document in documents:
for word, val in document.items():
if val > 0:
idfDict[word] += 1

for word, val in idfDict.items():

idfDict[word] = math.log(N / float(val))
return idfDict
idfs = computeIDF([numOfWordsA, numOfWordsB])
idfs
Step 8: Compute the term TF/IDF for all words.
def computeTFIDF(tfBagOfWords, idfs):
tfidf = {}
for word, val in tfBagOfWords.items():
tfidf[word] = val * idfs[word]
return tfidf

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

tfidfA = computeTFIDF(tfA, idfs)

tfidfB = computeTFIDF(tfB, idfs)
df = pd.DataFrame([tfidfA, tfidfB])
df

Conclusion:
In this way we have done text data analysis using TF IDF algorithm
Assignment Question:
1) Perform Stemming for text = "studies studying cries cry". Compare
the results generated with Lemmatization. Comment on your answer how
Stemming and Lemmatization differ from each other.

2) Write Python code for removing stop words from the below documents, conver the
documents into lowercase and calculate the TF, IDF and TFIDF score for each
document.
documentA = 'Jupiter is the largest Planet'
documentB = 'Mars is the fourth planet from the Sun'

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

----------------------------------------------------------------------------------------------------------------

Group A
Assignment No: 8
----------------------------------------------------------------------------------------------------------------
Contents for Theory:
1. Seaborn Library Basics
2. Know your Data
3. Finding patterns of data.
4. Checking how the price of the ticket (column name: 'fare') for each passenger is
distributed by plotting a histogram.

---------------------------------------------------------------------------------------------------------------
Theory:
Data Visualisation plays a very important role in Data mining. Various data scientists spent their
time exploring data through visualisation. To accelerate this process we need to have a
well-documentation of all the plots.
Even plenty of resources can’t be transformed into valuable goods without planning and
architecture

1. Seaborn Library Basics

Seaborn is a Python data visualisation library based on matplotlib. It provides a
high-level interface for drawing attractive and informative statistical graphics.
For the installation of Seaborn, you may run any of the following in your command line.
pip install seaborn
conda install seaborn
To import seaborn you can run the following command.
import seaborn as sns

2. Know your data

The dataset that we are going to use to draw our plots will be the Titanic dataset, which is
downloaded by default with the Seaborn library. All you have to do is use the load_dataset
function and pass it the name of the dataset.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Let's see what the Titanic dataset looks like. Execute the following script:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

dataset = sns.load_dataset('titanic')

dataset.head()

The dataset contains 891 rows and 15 columns and contains information about the passengers
who boarded the unfortunate Titanic ship. The original task is to predict whether or not the
passenger survived depending upon different features such as their age, ticket, cabin they
boarded, the class of the ticket, etc. We will use the Seaborn library to see if we can find any
patterns in the data.

3. Finding patterns of data.

Patterns of data can be find out with the help of different types of plots
Types of plots are:

A. Distribution Plots

a. Dist-Plot

b. Joint Plot

d. Rug Plot

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

B. Categorical Plots

a. Bar Plot

b. Count Plot

c. Box Plot

d. Violin Plot

C. Advanced Plots

a. Strip Plot

b. Swarm Plot

D. Matrix Plots

a. Heat Map

b. Cluster Map

A. Distribution Plots:

These plots help us to visualise the distribution of data. We can use these plots to understand the

mean, median, range, variance, deviation, etc of the data.

a. Distplot

● Dist plot gives us the histogram of the selected continuous variable.

● It is an example of a univariate analysis.
● We can change the number of bins i.e. number of vertical bars in a histogram

import seaborn as sns

sns.distplot(x = dataset['age'], bins = 10)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

The line that you see represents the kernel density estimation. You can remove this line
by passing False as the parameter for the kde attribute as shown below

sns.distplot(dataset['age'], bins = 10,kde=False)

Here the x-axis is the age and the y-axis displays frequency. For example, for bins = 10,

there are around 50 people having age 0 to 10

i.b. Joint Plot

● It is the combination of the distplot of two variables.

● It is an example of bivariate analysis.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● We additionally obtain a scatter plot between the variables to reflect their linear
relationship. We can customise the scatter plot into a hexagonal plot, where, the
more the colour intensity, the more will be the number of observations.

import seaborn as sns

# For Plot 1

sns.jointplot(x = dataset['age'], y = dataset['fare'], kind =

'scatter')

# For Plot 2

sns.jointplot(x = dataset['age'], y = dataset['fare'], kind = 'hex')

● From the output, you can see that a joint plot has three parts. A distribution plot at the top
for the column on the x-axis, a distribution plot on the right for the column on the y-axis
and a scatter plot in between that shows the mutual distribution of data for both the
columns. You can see that there is no correlation observed between prices and the fares.
● You can change the type of the joint plot by passing a value for the kind parameter. For
instance, if instead of a scatter plot, you want to display the distribution of data in the
form of a hexagonal plot, you can pass the value hex for the kind parameter.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

● In the hexagonal plot, the hexagon with the most number of points gets darker colour. So
if you look at the above plot, you can see that most of the passengers are between the
ages of 20 and 30 and most of them paid between 10-50 for the tickets.

a. c. The Rug Plot

b. The rugplot() is used to draw small bars along the x-axis for each point in the dataset. To
plot a rug plot, you need to pass the name of the column. Let's plot a rug plot for fare.
sns.rugplot(dataset['fare'])

From the output, you can see that most of the instances for the fares have values between 0 and
100.

These are some of the most commonly used distribution plots offered by the Python's Seaborn
Library. Let's see some of the categorical plots in the Seaborn library.

2. Categorical Plots
Categorical plots, as the name suggests, are normally used to plot categorical data. The
categorical plots plot the values in the categorical column against another categorical column or
a numeric column. Let's see some of the most commonly used categorical data.

b. The Bar Plot

The barplot() is used to display the mean value for each value in a categorical column, against a
numeric column. The first parameter is the categorical column, the second parameter is the
numeric column while the third parameter is the dataset. For instance, if you want to know the
mean value of the age of the male and female passengers, you can use the bar plot as follows.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

sns.barplot(x='sex', y='age', data=dataset)

From the output, you can clearly see that the average age of male passengers is just less than 40
while the average age of female passengers is around 33.

In addition to finding the average, the bar plot can also be used to calculate other aggregate
values for each category. To do so, you need to pass the aggregate function to the estimator. For
instance, you can calculate the standard deviation for the age of each gender as follows:
import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

sns.barplot(x='sex', y='age', data=dataset, estimator=np.std)

Notice, in the above script we use the std aggregate function from the numpy library to calculate
the standard deviation for the ages of male and female passengers. The output looks like this:

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

c. The Count Plot

The count plot is similar to the bar plot, however it displays the count of the categories in a
specific column. For instance, if we want to count the number of males and women passenger we
can do so using count plot as follows:

sns.countplot(x='sex', data=dataset)

d. The Box Plot

The box plot is used to display the distribution of the categorical data in the form of quartiles.
The centre of the box shows the median value. The value from the lower whisker to the bottom
of the box shows the first quartile. From the bottom of the box to the middle of the box lies the
second quartile. From the middle of the box to the top of the box lies the third quartile and finally
from the top of the box to the top whisker lies the last quartile.

Now let's plot a box plot that displays the distribution for the age with respect to each gender.
You need to pass the categorical column as the first parameter (which is sex in our case) and the
numeric column (age in our case) as the second parameter. Finally, the dataset is passed as the
third parameter, take a look at the following script:

sns.boxplot(x='sex', y='age', data=dataset)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Let's try to understand the box plot for females. The first quartile starts at around 1 and ends at
20 which means that 25% of the passengers are aged between 1 and 20. The second quartile
starts at around 20 and ends at around 28 which means that 25% of the passengers are aged
between20 and 28. Similarly, the third quartile starts and ends between 28 and 38, hence 25%
passengers are aged within this range and finally the fourth or last quartile starts at 38 and ends
around 64.

If there are any outliers or the passengers that do not belong to any of the quartiles, they are
called outliers and are represented by dots on the box plot.

You can make your box plots more fancy by adding another layer of distribution. For instance, if
you want to see the box plots of forage of passengers of both genders, along with the information
about whether or not they survived, you can pass the survived as value to the hue parameter as
shown below:
sns.boxplot(x='sex', y='age', data=dataset, hue="survived")

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Now in addition to the information about the age of each gender, you can also see the
distribution of the passengers who survived. For instance, you can see that among the male
passengers, on average more younger people survived as compared to the older ones. Similarly,
you can see that the variation among the age of female passengers who did not survive is much
greater than the age of the surviving female passengers.

e. The Violin Plot

The violin plot is similar to the box plot, however, the violin plot allows us to display all the
components that actually correspond to the data point. The violinplot() function is used to plot
the violin plot. Like the box plot, the first parameter is the categorical column, the second
parameter is the numeric column while the third parameter is the dataset.

Let's plot a violin plot that displays the distribution for the age with respect to each gender.

sns.violinplot(x='sex', y='age', data=dataset)

You can see from the figure above that violin plots provide much more information about the
data as compared to the box plot. Instead of plotting the quartile, the violin plot allows us to see
all the components that actually correspond to the data. The area where the violin plot is thicker
has a higher number of instances for the age. For instance, from the violin plot for males, it is
clearly evident that the number of passengers with age between 20 and 40 is higher than all the
rest of the age brackets.

Like box plots, you can also add another categorical variable to the violin plot using the hue
parameter as shown below:

sns.violinplot(x='sex', y='age', data=dataset, hue='survived')

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Now you can see a lot of information on the violin plot. For instance, if you look at the bottom of
the violin plot for the males who survived (left-orange), you can see that it is thicker than the
bottom of the violin plot for the males who didn't survive (left-blue). This means that the number
of young male passengers who survived is greater than the number of young male passengers
who did not survive

Advanced Plots:

a. The Strip Plot

The strip plot draws a scatter plot where one of the variables is categorical. We have seen scatter
plots in the joint plot and the pair plot sections where we had two numeric variables. The strip
plot is different in a way that one of the variables is categorical in this case, and for each
category in the categorical variable, you will see a scatter plot with respect to the numeric
column.

The stripplot() function is used to plot the violin plot. Like the box plot, the first parameter is the
categorical column, the second parameter is the numeric column while the third parameter is the
dataset. Look at the following script:

sns.stripplot(x='sex', y='age', data=dataset, jitter=False)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

You can see the scattered plots of age for both males and females. The data points look like
strips. It is difficult to comprehend the distribution of data in this form. To better comprehend the
data, pass True for the jitter parameter which adds some random noise to the data. Look at the
following script:

sns.stripplot(x='sex', y='age', data=dataset, jitter=True)

Now you have a better view for the distribution of age across the genders.

Like violin and box plots, you can add an additional categorical column to strip plot using hue
parameter as shown below:

sns.stripplot(x='sex', y='age', data=dataset, jitter=True, hue='survived')

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

b. The Swarm Plot

The swarm plot is a combination of the strip and the violin plots. In the swarm plots, the points
are adjusted in such a way that they don't overlap. Let's plot a swarm plot for the distribution of
age against gender. The swarmplot() function is used to plot the violin plot. Like the box plot, the
first parameter is the categorical column, the second parameter is the numeric column while the
third parameter is the dataset. Look at the following script:

sns.swarmplot(x='sex', y='age', data=dataset)

You can clearly see that the above plot contains scattered data points like the strip plot and the
data points are not overlapping. Rather they are arranged to give a view similar to that of a violin
plot.

Let's add another categorical column to the swarm plot using the hue parameter.

sns.swarmplot(x='sex', y='age', data=dataset, hue='survived')

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

From the output, it is evident that the ratio of surviving males is less than the ratio of surviving
females. Since for the male plot, there are more blue points and less orange points. On the other
hand, for females, there are more orange points (surviving) than the blue points (not surviving).
Another observation is that amongst males of age less than 10, more passengers survived as
compared to those who didn't.

1. Matrix Plots
Matrix plots are the type of plots that show data in the form of rows and columns. Heat maps are
the prime examples of matrix plots.

a. Heat Maps
Heat maps are normally used to plot correlation between numeric columns in the form of a
matrix. It is important to mention here that to draw matrix plots, you need to have meaningful
information on rows as well as columns. Let's plot the first five rows of the Titanic dataset to see
if both the rows and column headers have meaningful information. Execute the following script:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

dataset = sns.load_dataset('titanic')

dataset.head()

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

From the output, you can see that the column headers contain useful information such as
passengers surviving, their age, fare etc. However the row headers only contain indexes 0, 1, 2,
etc. To plot matrix plots, we need useful information on both columns and row headers. One way
to do this is to call the corr() method on the dataset. The corr() function returns the correlation
between all the numeric columns of the dataset. Execute the following script:

dataset.corr()

In the output, you will see that both the columns and the rows have meaningful header
information, as shown below:

Now to create a heat map with these correlation values, you need to call the heatmap() function
and pass it your correlation dataframe. Look at the following script:

corr = dataset.corr()

sns.heatmap(corr)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

From the output, it can be seen that what heatmap essentially does is that it plots a box for every
combination of rows and column value. The colour of the box depends upon the gradient. For
instance, in the above image if there is a high correlation between two features, the
corresponding cell or the box is white, on the other hand if there is no correlation, the
corresponding cell remains black.

The correlation values can also be plotted on the heatmap by passing True for the annot
parameter. Execute the following script to see this in action:

corr = dataset.corr()

sns.heatmap(corr, annot=True)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

You can also change the colour of the heatmap by passing an argument for the cmap parameter.
For now, just look at the following script:

corr = dataset.corr()

sns.heatmap(corr)

b. Cluster Map:
In addition to the heat map, another commonly used matrix plot is the cluster map. The
cluster map basically uses Hierarchical Clustering to cluster the rows and columns of the
matrix.
Let's plot a cluster map for the number of passengers who travelled in a specific month of
a specific year. Execute the following script:
4. Checking how the price of the ticket (column name: 'fare') for each passenger is
distributed by plotting a histogram.
import seaborn as sns

dataset = sns.load_dataset('titanic')

sns.histplot(dataset['fare'], kde=False, bins=10)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

From the histogram, it is seen that for around 730 passengers the price of the ticket is 50.
For 100 passengers the price of the ticket is 100 and so on.

Conclusion-
Seaborn is an advanced data visualisation library built on top of Matplotlib library. In this
assignment, we looked at how we can draw distributional and categorical plots using the Seaborn
library. We have seen how to plot matrix plots in Seaborn. We also saw how to change plot styles
and use grid functions to manipulate subplots.

Assignment Questions
1. List out different types of plot to find patterns of data
2. Explain when you will use distribution plots and when you will use categorical plots.
3. Write the conclusion from the following swarm plot (consider titanic dataset)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

4. Which parameter is used to add another categorical variable to the violin plot,
Explain with syntax and example.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Correctness Documentation Timely Dated Sign of

Write-up Viva Total
of Program of Program Completion Subject Teacher

4 4 4 4 4 20

Expected Date of Completion:..................................... Actual Date of Completion:.......................

----------------------------------------------------------------------------------------------------------------

Group A
Assignment No: 9
----------------------------------------------------------------------------------------------------------------
Title of the Assignment: Data Visualization II
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for
distribution of age with respect to each gender along with the information about whether they
survived or not. (Column names : 'sex' and 'age')
2. Write observations on the inference from the above statistics.
-----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to perform the data Visualization
operation using Python on any open source dataset
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.
--------------------------------------------------------------------------------------------------------------
Assignment Questions
1. Write down the code to use inbuilt dataset ‘titanic’ using seaborn library.
2. Write code to plot a box plot for distribution of age with respect to each gender
along with the information about whether they survived or not.
3. Write the observations from the box plot.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Correctness Documentation Timely Dated Sign of

Write-up Viva Total
of Program of Program Completion Subject Teacher

4 4 4 4 4 20

Expected Date of Completion:..................................... Actual Date of Completion:.......................

----------------------------------------------------------------------------------------------------------------

Group A
Assignment No: 10
----------------------------------------------------------------------------------------------------------------
Title of the Assignment: Data Visualization III
Download the Iris flower dataset or any other dataset into a DataFrame. (e.g.,
https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset and give the inference as:
1. List down the features and their types (e.g., numeric, nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to illustrate the feature distributions.
3. Create a box plot for each feature in the dataset.
4. Compare distributions and identify outliers.
-----------------------------------------------------------------------------------------------------------------
Objective of the Assignment: Students should be able to perform the data Visualization
operation using Python on any open source dataset
---------------------------------------------------------------------------------------------------------------
Prerequisite:
1. Basic of Python Programming
2. Seaborn Library, Concept of Data Visualization.
3. Types of variables
--------------------------------------------------------------------------------------------------------------
Assignment Questions
1. For the iris dataset, list down the features and their types.
2. Write a code to create a histogram for each feature. (iris dataset)

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

3. Write a code to create a boxplot for each feature. (iris dataset)

4. Identify the outliers from the boxplot drawn for iris dataset.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

----------------------------------------------------------------------------------------------------------------

Group B
Assignment No: 1
----------------------------------------------------------------------------------------------------------------
Theory:
● Steps to Install Hadoop
● Java Code for word count
● Input File

Steps to install Hadoop:

Step 1) mkdir words

Step 2) Download hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce
program. Visit the following
link
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/1.2.1

Step 3) Put that downloaded jar file into words folder.

Step 4) Implement WordCount.java program.

Step 5) Create input1.txt on home directory with some random text

Step 6) go on words path then compile

javac -classpath /home/vijay/words/hadoop-core-1.2.1.jar /home/vijay/words/WordCount.java

Step 7) jar -cvf words.jar -c words/ .

Step 8) cd .. then use following commands

hadoop fs -mkdir /input

hadoop fs -put input1.txt /input

hadoop fs -ls /input

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

hadoop jar /home/vijay/words/words12.jar WordCount /input/input1.txt /out321

hadoop fs -ls /out321

hadoop fs -cat /out321/part-r-00000

(Otherwise check in Browsing HDFS -> Utilities -> Browse the file System -> /)

Java Code for word count:

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;

public class WordCount extends Configured implements Tool

{
public static void main(String args[]) throws Exception
{
int res = ToolRunner.run(new WordCount(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
Path inputPath = new Path(args[0]);
Path outputPath = new Path(args[1]);

Configuration conf = getConf();

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Job job = new Job(conf, this.getClass().toString());

job.setJarByClass(WordCount.class);

FileInputFormat.setInputPaths(job, inputPath);
FileOutputFormat.setOutputPath(job, outputPath);

job.setJobName("WordCount");

job.setMapperClass(Map.class);
job.setCombinerClass(Reduce.class);
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

return job.waitForCompletion(true) ? 0 : 1;
}

public static class Map extends Mapper<LongWritable, Text, Text,

IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Mapper.Context

context) throws IOException, InterruptedException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
context.write(word, one);
}

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

}
}

public static class Reduce extends Reducer<Text, IntWritable, Text,

IntWritable>
{

public void reduce(Text key, Iterable<IntWritable> values, Context

context) throws IOException, InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
}
Input File
Pune
Mumbai
Nashik
Pune
Nashik
Kolapur

Assignment Questions
1. What is the map reduce explain with a small example?
2. Write down steps to install hadoop.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

----------------------------------------------------------------------------------------------------------------

Group B
Assignment No: 2
----------------------------------------------------------------------------------------------------------------
Theory:
● Steps to Install Hadoop for distributed environment
● Java Code for processes a log file of a system

Steps to Install Hadoop for distributed environment:

Initially create one folder logfiles1 on desktop. In that folder store input file
(access_log_short.csv), SalesMapper.java, SalesCountryReducer.java, SalesCountryDriver.java
files)

Step 1) Go to Hadoop home directory and format the NameNode.

cd hadoop-2.7.3

bin/hadoop namenode -format

Step 2) Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start all the
daemons/nodes.

cd hadoop-2.7.3/sbin

1) Start NameNode:

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files
stored in the HDFS and tracks all the file stored across the cluster.

./hadoop-daemon.sh start namenode

2) Start DataNode:

On startup, a DataNode connects to the Namenode and it responds to the requests from the
Namenode for different operations.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

./hadoop-daemon.sh start datanode

3) Start ResourceManager:

ResourceManager is the master that arbitrates all the available cluster resources and thus helps in
managing the distributed applications running on the YARN system. Its work is to manage each
NodeManagers and the each application’s ApplicationMaster.

./yarn-daemon.sh start resourcemanager

4) Start NodeManager:

The NodeManager in each machine framework is the agent which is responsible for managing
containers, monitoring their resource usage and reporting the same to the ResourceManager.

./yarn-daemon.sh start nodemanager

5) Start JobHistoryServer:

JobHistoryServer is responsible for servicing all job history related requests from client.

./mr-jobhistory-daemon.sh start historyserver

Step 3) To check that all the Hadoop services are up and running, run the below command.

jps

Step 4) cd

Step 5) sudo mkdir mapreduce_vijay

Step 6) sudo chmod 777 -R mapreduce_vijay/

Step 7) sudo chown -R vijay mapreduce_vijay/

Step 8) sudo cp /home/vijay/Desktop/logfiles1/* ~/mapreduce_vijay/

Step 9) cd mapreduce_vijay/

Step 10) ls

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Step 11) sudo chmod +r .

Step 12) export

CLASSPATH="/home/vijay/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-client-co
re-2.7.3.jar:/home/vijay/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-client-comm
on-2.7.3.jar:/home/vijay/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar:~/mapre
duce_vijay/SalesCountry/*:$HADOOP_HOME/lib/*"

Step 13) javac -d . SalesMapper.java SalesCountryReducer.java SalesCountryDriver.java

Step 14) ls

Step 15) cd SalesCountry/

Step 16) ls (check is class files are created)

Step 17) cd ..

Step 18) gedit Manifest.txt

(add following lines to it:
Main-Class: SalesCountry.SalesCountryDriver)

Step 19) jar -cfm mapreduce_vijay.jar Manifest.txt SalesCountry/*.class

Step 20) ls

Step 21) cd

Step 22) cd mapreduce_vijay/

Step 23) sudo mkdir /input200

Step 24) sudo cp access_log_short.csv /input200

Step 25) $HADOOP_HOME/bin/hdfs dfs -put /input200 /

Step 26) $HADOOP_HOME/bin/hadoop jar mapreduce_vijay.jar /input200 /output200

Step 27) hadoop fs -ls /output200

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Step 28) hadoop fs -cat /out321/part-00000

Step 29) Now open the Mozilla browser and go to localhost:50070/dfshealth.html to check the
NameNode interface.

Java Code to process logfile

Mapper Class:
package SalesCountry;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;

public class SalesMapper extends MapReduceBase implements Mapper<LongWritable,

Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);

public void map(LongWritable key, Text value, OutputCollector<Text,

IntWritable> output, Reporter reporter) throws IOException {

String valueString = value.toString();

String[] SingleCountryData = valueString.split("-");
output.collect(new Text(SingleCountryData[0]), one);
}
}
Reducer Class:
package SalesCountry;

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;

public class SalesCountryReducer extends MapReduceBase implements Reducer<Text,

IntWritable, Text, IntWritable> {

public void reduce(Text t_key, Iterator<IntWritable> values,

OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException
{
Text key = t_key;
int frequencyForCountry = 0;
while (values.hasNext()) {
// replace type of value with the actual type of our value
IntWritable value = (IntWritable) values.next();
frequencyForCountry += value.get();

}
output.collect(key, new IntWritable(frequencyForCountry));
}
}
Driver Class:
package SalesCountry;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class SalesCountryDriver {

public static void main(String[] args) {
JobClient my_client = new JobClient();
// Create a configuration object for the job
JobConf job_conf = new JobConf(SalesCountryDriver.class);

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

// Set a name of the Job

job_conf.setJobName("SalePerCountry");

// Specify data type of output key and value

job_conf.setOutputKeyClass(Text.class);
job_conf.setOutputValueClass(IntWritable.class);

// Specify names of Mapper and Reducer Class

job_conf.setMapperClass(SalesCountry.SalesMapper.class);
job_conf.setReducerClass(SalesCountry.SalesCountryReducer.class);

// Specify formats of the data type of Input and output

job_conf.setInputFormat(TextInputFormat.class);
job_conf.setOutputFormat(TextOutputFormat.class);

// Set input and output directories using command line arguments,

//arg[0] = name of input directory on HDFS, and arg[1] = name of
output directory to be created to store the output file.

FileInputFormat.setInputPaths(job_conf, new Path(args[0]));

FileOutputFormat.setOutputPath(job_conf, new Path(args[1]));

my_client.setConf(job_conf);
try {
// Run the job
JobClient.runJob(job_conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}

Input File
Pune
Mumbai
Nashik

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Pune
Nashik
Kolapur

Assignment Questions
1. Write down the steps for Design a distributed application using MapReduce which
processes a log file of a system.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

----------------------------------------------------------------------------------------------------------------

Group B
Assignment No: 3
----------------------------------------------------------------------------------------------------------------
Theory:
● Steps to Install Scala
● Apache Spark Framework Installation
● Souce Code

1) Install Scala

Step 1) java -version

Step 2) Install Scala from the apt repository by running the following commands to search for
scala and install it.

sudo apt search scala ⇒ Search for the package

sudo apt install scala ⇒ Install the package

Step 3) To verify the installation of Scala, run the following command.

scala -version

2) Apache Spark Framework Installation

Apache Spark is an open-source, distributed processing system used for big data workloads. It
utilizes in-memory caching, and optimized query execution for fast analytic queries against data
of any size.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

Step 1) Now go to the official Apache Spark download page and grab the latest version (i.e.
3.2.1) at the time of writing this article. Alternatively, you can use the wget command to
download the file directly in the terminal.

wget https://apachemirror.wuchna.com/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz

Step 2) Extract the Apache Spark tar file.

tar -xvzf spark-3.1.1-bin-hadoop2.7.tgz

Step 3) Move the extracted Spark directory to /opt directory.

sudo mv spark-3.1.1-bin-hadoop2.7 /opt/spark

Configure Environmental Variables for Spark

Step 4) Now you have to set a few environmental variables in .profile file before starting up the
spark.

echo "export SPARK_HOME=/opt/spark" >> ~/.profile

echo "export PATH=$PATH:/opt/spark/bin:/opt/spark/sbin" >> ~/.profile
echo "export PYSPARK_PYTHON=/usr/bin/python3" >> ~/.profile

Step 5) To make sure that these new environment variables are reachable within the shell and
available to Apache Spark, it is also mandatory to run the following command to take recent
changes into effect.

source ~/.profile

Step 6) ls -l /opt/spark

Start Apache Spark in Ubuntu

Step 7) Run the following command to start the Spark master service and slave service.

start-master.sh

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

start-workers.sh spark://localhost:7077
(if workers not starting then remove and install openssh:
sudo apt-get remove openssh-client openssh-server
sudo apt-get install openssh-client openssh-server)

Step 8) Once the service is started go to the browser and type the following URL access spark
page. From the page, you can see my master and slave service is started.

http://localhost:8080/

Step 9) You can also check if spark-shell works fine by launching the spark-shell command.
Spark-shell

sudo apt install snapd

snap find “intellij”

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

sudo snap install intellij-idea-community - - classic

Start Intellij IDE community Edition

Source Code:

/* Sample Code to print Statement */

object ExampleString {
def main(args: Array[String]) {

//declare and assign string variable "text"

val text : String = "You are reading SCALA programming language.";

//print the value of string variable "text"

println("Value of text is: " + text);

}
}

/**Scala program to find a number is positive, negative or positive.*/

object ExCheckNumber {
def main(args: Array[String]) {

/**declare a variable*/
var number= (-100);

if(number==0){
println("number is zero");
}
else if(number>0){

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Department of Computer Engineering Subject : DSBDAL

println("number is positive");
}
else{
println("number is negative");
}
}
}

/Scala program to print your name/

object ExPrintName {
def main(args: Array[String]) {
println("My name is Mike!")
}
}

/**Scala Program to find largest number among two numbers.*/

object ExFindLargest {
def main(args: Array[String]) {
var number1=20;
var number2=30;
var x = 10;

if( number1>number2){
println("Largest number is:" + number1);
}
else{
println("Largest number is:" + number2);
}
}
}

Assignment Questions
1. Write down steps to install scala.

SNJB’s Late Sau. K B Jain College of Engineering, Chandwad Dist. Nashik, MS

Python Libraries and Packages For Data Science
100% (1)
Python Libraries and Packages For Data Science
5 pages
Ass 1 DSBDL
No ratings yet
Ass 1 DSBDL
24 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Exp 1
No ratings yet
Exp 1
22 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
Ass-1 Prac
No ratings yet
Ass-1 Prac
23 pages
Unit 4
No ratings yet
Unit 4
105 pages
Programming For Data Science
No ratings yet
Programming For Data Science
48 pages
Python For Data Science
No ratings yet
Python For Data Science
8 pages
Python Abstract
No ratings yet
Python Abstract
7 pages
CH 4
No ratings yet
CH 4
17 pages
D P Lab Manual
No ratings yet
D P Lab Manual
54 pages
ML File Updated
No ratings yet
ML File Updated
60 pages
Top 18 Python Libraries
100% (1)
Top 18 Python Libraries
11 pages
Cs3361 Data Science Laboratory
No ratings yet
Cs3361 Data Science Laboratory
139 pages
Final Fds Manual Print
No ratings yet
Final Fds Manual Print
55 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
Numpy Data Analysis and Visualisation With Python
No ratings yet
Numpy Data Analysis and Visualisation With Python
75 pages
CS3361 - Data Science Laboratory
No ratings yet
CS3361 - Data Science Laboratory
31 pages
Machine Learning - Manual
No ratings yet
Machine Learning - Manual
32 pages
Final Fds Manual
No ratings yet
Final Fds Manual
77 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Report File
No ratings yet
Report File
40 pages
Fds Merged
No ratings yet
Fds Merged
102 pages
FDS Lab Meterial CS3361
No ratings yet
FDS Lab Meterial CS3361
30 pages
l9 Scientific Python Proc
No ratings yet
l9 Scientific Python Proc
30 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
Suraj Report File
No ratings yet
Suraj Report File
17 pages
Unit 5
No ratings yet
Unit 5
27 pages
Data Science Lecture 5 6th Semster
No ratings yet
Data Science Lecture 5 6th Semster
3 pages
Bcse206l Fds Module-5 Smsatapathy
No ratings yet
Bcse206l Fds Module-5 Smsatapathy
74 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
Python Libraries Seminar Report
100% (2)
Python Libraries Seminar Report
16 pages
Unit 5
No ratings yet
Unit 5
11 pages
Lab Manual
No ratings yet
Lab Manual
19 pages
Data Science Assignment 1 Answers
No ratings yet
Data Science Assignment 1 Answers
3 pages
DS Final
No ratings yet
DS Final
46 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
Unit 5 Python
No ratings yet
Unit 5 Python
10 pages
Data Ty
No ratings yet
Data Ty
59 pages
UNIT-6 (Data Analytics and Visualization With Python)
No ratings yet
UNIT-6 (Data Analytics and Visualization With Python)
41 pages
Tool and Lib in Data Science
No ratings yet
Tool and Lib in Data Science
32 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
72 pages
Ai, Ds & ML
No ratings yet
Ai, Ds & ML
52 pages
Data Preprocessing and Data Analysis Using Python
No ratings yet
Data Preprocessing and Data Analysis Using Python
32 pages
Python CA2
No ratings yet
Python CA2
11 pages
DAY6 Pandas Seaborn
No ratings yet
DAY6 Pandas Seaborn
97 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
74 pages
Wa0005.
No ratings yet
Wa0005.
29 pages
8 LO5 Lect 1
No ratings yet
8 LO5 Lect 1
16 pages
Lab 2 Report
No ratings yet
Lab 2 Report
6 pages
Data Science Workshop - Day 1
No ratings yet
Data Science Workshop - Day 1
80 pages
DSL Rough Draft
No ratings yet
DSL Rough Draft
34 pages
TY FDS Workbook
No ratings yet
TY FDS Workbook
56 pages
Python Week+1 New
No ratings yet
Python Week+1 New
44 pages
Data Structures For Statistical Computing in Pytho
No ratings yet
Data Structures For Statistical Computing in Pytho
7 pages
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
IOT Mini Project Report
No ratings yet
IOT Mini Project Report
26 pages
Unit 3 DW
No ratings yet
Unit 3 DW
19 pages
ML (Methodology)
No ratings yet
ML (Methodology)
4 pages
A Novel Approach To Template Filling With Automatic Speech Recognition For Healthcare Professionals
No ratings yet
A Novel Approach To Template Filling With Automatic Speech Recognition For Healthcare Professionals
6 pages
Software Requirements Specification For Fake News Prediction Using Machine Learning
No ratings yet
Software Requirements Specification For Fake News Prediction Using Machine Learning
8 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Wa0063.
No ratings yet
Wa0063.
40 pages
Raghav's Resume
No ratings yet
Raghav's Resume
2 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
DWM Theory
No ratings yet
DWM Theory
37 pages
2021, Gulati - Efficiency Enhancement of Machine Learning Approaches Through The Impact of Preprocessing Techniques
No ratings yet
2021, Gulati - Efficiency Enhancement of Machine Learning Approaches Through The Impact of Preprocessing Techniques
6 pages
Akash Kumar Singh - 23WU0202098
No ratings yet
Akash Kumar Singh - 23WU0202098
6 pages
1 s2.0 S0957417424031506 Main
No ratings yet
1 s2.0 S0957417424031506 Main
17 pages
Sentiment Analysis Report
No ratings yet
Sentiment Analysis Report
31 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
167 pages
Marketing Analytics Lab-3
No ratings yet
Marketing Analytics Lab-3
3 pages
Feng Weakly Supervised High-Fidelity Clothing Model Generation CVPR 2022 Paper
No ratings yet
Feng Weakly Supervised High-Fidelity Clothing Model Generation CVPR 2022 Paper
10 pages
Text Classification and Processing Using NLP
No ratings yet
Text Classification and Processing Using NLP
21 pages
Project Report 8th Sem
100% (1)
Project Report 8th Sem
34 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Bone Cancer Detection at Earlier Stage Using CNN Ijariie13980
No ratings yet
Bone Cancer Detection at Earlier Stage Using CNN Ijariie13980
7 pages
Visual Taxonomy Report
No ratings yet
Visual Taxonomy Report
10 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Internship Report
No ratings yet
Internship Report
12 pages
Summer Internship Report.
No ratings yet
Summer Internship Report.
27 pages
Product Helpfulness Detection With Novel Transformer Based BERT Embedding and Class Probability Features
No ratings yet
Product Helpfulness Detection With Novel Transformer Based BERT Embedding and Class Probability Features
13 pages
Codsoft Report
No ratings yet
Codsoft Report
26 pages
Machine Learning Internship Report
No ratings yet
Machine Learning Internship Report
27 pages
Forest Fire Prediction Using Machine Learning
No ratings yet
Forest Fire Prediction Using Machine Learning
28 pages
Mastering Exploratory Data Analysis With Python - A Comprehensive Guide To Unveiling Hidden Insights
No ratings yet
Mastering Exploratory Data Analysis With Python - A Comprehensive Guide To Unveiling Hidden Insights
73 pages