[go: up one dir, main page]

0% found this document useful (0 votes)
45 views70 pages

EDAV Manual With Code

The document outlines three assignments for the TY BTech DS(A) course at D. Y. Patil College of Engineering and Technology, focusing on exploratory data analysis and visualization. It covers statistical characteristics using pandas, various data visualization techniques including scatter plots and histograms, and data transformation methods such as reshaping and deduplication. Each assignment includes theoretical explanations, code examples, and observations on the learning outcomes for students.

Uploaded by

Pruthviraj Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views70 pages

EDAV Manual With Code

The document outlines three assignments for the TY BTech DS(A) course at D. Y. Patil College of Engineering and Technology, focusing on exploratory data analysis and visualization. It covers statistical characteristics using pandas, various data visualization techniques including scatter plots and histograms, and data transformation methods such as reshaping and deduplication. Each assignment includes theoretical explanations, code examples, and observations on the learning outcomes for students.

Uploaded by

Pruthviraj Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

D. Y.

Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year-2024-25

Subject: Exploratory Data Analysis and Visualization Laboratory Class: TY BTech DS(A)

Assignment No.: 1

Title: Program to get statistical characteristics of dataset using pandas

Theory: A large number of methods collectively compute descriptive statistics and other related
operations on DataFrame. Most of these are aggregations like sum(), mean(), but some of them,
like sum(), produce an object of the same size. Generally speaking, these methods take an
axis argument, just like ndarray.{sum, std, ...}, but the axis can be specified by name or integer

• DataFrame − “index” (axis=0, default), “columns” (axis=1)


Let us create a DataFrame and use this object throughout this assignment for all the operations.

sum()
Returns the sum of the values for the requested axis. By default, axis is index (axis=0).
mean()
Returns the average value
std()
Returns the Bressel standard deviation of the numerical columns.

Functions & Description


Let us now understand the functions under Descriptive Statistics in Python Pandas. The following
table list down the important functions −
Sr.No. Function Description

1 count() Number of non-null observations

2 sum() Sum of values

3 mean() Mean of Values

4 median() Median of Values

5 mode() Mode of values

6 std() Standard Deviation of the Values

1
7 min() Minimum Value

8 max() Maximum Value

9 abs() Absolute Value

10 prod() Product of Values

11 cumsum() Cumulative Sum

12 cumprod() Cumulative Product

Note − Since DataFrame is a Heterogeneous data structure. Generic operations don’t work withall
functions.
• Functions like sum(), cumsum() work with both numeric and character (or) string data
elements without any error. Though n practice, character aggregations are never used
generally, these functions do not throw any exception.
• Functions like abs(), cumprod() throw exception when the DataFrame contains character
or string data because such operations cannot be performed.

Summarizing Data
The describe() function computes a summary of statistics pertaining to the DataFrame columns.
This function gives the mean, std and IQR values. And, function excludes the character columns
and given summary about numeric columns. 'include' is the argument which is used to pass
necessary information regarding what columns need to be considered for summarizing. Takes the
list of values; by default, 'number'.

• object − Summarizes String columns


• number − Summarizes Numeric columns
all − Summarizes all columns together (Should not pass it as a list

2
Code :

import pandas as pd
import numpy as np

# Create a Dictionary of series


d={
'Name': pd.Series(['Tom', 'James', 'Ricky', 'Vin', 'Steve', 'Smith', 'Jack', 'Lee', 'David', 'Gasper',
'Betina', 'Andres']),
'Age': pd.Series([25, 26, 25, 23, 30, 29, 23, 34, 40, 30, 51, 46]),
'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8, 3.78, 2.98, 4.80, 4.10, 3.65])
}

# Create a DataFrame
df = pd.DataFrame(d)

# Display various statistics of the DataFrame

print("Sum of columns:")
print(df.sum(numeric_only=True)) # Sum only numeric columns

print("\nSum of rows (only numeric columns):")


print(df[['Age', 'Rating']].sum(axis=1))

print("\nMean of columns:")
print(df[['Age', 'Rating']].mean()) # Mean only for numeric columns

print("\nStandard deviation of columns:")


print(df[['Age', 'Rating']].std()) # Std dev only for numeric columns

print("\nDescription of numeric columns:")


print(df.describe())

print("\nDescription including only object (string) columns:")


print(df.describe(include=['object']))

print("\nDescription of all columns:")


print(df.describe(include='all'))

Observations: Thus students are able to write a program to get statistical characteristics of dataset using
pandas.

3
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year-2024-25

Subject: Exploratory Data Analysis and Visualization Laboratory Class: T.Y.B.Tech DS(A)

Assignment No.: 2

Title: Programs for analysis of data through different plots(scatter, bubble, area, stacked)
and charts( line, bar, table, pie, histogram)

• Plots

1. Scatter Plot
It is a type of plot using Cartesian coordinates to display values for two variables for a set of

data. It is displayed as a collection of points. Their position on the horizontal axis determines the

value of one variable. The position on the vertical axis determines the value of the other variable.

A scatter plot can be used when one variable can be controlled and the other variable depends on

it. It can also be used when both continuous variables are independent.

4
According to the correlation of the data points, scatter plots are grouped into different types.These

correlation types are listed below

Positive Correlation

In these types of plots, an increase in the independent variable indicates an increase in the

variable that depends on it. A scatter plot can have a high or low positive correlation.

Negative Correlation

In these types of plots, an increase in the independent variable indicates a decrease in the

variable that depends on it. A scatter plot can have a high or low negative correlation.

No Correlation

Two groups of data visualized on a scatter plot are said to not correlate if there is no clear

correlation between them.

2. Bubble Chart
A bubble chart displays three attributes of data. They are represented by x location, y location,
and size of the bubble.

Simple Bubble Chart

It is the basic type of bubble chart and is equivalent to the normal bubble chart.

Labelled Bubble Chart

The bubbles on this bubble chart are labelled for easy identification. This is to deal with different

groups of data.

The multivariable Bubble Chart

This chart has four dataset variables. The fourth variable is distinguished with a different colour.

Map Bubble Chart

It is used to illustrate data on a map.

3D Bubble Chart

This is a bubble chart designed in a 3-dimensional space. The bubbles here are spherical.

5
Area Chart:
It is represented by the area between the lines and the axis. The area is proportional to the

amount it represents.

These are types of area charts:

Simple area Chart

IIn this chart, the coloured segments overlap each other. They are placed above each other.

Stacked Area Chart

In this chart, the coloured segments are stacked on top of one another. Thus they do not intersect.

100% Stacked area Chart

In this chart, the area occupied by each group of data is measured as a percentage of its amount

from the total data. Usually, the vertical axis totals a hundred per cent.

3-D Area Chart


This chart is measured on a 3-dimensional space.

Stacked Bar Graph:


The stacked bar graphs are used to show dataset subgroups. However, the bars are stacked on top.

Charts:
Line Graph

It displays a sequence of data points as markers. The points are ordered typically by their x-axis

value. These points are joined with straight line segments. A line graph is used to visualize a

trend in data over intervals of time.

Here are types of line graphs:

Simple Line Graph

A simple line graph plots only one line on the graph. One of the axes defines the independent

variable. The other axis contains a variable that depends on it.

6
Multiple Line Graph

Multiple line graphs contain more than one line. They represent multiple variables in a dataset.

This type of graph can be used to study more than one variable over the same period.

Compound Line Graph

It is an extension of a simple line graph. It is used when dealing with different groups of data

from a larger dataset. Its every line graph is shaded downwards to the x-axis. It has each group

stacked upon one another.

Bar Graph:
A bar graph is a graph that presents categorical data with rectangle-shaped bars. The heights or

lengths of these bars are proportional to the values that they represent. The bars can be vertical or

horizontal. A vertical bar graph is sometimes called a column graph.

Grouped Bar Graph

Grouped bar graphs are used when the datasets have subgroups that need to be visualized on the

graph. The subgroups are differentiated by distinct colours.

Segmented Bar Graph

This is the type of stacked bar graph where each stacked bar shows the percentage of its discrete

value from the total value. The total percentage is 100%.

Pie Chart
A pie chart is a circular statistical graphic. To illustrate numerical proportion, it is divided into

slices. In a pie chart, for every slice, each of its arc lengths is proportional to the amount it

represents. The central angles, and area are also proportional. It is named after a sliced pie.

7
These are types of pie charts:

Simple Pie Chart

This is the basic type of pie chart. It is often called just a pie chart.

Exploded Pie Chart

One or more sectors of the chart are separated (termed as exploded) from the chart in an

exploded pie chart. It is used to emphasize a particular element in the data set.

Donut Chart

In this pie chart, there is a hole in the centre. The hole makes it look like a donut from which it

derives its name.

Pie of Pie

A pie of pie is a chart that generates an entirely new pie chart detailing a small sector of the

existing pie chart. It can be used to reduce the clutter and emphasize a particular group of

elements.

Bar of Pie

This is similar to the pie of pie, except that a bar chart is what is generated.

3D Pie Chart

This is a pie chart that is represented in a 3-dimensional space

8
Histogra
m:
A histogram is an approximate representation of the distribution of numerical data. The data is

divided into non-overlapping intervals called bins and buckets. A rectangle is erected over a bin

whose height is proportional to the number of data points in the bin. Histograms give a feel of

the density of the distribution of the underlying data.

It is classified into different parts depending on its distribution as below:

Normal Distribution

This chart is usually bell-shaped.

Bimodal Distribution

In this histogram, there are two groups of histogram charts that are of normal distribution. It is a

result of combining two variables in a dataset.

Skewed Distribution

This is an asymmetric graph with an off-centre peak. The peak tends towards the beginning or

end of the graph. A histogram can be said to be right or left-skewed depending on the direction

where the peak tends towards.

Random Distribution

This histogram does not have a regular pattern. It produces multiple peaks. It can be called a

multimodal distribution

Edge Peak Distribution


This distribution is similar to that of a normal distribution, except for a large peak at one of its
ends.

9
Comb Distribution

The comb distribution is like a comb. The height of rectangle-shaped bars is alternately tall and

short.

Observations: Thus students are able to analyze data through different plots and charts.

10
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year 2024-25
Subject: Exploratory Data Analysis and Visualization Laboratory Class: TY BTech DS (A)

Assignment No.: 3

Title: Implementation of data transformation-reshaping and deduplication of data


Theory:

Data transformation is the process of extracting good, reliable data from these sources. This
involves converting data from one structure (or no structure) to another so you can integrate it
with a data warehouse or with different applications. It allows you to expose the information to
advanced business intelligence tools to create valuable performance reports and forecast future
trends.

Data transformation includes two primary stages: understanding and mapping the data; and
transforming the data.

The best way to select, implement and integrate data deduplication can vary depending on how
the deduplication is performed. Here are some general principles that you can follow in selecting
the right deduplicating approach and then integrating it into your environment.

Step 1: Assess your backup environment

What deduplication ratio a company achieves will depend heavily on the following factors:

• Type of data

• Change rate of the data

• Amount of redundant data

• Type of backup performed (full, incremental or differential)

• Retention length of the archived or backup data

11
The challenge most companies have is quickly and effectively gathering this data. Agentless data
gathering and information classification tools from Aptare Inc., Asigra Inc., Bocada
Inc. and Kazeon Systems Inc. can assist in performing these assessments while requiring
minimal or no changes to your servers in the form of agent deployments.

Step 2: Establish how much you can change your backup environment

Deploying backup software that uses software agents will require installing agents on each server
or virtual machine and doing server reboots after it's installed. This approach generally results in
faster backup times and higher deduplication ratios than using a data deduplication appliance.
However, it can take more time and require many changes to a company's backup environment.
Using a data deduplication appliance typically requires no changes to servers, though a company
will need to tune its backup software according to if the appliance is configured as a file server or
a virtual tape library (VTL).

Step 3: Purchase a scalable storage architecture

The amount of data that a company initially plans to back up and what it actually ends up
backing up are usually two very different numbers. A company usually finds deduplication so
effective when it starts using it in its backup process that it quickly scales its use and deployment
beyond initial intentions, so you should confirm that deduplicating hardware appliances can scale
both performance and capacity. You should also verify that the hardware and software
deduplication products can provide global deduplication and replication features to maximize
duplication's benefits throughout the enterprise, facilitate technology refreshes and/or capacity
growth, and efficiently bring in deduplicated data from remote offices.

Step 4: Check the level of integration between backup software and hardware appliances

The level of integration that a hardware appliance has with backup software (or vice versa) can
expedite backups and recoveries. For example, ExaGrid Systems Inc. ExaGrid appliances
recognize backup streams from CA ARCserve and can better deduplicate data from that backup
software than streams from backup software that it doesn't recognize. Enterprise backup software
is also starting to better manage disk storage systems so data can be placed on different disk

12
storage systems with different tiers of disk, so they can back up and recover data more quickly
short term and then more cost-effectively store it long term.

Step 5: Perform the first backup

The first backup using agent-based deduplication software can potentially be a harrowing
experience. It can create a significant amount of overhead on the server and take much longer
than normal to complete because it needs to deduplicate all of the data. However, once the first
backup is complete, it only needs to back up and deduplicate changed data going forward. Using
a hardware appliance, the experience tends to be the opposite. The first backup may occur
quickly but backups may slow over time depending on how scalable the hardware appliance is,
how much data is changing and how much data growth that a company is experiencing.
Code :

import pandas as pd

# Sample data with duplicates


data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward', 'Alice', 'Bob'],
'Department': ['HR', 'IT', 'Finance', 'IT', 'Finance', 'HR', 'IT'],
'Salary': [50000, 60000, 55000, 60000, 55000, 50000, 60000]
}

# Create DataFrame
df = pd.DataFrame(data)

# Display the original data


print("Original Data:")
print(df)

### Reshaping Data ###

# 1. Pivot the data: Reshape the DataFrame by pivoting on 'Name' and 'Department'
pivot_df = df.pivot_table(index='Name', columns='Department', values='Salary', aggfunc='sum')
print("\nPivoted Data (Reshaped):")
print(pivot_df)

# 2. Melt the data: Reshape the DataFrame back to long format


melted_df = pivot_df.reset_index().melt(id_vars='Name', value_name='Salary')
print("\nMelted Data (Reshaped):")
print(melted_df)

13
### Deduplication of Data ###

# 3. Remove duplicate rows


deduped_df = df.drop_duplicates()
print("\nData after Deduplication (all duplicates removed):")
print(deduped_df)

# 4. Remove duplicates based on specific columns ('Name')


deduped_by_name_df = df.drop_duplicates(subset=['Name'])
print("\nData after Deduplication based on 'Name':")
print(deduped_by_name_df)

Observations: Thus students are able to implement data transformation-reshaping anddeduplication of


data

14
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year 2024-25

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 4

Title: Implementation of data transformation-Handling missing data ,filling missing data


Theory:

15
The rationale for Mode is to replace the population of missing values with the most frequent value
since this is the most likely occurrence.

Image by Author

2) Last Observation Carried Forward (LOCF)

If data is time-series data, one of the most widely used imputation methods is the last observation
carried forward (LOCF). Whenever a value is missing, it is replaced with the last observed value.
This method is advantageous as it is easy to understand and communicate. Although simple, this
method strongly assumes that the value of the outcome remains unchanged by the missing data,
which seems unlikely in many settings.

16
Image by Author

3) Next Observation Carried Backward (NOCB)

A similar approach like LOCF works oppositely by taking the first observation after the missing
value and carrying it backward (“next observation carried backwards”, or NOCB).

17
Image by Author

4) Linear Interpolation

Interpolation is a mathematical method that adjusts a function to data and uses this function to
extrapolate the missing data. The simplest type of interpolation is linear interpolation, which
means between the values before the missing data and the value. Of course, we could have a
pretty complex pattern in data, and linear interpolation could not be enough. There are several
different types of interpolation. Just in Pandas, we have the following options like: ‘linear’,
‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘polynomial’, ‘spline’,
‘piecewise polynomial’ and many more.

18
Image by Author

5) Common-Point Imputation

For a rating scale, using the middle point or most commonly chosen value. For example, on a
five-point scale, substitute a 3, the midpoint, or a 4, the most common value (in many cases). It is
similar to the mean value but more suitable for ordinal values.

6) Adding a category to capture NA

This is perhaps the most widely used method of missing data imputation for categorical variables.
This method consists of treating missing data as an additional label or category of the variable. All
the missing observations are grouped in the newly created label ‘Missing’. It does not assume

19
anything on the missingness of the values. It is very well suited when the number of missing data
is high.

Image by Author

7) Frequent category imputation

Replacement of missing values by the most frequent category is the equivalent of mean/median
imputation. It consists of replacing all occurrences of missing values within a variable with the
variable's most frequent label or category.

20
Image by Author

8) Arbitrary Value Imputation

Arbitrary value imputation consists of replacing all occurrences of missing values within a
variable with an arbitrary value. Ideally, the arbitrary value should be different from the
median/mean/mode and not within the normal values of the variable. Typically used arbitrary
values are 0, 999, -999 (or other combinations of 9’s) or -1 (if the distribution is positive).
Sometimes data already contain an arbitrary value from the originator for the missing values. This
works reasonably well for numerical features predominantly positive in value and for tree-based
models in general. This used to be a more common method when the out-of-the-box machine
learning libraries and algorithms were not very adept at working with missing data.

21
Image by Author

9) Adding a variable to capture NA

When data are not missing completely at random, we can capture the importance of missingness
by creating an additional variable indicating whether the data was missing for that observation (1)
or not (0). The additional variable is binary: it takes only the values 0 and 1, 0 indicating that a
value was present for that observation, and 1 indicating that the value was missing. Typically,
mean/median imputation is done to add a variable to capture those observations where the data
was missing.

22
Image by Author

10) Random Sampling Imputation

Random sampling imputation is in principle similar to mean/median imputation because it aims to


preserve the statistical parameters of the original variable, for which data is missing. Random
sampling consists of taking a random observation from the pool of available observations and
using that randomly extracted value to fill the NA. In Random Sampling, one takes as many
random observations as missing values are present in the variable. Random sample imputation
assumes that the data are missing completely at random (MCAR). If this is the case, it makes
sense to substitute the missing values with values extracted from the original variable distribution.

23
Multiple Imputation

Multiple Imputation (MI) is a statistical technique for handling missing data. The key concept of
MI is to use the distribution of the observed data to estimate a set of plausible values for the
missing data. Random components are incorporated into these estimated values to show their
uncertainty. Multiple datasets are created and then analysed individually but identically to obtain
a set of parameter estimates. Estimates are combined to obtain a set of parameter estimates. The
benefit of the multiple imputations is that restoring the natural variability of the missing values
incorporates the uncertainty due to the missing data, which results in a valid statistical inference.
As a flexible way of handling more than one missing variable, apply a Multiple Imputation by
Chained Equations (MICE) approach. Refer to the reference section to get more information on
MI and MICE. Below is a schematic representation of MICE.

Predictive/Statistical models that impute the missing data

This should be done in conjunction with some cross-validation scheme to avoid leakage. This can
be very effective and can help with the final model. There are many options for such a predictive
model, including a neural network. Here I am listing a few which are very popular.

Linear Regression

In regression imputation, the existing variables are used to predict, and then the predicted value is
substituted as if an actually obtained value. This approach has several advantages because the
imputation retains a great deal of data over the listwise or pairwise deletion and avoids
significantly altering the standard deviation or the shape of the distribution. However, as in a

24
mean substitution, while a regression imputation substitutes a value predicted from other
variables, no novel information is added, while the sample size has been increased and the
standard error is reduced.

Random Forest

Random forest is a non-parametric imputation method applicable to various variable types that
work well with both data missing at random and not missing at random. Random forest uses
multiple decision trees to estimate missing values and outputs OOB (out of the bag) imputation
error estimates. One caveat is that random forest works best with large datasets, and using random
forest on small datasets runs the risk of overfitting.

k-NN (k Nearest Neighbour)

k-NN imputes the missing attribute values based on the nearest K neighbour. Neighbours are
determined based on a distance measure. Once K neighbours are determined, the missing value is
imputed by taking mean/median or mode of known attribute values of the missing attribute.

Maximum likelihood

The assumption that the observed data are a sample drawn from a multivariate normal distribution
is relatively easy to understand. After the parameters are estimated using the available data, the
missing data are estimated based on the parameters which have just been estimated. Several
strategies are using the maximum likelihood method to handle the missing data.

Expectation-Maximization

Expectation-Maximization (EM) is the maximum likelihood method used to create a new data set.
All missing values are imputed with values estimated by the maximum likelihood methods. This
approach begins with the expectation step, during which the parameters (e.g., variances,
covariances, and means) are estimated, perhaps using the listwise deletion. Those estimates are
then used to create a regression equation to predict the missing data. The maximization step uses
those equations to fill in the missing data. The expectation step is then repeated with the new
parameters, where the new regression equations are determined to “fill in” the missing data. The
expectation and maximization steps are repeated until the system stabilizes.

25
Sensitivity analysis

Sensitivity analysis is defined as the study which defines how the uncertainty in the output of a
model can be allocated to the different sources of uncertainty in its inputs. When analysing the
missing data, additional assumptions on the missing data are made, and these assumptions are
often applicable to the primary analysis. However, the assumptions cannot be definitively
validated for correctness. Therefore, the National Research Council has proposed that the
sensitivity analysis be conducted to evaluate the robustness of the results to the deviations from
the MAR assumption.

Algorithms that Support Missing Values

Not all algorithms fail when there is missing data. Some algorithms can be made robust to
missing data, such as k-Nearest Neighbours, that can ignore a column from a distance measure
when a value is missing. Some algorithms can use the missing value as a unique and different
value when building the predictive model, such as classification and regression trees. An
algorithm like XGBoost takes into consideration of any missing data. If your imputation does not
work well, try a model that is robust to missing data.
Code:
import pandas as pd
import numpy as np

# 1.1 Sample data with missing values


data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
'Age': [25, np.nan, 30, np.nan, 45],
'City': ['New York', 'Los Angeles', 'Chicago', np.nan, 'San Francisco'],
'Salary': [50000, np.nan, 55000, 60000, np.nan]
}

# Create DataFrame
df = pd.DataFrame(data)

# Display the original data


print("Original Data:")
print(df)

# 1.2 Handling missing data

### Filling missing data with a specific value


df['Age'].fillna(0, inplace=True) # Replace NaN in 'Age' with 0
print("\nAfter filling NaN in 'Age' with 0:")
print(df)

### Filling missing data with the mean (for numerical data)
26
df['Salary'].fillna(df['Salary'].mean(), inplace=True) # Replace NaN in 'Salary' with the mean value
print("\nAfter filling NaN in 'Salary' with the mean value:")
print(df)
### Filling missing data with the mean (for numerical data)
df['Salary'].fillna(df['Salary'].mean(), inplace=True) # Replace NaN in 'Salary' with the mean value
print("\nAfter filling NaN in 'Salary' with the mean value:")
print(df)

### Filling missing data with the most frequent value (for categorical data)
df['City'].fillna(df['City'].mode()[0], inplace=True) # Replace NaN in 'City' with the most frequent value
print("\nAfter filling NaN in 'City' with the most frequent value:")
print(df)

### Last Observation Carried Forward (LOCF)


df['Age_LOCF'] = df['Age'].fillna(method='ffill') # Fill NaN with the last valid observation
print("\nAfter LOCF (Last Observation Carried Forward) in 'Age':")
print(df)

### Next Observation Carried Backward (NOCB)


df['Age_NOCB'] = df['Age'].fillna(method='bfill') # Fill NaN with the next valid observation
print("\nAfter NOCB (Next Observation Carried Backward) in 'Age':")
print(df)

### Linear Interpolation


df['Age_Linear_Interpolation'] = df['Age'].interpolate(method='linear') # Interpolate missing values
print("\nAfter Linear Interpolation in 'Age':")
print(df)

### Adding a category to capture NA


df['City_with_NA'] = df['City'].fillna('Missing') # Fill NaN in 'City' with a specific category 'Missing'
print("\nAfter filling NaN in 'City' with 'Missing':")
print(df)

### Frequent category imputation


df['City_Frequent'] = df['City'].fillna(df['City'].mode()[0]) # Replace NaN in 'City' with the most frequent value
print("\nAfter filling NaN in 'City' with the most frequent value (Frequent Category Imputation):")
print(df)

### Arbitrary Value Imputation


df['Salary_Arbitrary'] = df['Salary'].fillna(99999) # Replace NaN in 'Salary' with an arbitrary value
print("\nAfter Arbitrary Value Imputation in 'Salary':")
print(df)

### Adding a variable to capture NA


df['Age_was_missing'] = df['Age'].isnull().astype(int) # Create a binary variable to indicate if 'Age' was missing
print("\nAfter adding a variable to capture if 'Age' was missing:")
print(df)

### Random Sampling Imputation


random_sample = df['Age'].dropna().sample(df['Age'].isnull().sum(), random_state=0)
random_sample.index = df[df['Age'].isnull()].index
df.loc[df['Age'].isnull(), 'Age_Random_Sample'] = random_sample
print("\nAfter Random Sampling Imputation in 'Age':")
print(df)

27
Observations: Thus students are able to implement data transformation-Handling , missing data,filling
missing data.

28
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year 2024-25

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 5

Title: Program on Discretization and binning of data

Theory:

1. Discretization:
Discretization is the process of converting continuous data into discrete intervals, often useful when
continuous data needs to be categorized or simplified for analysis.
• Equal-width binning: The range of the data is divided into intervals of equal size. This method
works best when data is uniformly distributed.
o Example: For values between 0 and 100, dividing into 5 bins would create the ranges [0,
20), [20, 40), [40, 60), etc.
• Equal-frequency binning: Each bin contains roughly the same number of data points. This
method ensures that each bin represents a similar number of samples, making it useful when
data is skewed.
o Example: Dividing 100 data points into 5 bins of 20 data points each.
• Custom binning: You can define bins based on domain knowledge or specific needs (e.g.,
specific ranges of interest).

2. Binning:
Binning involves grouping continuous data into a specified number of discrete bins. This is useful in
reducing noise and helping identify trends or patterns that might not be immediately visible in the
continuous data.
• Helps detect patterns: By simplifying the data into bins, you can more easily identify
relationships or trends.
• Used for handling outliers: Extreme values can be placed in larger or "overflow" bins to
minimize their effect on analysis.
• Prepares data for algorithms: Some machine learning algorithms require or benefit from
discrete input data.

29
1}Binning Continuous Data:
You can use pandas.cut() to bin continuous values into predefined intervals.
Bins are intervals like [118, 125], where the left side is open (not inclusive), and the right side is
closed (inclusive).
import pandas as pd

# Example dataset of heights


height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]

# Defining bins
bins = [118, 125, 135, 160, 200]

# Using pd.cut() to categorize heights into bins


category = pd.cut(height, bins)

# Display the result


print(category)

2}Adjusting Interval Boundaries:

By default, intervals are right-closed (right=True), meaning the right boundary is inclusive.

You can change this to left-closed intervals (right=False).


# Adjusting interval boundaries to be left-closed
category2 = pd.cut(height, [118, 126, 136, 161, 200], right=False)
print(category2)

3}Counting Values in Each Bin:

The pd.value_counts() method counts the number of values in each bin.


# Count the number of values in each bin
print(pd.value_counts(category))

4}. Custom Bin Labels:

You can assign custom labels to bins by passing a list of labels to the labels argument.

# Custom bin labels


bin_names = ['Short Height', 'Average height', 'Good Height', 'Taller']
labeled_category = pd.cut(height, bins, labels=bin_names)
print(labeled_category)

Observations: Thus students are able to understand the process of Discretization and binning ofdata.
30
31
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year 2024-25

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 6

Title: Program on Handing dummy variables


Theory:
A dataset may contain various type of values, sometimes it consists of categorical values. So,
in-order to use those categorical value for programming efficiently we create dummy variables.
A dummy variable is a binary variable that indicates whether a separate categorical variable
takes on a specific value.
Explanation:

As you can see three dummy variables are created for the three categorical values of the
temperature attribute. We can create dummy variables in python using get_dummies() method.
Syntax: pandas.get_dummies(data, prefix=None, prefix_sep=’_’,)

Parameters:
• data= input data i.e. it includes pandas data frame. list . set . numpy arrays etc.
• prefix= Initial value
• prefix_sep= Data values separation.
Return Type: Dummy variables.

32
Step-by-step Approach:

• Import necessary modules


• Consider the data
• Perform operations on data to get dummies

Example 1:

import pandas as pd
import numpy as np

# create dataset
df = pd.DataFrame({'Temperature': ['Hot', 'Cold', 'Warm', 'Cold']})

# display dataset
print("Original DataFrame:")
print(df)

# create dummy variables


dummy_df = pd.get_dummies(df)
print("\nDataFrame with Dummy Variables:")
print(dummy_df)

Example 2:
import pandas as pd
import numpy as np

# create dataset
s = pd.Series(list('abca'))

# display dataset
print("Original Series:")
print(s)

# create dummy variables


dummy_s = pd.get_dummies(s)
print("\nSeries with Dummy Variables:")
print(dummy_s)

33
Example 3:

import pandas as pd
import numpy as np

# create dataset
df = pd.DataFrame({'A': ['hello', 'vignan', 'geeks'],
'B': ['vignan', 'hello', 'hello'],
'C': [1, 2, 3]})

# display dataset
print("Original DataFrame:")
print(df)

# create dummy variables


dummy_df = pd.get_dummies(df)
print("\nDataFrame with Dummy Variables:")
print(dummy_df)

Observations: Thus students are able to understand the concept of dummy variables.

34
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year 2024-25

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 7

Title: Implementation of different distributions(normal, poisson, uniform, gamma)

Theory:

Normal Distribution :

Normal distribution represents the behavior of most of the situations in the universe (That is

why it’s called a “normal” distribution. I guess!). The large sum of (small) random variables

often turns out to be normally distributed, contributing to its widespread application. Any

distribution is known as Normal distribution if it has the following characteristics:

1. The mean, median and mode of the distribution coincide.


2. The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
3. The total area under the curve is 1.
4. Exactly half of the values are to the left of the center and the other half to the right.

A normal distribution is highly different from Binomial Distribution. However, if the number of

trials approaches infinity then the shapes will be quite similar.

The PDF of a random variable X following a normal distribution is given by:

The mean and variance of a random variable X which is said to be normally distributed is given

by:

Mean -> E(X) = µ

35
Variance -> Var(X) = σ^2

Here, µ (mean) and σ (standard deviation) are the parameters.

The graph of a random variable X ~ N (µ, σ) is shown below.

A standard normal distribution is defined as the distribution with mean 0 and standard deviation

1. For such a case, the PDF becomes:

import numpy as np
import matplotlib.pyplot as plt

# Parameters for the normal distribution


mu = 0 # Mean
sigma = 1 # Standard deviation

# Generate normal distribution


normal_data = np.random.normal(mu, sigma, 1000)

# Plot the distribution


plt.hist(normal_data, bins=30, density=True, alpha=0.6, color='b')
plt.title('Normal Distribution')
plt.show()

36
Poisson Distribution:
Suppose you work at a call center, approximately how many calls do you get in a day? It can be

any number. Now, the entire number of calls at a call center in a day is modeled by Poisson

distribution. Some more examples are

1. The number of emergency calls recorded at a hospital in a day.


2. The number of thefts reported in an area on a day.
3. The number of customers arriving at a salon in an hour.
4. The number of suicides reported in a particular city.
5. The number of printing errors at each page of the book.

You can now think of many examples following the same course. Poisson Distribution is

applicable in situations where events occur at random points of time and space wherein our

interest lies only in the number of occurrences of the event.

A distribution is called Poisson distribution when the following assumptions are valid:

1. Any successful event should not influence the outcome of another successful event.

2. The probability of success over a short interval must equal the probability of success over a

longer interval.

3. The probability of success in an interval approaches zero as the interval becomes smaller.

Now, if any distribution validates the above assumptions then it is a Poisson distribution. Some

notations used in Poisson distribution are:

• λ is the rate at which an event occurs,


• t is the length of a time interval,
• And X is the number of events in that time interval.

Here, X is called a Poisson Random Variable and the probability distribution of X is called

Poisson distribution.

Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.

The PMF of X following a Poisson distribution is given by:

37
The mean µ is the parameter of this distribution. µ is also defined as the λ times length of that

interval. The graph of a Poisson distribution is shown below:

The graph shown below illustrates the shift in the curve due to increase in mean.

It is perceptible that as the mean increases, the curve shifts to the right.

The mean and variance of X following a Poisson distribution:

Mean -> E(X) = µ

Variance -> Var(X) = µ

38
# Parameter for the Poisson distribution
lambda_ = 5 # Expected number of events

# Generate Poisson distribution


poisson_data = np.random.poisson(lambda_, 1000)

# Plot the distribution


plt.hist(poisson_data, bins=30, density=True, alpha=0.6, color='g')
plt.title('Poisson Distribution')
plt.show()

Uniform Distribution:

When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are

equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the

n number of possible outcomes of a uniform distribution are equally likely.

A variable X is said to be uniformly distributed if the density function is:

The graph of a uniform distribution curve looks like

You can see that the shape of the Uniform distribution curve is rectangular, the reason why

Uniform distribution is called rectangular distribution.

For a Uniform Distribution, a and b are the parameters.

The number of bouquets sold daily at a flower shop is uniformly distributed with a maximum of

40 and a minimum of 10.


39
Let’s try calculating the probability that the daily sales will fall between 15 and 30.

The probability that daily sales will fall between 15 and 30 is (30-15)*(1/(40-10)) = 0.5

Similarly, the probability that daily sales are greater than 20 is = 0.667

The mean and variance of X following a uniform distribution is:

Mean -> E(X) = (a+b)/2

Variance -> V(X) = (b-a)²/12

The standard
# Parameters for theuniform
uniformdensity has parameters a = 0 and b = 1, so the PDF for standard uniform
distribution
low = 0
highdensity
= 10 is given by:

# Generate uniform distribution


uniform_data = np.random.uniform(low, high, 1000)

# Plot the distribution


plt.hist(uniform_data, bins=30, density=True, alpha=0.6, color='r')
plt.title('Uniform Distribution')
plt.show()

40
Gamma distribution:

Gamma(λ, r) or Gamma(α, β). Continuous. In the same Poisson process for the exponential

distribution, the gamma distribution gives the time to the r th event. Thus, Exponential(λ) =

Gamma(λ, 1). The gamma distribution also has applications when r is not an integer. For that

generality the factorial function is replaced by the gamma function, Γ(x), described above. There

is an alternate parameterization Gamma(α, β) of the family of gamma distributions. The

connection is α = r, and β = 1/λ which is the expected time to the first event in a Poisson process.

Gamma

(λ, r) f(x) = 1 Γ(r) λ rx r−1 e −λx = x α−1 e −x/β β αΓ(α) , for x ∈ [0,∞) µ = r/λ = αβ. σ2 = r/λ2 =

αβ2 m(t) = (1 − t/λ) −r = (1 − βt) −α

Application to Bayesian statistics. Gamma distributions are used in Bayesian statistics as

conjugate priors for the distributions in the Poisson process. In Gamma(α, β), α counts the

number of occurrences observe while β keeps track of the elapsed time.

# Parameters for the gamma distribution


shape = 2 # Shape parameter (k)
scale = 2 # Scale parameter (theta)

# Generate gamma distribution


gamma_data = np.random.gamma(shape, scale, 1000)

# Plot the distribution


plt.hist(gamma_data, bins=30, density=True, alpha=0.6, color='y')
plt.title('Gamma Distribution')
plt.show()

Observations: Thus students are able to perform implementation of different


distributions(normal, poisson ,uniform, gamma)

41
42
43
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year 2024-25

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 8

Title: Program on Data cleaning


Theory:
Missing data is always a problem in real life scenarios. Areas like machine learning and data
mining face severe issues in the accuracy of their model predictions because of poor quality of
data caused by missing values. In these areas, missing value treatment is a major point of focus to
make their models more accurate and valid.
When and Why Is Data Missed?
Let us consider an online survey for a product. Many a times, people do not share all the
information related to them. Few people share their experience, but not how long they are using
the product; few people share how long they are using the product, their experience but not their
contact information. Thus, in some or the other way a part of data is always missing, and this is
very common in real time.
Let us now see how we can handle missing values (say NA or NaN) using Pandas.
import pandas as pd
import numpy as np

# Create a DataFrame with random numbers


and specified indices and columns
df = pd.DataFrame(np.random.randn(5, 3),
index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two',
'three'])

# Reindex the DataFrame to add new rows


with specified indices
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

# Print the DataFrame


print(df)

44
Using reindexing, we have created a DataFrame with missing values. In the
output, NaN means Not a Number.

Check for Missing Values


To make detecting missing values easier (and across different array dtypes), Pandas provides
the isnull() and notnull() functions, which are also methods on Series and DataFrame objects −
Example

import pandas as pd
import numpy as np

# Create a DataFrame with random numbers and specified indices and columns
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three'])

# Reindex the DataFrame to add new rows with specified indices


df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

# Check for null (NaN) values in the 'one' column and print the result
print(df['one'].isnull())

Cleaning / Filling Missing Data


Pandas provides various methods for cleaning the missing values. The fillna function can “fill in”
NA values with non-null data in a couple of ways, which we have illustrated in the following
sections.
Replace NaN with a Scalar Value
The following program shows how you can replace "NaN" with "0".

import pandas as pd
import numpy as np
Here, we are
# Create filling with
a DataFrame value
with zero;
random insteadand
numbers wespecified
can alsoindices
fill with
andany other value.
columns
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'], columns=['one', 'two', 'three'])

# Reindex the DataFrame to add a new row with index 'b'


df = df.reindex(['a', 'b', 'c'])

# Print the original DataFrame


print(df)

# Replace NaN values with 0 and print the result


print("\nNaN replaced with '0':")
print(df.fillna(0))

45
Fill NA Forward and Backward
Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values.

Method Action

pad/fill Fill methods


Forward

bfill/backfill Fill methods


Backward
Example
import pandas as pd
import numpy as np

# Create a DataFrame with random numbers and specified indices and columns
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three'])

# Reindex the DataFrame to add new rows with specified indices, introducing NaN values
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

# Use forward fill (pad) to fill missing values


print(df.fillna(method='pad'))

Drop Missing Values


If you want to simply exclude the missing values, then use the dropna function along with
the axis argument. By default, axis=0, i.e., along row, which means that if any value within a row
is NA then the whole row is excluded.
Example
import pandas as pd
import numpy as np

# Create a DataFrame with random numbers and specified indices and columns
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three'])

# Reindex the DataFrame to add new rows with specified indices, introducing NaN values
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

# Print the DataFrame after dropping rows with NaN values


print(df.dropna())

46
Replace Missing (or) Generic Values
Many times, we have to replace a generic value with some specific value. We can achieve this by
applying the replace method.
Replacing NA with a scalar value is equivalent behavior of the fillna() function.
Example
import pandas as pd
import numpy as np

# Create a DataFrame with specified values


df = pd.DataFrame({
'one': [10, 20, 30, 40, 50, 2000],
'two': [1000, 0, 30, 40, 50, 60]
})

# Replace values 1000 with 10, and 2000 with 60


print(df.replace({1000: 10, 2000: 60}))

Observations: Thus students are able to understand the concept of data cleaning.

47
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year 2024-25

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 9

Title: Implementation of descriptive statistics(variance, skewness, kurtosis, percentile)

Theory:

import numpy as np

A = np.array([[10, 14, 11, 7, 9.5, 15, 19],


[8, 9, 17, 14.5, 12, 18, 15.5],
[15, 7.5, 11.5, 10, 10.5, 7, 11],
[11.5, 11, 9, 12, 14, 12, 7.5]])

B = A.T
a = np.var(B, axis=0) # Variance along columns
b = np.var(B, axis=1) # Variance along rows

print(a) # Variance for each column


print(b) # Variance for each row

48
from scipy.stats import skew

skewness = skew(B, axis=0) # Skewness along columns


print(skewness)

from scipy.stats import kurtosis

kurtosis_value = kurtosis(B, axis=0, fisher=True) # Using Fisher's definition


print(kurtosis_value)

percentile_27 = np.percentile(B, 27, axis=0, interpolation='lower')


percentile_25 = np.percentile(B, 25, axis=1, interpolation='lower')
percentile_75 = np.percentile(B, 75, axis=0, interpolation='lower')
percentile_50 = np.percentile(B, 50, axis=0, interpolation='lower')

print(percentile_27)
print(percentile_25)
print(percentile_75)
print(percentile_50)

Observations: Thus students are able to implement different distributions.

49
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year 2024-25

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 10

Title: Implementation of grouping and groupby.


Theory:
Groupby is a pretty simple concept. We can create a grouping of categories and apply a
function to the categories. It’s a simple concept but it’s an extremely valuable technique that’s
widely used in data science. In real data science projects, you’ll be dealing with large amounts
of data and trying things over and over, so for efficiency, we use Groupby concept. Groupby
concept is really important because it’s ability to aggregate data efficiently, both in
performance and the amount code is magnificent. Groupby mainly refers to a process involving
one or more of the following steps they are:

• Splitting : It is a process in which we split data into group by applying some conditions on
datasets.
• Applying : It is a process in which we apply a function to each group independently
• Combining : It is a process in which we combine different datasets after applying groupby
and results into a data structure
The following image will help in understanding a process involve in Groupby concept.

1. Group the unique values from the Team column

2. Now there’s a bucket for each group

50
3. Toss the other data into the buckets

4. Apply a function on the weight column of each bucket.

51
Splitting Data into Groups
Splitting is a process in which we split data into a group by applying some conditions on
datasets. In order to split the data, we apply certain conditions on datasets. In order to split the
data, we use groupby() function this function is used to split the data into groups based on
some criteria. Pandas objects can be split on any of their axes. The abstract definition of
grouping is to provide a mapping of labels to group names. Pandas datasets can be split into
any of their objects. There are multiple ways to split data like:

• obj.groupby(key)
• obj.groupby(key, axis=1)
• obj.groupby([key1, key2])
Note :In this we refer to the grouping objects as the keys.

Grouping data with one key:


In order to group data with one key, we pass only one key as an argument in groupby function.

# Importing pandas module


import pandas as pd

# Define a dictionary containing employee data


data1 = {
'Name': ['Jai', 'Anuj', 'Jai', 'Princi', 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
'Age': [27, 24, 22, 32, 33, 36, 27, 32],
'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj', 'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd', 'B.Tech', 'B.com', 'Msc', 'MA']
}

# Convert the dictionary into DataFrame


df = pd.DataFrame(data1)

# Display the DataFrame


print(df)

Now we group a data of Name using groupby() function

# Using groupby function with one key


grouped = df.groupby('Name')

# Display the groups


print(grouped.groups)

52
.
Now we print the first entries in all the groups formed.
# Applying groupby() function to group the data on Name value
gk = df.groupby('Name')

# Let's print the first entries in all the groups formed


first_entries = gk.first()
print(first_entries)

Grouping data with multiple keys :


In order to group data with multiple keys, we pass multiple keys in groupby function.

# Importing pandas module


import pandas as pd

# Define a dictionary containing employee data


data1 = {
'Name': ['Jai', 'Anuj', 'Jai', 'Princi', 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
'Age': [27, 24, 22, 32, 33, 36, 27, 32],
'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj', 'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd', 'B.Tech', 'B.com', 'Msc', 'MA']
}

# Convert the dictionary into DataFrame


df = pd.DataFrame(data1)

# Display the DataFrame


print(df)

Now we group a data of “Name” and “Qualification” together using multiple keys in groupbyfunction

# Using multiple keys in groupby() function


grouped = df.groupby(['Name', 'Qualification'])

# Print the groups formed


print(grouped.groups)

53
Grouping data by sorting keys :
Group keys are sorted by default using the groupby operation. User can pass sort=False for
potential speedups.

# Importing pandas module


import pandas as pd

# Define a dictionary containing employee data


data1 = {
'Name': ['Jai', 'Anuj', 'Jai', 'Princi', 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
'Age': [27, 24, 22, 32, 33, 36, 27, 32]
}

# Convert the dictionary into DataFrame


df = pd.DataFrame(data1)

# Display the DataFrame


print(df)

Now we apply groupby() without sort


# Using groupby function without using sort
result = df.groupby(['Name'], sort=False).sum()

# Display the result


print(result)

Now we apply groupby() using sort in order to attain potential speedups

# Using groupby function with sort (default is True)


result_sorted = df.groupby(['Name']).sum()

# Display the result


print(result_sorted)

54
. Grouping data with object attributes :
Groups attribute is like dictionary whose keys are the computed unique groups and
corresponding values being the axis labels belonging to each group.

# Importing pandas module


import pandas as pd

# Define a dictionary containing employee data


data1 = {
'Name': ['Jai', 'Anuj', 'Jai', 'Princi', 'Gaurav', 'Anuj', 'Princi', 'Abhi'],
'Age': [27, 24, 22, 32, 33, 36, 27, 32],
'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj', 'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd', 'B.Tech', 'B.com', 'Msc', 'MA']
}

# Convert the dictionary into DataFrame


df = pd.DataFrame(data1)

# Display the DataFrame


print(df)

Now we group data like we do in a dictionary using keys.

# Using keys for grouping


groups = df.groupby('Name').groups

# Display the groups


print(groups)

Observations: Thus students are able to perform implementation of grouping and group by
useful concept in data Science.

55
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year 2024-25

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 11

Title: Implementation of hypothesis testing -T Test

Theory:

56
Observations: Thus students are able to implement hypothesis testing –T T

57
D. Y. Patil College of Engineering and Technology, Kolhapur

Department of Computer Science & Engineering(Data Science)


Academic Year 2024-25

Subject: :Exploratory Data Analysis and Visualization Laboratory Class:TY BTech DS (A)

Assignment No.: 12

Title: Create simple dashboard using tableau.


Theory:

A dashboard is a collection of different kinds of visualizations or views that we create on


Tableau. We can bring together different elements of multiple worksheets and put them on a
single dashboard. The dashboard option enables us to import and add charts and graphs from
worksheets to create a dashboard. On a dashboard, we can place relevant charts and graphs in
one view and analyze them for better insights.

Now, we will learn in a stepwise manner how to create a dashboard in Tableau Desktop.

1. Open a new dashboard


You can open a dashboard window either from the Dashboard option given on the menu bar or
from the Dashboard icon highlighted in red on the bottom bar.

Selecting the New Dashboard option or clicking on the Dashboard icon will open a new
window named Dashboard 1. You change the name of the dashboard as per your liking.

58
2. Dashboard pane
In the window where we can create our dashboard, we get a lot of tabs and options related to
dashboarding. On the left, we have a Dashboard pane which shows the dashboard size, list of
available sheets in a workbook, objects, etc.

From the Dashboard tab, we can set the size of our dashboard. We can enter custom dimensions
like the width and height of the dashboard as per our requirements.

59
Or, you can select from a list of available fixed dashboard sizes as shown in the screenshot
below.

3. Layout pane
Right next to the Dashboard pane is the Layout pane where we can enhance the appearance and
layout of the dashboard by setting the position, size, border, background, and paddings.

60
4. Adding a sheet
Now, we’ll add a sheet onto our empty dashboard. To add a sheet, drag and drop a sheet from
the Sheets column present in the Dashboard tab. It will display all the visualizations we have on
that sheet on our dashboard. If you wish to change or adjust the size and place of the
visual/chart/graph, click on the graph then click on the small downward arrow given at the
right. A drop-down list appears having the option Floating, select it. This will unfix your chart
from one position so that you can adjust it as per your liking.

Have a look at the picture below to see how you can drag a sheet or visual around on the
dashboard and adjust its size.

61
5. Adding more sheets
In a similar way, we can add as many sheets as we require and arrange them on the dashboard
properly.

Also, you can apply the filter or selections on one graph and treat it like a filter for all the other
visuals on the dashboard. To add a filter to a dashboard in Tableau, select Use as Filter option
given on the right of every visual.

62
Then on the selected visual, we make selections. For instance, we select the data point
corresponding to New Jersey in the heat map shown below. As soon as we select it, all the rest of
the graphs and charts change their information and make it relevant to New Jersey. Notice in
the Region section, the only region left is East which is where New Jersey is located.

6. Adding objects
Another set of tools that we get to make our dashboard more interactive and dynamic is in
the Objects section. We can add a wide variety of objects such as a web page, button, text box,
extension, etc.

63
From the objects pane, we can add a button and also select the action of that button, that is, what
that button should do when you click on it. Select the Edit Button option to explore the options
you can select from for a button object.

For instance, we add a web page of our DataFlair official site as shown in the screenshot below.

64
7. Final dashboard
Now, we move towards making a final dashboard in Tableau with all its elements in place. As
you can see in the screenshot below, we have three main visualizations on our current dashboard
i.e. a segmented area chart, scatter chart and a line chart showing the sales and profits forecast.
On the right pane, we have the list of legends showing Sub-category names, a forecast indicator
and a list of clusters.

We can add filters on this dashboard by clicking on a visual. For instance, we want to add a filter
based on months on the scatter plot showing sales values for different clusters. To add a months
filter, we click on the small downward arrow and then select Filters option. Then we
select Months of Order Date option. You can select any field based on which you wish to
create a new filter.
`

65
This will give us a slider filter to select a range of months for which we want to see our data.
You can adjust the position of the filter box and drag and drop it at whichever place you want.

You can make more changes into the filter by right-clicking on it. Also, you can change the type
of filter from the drop-down menu such as Relative Date, Range of Date, Start Date, End Date,
Browse Periods, etc.

66
Similarly, you can add and edit more filters on the dashboard.

8. Presentation mode
Once our dashboard is ready, we can view it in the Presentation Mode. To enable the
presentation mode, click on the icon present on the bar at the top as shown in the screenshot
below or press F7.

This opens our dashboard in the presentation mode. So far we were working in the Edit Mode. In
the presentation mode, it neatly shows all the visuals and objects that we have added on the
dashboard. We can see how the dashboard will look when we finally present it to others or share
it with other people for analysis.

67
Here, we can also apply the filter range to our data. The dashboard is interactive and will change
the data according to the filters we apply or selections we make.

For instance, we selected the brand Pixel from our list of items from the sub-category field. This
instantly changes the information on the visuals and makes it relevant to only Pixel.
9. Share workbook with others
We can also share all the worksheets and dashboard that we create together as a workbook with
other users. To share the workbook with others, click on the share icon (highlighted in red).
Next, you need to enter the server address of a Tableau server.
Note – You must have a Tableau Online or Tableau Server account in order to do this.

68
Observations: Thus students are able to create simple dashboard using tableau.

Prepared by : Miss. Poonam R Patil

69
70

You might also like