0% found this document useful (0 votes)

2K views41 pages

ccs346 Eda Lab Manual

lab

Uploaded by

Tishbian Meshach

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views41 pages

ccs346 Eda Lab Manual

lab

Uploaded by

Tishbian Meshach

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

lOMoARcPSD|45767737

CCS346 - EDA LAB Manual

Data Science and big data analysis (Annamalai University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

THANTHAI PERIYAR
GOVERNMENT INSTITUTE OF TECHNOLOGY,
VELLORE - 02

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CCS346 - EXPLORATORY DATA ANALYSIS

LAB MANUAL
(R2021)

NAME: ……………………………………………....

REGISTER NUMBER: ………………………….…

YEAR AND SEMESTER: ………………………....

ACADEMIC YEAR: ………………………………..

Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)

lOMoARcPSD|45767737

Vision of the Institution

To provide high quality learning environment through innovative teaching

and promote research to produce globally competitive engineers of excellent quality.

Mission of the Institution

• To offer education programmes that blend intensive technical training with

appropriate guidance inculcating analytical skills and problem-solving ability with
high degree of professionalism.

• To provide healthy environment with excellent facilities for learning, research and
innovative thinking.

• To educate the students achieve their professional excellence with ethical and
social responsibilities.

Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)

lOMoARcPSD|45767737

CONTENTS
Page
Ex. No. Date Name of the Experiment Marks Signature
No.

Installing the Data Analysis and Visualization

1 1
tool: R/Python/Tableau Public/Power BI

Performing exploratory data analysis (EDA)

with datasets like email data set. Export all
2 your emails as a dataset, import them inside a 4
pandas data frame, visualize them and get
different insights from the data

Working with Numpy arrays, Pandas data

3 8
frames, Basic plots using Matplotlib

Exploring various variable and row filters in

R for cleaning data. Apply various plot
4 15
features in R on sample data sets and
visualize

Performing Time Series Analysis and apply

5 18
the various visualization techniques

Performing Data Analysis and representation

6 on a Map using various Map data sets with 21
Mouse Rollover effect, user interaction, etc.,

Building cartographic visualization for

7 multiple datasets involving countries of the 26
world, states and districts in India, etc.,

8 Performing EDA on Wine Quality Data set 29

Case study on a data set and apply the various

9 EDA and visualization techniques and 34
present an analysis report

Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)

lOMoARcPSD|45767737

EX. NO: 01
INSTALLATION OF DATA ANALYSIS AND VISUALIZATION TOOLS
DATE:

AIM
To install data analysis and visualization tools like R, Python, Tableau Public and
Power BI.

PROCEDURE

1. R:
• Download R: Visit the official R website (https://cran.r-project.org/) and download the
installer for your operating system (Windows, macOS, or Linux).
• Install R by following the instructions provided in the installer.

2. Python:
• Download Python: Visit the official Python website (https://www.python.org/) and
download the Python installer for your OS (Windows, macOS, or Linux).
• Install Python by running the installer and making sure to check the option to add
Python to your system's PATH during installation.

I. INSTALL NUMPY WITH PIP

NumPy (Numerical Python) is an open-source core Python library for
scientific computations. It is a general-purpose array and matrices processing
package.

pip install numpy

II. INSTALL JUPYTERLAB

Install Jupyter Lab with pip:

pip install jupyter

Once installed, launch Jupyter Lab with:

jupyter-lab

1
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

III. JUPYTER NOTEBOOK

Install the classic Jupyter Notebook with:

pip install notebook

To run the notebook:

jupyter notebook

IV. INSTALL SCIPY

Scipy is a python library that is useful in solving many mathematical
equations and algorithms. It is designed on the top of Numpy library. SCIPY
means scientific python.

pip install scipy

V. INSTALL PANDAS
pandas is a Python package that provides fast, flexible, and expressive
data structures designed to make working with "relational" or "labeled" data
both easy and intuitive.

pip install pandas

VI. INSTALL MATPLOTLIB

Matplotlib is a comprehensive library for creating static, animated,
and interactive visualizations in Python. Working with "relational" or "labeled"
data both easy and intuitive.

pip install matplotlib

2
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

3. Tableau Public:
• Tableau Public: It is a web-based tool, so there's no installation required. Simply visit
the Tableau Public website (https://public.tableau.com/s/gallery) and create an account
to start using it.
4. Power BI:
• Download Power BI Desktop: Go to the official Power BI website
(https://powerbi.microsoft.com/en-us/desktop/) and download Power BI Desktop.
• Install Power BI Desktop by running the installer.

RESULT
Thus the data analysis and visualization tools like R, Python, Tableau Public, and Power
BI has been installed Successfully.

3
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 02
PERFORM EXPLORATORY DATA ANALYSIS WITH EMAIL DATA SET
DATE:

AIM
To perform exploratory data analysis (EDA) with datasets like email data set. Export all
your emails as a dataset, import them inside a pandas data frame, visualize them and get different
insights from the data.

ABOUT THE DATASET

 The csv file contains 5172 rows, each row for each email.
 There are 3002 columns.
 The first column indicates Email name.
 The name has been set with numbers and not recipients' name to protect privacy.
 The last column has the labels for prediction : 1 for spam, 0 for not spam.

 The remaining 3000 columns are the 3000 most common words in all the emails, after
excluding the non-alphabetical characters/words.

 For each row, the count of each word(column) in that email(row) is stored in the respective
cells.

 Thus, information regarding all 5172 emails are stored in a compact dataframe rather than as
separate text files.

ALGORITHM
1. Load the Dataset
2. Display Basic Information and First Few Rows
3. Visualize the Distribution of Labels (Spam or Not Spam)
4. Visualize the Correlation Between Features (Word Occurrences)
5. Display Summary Statistics of Word Occurrences
6. Display the Top 10 Most Common Words in Spam/Not Spam Emails

4
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

PROGRAM

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset

file_path = '/emails.csv'
df = pd.read_csv(file_path)

# Display basic information about the dataset

print("Dataset Information:")
print(df.info())

# Display the first few rows of the dataset

print("\nFirst few rows of the dataset:")
print(df.head())

# Visualize the distribution of labels (spam or not spam)

plt.figure(figsize=(8, 6))
sns.countplot(x='Prediction', data=df)
plt.title('Distribution of Spam and Not Spam Emails')
plt.xlabel('Prediction (1: Spam, 0: Not Spam)')
plt.ylabel('Count')
plt.show()

# Visualize the correlation between features (word occurrences)

plt.figure(figsize=(10, 8))
sns.heatmap(df.iloc[:, 1:-1].corr(), cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Between Word Occurrences')

5
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

plt.show()

# Display summary statistics of word occurrences

word_occurrences_summary = df.iloc[:, 1:-1].describe()
print("\nSummary Statistics of Word Occurrences:")
print(word_occurrences_summary)

# Display the most common words in spam emails

spam_word_frequencies = df[df['Prediction'] == 1].iloc[:, 1:-
1].sum().sort_values(ascending=False)
print("\nTop 10 Most Common Words in Spam Emails:")
print(spam_word_frequencies.head(10))

# Display the most common words in not spam emails

not_spam_word_frequencies = df[df['Prediction'] == 0].iloc[:, 1:-
1].sum().sort_values(ascending=False)

print("\nTop 10 Most Common Words in Not Spam Emails:")

print(not_spam_word_frequencies.head(10))

OUTPUT

6
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

RESULT:
Thus the perform exploratory data analysis (EDA) with datasets like email data set and to
export all our emails as a datasets, import them inside a pandas data frame, visualize them and get
different insights from the data has been done successfully.

7
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 03
WORKING WITH NUMPY ARRAYS, PANDAS DATA FRAMES, BASIC PLOTS
USING MATPLOTLIB
DATE:

AIM
To implement python programs with Numpy arrays, Pandas data frames, Basic plots using
Matplotlib.

WORKING WITH NUMPY ARRAYS

ALGORITHM
1. Array Creation - Create 1-dimensional (arr1) and 2-dimensional (arr2) NumPy arrays.
2. Basic Operations - Perform basic operations like addition and multiplication on arrays.
3. Accessing Elements and Slicing - Access individual elements and perform slicing operations.
4. Array Shape and Dimensions - Retrieve the shape and number of dimensions of the array.
5. Reshape Array - Reshape a 1D array (arr1) to have 5 rows and 1 column.
6. Displaying Results - Print the original arrays and the results of operations.

PROGRAM
import numpy as np
# Creating NumPy arrays
arr1 = np.array([1, 2, 3, 4, 5]) # 1-dimensional array
arr2 = np.array([[1, 2, 3], [4, 5, 6]]) # 2-dimensional array
# Basic operations on arrays
arr_sum = arr1 + 10 # Add 10 to each element
arr_product = arr2 * 2 # Multiply each element by 2
# Accessing elements and slicing
element_at_index_2 = arr1[2] # Accessing element at index 2
sliced_arr1 = arr1[1:4] # Slicing from index 1 to 3 (exclusive)
sliced_arr2 = arr2[:, 1] # Accessing the second column of arr2

8
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

# Array shape and dimensions

shape_of_arr2 = arr2.shape # Shape of arr2
dimensions_of_arr2 = arr2.ndim # Number of dimensions in arr2
# Reshape the array
reshaped_arr1 = arr1.reshape((5, 1)) # Reshape arr1 to have 5 rows and 1 column
# Displaying results
print("Original 1D array:")
print(arr1)
print("\nOriginal 2D array:")
print(arr2)
print("\nArray operations:")
print("arr1 + 10:", arr_sum)
print("arr2 * 2:", arr_product)
print("\nArray slicing:")
print("Element at index 2:", element_at_index_2)
print("Sliced arr1:", sliced_arr1)
print("Second column of arr2:", sliced_arr2)
print("\nArray shape and dimensions:")
print("Shape of arr2:", shape_of_arr2)
print("Number of dimensions in arr2:", dimensions_of_arr2)
print("\nReshaped array:")
print(reshaped_arr1)

OUTPUT

9
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

WORKING WITH PANDAS

ALGORITHM
1. Creating a Pandas DataFrame - Create a simple DataFrame using a dictionary of lists.
2. Displaying the DataFrame - Print the original DataFrame.

3. Basic Operations on DataFrame - Add a new column ('Salary') and increment the 'Age' column
by 1.
4. Filtering Data - Select rows where age is less than 30.
5. Grouping and Aggregation - Calculate the average salary for each unique city in the DataFrame.

6. Displaying Results - Print the updated DataFrame, filtered data, and the result of grouping and
aggregation.

10
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

PROGRAM
import pandas as pd
# Creating a Pandas DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'Age': [25, 30, 22, 35, 28],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston']
}
df = pd.DataFrame(data)
# Displaying the DataFrame
print("Original DataFrame:")
print(df)
# Basic operations on DataFrame
df['Salary'] = [60000, 75000, 50000, 90000, 65000] # Adding a new column
df['Age'] = df['Age'] + 1 # Incrementing the 'Age' column by 1
# Filtering data
young_people = df[df['Age'] < 30] # Selecting rows where age is less than 30
# Grouping and aggregation
city_avg_salary = df.groupby('City')['Salary'].mean()
# Displaying the updated DataFrame and filtered data
print("\nUpdated DataFrame:")
print(df)
print("\nYoung people (Age < 30):")
print(young_people)
print("\nAverage Salary by City:")
print(city_avg_salary)

11
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

OUTPUT

BASIC PLOTS USING MATPLOTLIB

ALGORITHM
1. Create Sample Data -Generate x values from 0 to 10 with 100 data points. Calculate y values
for sin(x) and cos(x).
2. Plotting the Data - Create a figure with a specific size. Plot sin(x) and cos(x) using plt.plot.
3. Plot sampled points using plt.scatter.
4. Adding Labels and Title - Add labels to the X and Y axes.Set the plot title.
5. Legend and Grid - Display a legend to label the plotted lines. Add a grid to the plot.
6. Show the Plot

12
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

PROGRAM
import matplotlib.pyplot as plt
import numpy as np
# Create sample data
x = np.linspace(0, 10, 100) # 100 points between 0 and 10
y1 = np.sin(x)
y2 = np.cos(x)
# Plotting the data
plt.figure(figsize=(8, 6))
# Line plot
plt.plot(x, y1, label='sin(x)', color='blue', linestyle='-', linewidth=2)
plt.plot(x, y2, label='cos(x)', color='green', linestyle='--', linewidth=2)
# Scatter plot
plt.scatter(x[::10], y1[::10], marker='o', color='red', label='Sampled Points')
# Adding labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Sin and Cos Functions')
plt.legend() # Show legend
# Show the plot
plt.grid(True)
plt.show()

OUTPUT

13
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

RESULT
Thus python programs to implement Numpy arrays, Pandas data frames, Basic plots using
Matplotlib has been executed successfully.

14
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 04
EXPLORE VARIOUS VARIABLE AND ROW FILTERS IN R
DATE:

AIM
To explore various variable and row filters in R for cleaning data and to apply various plot
features in R on sample data sets and visualize.

ALGORITHM:
 Loading necessary libraries (installing if needed).
 Generating a sample dataset (Smoking, Alcohol and (O)esophageal Cancer).
 Displaying the summary of the sample data.
 Applying variable filters and row filters.
 Creating different types of plots using ggplot2.
 Saving the generated plots as image files.

PROGRAM:
# Install and load necessary libraries
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
library(ggplot2)
# Sample data generation
set.seed(123)
sample_data <- data.frame(
ID = 1:100,
Age = sample(18:60, 100, replace = TRUE),
Gender = sample(c("Male", "Female"), 100, replace = TRUE),
Income = rnorm(100, mean = 50000, sd = 10000),

15
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

Score = rnorm(100, mean = 70, sd = 10)

)
# Display summary of the sample data
print("Summary of Sample Data:")
print(summary(sample_data))
# Variable filter: Selecting specific columns
selected_columns <- sample_data[, c("ID", "Age", "Income")]
# Row filter: Filtering rows based on a condition (e.g., Age greater than 30)
filtered_data <- subset(sample_data, Age > 30)
# Data visualization using ggplot2
# Scatter plot of Age vs. Income

ggplot(sample_data, aes(x = Age, y = Income)) + geom_point() +labs(title = "Scatter Plot of Age

vs. Income", x = "Age", y = "Income")
# Histogram of Age

ggplot(sample_data, aes(x = Age)) +geom_histogram(binwidth = 5, fill = "skyblue", color =

"black") + labs(title = "Histogram of Age", x = "Age", y = "Frequency")
# Boxplot of Score by Gender

ggplot(sample_data, aes(x = Gender, y = Score, fill = Gender)) + geom_boxplot() + labs(title =

"Boxplot of Score by Gender", x = "Gender", y = "Score")
# Save plots as image files
ggsave("scatter_plot.png", height = 5, width = 7, dpi = 300)
ggsave("histogram.png", height = 5, width = 7, dpi = 300)
ggsave("boxplot.png", height = 5, width = 7, dpi = 300)
# Display a message indicating the completion of the program
print("Program completed successfully.")

16
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

OUTPUT:

RESULT

17
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 05
TIME SERIES ANALYSIS
DATE:

AIM
To perform Time Series Analysis and apply the various visualization techniques using R
language.

DATA DESCRIPTION:
 The training datasets consists of approximately 145k time series.

 Each of these time series represent a number of daily views of a different Wikipedia article,
starting from July, 1st, 2015 up until December 31st, 2016. The leader board during the
training stage is based on traffic from January, 1st, 2017 up until March 1st, 2017.
 The second stage will use training data up until September 1st, 2017.

 The final ranking of the competition will be based on predictions of daily views between
September 13th, 2017 and November 13th, 2017 for each article in the datasets.
 You will submit your forecasts for these dates by September 12th.

 For each time series, you are provided the name of the article as well as the type of traffic that
this time series represent (all, mobile, desktop, spider).

 You may use this metadata and any other publicly available data to make predictions.
Unfortunately, the data source for this datasets does not distinguish between traffic values of
zero and missing values.
 A missing value may mean the traffic was zero or that the data is not available for that day.

 To reduce the submission file size, each page and date combination has been given a shorter
Id. The mapping between page names and the submission Id is given in the key files.

ALGORITHM:
1. Generate Synthetic Time Series Data(Web Traffic Time Series Forecasting)
2. Plot the Time Series
3. Decompose Time Series
4. Plot Decomposed Components

18
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

5. Autocorrelation Plot
6. Partial Autocorrelation Plot

PROGRAM:
# Set seed for reproducibility
set.seed(123)
# Generate a synthetic time series data
date_sequence <- seq(as.Date("2022-01-01"), by = "days", length.out = 365)
time_series_data <- data.frame(
Date = date_sequence,
Value = cumsum(rnorm(365))
)
# Plot the time series

plot(time_series_data$Date, time_series_data$Value, type = "l", xlab = "Date", ylab = "Value",

main = "Synthetic Time Series")
# Decompose the time series into trend, seasonality, and remainder components
decomposition <- decompose(ts(time_series_data$Value, frequency = 365))
# Plot the decomposed components
par(mfrow = c(3, 1))
plot(decomposition$seasonal, main = "Seasonal Component")
plot(decomposition$trend, main = "Trend Component")
plot(decomposition$random, main = "Residual Component")
# Autocorrelation plot
acf(time_series_data$Value, main = "Autocorrelation Plot")
# Partial autocorrelation plot
pacf(time_series_data$Value, main = "Partial Autocorrelation Plot")

19
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

OUTPUT:

RESULT

20
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 06
PERFORM DATA ANALYSIS AND REPRESENTATION ON A MAP
DATE:

AIM
To perform Data Analysis and representation on a Map using various Map data sets with
Mouse Rollover effect, user interaction, etc.

DATASET:
 What types of crimes are most common?(Crimes in Boston)
 Where are different types of crimes most likely to occur?
 Does the frequency of crimes change over the day? Week? Year?

ALGORITHM
1. Import Libraries

i. Import necessary libraries, including pandas, folium, HeatMap from folium. plugins and
Mouse Position
2. Load Crime Dataset
3. Data Cleaning and Filtering
i. Drop rows with missing values in the 'Latitude' or 'Longitude' columns,.
ii. Filter out rows where latitude and longitude are both zero.
4. Create Folium Map
5. Add HeatMap Layer

6. Add Mouse Position - Add a MousePosition plugin to display latitude and longitude on
mouseover.
7. Save and Display:

21
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

PROGRAM:
A)

import pandas as pd
import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster
data = folium.Map(location=[42.32,-71.0589], tiles='openstreetmap',
zoom_start=10)
data

crimes = pd.read_csv("/content/sample_data/crime.csv",
encoding='latin-1')
crimes.dropna(subset=['Lat', 'Long', 'Location'], inplace=True)

crimes = crimes[crimes.OFFENSE_CODE_GROUP.isin([
'Larceny', 'Auto Theft', 'Robbery', 'Larceny From Motor Vehicle',
'Residential Burglary',
'Simple Assault', 'Harassment', 'Ballistics', 'Aggravated Assault',
'Other Burglary',
'Arson', 'Commercial Burglary', 'HOME INVASION', 'Homicide',
'Criminal Harassment',
'Manslaughter'])]
crimes = crimes[crimes.YEAR>=2018]
crimes

daytime_robberies = crimes[((crimes.OFFENSE_CODE_GROUP == 'Robbery') &

\
(crimes.HOUR.isin(range(9,18))))]
data = folium.Map(location=[42.32,-71.0589], tiles='cartodbpositron',
zoom_start=13)
for idx, row in daytime_robberies.iterrows():

22
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

Marker([row['Lat'], row['Long']]).add_to(m_2)
data

import pandas as pd
import folium
from folium.plugins import HeatMap, MousePosition
from datetime import datetime

# Load the crime dataset (replace 'your_crime_dataset.csv' with the

actual file path)
file_path = '/content/sample_data/crime.csv'
crime_df = pd.read_csv(file_path,encoding='latin-1')

# Data cleaning and filtering

crime_df = crime_df.dropna(subset=['Lat', 'Long'])
crime_df = crime_df[(crime_df['Lat'] != 0) & (crime_df['Long'] != 0)]

# Create a Folium map centered around the average coordinates of

crimes
map_center = [crime_df['Lat'].mean(), crime_df['Long'].mean()]
crime_map = folium.Map(location=map_center, zoom_start=12)

# Add HeatMap layer to visualize crime density

heat_data = [[row['Lat'], row['Long']] for index, row in
crime_df.iterrows()]
HeatMap(heat_data, radius=15).add_to(crime_map)

# Add MousePosition for latitude and longitude display on mouseover

23
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

MousePosition().add_to(crime_map)

# Save the map to an HTML file

crime_map.save('crime_map.html')

# Display the map

crime_map

OUTPUT:

24
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

RESULT

25
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 07
CARTOGRAPHIC VISUALIZATION FOR MULTIPLE DATASETS
DATE:

AIM
To build cartographic visualization for multiple datasets involving various countries of the
world; states and districts in India, etc..

ALGORITHM:
1. Import Necessary Libraries - Import the required libraries, including pandas, geopandas, and
folium.

2. Load World Countries Shapefile - Use gpd.read_file to load the world countries shapefile. This
shapefile is obtained from the Natural Earth dataset.

3. Load India States Shapefile - Use gpd.read_file to load the India states shapefile. Ensure that all
necessary files (.shp, .shx, .dbf) are present in the specified path.
4. Create a Folium Map - Use folium.Map to create a Folium map centered around the world.

5. Add World Countries to the Map - Use folium.Choropleth to add world countries to the map.
This creates a choropleth layer based on population estimates.

6. Add India States to the Map - Use folium.GeoJson to add India states to the map. This layer
displays the boundaries of Indian states.
7. Save the Map to an HTML File - Use the save method of the Folium map object to save the map
as an HTML file.

PROGRAM;

import geopandas as gpd

import folium

# Load world countries shapefile

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# Load India states shapefile

26
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

india_states = gpd.read_file('/content/sample_data/Indian_States.shp')

# Create a Folium map centered around the world

world_map = folium.Map(location=[0, 0], zoom_start=2)

# Add world countries to the map

folium.Choropleth(
geo_data=world,
name='choropleth',
data=None,
columns=['name', 'pop_est'],
key_on='feature.properties.name',
fill_color='YlGnBu',
fill_opacity=0.7,
line_opacity=0.2,
).add_to(world_map)

# Add India states to the map

folium.GeoJson(india_states).add_to(world_map)

# Save the map to an HTML file

world_map.save('world_map.html')

# Display the map

world_map

27
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

OUTPUT:

RESULT:

28
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 08
EDA ON WINE QUALITY DATA SET
DATE:

AIM
To Perform EDA on Wine Quality Data Set.

DESCRIPTION
 This datasets is related to red variants of the Portuguese "Vinho Verde" wine.

 The dataset describes the amount of various chemicals present in wine and their effect on it's
quality.
 The datasets can be viewed as classification or regression tasks.

 The classes are ordered and not balanced (e.g. there are much more normal wines than
excellent or poor ones).
 our task is to predict the quality of wine using the given data.
 A simple yet challenging project, to anticipate the quality of wine.

 The complexity arises due to the fact that the dataset has fewer samples, & is highly
imbalanced.

 This data frame contains the following columns:

Input variables (based on physicochemical tests):\
1 - fixed acidity\
2 - volatile acidity\
3 - citric acid\
4 - residual sugar\
5 - chlorides\
6 - free sulfur dioxide\
7 - total sulfur dioxide\
8 - density\
9 - pH\
10 - sulphates\
11 - alcohol\
Output variable (based on sensory data):\
12 - quality (score between 0 and 10)

29
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

ALGORITHM
1. Load the Data - The Wine Quality Data Set for red variants is loaded from the provided URL.

2. Display Basic Information - Display the first few rows of the dataset, summary statistics, and
check for missing values.
3. Distribution of Wine Quality - Visualize the distribution of wine quality using a count plot.

4. Correlation Matrix - Calculate the correlation matrix of features and visualize it using a
heatmap.
5. Pair Plot - Create a pair plot for selected features to observe relationships and distributions.

6. Box Plots - Use box plots to visualize the distribution of selected features across different
wine qualities.

PROGRAM
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Wine Quality Data Set for red variants

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-
quality/winequality-red.csv"
wine_data_red = pd.read_csv(url, sep=';')

# Display the first few rows of the dataset

print("First few rows of the Wine Quality Data Set for red variants:")
print(wine_data_red.head())

# Summary statistics
print("\nSummary Statistics:")
print(wine_data_red.describe())

30
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

# Check for missing values

print("\nMissing Values:")
print(wine_data_red.isnull().sum())

# Distribution of wine quality

plt.figure(figsize=(8, 6))
sns.countplot(x='quality', data=wine_data_red, palette='viridis')
plt.title('Distribution of Wine Quality for Red Variants')
plt.xlabel('Quality')
plt.ylabel('Count')
plt.show()

# Correlation matrix
correlation_matrix_red = wine_data_red.corr()

# Heatmap of the correlation matrix

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_red, annot=True, cmap='coolwarm',
fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix for Red Variants')
plt.show()

# Pair plot for selected features

selected_features_red = ['fixed acidity', 'volatile acidity', 'citric
acid', 'residual sugar', 'chlorides',
'free sulfur dioxide', 'total sulfur
dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']
sns.pairplot(wine_data_red[selected_features_red], diag_kind='kde')
plt.suptitle("Pair Plot of Selected Features for Red Variants",
y=1.02)

31
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

plt.show()

# Box plot for selected features

plt.figure(figsize=(12, 8))
for i, feature in enumerate(selected_features_red[:-1]):
plt.subplot(3, 4, i + 1)
sns.boxplot(x='quality', y=feature, data=wine_data_red,
palette='viridis')
plt.title(f'Box Plot of {feature} by Wine Quality')
plt.xlabel('Quality')
plt.ylabel(feature)
plt.tight_layout()
plt.show()

OUTPUT

32
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

RESULT

33
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 09
CASE STUDY
DATE:

AIM
To use a case study on a data set and apply the various EDA and visualization techniques
and present an analysis report.

LIST OF CASE STUDY

1. Exploring Airbnb Data in a Specific City
Analyzing price distribution, factors affecting pricing, and popular neighborhoods.
2. Analyzing COVID-19 Trends
Visualizing daily cases, trends over time, and regional variations.
3. Customer Segmentation for an E-commerce Platform
Using purchase history and demographics for segmentation.
4. Predicting Housing Prices
EDA on housing data, identifying features influencing prices, and building a predictive
model.
5. Social Media Sentiment Analysis
Analyzing sentiment trends on platforms like Twitter or Reddit.
6. Exploring Movie Ratings Data
Analyzing movie ratings, genres, and factors affecting popularity.
7. Stock Market Analysis
Visualizing stock prices, trends, and correlation between different stocks.
8. Online Retail Analysis
Understanding customer behavior, popular products, and seasonality.
9. Analyze and Visualize Traffic Patterns
Using GPS or traffic data to understand peak hours and congestion.
10. Food Delivery Analysis
Exploring delivery times, customer satisfaction, and popular cuisines.

34
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

11. Credit Card Fraud Detection

EDA on a credit card transactions dataset and building a fraud detection model.
12. Visualizing Global CO2 Emissions
Analyzing emission trends, contributors, and regional variations.
13. Analysis of Mobile App Reviews
Sentiment analysis and feature popularity in app reviews.
14. Exploring Social Network Data
Analyzing user connections, engagement, and influential users.
15. Detecting Anomalies in Network Traffic
Identifying unusual patterns in network data.
16. E-commerce Sales Forecasting
Predicting future sales based on historical data.
17. Analysis of Road Accidents Data
Understanding factors leading to accidents and identifying high-risk areas.
18. Explore Health and Fitness Data
Analyzing fitness tracker data, user habits, and correlations.
19. Survey Data Analysis
Analyzing survey responses and identifying trends or patterns.
20. Predictive Maintenance in Manufacturing
Analyzing equipment failure patterns to predict maintenance needs.
21. Exploring Educational Data
Analyzing student performance, factors affecting grades, and trends.
22. Visualizing Weather Patterns
Analyzing temperature, precipitation, and other weather data.
23. Analysis of Online User Behavior
Understanding user interactions on a website or app.
24. Predictive Analysis for Employee Attrition
Identifying factors leading to employee attrition and predicting future attrition.
25. Analysis of Music Streaming Data
Identifying popular genres, artists, and user preferences.

35
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

26. Effect of Ad Campaigns on Sales

Analyzing the impact of advertising on product sales.
27. Explore Traffic Camera Data
Analyzing traffic camera feeds for congestion patterns.
28. Visualizing Population Density
Understanding population distribution and density.
29. Analysis of Social Services Data
Analyzing the utilization of social services in a community.
30. Explore Google Analytics Data
Analyzing website traffic, user interactions, and popular content.
31. Analyze Restaurant Reviews
Sentiment analysis and identifying factors influencing restaurant ratings.
32. Water Quality Analysis
Analyzing water quality parameters over time.
33. Analysis of Online Gaming Data
Understanding player behavior, popular games, and engagement.
34. Explore Flight Delay Data
Analyzing factors contributing to flight delays.
35. Analysis of Kickstarter Campaigns
Understanding successful vs. unsuccessful crowdfunding campaigns.
36. Visualizing Earthquake Data
Analyzing earthquake patterns, magnitudes, and affected regions.
37. Explore Google Trends Data
Analyzing search trends over time.
38. Analyze Call Center Data
Understanding call volumes, response times, and customer satisfaction.
39. Explore Museum Visitor Data
Analyzing visitor demographics and exhibit popularity.
40. Analysis of Cybersecurity Incidents
Identifying patterns in cybersecurity incident data.

36
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

41. Explore Government Spending Data

Analyzing government expenditure in different sectors.
42. Visualizing Social Media Influence
Identifying influencers and their impact on social media.
43. Analysis of Online Course Engagement
Understanding factors affecting student engagement in online courses.
44. Explore Retail Store Sales Data
Analyzing sales patterns, promotions, and customer preferences.
45. Analyze Road Infrastructure Data
Identifying traffic bottlenecks and road conditions.
46. Visualizing Wildlife Migration Patterns
Analyzing tracking data for migratory animals.
47. Analysis of Hotel Reviews
Sentiment analysis and identifying factors influencing hotel ratings.
48. Explore Solar Energy Production Data
Analyzing solar energy production patterns.
49. Analysis of Social Housing Data
Understanding housing affordability and distribution.
50. Visualizing Satellite Imagery Data
Analyzing satellite imagery for land use patterns.

37
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)

Ccs346 Eda Unit 1 Notes
100% (2)
Ccs346 Eda Unit 1 Notes
20 pages
EDA Unit 1 Notes
No ratings yet
EDA Unit 1 Notes
27 pages
Ccs334 - Big Data Analytics
75% (4)
Ccs334 - Big Data Analytics
2 pages
ME P4252-II Semester - MACHINE LEARNING
100% (1)
ME P4252-II Semester - MACHINE LEARNING
48 pages
CSBS - AD3491 - FDSA - IA 1 - Answer Key
100% (11)
CSBS - AD3491 - FDSA - IA 1 - Answer Key
14 pages
FDS - Unit 1 Question Bank
No ratings yet
FDS - Unit 1 Question Bank
16 pages
Chapter 5 Case Study - New Century Health Clinic
No ratings yet
Chapter 5 Case Study - New Century Health Clinic
6 pages
Ad3301-Data-Exploration-And-Visualization Lab Manual
No ratings yet
Ad3301-Data-Exploration-And-Visualization Lab Manual
24 pages
OCS353 - Data Science Manual-FULL
No ratings yet
OCS353 - Data Science Manual-FULL
64 pages
Ad3491 Fdsa Unit 2 Notes Eduengg
No ratings yet
Ad3491 Fdsa Unit 2 Notes Eduengg
82 pages
AD3491 FDSA Syllabus
No ratings yet
AD3491 FDSA Syllabus
2 pages
Lab Manual Daa Ad3351 Aids III Sem Regulation 2021
No ratings yet
Lab Manual Daa Ad3351 Aids III Sem Regulation 2021
48 pages
Cd3291 Dsa Notes
100% (1)
Cd3291 Dsa Notes
168 pages
CCS341 Set1
100% (2)
CCS341 Set1
2 pages
AD3501 Deep Learning Syllabus
No ratings yet
AD3501 Deep Learning Syllabus
1 page
DVT - Question Bank
100% (1)
DVT - Question Bank
3 pages
Ccs355 Neural Networks and Deep Learning Unit1 (1)
No ratings yet
Ccs355 Neural Networks and Deep Learning Unit1 (1)
29 pages
AD3251 Data Structures Design Question Bank 1
No ratings yet
AD3251 Data Structures Design Question Bank 1
1 page
Machine Learning - AL3451 - Notes - Unit 1 - Introduction To Machine Learning
No ratings yet
Machine Learning - AL3451 - Notes - Unit 1 - Introduction To Machine Learning
29 pages
CCS341-Data Warehousing Lab Manual (2021)
100% (1)
CCS341-Data Warehousing Lab Manual (2021)
50 pages
ML - LAB Record
No ratings yet
ML - LAB Record
36 pages
ML Lab Manual - Ex No. 1 To 9
No ratings yet
ML Lab Manual - Ex No. 1 To 9
26 pages
Unit5 BD
100% (2)
Unit5 BD
91 pages
CCS334 - Bda Lab Manual
No ratings yet
CCS334 - Bda Lab Manual
40 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
23 pages
Ccs334 Big Data Analytics
0% (1)
Ccs334 Big Data Analytics
2 pages
Ccs337 - Cognitive Science Laboratory Lab Manual Record
No ratings yet
Ccs337 - Cognitive Science Laboratory Lab Manual Record
27 pages
CS3361 Data Science Lab Manual (II CYS)
100% (1)
CS3361 Data Science Lab Manual (II CYS)
40 pages
CSM Laboratory Manual Edited
No ratings yet
CSM Laboratory Manual Edited
22 pages
Data Science Fundamentals QB
No ratings yet
Data Science Fundamentals QB
23 pages
CCS360 Lab Record
No ratings yet
CCS360 Lab Record
28 pages
Unit I-Introduction
100% (1)
Unit I-Introduction
23 pages
EDA Unit 2 Notes
No ratings yet
EDA Unit 2 Notes
61 pages
CCS334 BDA Practical Question
No ratings yet
CCS334 BDA Practical Question
2 pages
Ccs341 DW Lab Manual Chumma Chumma Practical Notes
No ratings yet
Ccs341 DW Lab Manual Chumma Chumma Practical Notes
89 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
33 pages
CCS334 Big Data Analytics Important Question
No ratings yet
CCS334 Big Data Analytics Important Question
1 page
BDA Lab Manual AI&DS
No ratings yet
BDA Lab Manual AI&DS
60 pages
ccs355 Lab Manual
No ratings yet
ccs355 Lab Manual
24 pages
AD3391 Database Design and Management Nov Dec 2023 Question Paper Download
No ratings yet
AD3391 Database Design and Management Nov Dec 2023 Question Paper Download
3 pages
AD3351 DAA Lab Manual
No ratings yet
AD3351 DAA Lab Manual
47 pages
Cs3451 Ios Unit 5 Notes
No ratings yet
Cs3451 Ios Unit 5 Notes
21 pages
Cs3301 Unit Important Q-Data-Structures
No ratings yet
Cs3301 Unit Important Q-Data-Structures
8 pages
CCS374 Web Application Security
No ratings yet
CCS374 Web Application Security
18 pages
Co-Po Big Data Analytics
100% (1)
Co-Po Big Data Analytics
41 pages
AL3391 Notes Unit I
100% (1)
AL3391 Notes Unit I
52 pages
Big Data Analysis Lab Manual
No ratings yet
Big Data Analysis Lab Manual
39 pages
Ad3491 Fdsa Unit 4 Notes Eduengg-2
No ratings yet
Ad3491 Fdsa Unit 4 Notes Eduengg-2
16 pages
Ccs341 - Data Warehousing
100% (1)
Ccs341 - Data Warehousing
2 pages
CSBS - AD3491 - FDSA - IA 2 - Answer Key
50% (2)
CSBS - AD3491 - FDSA - IA 2 - Answer Key
14 pages
CS3361 Data Science Lab Manual
No ratings yet
CS3361 Data Science Lab Manual
82 pages
CCS335-cloud Computing Lab
No ratings yet
CCS335-cloud Computing Lab
2 pages
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
No ratings yet
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
12 pages
CD3281 Dsa Lab 2021 R
100% (2)
CD3281 Dsa Lab 2021 R
3 pages
AIDS Syllabus 2021 L
No ratings yet
AIDS Syllabus 2021 L
87 pages
Ad3251 Data Structures Design
No ratings yet
Ad3251 Data Structures Design
2 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
38 pages
DEV LAB MANUAL
No ratings yet
DEV LAB MANUAL
35 pages
DEV Lab Manual-1
No ratings yet
DEV Lab Manual-1
27 pages
DEV Manual - ESEC
No ratings yet
DEV Manual - ESEC
27 pages
Dev Practical List
No ratings yet
Dev Practical List
34 pages
Stucor Cs3351-Dpco Removed
No ratings yet
Stucor Cs3351-Dpco Removed
99 pages
Epl Record
No ratings yet
Epl Record
63 pages
Ma6452 Statistics and Numerical Methods Mathematics II
No ratings yet
Ma6452 Statistics and Numerical Methods Mathematics II
14 pages
Statistics and Numerical Methods - Department of Mathematics
No ratings yet
Statistics and Numerical Methods - Department of Mathematics
15 pages
Sennebogen Paperfoldable Digital
100% (1)
Sennebogen Paperfoldable Digital
3 pages
Web Design and Development Standards
No ratings yet
Web Design and Development Standards
6 pages
Disassembly & Reassembly
No ratings yet
Disassembly & Reassembly
9 pages
Mass Additions Create
No ratings yet
Mass Additions Create
15 pages
Mathematics VI
No ratings yet
Mathematics VI
18 pages
8051 Family Microcontroller Related Links
No ratings yet
8051 Family Microcontroller Related Links
2 pages
MATLAB Meets LEGO Mindstorms - A Freshman Introduction Course Into Practical Engineering
No ratings yet
MATLAB Meets LEGO Mindstorms - A Freshman Introduction Course Into Practical Engineering
13 pages
Splunk-6 0 3-ModuleRef
No ratings yet
Splunk-6 0 3-ModuleRef
239 pages
DASH Tesla Model Y Tampa, FL Ride FAQs
No ratings yet
DASH Tesla Model Y Tampa, FL Ride FAQs
4 pages
NU-EP1: Instruction Manual
No ratings yet
NU-EP1: Instruction Manual
4 pages
3D Ear Scanning Has Arrived
No ratings yet
3D Ear Scanning Has Arrived
4 pages
PQA Orientation2
No ratings yet
PQA Orientation2
23 pages
Grandstream Networks, Inc.: UCM6xxx Series - Follow Me Guide
No ratings yet
Grandstream Networks, Inc.: UCM6xxx Series - Follow Me Guide
8 pages
Pilot Training Manual V 1
No ratings yet
Pilot Training Manual V 1
58 pages
Verify Oracle EBS SSO Profiles
No ratings yet
Verify Oracle EBS SSO Profiles
3 pages
50 Milliseconds To Make A Good First Impression
No ratings yet
50 Milliseconds To Make A Good First Impression
13 pages
Sem 4 Electrical UEMassignment
No ratings yet
Sem 4 Electrical UEMassignment
30 pages
C Programming Language
No ratings yet
C Programming Language
34 pages
SYS 5013 Syllabus
No ratings yet
SYS 5013 Syllabus
10 pages
Abdallah CV New
No ratings yet
Abdallah CV New
1 page
Incomplete-Food Delivery
No ratings yet
Incomplete-Food Delivery
31 pages
Media are the communication outlets or tools used to store and deliver information or data
No ratings yet
Media are the communication outlets or tools used to store and deliver information or data
7 pages
PSPICE Simulation of Three-Phase Inverters by Means of Swiching Functions (Salazar Joós)
No ratings yet
PSPICE Simulation of Three-Phase Inverters by Means of Swiching Functions (Salazar Joós)
8 pages
Lecture 10 Rapid prototyping concept, advantages
No ratings yet
Lecture 10 Rapid prototyping concept, advantages
15 pages
AI Chapter1 SAV
No ratings yet
AI Chapter1 SAV
28 pages
Drive Down Operating Costs: With The HVAC Efficiency Leader
No ratings yet
Drive Down Operating Costs: With The HVAC Efficiency Leader
76 pages
DE5002 - Part Two - The Shop That Doesn't Sell - Breif 2023
No ratings yet
DE5002 - Part Two - The Shop That Doesn't Sell - Breif 2023
2 pages
Biochrom Anthos 2020 Microplate Reader Quick Start Guide - 2020-QSG-Formated
No ratings yet
Biochrom Anthos 2020 Microplate Reader Quick Start Guide - 2020-QSG-Formated
2 pages
Thunderbolt and Target Displays For M1 Mac Mini
No ratings yet
Thunderbolt and Target Displays For M1 Mac Mini
1 page