[go: up one dir, main page]

0% found this document useful (0 votes)
2K views41 pages

ccs346 Eda Lab Manual

lab

Uploaded by

Tishbian Meshach
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views41 pages

ccs346 Eda Lab Manual

lab

Uploaded by

Tishbian Meshach
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

lOMoARcPSD|45767737

CCS346 - EDA LAB Manual

Data Science and big data analysis (Annamalai University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

THANTHAI PERIYAR
GOVERNMENT INSTITUTE OF TECHNOLOGY,
VELLORE - 02

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CCS346 - EXPLORATORY DATA ANALYSIS


LAB MANUAL
(R2021)

NAME: ……………………………………………....

REGISTER NUMBER: ………………………….…

YEAR AND SEMESTER: ………………………....

ACADEMIC YEAR: ………………………………..

Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)


lOMoARcPSD|45767737

Vision of the Institution

To provide high quality learning environment through innovative teaching


and promote research to produce globally competitive engineers of excellent quality.

Mission of the Institution

• To offer education programmes that blend intensive technical training with


appropriate guidance inculcating analytical skills and problem-solving ability with
high degree of professionalism.

• To provide healthy environment with excellent facilities for learning, research and
innovative thinking.

• To educate the students achieve their professional excellence with ethical and
social responsibilities.

Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)


lOMoARcPSD|45767737

CONTENTS
Page
Ex. No. Date Name of the Experiment Marks Signature
No.

Installing the Data Analysis and Visualization


1 1
tool: R/Python/Tableau Public/Power BI

Performing exploratory data analysis (EDA)


with datasets like email data set. Export all
2 your emails as a dataset, import them inside a 4
pandas data frame, visualize them and get
different insights from the data

Working with Numpy arrays, Pandas data


3 8
frames, Basic plots using Matplotlib

Exploring various variable and row filters in


R for cleaning data. Apply various plot
4 15
features in R on sample data sets and
visualize

Performing Time Series Analysis and apply


5 18
the various visualization techniques

Performing Data Analysis and representation


6 on a Map using various Map data sets with 21
Mouse Rollover effect, user interaction, etc.,

Building cartographic visualization for


7 multiple datasets involving countries of the 26
world, states and districts in India, etc.,

8 Performing EDA on Wine Quality Data set 29

Case study on a data set and apply the various


9 EDA and visualization techniques and 34
present an analysis report

Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)


lOMoARcPSD|45767737

EX. NO: 01
INSTALLATION OF DATA ANALYSIS AND VISUALIZATION TOOLS
DATE:

AIM
To install data analysis and visualization tools like R, Python, Tableau Public and
Power BI.

PROCEDURE

1. R:
• Download R: Visit the official R website (https://cran.r-project.org/) and download the
installer for your operating system (Windows, macOS, or Linux).
• Install R by following the instructions provided in the installer.

2. Python:
• Download Python: Visit the official Python website (https://www.python.org/) and
download the Python installer for your OS (Windows, macOS, or Linux).
• Install Python by running the installer and making sure to check the option to add
Python to your system's PATH during installation.

I. INSTALL NUMPY WITH PIP


NumPy (Numerical Python) is an open-source core Python library for
scientific computations. It is a general-purpose array and matrices processing
package.

pip install numpy

II. INSTALL JUPYTERLAB


Install Jupyter Lab with pip:

pip install jupyter

Once installed, launch Jupyter Lab with:

jupyter-lab

1
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

III. JUPYTER NOTEBOOK


Install the classic Jupyter Notebook with:

pip install notebook

To run the notebook:

jupyter notebook

IV. INSTALL SCIPY


Scipy is a python library that is useful in solving many mathematical
equations and algorithms. It is designed on the top of Numpy library. SCIPY
means scientific python.

pip install scipy

V. INSTALL PANDAS
pandas is a Python package that provides fast, flexible, and expressive
data structures designed to make working with "relational" or "labeled" data
both easy and intuitive.

pip install pandas

VI. INSTALL MATPLOTLIB


Matplotlib is a comprehensive library for creating static, animated,
and interactive visualizations in Python. Working with "relational" or "labeled"
data both easy and intuitive.

pip install matplotlib

2
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

3. Tableau Public:
• Tableau Public: It is a web-based tool, so there's no installation required. Simply visit
the Tableau Public website (https://public.tableau.com/s/gallery) and create an account
to start using it.
4. Power BI:
• Download Power BI Desktop: Go to the official Power BI website
(https://powerbi.microsoft.com/en-us/desktop/) and download Power BI Desktop.
• Install Power BI Desktop by running the installer.

RESULT
Thus the data analysis and visualization tools like R, Python, Tableau Public, and Power
BI has been installed Successfully.

3
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 02
PERFORM EXPLORATORY DATA ANALYSIS WITH EMAIL DATA SET
DATE:

AIM
To perform exploratory data analysis (EDA) with datasets like email data set. Export all
your emails as a dataset, import them inside a pandas data frame, visualize them and get different
insights from the data.

ABOUT THE DATASET


 The csv file contains 5172 rows, each row for each email.
 There are 3002 columns.
 The first column indicates Email name.
 The name has been set with numbers and not recipients' name to protect privacy.
 The last column has the labels for prediction : 1 for spam, 0 for not spam.

 The remaining 3000 columns are the 3000 most common words in all the emails, after
excluding the non-alphabetical characters/words.

 For each row, the count of each word(column) in that email(row) is stored in the respective
cells.

 Thus, information regarding all 5172 emails are stored in a compact dataframe rather than as
separate text files.

ALGORITHM
1. Load the Dataset
2. Display Basic Information and First Few Rows
3. Visualize the Distribution of Labels (Spam or Not Spam)
4. Visualize the Correlation Between Features (Word Occurrences)
5. Display Summary Statistics of Word Occurrences
6. Display the Top 10 Most Common Words in Spam/Not Spam Emails

4
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

PROGRAM

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset


file_path = '/emails.csv'
df = pd.read_csv(file_path)

# Display basic information about the dataset


print("Dataset Information:")
print(df.info())

# Display the first few rows of the dataset


print("\nFirst few rows of the dataset:")
print(df.head())

# Visualize the distribution of labels (spam or not spam)


plt.figure(figsize=(8, 6))
sns.countplot(x='Prediction', data=df)
plt.title('Distribution of Spam and Not Spam Emails')
plt.xlabel('Prediction (1: Spam, 0: Not Spam)')
plt.ylabel('Count')
plt.show()

# Visualize the correlation between features (word occurrences)


plt.figure(figsize=(10, 8))
sns.heatmap(df.iloc[:, 1:-1].corr(), cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Between Word Occurrences')

5
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

plt.show()

# Display summary statistics of word occurrences


word_occurrences_summary = df.iloc[:, 1:-1].describe()
print("\nSummary Statistics of Word Occurrences:")
print(word_occurrences_summary)

# Display the most common words in spam emails


spam_word_frequencies = df[df['Prediction'] == 1].iloc[:, 1:-
1].sum().sort_values(ascending=False)
print("\nTop 10 Most Common Words in Spam Emails:")
print(spam_word_frequencies.head(10))

# Display the most common words in not spam emails


not_spam_word_frequencies = df[df['Prediction'] == 0].iloc[:, 1:-
1].sum().sort_values(ascending=False)

print("\nTop 10 Most Common Words in Not Spam Emails:")


print(not_spam_word_frequencies.head(10))

OUTPUT

6
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

RESULT:
Thus the perform exploratory data analysis (EDA) with datasets like email data set and to
export all our emails as a datasets, import them inside a pandas data frame, visualize them and get
different insights from the data has been done successfully.

7
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 03
WORKING WITH NUMPY ARRAYS, PANDAS DATA FRAMES, BASIC PLOTS
USING MATPLOTLIB
DATE:

AIM
To implement python programs with Numpy arrays, Pandas data frames, Basic plots using
Matplotlib.

WORKING WITH NUMPY ARRAYS


ALGORITHM
1. Array Creation - Create 1-dimensional (arr1) and 2-dimensional (arr2) NumPy arrays.
2. Basic Operations - Perform basic operations like addition and multiplication on arrays.
3. Accessing Elements and Slicing - Access individual elements and perform slicing operations.
4. Array Shape and Dimensions - Retrieve the shape and number of dimensions of the array.
5. Reshape Array - Reshape a 1D array (arr1) to have 5 rows and 1 column.
6. Displaying Results - Print the original arrays and the results of operations.

PROGRAM
import numpy as np
# Creating NumPy arrays
arr1 = np.array([1, 2, 3, 4, 5]) # 1-dimensional array
arr2 = np.array([[1, 2, 3], [4, 5, 6]]) # 2-dimensional array
# Basic operations on arrays
arr_sum = arr1 + 10 # Add 10 to each element
arr_product = arr2 * 2 # Multiply each element by 2
# Accessing elements and slicing
element_at_index_2 = arr1[2] # Accessing element at index 2
sliced_arr1 = arr1[1:4] # Slicing from index 1 to 3 (exclusive)
sliced_arr2 = arr2[:, 1] # Accessing the second column of arr2

8
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

# Array shape and dimensions


shape_of_arr2 = arr2.shape # Shape of arr2
dimensions_of_arr2 = arr2.ndim # Number of dimensions in arr2
# Reshape the array
reshaped_arr1 = arr1.reshape((5, 1)) # Reshape arr1 to have 5 rows and 1 column
# Displaying results
print("Original 1D array:")
print(arr1)
print("\nOriginal 2D array:")
print(arr2)
print("\nArray operations:")
print("arr1 + 10:", arr_sum)
print("arr2 * 2:", arr_product)
print("\nArray slicing:")
print("Element at index 2:", element_at_index_2)
print("Sliced arr1:", sliced_arr1)
print("Second column of arr2:", sliced_arr2)
print("\nArray shape and dimensions:")
print("Shape of arr2:", shape_of_arr2)
print("Number of dimensions in arr2:", dimensions_of_arr2)
print("\nReshaped array:")
print(reshaped_arr1)

OUTPUT

9
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

WORKING WITH PANDAS


ALGORITHM
1. Creating a Pandas DataFrame - Create a simple DataFrame using a dictionary of lists.
2. Displaying the DataFrame - Print the original DataFrame.

3. Basic Operations on DataFrame - Add a new column ('Salary') and increment the 'Age' column
by 1.
4. Filtering Data - Select rows where age is less than 30.
5. Grouping and Aggregation - Calculate the average salary for each unique city in the DataFrame.

6. Displaying Results - Print the updated DataFrame, filtered data, and the result of grouping and
aggregation.

10
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

PROGRAM
import pandas as pd
# Creating a Pandas DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'Age': [25, 30, 22, 35, 28],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston']
}
df = pd.DataFrame(data)
# Displaying the DataFrame
print("Original DataFrame:")
print(df)
# Basic operations on DataFrame
df['Salary'] = [60000, 75000, 50000, 90000, 65000] # Adding a new column
df['Age'] = df['Age'] + 1 # Incrementing the 'Age' column by 1
# Filtering data
young_people = df[df['Age'] < 30] # Selecting rows where age is less than 30
# Grouping and aggregation
city_avg_salary = df.groupby('City')['Salary'].mean()
# Displaying the updated DataFrame and filtered data
print("\nUpdated DataFrame:")
print(df)
print("\nYoung people (Age < 30):")
print(young_people)
print("\nAverage Salary by City:")
print(city_avg_salary)

11
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

OUTPUT

BASIC PLOTS USING MATPLOTLIB


ALGORITHM
1. Create Sample Data -Generate x values from 0 to 10 with 100 data points. Calculate y values
for sin(x) and cos(x).
2. Plotting the Data - Create a figure with a specific size. Plot sin(x) and cos(x) using plt.plot.
3. Plot sampled points using plt.scatter.
4. Adding Labels and Title - Add labels to the X and Y axes.Set the plot title.
5. Legend and Grid - Display a legend to label the plotted lines. Add a grid to the plot.
6. Show the Plot

12
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

PROGRAM
import matplotlib.pyplot as plt
import numpy as np
# Create sample data
x = np.linspace(0, 10, 100) # 100 points between 0 and 10
y1 = np.sin(x)
y2 = np.cos(x)
# Plotting the data
plt.figure(figsize=(8, 6))
# Line plot
plt.plot(x, y1, label='sin(x)', color='blue', linestyle='-', linewidth=2)
plt.plot(x, y2, label='cos(x)', color='green', linestyle='--', linewidth=2)
# Scatter plot
plt.scatter(x[::10], y1[::10], marker='o', color='red', label='Sampled Points')
# Adding labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Sin and Cos Functions')
plt.legend() # Show legend
# Show the plot
plt.grid(True)
plt.show()

OUTPUT

13
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

RESULT
Thus python programs to implement Numpy arrays, Pandas data frames, Basic plots using
Matplotlib has been executed successfully.

14
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 04
EXPLORE VARIOUS VARIABLE AND ROW FILTERS IN R
DATE:

AIM
To explore various variable and row filters in R for cleaning data and to apply various plot
features in R on sample data sets and visualize.

ALGORITHM:
 Loading necessary libraries (installing if needed).
 Generating a sample dataset (Smoking, Alcohol and (O)esophageal Cancer).
 Displaying the summary of the sample data.
 Applying variable filters and row filters.
 Creating different types of plots using ggplot2.
 Saving the generated plots as image files.

PROGRAM:
# Install and load necessary libraries
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
library(ggplot2)
# Sample data generation
set.seed(123)
sample_data <- data.frame(
ID = 1:100,
Age = sample(18:60, 100, replace = TRUE),
Gender = sample(c("Male", "Female"), 100, replace = TRUE),
Income = rnorm(100, mean = 50000, sd = 10000),

15
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

Score = rnorm(100, mean = 70, sd = 10)


)
# Display summary of the sample data
print("Summary of Sample Data:")
print(summary(sample_data))
# Variable filter: Selecting specific columns
selected_columns <- sample_data[, c("ID", "Age", "Income")]
# Row filter: Filtering rows based on a condition (e.g., Age greater than 30)
filtered_data <- subset(sample_data, Age > 30)
# Data visualization using ggplot2
# Scatter plot of Age vs. Income

ggplot(sample_data, aes(x = Age, y = Income)) + geom_point() +labs(title = "Scatter Plot of Age


vs. Income", x = "Age", y = "Income")
# Histogram of Age

ggplot(sample_data, aes(x = Age)) +geom_histogram(binwidth = 5, fill = "skyblue", color =


"black") + labs(title = "Histogram of Age", x = "Age", y = "Frequency")
# Boxplot of Score by Gender

ggplot(sample_data, aes(x = Gender, y = Score, fill = Gender)) + geom_boxplot() + labs(title =


"Boxplot of Score by Gender", x = "Gender", y = "Score")
# Save plots as image files
ggsave("scatter_plot.png", height = 5, width = 7, dpi = 300)
ggsave("histogram.png", height = 5, width = 7, dpi = 300)
ggsave("boxplot.png", height = 5, width = 7, dpi = 300)
# Display a message indicating the completion of the program
print("Program completed successfully.")

16
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

OUTPUT:

RESULT

17
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 05
TIME SERIES ANALYSIS
DATE:

AIM
To perform Time Series Analysis and apply the various visualization techniques using R
language.

DATA DESCRIPTION:
 The training datasets consists of approximately 145k time series.

 Each of these time series represent a number of daily views of a different Wikipedia article,
starting from July, 1st, 2015 up until December 31st, 2016. The leader board during the
training stage is based on traffic from January, 1st, 2017 up until March 1st, 2017.
 The second stage will use training data up until September 1st, 2017.

 The final ranking of the competition will be based on predictions of daily views between
September 13th, 2017 and November 13th, 2017 for each article in the datasets.
 You will submit your forecasts for these dates by September 12th.

 For each time series, you are provided the name of the article as well as the type of traffic that
this time series represent (all, mobile, desktop, spider).

 You may use this metadata and any other publicly available data to make predictions.
Unfortunately, the data source for this datasets does not distinguish between traffic values of
zero and missing values.
 A missing value may mean the traffic was zero or that the data is not available for that day.

 To reduce the submission file size, each page and date combination has been given a shorter
Id. The mapping between page names and the submission Id is given in the key files.

ALGORITHM:
1. Generate Synthetic Time Series Data(Web Traffic Time Series Forecasting)
2. Plot the Time Series
3. Decompose Time Series
4. Plot Decomposed Components

18
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

5. Autocorrelation Plot
6. Partial Autocorrelation Plot

PROGRAM:
# Set seed for reproducibility
set.seed(123)
# Generate a synthetic time series data
date_sequence <- seq(as.Date("2022-01-01"), by = "days", length.out = 365)
time_series_data <- data.frame(
Date = date_sequence,
Value = cumsum(rnorm(365))
)
# Plot the time series

plot(time_series_data$Date, time_series_data$Value, type = "l", xlab = "Date", ylab = "Value",


main = "Synthetic Time Series")
# Decompose the time series into trend, seasonality, and remainder components
decomposition <- decompose(ts(time_series_data$Value, frequency = 365))
# Plot the decomposed components
par(mfrow = c(3, 1))
plot(decomposition$seasonal, main = "Seasonal Component")
plot(decomposition$trend, main = "Trend Component")
plot(decomposition$random, main = "Residual Component")
# Autocorrelation plot
acf(time_series_data$Value, main = "Autocorrelation Plot")
# Partial autocorrelation plot
pacf(time_series_data$Value, main = "Partial Autocorrelation Plot")

19
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

OUTPUT:

RESULT

20
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 06
PERFORM DATA ANALYSIS AND REPRESENTATION ON A MAP
DATE:

AIM
To perform Data Analysis and representation on a Map using various Map data sets with
Mouse Rollover effect, user interaction, etc.

DATASET:
 What types of crimes are most common?(Crimes in Boston)
 Where are different types of crimes most likely to occur?
 Does the frequency of crimes change over the day? Week? Year?

ALGORITHM
1. Import Libraries

i. Import necessary libraries, including pandas, folium, HeatMap from folium. plugins and
Mouse Position
2. Load Crime Dataset
3. Data Cleaning and Filtering
i. Drop rows with missing values in the 'Latitude' or 'Longitude' columns,.
ii. Filter out rows where latitude and longitude are both zero.
4. Create Folium Map
5. Add HeatMap Layer

6. Add Mouse Position - Add a MousePosition plugin to display latitude and longitude on
mouseover.
7. Save and Display:

21
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

PROGRAM:
A)

import pandas as pd
import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster
data = folium.Map(location=[42.32,-71.0589], tiles='openstreetmap',
zoom_start=10)
data

crimes = pd.read_csv("/content/sample_data/crime.csv",
encoding='latin-1')
crimes.dropna(subset=['Lat', 'Long', 'Location'], inplace=True)

crimes = crimes[crimes.OFFENSE_CODE_GROUP.isin([
'Larceny', 'Auto Theft', 'Robbery', 'Larceny From Motor Vehicle',
'Residential Burglary',
'Simple Assault', 'Harassment', 'Ballistics', 'Aggravated Assault',
'Other Burglary',
'Arson', 'Commercial Burglary', 'HOME INVASION', 'Homicide',
'Criminal Harassment',
'Manslaughter'])]
crimes = crimes[crimes.YEAR>=2018]
crimes

daytime_robberies = crimes[((crimes.OFFENSE_CODE_GROUP == 'Robbery') &


\
(crimes.HOUR.isin(range(9,18))))]
data = folium.Map(location=[42.32,-71.0589], tiles='cartodbpositron',
zoom_start=13)
for idx, row in daytime_robberies.iterrows():

22
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

Marker([row['Lat'], row['Long']]).add_to(m_2)
data

B)

import pandas as pd
import folium
from folium.plugins import HeatMap, MousePosition
from datetime import datetime

# Load the crime dataset (replace 'your_crime_dataset.csv' with the


actual file path)
file_path = '/content/sample_data/crime.csv'
crime_df = pd.read_csv(file_path,encoding='latin-1')

# Data cleaning and filtering


crime_df = crime_df.dropna(subset=['Lat', 'Long'])
crime_df = crime_df[(crime_df['Lat'] != 0) & (crime_df['Long'] != 0)]

# Create a Folium map centered around the average coordinates of


crimes
map_center = [crime_df['Lat'].mean(), crime_df['Long'].mean()]
crime_map = folium.Map(location=map_center, zoom_start=12)

# Add HeatMap layer to visualize crime density


heat_data = [[row['Lat'], row['Long']] for index, row in
crime_df.iterrows()]
HeatMap(heat_data, radius=15).add_to(crime_map)

# Add MousePosition for latitude and longitude display on mouseover

23
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

MousePosition().add_to(crime_map)

# Save the map to an HTML file


crime_map.save('crime_map.html')

# Display the map


crime_map

OUTPUT:

24
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

RESULT

25
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 07
CARTOGRAPHIC VISUALIZATION FOR MULTIPLE DATASETS
DATE:

AIM
To build cartographic visualization for multiple datasets involving various countries of the
world; states and districts in India, etc..

ALGORITHM:
1. Import Necessary Libraries - Import the required libraries, including pandas, geopandas, and
folium.

2. Load World Countries Shapefile - Use gpd.read_file to load the world countries shapefile. This
shapefile is obtained from the Natural Earth dataset.

3. Load India States Shapefile - Use gpd.read_file to load the India states shapefile. Ensure that all
necessary files (.shp, .shx, .dbf) are present in the specified path.
4. Create a Folium Map - Use folium.Map to create a Folium map centered around the world.

5. Add World Countries to the Map - Use folium.Choropleth to add world countries to the map.
This creates a choropleth layer based on population estimates.

6. Add India States to the Map - Use folium.GeoJson to add India states to the map. This layer
displays the boundaries of Indian states.
7. Save the Map to an HTML File - Use the save method of the Folium map object to save the map
as an HTML file.

PROGRAM;

import geopandas as gpd


import folium

# Load world countries shapefile


world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

# Load India states shapefile

26
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

india_states = gpd.read_file('/content/sample_data/Indian_States.shp')

# Create a Folium map centered around the world


world_map = folium.Map(location=[0, 0], zoom_start=2)

# Add world countries to the map


folium.Choropleth(
geo_data=world,
name='choropleth',
data=None,
columns=['name', 'pop_est'],
key_on='feature.properties.name',
fill_color='YlGnBu',
fill_opacity=0.7,
line_opacity=0.2,
).add_to(world_map)

# Add India states to the map


folium.GeoJson(india_states).add_to(world_map)

# Save the map to an HTML file


world_map.save('world_map.html')

# Display the map


world_map

27
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

OUTPUT:

RESULT:

28
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 08
EDA ON WINE QUALITY DATA SET
DATE:

AIM
To Perform EDA on Wine Quality Data Set.

DESCRIPTION
 This datasets is related to red variants of the Portuguese "Vinho Verde" wine.

 The dataset describes the amount of various chemicals present in wine and their effect on it's
quality.
 The datasets can be viewed as classification or regression tasks.

 The classes are ordered and not balanced (e.g. there are much more normal wines than
excellent or poor ones).
 our task is to predict the quality of wine using the given data.
 A simple yet challenging project, to anticipate the quality of wine.

 The complexity arises due to the fact that the dataset has fewer samples, & is highly
imbalanced.

 This data frame contains the following columns:


Input variables (based on physicochemical tests):\
1 - fixed acidity\
2 - volatile acidity\
3 - citric acid\
4 - residual sugar\
5 - chlorides\
6 - free sulfur dioxide\
7 - total sulfur dioxide\
8 - density\
9 - pH\
10 - sulphates\
11 - alcohol\
Output variable (based on sensory data):\
12 - quality (score between 0 and 10)

29
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

ALGORITHM
1. Load the Data - The Wine Quality Data Set for red variants is loaded from the provided URL.

2. Display Basic Information - Display the first few rows of the dataset, summary statistics, and
check for missing values.
3. Distribution of Wine Quality - Visualize the distribution of wine quality using a count plot.

4. Correlation Matrix - Calculate the correlation matrix of features and visualize it using a
heatmap.
5. Pair Plot - Create a pair plot for selected features to observe relationships and distributions.

6. Box Plots - Use box plots to visualize the distribution of selected features across different
wine qualities.

PROGRAM
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Wine Quality Data Set for red variants


url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-
quality/winequality-red.csv"
wine_data_red = pd.read_csv(url, sep=';')

# Display the first few rows of the dataset


print("First few rows of the Wine Quality Data Set for red variants:")
print(wine_data_red.head())

# Summary statistics
print("\nSummary Statistics:")
print(wine_data_red.describe())

30
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

# Check for missing values


print("\nMissing Values:")
print(wine_data_red.isnull().sum())

# Distribution of wine quality


plt.figure(figsize=(8, 6))
sns.countplot(x='quality', data=wine_data_red, palette='viridis')
plt.title('Distribution of Wine Quality for Red Variants')
plt.xlabel('Quality')
plt.ylabel('Count')
plt.show()

# Correlation matrix
correlation_matrix_red = wine_data_red.corr()

# Heatmap of the correlation matrix


plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_red, annot=True, cmap='coolwarm',
fmt=".2f", linewidths=.5)
plt.title('Correlation Matrix for Red Variants')
plt.show()

# Pair plot for selected features


selected_features_red = ['fixed acidity', 'volatile acidity', 'citric
acid', 'residual sugar', 'chlorides',
'free sulfur dioxide', 'total sulfur
dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']
sns.pairplot(wine_data_red[selected_features_red], diag_kind='kde')
plt.suptitle("Pair Plot of Selected Features for Red Variants",
y=1.02)

31
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

plt.show()

# Box plot for selected features


plt.figure(figsize=(12, 8))
for i, feature in enumerate(selected_features_red[:-1]):
plt.subplot(3, 4, i + 1)
sns.boxplot(x='quality', y=feature, data=wine_data_red,
palette='viridis')
plt.title(f'Box Plot of {feature} by Wine Quality')
plt.xlabel('Quality')
plt.ylabel(feature)
plt.tight_layout()
plt.show()

OUTPUT

32
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

RESULT

33
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

EX. NO: 09
CASE STUDY
DATE:

AIM
To use a case study on a data set and apply the various EDA and visualization techniques
and present an analysis report.

LIST OF CASE STUDY


1. Exploring Airbnb Data in a Specific City
Analyzing price distribution, factors affecting pricing, and popular neighborhoods.
2. Analyzing COVID-19 Trends
Visualizing daily cases, trends over time, and regional variations.
3. Customer Segmentation for an E-commerce Platform
Using purchase history and demographics for segmentation.
4. Predicting Housing Prices
EDA on housing data, identifying features influencing prices, and building a predictive
model.
5. Social Media Sentiment Analysis
Analyzing sentiment trends on platforms like Twitter or Reddit.
6. Exploring Movie Ratings Data
Analyzing movie ratings, genres, and factors affecting popularity.
7. Stock Market Analysis
Visualizing stock prices, trends, and correlation between different stocks.
8. Online Retail Analysis
Understanding customer behavior, popular products, and seasonality.
9. Analyze and Visualize Traffic Patterns
Using GPS or traffic data to understand peak hours and congestion.
10. Food Delivery Analysis
Exploring delivery times, customer satisfaction, and popular cuisines.

34
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

11. Credit Card Fraud Detection


EDA on a credit card transactions dataset and building a fraud detection model.
12. Visualizing Global CO2 Emissions
Analyzing emission trends, contributors, and regional variations.
13. Analysis of Mobile App Reviews
Sentiment analysis and feature popularity in app reviews.
14. Exploring Social Network Data
Analyzing user connections, engagement, and influential users.
15. Detecting Anomalies in Network Traffic
Identifying unusual patterns in network data.
16. E-commerce Sales Forecasting
Predicting future sales based on historical data.
17. Analysis of Road Accidents Data
Understanding factors leading to accidents and identifying high-risk areas.
18. Explore Health and Fitness Data
Analyzing fitness tracker data, user habits, and correlations.
19. Survey Data Analysis
Analyzing survey responses and identifying trends or patterns.
20. Predictive Maintenance in Manufacturing
Analyzing equipment failure patterns to predict maintenance needs.
21. Exploring Educational Data
Analyzing student performance, factors affecting grades, and trends.
22. Visualizing Weather Patterns
Analyzing temperature, precipitation, and other weather data.
23. Analysis of Online User Behavior
Understanding user interactions on a website or app.
24. Predictive Analysis for Employee Attrition
Identifying factors leading to employee attrition and predicting future attrition.
25. Analysis of Music Streaming Data
Identifying popular genres, artists, and user preferences.

35
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

26. Effect of Ad Campaigns on Sales


Analyzing the impact of advertising on product sales.
27. Explore Traffic Camera Data
Analyzing traffic camera feeds for congestion patterns.
28. Visualizing Population Density
Understanding population distribution and density.
29. Analysis of Social Services Data
Analyzing the utilization of social services in a community.
30. Explore Google Analytics Data
Analyzing website traffic, user interactions, and popular content.
31. Analyze Restaurant Reviews
Sentiment analysis and identifying factors influencing restaurant ratings.
32. Water Quality Analysis
Analyzing water quality parameters over time.
33. Analysis of Online Gaming Data
Understanding player behavior, popular games, and engagement.
34. Explore Flight Delay Data
Analyzing factors contributing to flight delays.
35. Analysis of Kickstarter Campaigns
Understanding successful vs. unsuccessful crowdfunding campaigns.
36. Visualizing Earthquake Data
Analyzing earthquake patterns, magnitudes, and affected regions.
37. Explore Google Trends Data
Analyzing search trends over time.
38. Analyze Call Center Data
Understanding call volumes, response times, and customer satisfaction.
39. Explore Museum Visitor Data
Analyzing visitor demographics and exhibit popularity.
40. Analysis of Cybersecurity Incidents
Identifying patterns in cybersecurity incident data.

36
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737

41. Explore Government Spending Data


Analyzing government expenditure in different sectors.
42. Visualizing Social Media Influence
Identifying influencers and their impact on social media.
43. Analysis of Online Course Engagement
Understanding factors affecting student engagement in online courses.
44. Explore Retail Store Sales Data
Analyzing sales patterns, promotions, and customer preferences.
45. Analyze Road Infrastructure Data
Identifying traffic bottlenecks and road conditions.
46. Visualizing Wildlife Migration Patterns
Analyzing tracking data for migratory animals.
47. Analysis of Hotel Reviews
Sentiment analysis and identifying factors influencing hotel ratings.
48. Explore Solar Energy Production Data
Analyzing solar energy production patterns.
49. Analysis of Social Housing Data
Understanding housing affordability and distribution.
50. Visualizing Satellite Imagery Data
Analyzing satellite imagery for land use patterns.

37
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)

You might also like