ccs346 Eda Lab Manual
ccs346 Eda Lab Manual
THANTHAI PERIYAR
GOVERNMENT INSTITUTE OF TECHNOLOGY,
VELLORE - 02
NAME: ……………………………………………....
• To provide healthy environment with excellent facilities for learning, research and
innovative thinking.
• To educate the students achieve their professional excellence with ethical and
social responsibilities.
CONTENTS
Page
Ex. No. Date Name of the Experiment Marks Signature
No.
EX. NO: 01
INSTALLATION OF DATA ANALYSIS AND VISUALIZATION TOOLS
DATE:
AIM
To install data analysis and visualization tools like R, Python, Tableau Public and
Power BI.
PROCEDURE
1. R:
• Download R: Visit the official R website (https://cran.r-project.org/) and download the
installer for your operating system (Windows, macOS, or Linux).
• Install R by following the instructions provided in the installer.
2. Python:
• Download Python: Visit the official Python website (https://www.python.org/) and
download the Python installer for your OS (Windows, macOS, or Linux).
• Install Python by running the installer and making sure to check the option to add
Python to your system's PATH during installation.
jupyter-lab
1
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
jupyter notebook
V. INSTALL PANDAS
pandas is a Python package that provides fast, flexible, and expressive
data structures designed to make working with "relational" or "labeled" data
both easy and intuitive.
2
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
3. Tableau Public:
• Tableau Public: It is a web-based tool, so there's no installation required. Simply visit
the Tableau Public website (https://public.tableau.com/s/gallery) and create an account
to start using it.
4. Power BI:
• Download Power BI Desktop: Go to the official Power BI website
(https://powerbi.microsoft.com/en-us/desktop/) and download Power BI Desktop.
• Install Power BI Desktop by running the installer.
RESULT
Thus the data analysis and visualization tools like R, Python, Tableau Public, and Power
BI has been installed Successfully.
3
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
EX. NO: 02
PERFORM EXPLORATORY DATA ANALYSIS WITH EMAIL DATA SET
DATE:
AIM
To perform exploratory data analysis (EDA) with datasets like email data set. Export all
your emails as a dataset, import them inside a pandas data frame, visualize them and get different
insights from the data.
The remaining 3000 columns are the 3000 most common words in all the emails, after
excluding the non-alphabetical characters/words.
For each row, the count of each word(column) in that email(row) is stored in the respective
cells.
Thus, information regarding all 5172 emails are stored in a compact dataframe rather than as
separate text files.
ALGORITHM
1. Load the Dataset
2. Display Basic Information and First Few Rows
3. Visualize the Distribution of Labels (Spam or Not Spam)
4. Visualize the Correlation Between Features (Word Occurrences)
5. Display Summary Statistics of Word Occurrences
6. Display the Top 10 Most Common Words in Spam/Not Spam Emails
4
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
PROGRAM
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
5
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
plt.show()
OUTPUT
6
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
RESULT:
Thus the perform exploratory data analysis (EDA) with datasets like email data set and to
export all our emails as a datasets, import them inside a pandas data frame, visualize them and get
different insights from the data has been done successfully.
7
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
EX. NO: 03
WORKING WITH NUMPY ARRAYS, PANDAS DATA FRAMES, BASIC PLOTS
USING MATPLOTLIB
DATE:
AIM
To implement python programs with Numpy arrays, Pandas data frames, Basic plots using
Matplotlib.
PROGRAM
import numpy as np
# Creating NumPy arrays
arr1 = np.array([1, 2, 3, 4, 5]) # 1-dimensional array
arr2 = np.array([[1, 2, 3], [4, 5, 6]]) # 2-dimensional array
# Basic operations on arrays
arr_sum = arr1 + 10 # Add 10 to each element
arr_product = arr2 * 2 # Multiply each element by 2
# Accessing elements and slicing
element_at_index_2 = arr1[2] # Accessing element at index 2
sliced_arr1 = arr1[1:4] # Slicing from index 1 to 3 (exclusive)
sliced_arr2 = arr2[:, 1] # Accessing the second column of arr2
8
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
OUTPUT
9
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
3. Basic Operations on DataFrame - Add a new column ('Salary') and increment the 'Age' column
by 1.
4. Filtering Data - Select rows where age is less than 30.
5. Grouping and Aggregation - Calculate the average salary for each unique city in the DataFrame.
6. Displaying Results - Print the updated DataFrame, filtered data, and the result of grouping and
aggregation.
10
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
PROGRAM
import pandas as pd
# Creating a Pandas DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
'Age': [25, 30, 22, 35, 28],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston']
}
df = pd.DataFrame(data)
# Displaying the DataFrame
print("Original DataFrame:")
print(df)
# Basic operations on DataFrame
df['Salary'] = [60000, 75000, 50000, 90000, 65000] # Adding a new column
df['Age'] = df['Age'] + 1 # Incrementing the 'Age' column by 1
# Filtering data
young_people = df[df['Age'] < 30] # Selecting rows where age is less than 30
# Grouping and aggregation
city_avg_salary = df.groupby('City')['Salary'].mean()
# Displaying the updated DataFrame and filtered data
print("\nUpdated DataFrame:")
print(df)
print("\nYoung people (Age < 30):")
print(young_people)
print("\nAverage Salary by City:")
print(city_avg_salary)
11
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
OUTPUT
12
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
PROGRAM
import matplotlib.pyplot as plt
import numpy as np
# Create sample data
x = np.linspace(0, 10, 100) # 100 points between 0 and 10
y1 = np.sin(x)
y2 = np.cos(x)
# Plotting the data
plt.figure(figsize=(8, 6))
# Line plot
plt.plot(x, y1, label='sin(x)', color='blue', linestyle='-', linewidth=2)
plt.plot(x, y2, label='cos(x)', color='green', linestyle='--', linewidth=2)
# Scatter plot
plt.scatter(x[::10], y1[::10], marker='o', color='red', label='Sampled Points')
# Adding labels and title
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Sin and Cos Functions')
plt.legend() # Show legend
# Show the plot
plt.grid(True)
plt.show()
OUTPUT
13
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
RESULT
Thus python programs to implement Numpy arrays, Pandas data frames, Basic plots using
Matplotlib has been executed successfully.
14
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
EX. NO: 04
EXPLORE VARIOUS VARIABLE AND ROW FILTERS IN R
DATE:
AIM
To explore various variable and row filters in R for cleaning data and to apply various plot
features in R on sample data sets and visualize.
ALGORITHM:
Loading necessary libraries (installing if needed).
Generating a sample dataset (Smoking, Alcohol and (O)esophageal Cancer).
Displaying the summary of the sample data.
Applying variable filters and row filters.
Creating different types of plots using ggplot2.
Saving the generated plots as image files.
PROGRAM:
# Install and load necessary libraries
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
library(ggplot2)
# Sample data generation
set.seed(123)
sample_data <- data.frame(
ID = 1:100,
Age = sample(18:60, 100, replace = TRUE),
Gender = sample(c("Male", "Female"), 100, replace = TRUE),
Income = rnorm(100, mean = 50000, sd = 10000),
15
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
16
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
OUTPUT:
RESULT
17
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
EX. NO: 05
TIME SERIES ANALYSIS
DATE:
AIM
To perform Time Series Analysis and apply the various visualization techniques using R
language.
DATA DESCRIPTION:
The training datasets consists of approximately 145k time series.
Each of these time series represent a number of daily views of a different Wikipedia article,
starting from July, 1st, 2015 up until December 31st, 2016. The leader board during the
training stage is based on traffic from January, 1st, 2017 up until March 1st, 2017.
The second stage will use training data up until September 1st, 2017.
The final ranking of the competition will be based on predictions of daily views between
September 13th, 2017 and November 13th, 2017 for each article in the datasets.
You will submit your forecasts for these dates by September 12th.
For each time series, you are provided the name of the article as well as the type of traffic that
this time series represent (all, mobile, desktop, spider).
You may use this metadata and any other publicly available data to make predictions.
Unfortunately, the data source for this datasets does not distinguish between traffic values of
zero and missing values.
A missing value may mean the traffic was zero or that the data is not available for that day.
To reduce the submission file size, each page and date combination has been given a shorter
Id. The mapping between page names and the submission Id is given in the key files.
ALGORITHM:
1. Generate Synthetic Time Series Data(Web Traffic Time Series Forecasting)
2. Plot the Time Series
3. Decompose Time Series
4. Plot Decomposed Components
18
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
5. Autocorrelation Plot
6. Partial Autocorrelation Plot
PROGRAM:
# Set seed for reproducibility
set.seed(123)
# Generate a synthetic time series data
date_sequence <- seq(as.Date("2022-01-01"), by = "days", length.out = 365)
time_series_data <- data.frame(
Date = date_sequence,
Value = cumsum(rnorm(365))
)
# Plot the time series
19
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
OUTPUT:
RESULT
20
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
EX. NO: 06
PERFORM DATA ANALYSIS AND REPRESENTATION ON A MAP
DATE:
AIM
To perform Data Analysis and representation on a Map using various Map data sets with
Mouse Rollover effect, user interaction, etc.
DATASET:
What types of crimes are most common?(Crimes in Boston)
Where are different types of crimes most likely to occur?
Does the frequency of crimes change over the day? Week? Year?
ALGORITHM
1. Import Libraries
i. Import necessary libraries, including pandas, folium, HeatMap from folium. plugins and
Mouse Position
2. Load Crime Dataset
3. Data Cleaning and Filtering
i. Drop rows with missing values in the 'Latitude' or 'Longitude' columns,.
ii. Filter out rows where latitude and longitude are both zero.
4. Create Folium Map
5. Add HeatMap Layer
6. Add Mouse Position - Add a MousePosition plugin to display latitude and longitude on
mouseover.
7. Save and Display:
21
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
PROGRAM:
A)
import pandas as pd
import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster
data = folium.Map(location=[42.32,-71.0589], tiles='openstreetmap',
zoom_start=10)
data
crimes = pd.read_csv("/content/sample_data/crime.csv",
encoding='latin-1')
crimes.dropna(subset=['Lat', 'Long', 'Location'], inplace=True)
crimes = crimes[crimes.OFFENSE_CODE_GROUP.isin([
'Larceny', 'Auto Theft', 'Robbery', 'Larceny From Motor Vehicle',
'Residential Burglary',
'Simple Assault', 'Harassment', 'Ballistics', 'Aggravated Assault',
'Other Burglary',
'Arson', 'Commercial Burglary', 'HOME INVASION', 'Homicide',
'Criminal Harassment',
'Manslaughter'])]
crimes = crimes[crimes.YEAR>=2018]
crimes
22
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
Marker([row['Lat'], row['Long']]).add_to(m_2)
data
B)
import pandas as pd
import folium
from folium.plugins import HeatMap, MousePosition
from datetime import datetime
23
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
MousePosition().add_to(crime_map)
OUTPUT:
24
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
RESULT
25
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
EX. NO: 07
CARTOGRAPHIC VISUALIZATION FOR MULTIPLE DATASETS
DATE:
AIM
To build cartographic visualization for multiple datasets involving various countries of the
world; states and districts in India, etc..
ALGORITHM:
1. Import Necessary Libraries - Import the required libraries, including pandas, geopandas, and
folium.
2. Load World Countries Shapefile - Use gpd.read_file to load the world countries shapefile. This
shapefile is obtained from the Natural Earth dataset.
3. Load India States Shapefile - Use gpd.read_file to load the India states shapefile. Ensure that all
necessary files (.shp, .shx, .dbf) are present in the specified path.
4. Create a Folium Map - Use folium.Map to create a Folium map centered around the world.
5. Add World Countries to the Map - Use folium.Choropleth to add world countries to the map.
This creates a choropleth layer based on population estimates.
6. Add India States to the Map - Use folium.GeoJson to add India states to the map. This layer
displays the boundaries of Indian states.
7. Save the Map to an HTML File - Use the save method of the Folium map object to save the map
as an HTML file.
PROGRAM;
26
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
india_states = gpd.read_file('/content/sample_data/Indian_States.shp')
27
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
OUTPUT:
RESULT:
28
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
EX. NO: 08
EDA ON WINE QUALITY DATA SET
DATE:
AIM
To Perform EDA on Wine Quality Data Set.
DESCRIPTION
This datasets is related to red variants of the Portuguese "Vinho Verde" wine.
The dataset describes the amount of various chemicals present in wine and their effect on it's
quality.
The datasets can be viewed as classification or regression tasks.
The classes are ordered and not balanced (e.g. there are much more normal wines than
excellent or poor ones).
our task is to predict the quality of wine using the given data.
A simple yet challenging project, to anticipate the quality of wine.
The complexity arises due to the fact that the dataset has fewer samples, & is highly
imbalanced.
29
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
ALGORITHM
1. Load the Data - The Wine Quality Data Set for red variants is loaded from the provided URL.
2. Display Basic Information - Display the first few rows of the dataset, summary statistics, and
check for missing values.
3. Distribution of Wine Quality - Visualize the distribution of wine quality using a count plot.
4. Correlation Matrix - Calculate the correlation matrix of features and visualize it using a
heatmap.
5. Pair Plot - Create a pair plot for selected features to observe relationships and distributions.
6. Box Plots - Use box plots to visualize the distribution of selected features across different
wine qualities.
PROGRAM
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Summary statistics
print("\nSummary Statistics:")
print(wine_data_red.describe())
30
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
# Correlation matrix
correlation_matrix_red = wine_data_red.corr()
31
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
plt.show()
OUTPUT
32
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
RESULT
33
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
EX. NO: 09
CASE STUDY
DATE:
AIM
To use a case study on a data set and apply the various EDA and visualization techniques
and present an analysis report.
34
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
35
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
36
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)
lOMoARcPSD|45767737
37
Downloaded by Tishbian Meshach (www.stishbian262@gmail.com)