0 ratings0% found this document useful (0 votes) 48 views17 pagesWeather Data Analysis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
DSBDA REPORT
ON
”Weather Data Analysis
SUBMITTED TO THE SAVITRIBAI PHULE PUNE
UNIVERSITY IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS
FOR THE AWARD OF THE,
BACHELOR OF IT ENGINEERING BY
‘Mr. Hitesh Gopal Patil (11902708552)
Mr. Prathamesh Bajirao Chormale (11902708510)
‘Ms. Tanuja Babasaheb Shinde (11902708569)
Mr. Devidas Kakasaheb Tambe (11902708573)
UNDER THE GUIDANCE OF
Ms. Shital S. Patil
DEPARTMENT OF IT ENGINEERING
Sir Visvesvaraya Institute Of Technology, Nashik
A/p.Chincholi, Tal.Sinnar, Dist.Nashik -
422102(MS)India YEAR 2024-2025DEPARTMENT OF INFORMATION TECHNOLOGY
Sir Visvesvaraya Institute Of Technology, Nashik
A /p.Chincholi, Tal.Sinner, Dist.Nashik - 422102(MS)India
Year 2024-25
we
CERTIFICATE
‘This is to certify that DSBDAL report entitled
“Weather Data Analysis”
Is submitted as partial fulfillment of curriculum
of the T.E. of IT Engineering
BY
‘Mr. Hitesh Gopal Patil (11902708552)
Mr. Prathamesh Bajirao Chormale (11902708510)
Ms. Tanuja Babasaheb Shinde (T1902708569)
‘Mr. Devidas Kakasaheb Tambe (11902708573)
(Ms. Shital S. Patil) (Dr-Pratibha.V.Kashid)
Guide Head Of Department
SVIT, NashikCertificate By Guide
This is to certify that
Mr. Hitesh Gopal Patil (11902708552)
‘Mr. Prathamesh Bajirao Chormale (11902708510)
Ms. Tanuja Babasaheb Shinde (11902708569)
Mr. Devidas Kakasaheb Tambe (11902708573)
Has completed the DSBDA project under my guidance and that, I have verified the work
for its originality in documentation, problem statement, literature survey and conclusion
presented in DSBDA project .
Place: Nashik (Ms. Shital S. Patil)
Date:Acknowledgement
Itis our immense pleasure to work on this project Weather Data Analysis. It is only the
ble:
ing of my divine master which has prompted and mentally equipped me to undergo the
study of this project.
We would like to thank Prof Dr.G.B.Shinde, Principal, Sir Visvesvarya Institute of Technology
for giving me such an opportunity to develop practical knowledge about subject. We are also
thankful to Dr.Pratibha.V.Kashid, Head of IT Engineering Department for his valuable
encouragement at every phase of our project and completion.
We offer our sincere thanks to our guide Ms, Shital S, Patil, who very a encourages We to work on
the subject and gave his valuable guidance from time to time. While preparing this project we are
very much thankful to him,
We are also grateful to entire staff of IT Engineering Department for their kind co- operation who
helped we in successful completion of project.
SVIT, NASHIK.
Mr. Hitesh Gopal Patil (11902708552)
Mr. Prathamesh Bajirao Chormale (11902708510)
Tanuja Babasaheb Shinde (1902708569)
Mr. Devidas Kakasaheb Tambe (11902708573)INDEX
SR.NO TITTLE PAGE NO.
1 Abstract 1
2 Introduction 2
3 Implementation 3
4 Conclusion 5ABSTRACT
The aim of this project is to perform exploratory data analysis and predictive modeling on a weather dataset
using Python, The dataset contains hourly weather records for the year 2012, including attributes such as
temperature, humidity, wind speed, visibility, and atmospheric pressure. Through data preprocessing and
visualization techniques, we uncover patterns, seasonal trends, and relationships among the variables.
Additionally, a simple linear regression model is implemented to predict temperature based on selected
features like humidity, wind speed, and pressure. The project highlights the importance of data-driven
insights in understanding weather behavior and sets the foundation for building more accurate predictive
systems in the future.
‘This project presents a comprehensive analysis of hourly weather data collected over the year 2012. The
objective is to explore, understand, and predict weather patterns using data science tools and techniques. The
dataset includes key weather parameters such as temperature, dew point, relative humidity, wind speed,
visibility, and atmospheric pressure.
‘The analysis begins with data cleaning and preprocessing, followed by detailed exploratory data analysis
(EDA) using visualizations like line graphs, scatter plots, histograms, and heatmaps. These visualizations
help reveal trends such as seasonal temperature variation, the relationship between temperature and
humidity, and correlations among various weather attributes.INTRODUCTION
Weather has a significant impact on human life, affecting agriculture, transportation, health, and even the
economy. With the growing availability of large weather datasets and powerful data analysis tools, it
possible to understand and predict weather patterns using data science techniques.
s now
This project focuses on analyzing hourly weather data collected throughout the year 2012. The dataset
includes various parameters such as temperature, dew point, humidity, wind speed, visibility, and
atmospheric pressure. By performing exploratory data analysis (EDA), we aim to uncover meaningful
patterns and relationships among these weather attributes.
In addition to EDA, we also implement a basic machine learning model to predict temperature based on
other environmental features. Python libraries like Pandas, Matplotlib, Seaborn, and Scikit-learn are used to
handle data processing, visualization, and modeling.
‘The objective of this project is not only to gain insights from real-world weather data but also to apply
fundamental data science techniques that are essential for solving practical problems.IMPLEMENTATION
‘The implementation of this project was carried out in Python using Jupyter Notebook. It involved multi
steps including data loading, cleaning, analysis, visualization, and predictive modeling. Below is a detailed
explanation of each phase:
1. Importing Required Libraries
We started by importing essential libraries:
+ pandas and numpy for data manipulation,
+ matplotlib.pyplot and seaborn for data visualization,
+ scikit-leamn for building the machine learning model.
2. Loading and Exploring the Dataset
‘The dataset Weather Data.csv was loaded using Pandas. We used functions like .info(), -head(), and
describe() to understand its structure and summary statistics.
3. Data Cleaning and Preprocessing
+ Checked for missing values and found none.
+ Removed any duplicate records.
© Converted Date/Time column to datetime format and set it as the index for time-series analysis,
4, Data Visualization
Various plots were created to analyze trends and relationships:
+ Line Plot: To visi
* Histogram: To observe temperature distribution.
+ Heatmap: To understand correlation among numerical features.
+ Scatter Plot: To examine relationship between humidity and temperature.
+ Daily & Monthly Trends: Focused analysis on May Ist and monthly averages.
lize temperature trends throughout the yea
5, Feature Engineering
+ Extracted the month from the datetime index for seasonal analysis.6. Machine Learning Model
A Linear Regression model was implemented to predict temperature using:
+ Relative Humidity
+ Wind Speed
+ Pressure
Steps:
+ Defined input (X) and output (y) features,
+ Split the dataset into training and testing sets.
+ Trained the model and evaluated its performance using R? score and Mean Squared Error (MSE).
Results:
+ R®Score: 0.177
+ Mean Squared Error: 119.12
This shows the linear model could partially explain the variation in temperature but could be improved with
more features or complex models.CONCLUSION
In this project, we successfully analyzed a real-world weather dataset using Python. By applying data
cleaning, preprocessing, and visualization techniques, we were able to uncover meaningful insights about
temperature trends, humidity levels, seasonal patterns, and the relationships between different weather
parameters.
We observed that temperature generally follows a seasonal trend and is influenced by factors like humidity
and atmospheric pressure. The data visualizations helped us better understand these patterns
Furthermore, we implemented a simple linear regression model to predict temperature using humidity, wind
speed, and pressure as input features. Although the model provided a basic prediction, the R? score indicated
that more complex models or additional data would be needed to improve accuracy.
This project has strengthened our understanding of exploratory data analysis, time-series data handling, and
regression modeling. It also demonstrates how data science techniques can be applied to gain valuable
insights from environmental data, paving the way for more advanced forecasting systems in the future.import pandas as pd
import nunpy as np
import matplotlib.pyplot as plt
import seaborn as sns
GF = pd.read_csv( "Weather Data.cs)
af
°
8779
8780
8781
8782
8783
8784 rows x 8 columns
Date/Time Temp ¢
anjeore
0:00
anjeore
1:00
a2012
200
anjoiz
3:00
an2o12
4:00
12/31/2012
19:00
12/31/2012
20:00
12/31/2012
21:00
12/31/2012
22.00
12/31/2012
23:00
18
18
18
AS
a5
a
02
00
Rel
Hum_%
86
7
89
88
88
81
83
9B
89
86
Wind
Speed_km/h
30
24
28
28
30
Visibility km Press kPa
80
80
40
40
48
97
97
48
97
101.24
101.24
101.26
101.27
101.23
100.13
100.03
99.95
99.91
99.89
¢
df.info()
RangeIndex: 8784 entries, @ to 8783
Data columns (total 8 columns):
# Colum Non-Null Count Dtype
@ Date/Time 8784 non-null object
1 Temp_c 8784 non-null floatea
2 Dew Point Temp_C 8784 non-null floated
3 Rel Hum_% 8784 non-null int64
4 Wind Speed_km/h 8784 non-null int6a
5
6
Visibility km 8784 non-null floatea
Press_kPa 8784 non-null floate4
7 Weather 8784 non-null object
dtypes: floatea(4), int6a(2), object(2)
memory usage: 549.1+ KB
print (df.isnul1().sum())
Date/Time @
Temp_C
Dew Point Tenp_c
Rel Hum_%
Wind Speed_km/h
Visibility_km
Press_kPa
Weather
dtype: intea
Gf = df.drop_duplicates()
df .describe()
Dew Point Wind
TempC “Temp.c RetHUM% soeed km/h
Visibility km Press kPa
count 8784,000000 8784.000000 8784,000000 8784.000000 8784.000000 8784,000000
mean 8798144 2.555294 67431694 14945469 27.664447 101.051623
std 11.687883 10883072 16918881 8.688696 12.622688 0.844005,
min -23300000 -28,500000 18.000000 0.000000. 0.200000 97520000
25% 0.100000 5.900000 56000000 9.000000 24.100000 100560000
50% 9300000 + 3.300000 8.000000 + 13,000000 25.0000 101.070000
75% 18800000 11.80000081,000000 + 20,000000 2.000000 101590000
max 33,000000 24400000 100,.000000 + 83,000000 48300000 103.6500
df[ ‘Formatted Date'] = pd.to_datetime(df[ 'Date/Time' ])
dF. set_index( ‘Formatted Date’, inplace=True)
afFormatted
Date
2012-01-
o1
00:00:00
2012-01-
o1
01:00:00
2012-01-
o1
02:00:00
2012-01-
o1
(03:00:00
2012-01-
o1
04:00:00
2012-12-
31
19:00:00
2012-12-
31
20:00:00
2012-12-
31
21:00:00
2012-12-
31
22:00:00
2012-12-
31
23:00:00
Date/Time Temp.C
arj2012
0.00
qnj2012
1:00
anp2012
2:00
anor
3:00
anj2o1z
4:00
12/31/2012
19:00
12/31/2012
20:00
12312012
21:00
12/31/2012
22:00
12/31/2012
23:00
8784 rows x 8 columns
18
18
a5
o1
02
00
Point
Temp_C
27
15
18
Rel
Hum %
86
87
89
81
83
83
89
86
Wind
Speed_km/h
30
24
28
28
30
Visibility km Press kPé
80
80
49
40
48
97
97
48
97
13
Deere
plt.Figure(Figsize=(12,5))
plt.plot(df.index, df[‘Temp_c'])
plt.title("Tenperature Over Time")
plt.xlabel ("Date")
plt.ylabel( “Temperature (C)")
plt.grid()
plt.show()
101.2
101.24
01.2
101.2;
101.2:
100.1
1000:
99,9:
99.9"
99.8In [12
‘Temperature Over Time
‘empertire (€)
ate
numeric_df = df.select_dtypes(include=[‘float64', ‘int64'])
plt.figure(Figsize=(10,6))
sns-heatmap(numeric_df.corr(), annot=True, cmap='coolwarm' )
plt.title("Correlation Heatmap")
plt.show()
Correlation Heatmas
a 10
Temp.¢
os
ew Point Temp.€ os
-0a
Rel Hum 36
-02
Wind speed kr
. 00
siity_krn ~ 02
04
Press, kPa
Temp.
& g
z 4
2 z
z
ew Point Temp.C
Wind Speed kmh
pit. Figure(Figsize=(8,5))
sns.histplot(df[‘Tenp_C'], kde=True, color="orange’)
plt.title(‘ Temperature Distribution")
plt.xlabel('Tenperature (C)')
pit. ylabel (‘Frequency’)
plt.grid()
plt. show()n [14
‘Temperature Distribution
400
Frequency
§
200
100
0 10
‘Temperature (C)
pit. Figure (Figsize=(8,5))
sns.scatterplot (data=-df, x="Rel Hum_X',
plt.title(*Humidity vs Temperature’)
plt.grid()
plt.show()
‘Temp_c*)
Humidity vs Temperature
20 40 60. 80 100
Rel Hum_%
# Get data for the entire da
day_data = df-loc['2012-05-01"]
Ast May 2012
plt. Figure(figsize=(12,5))
plt.plot(day_data.index, day data['Temp_c'], marker='0', colors" green’)1
plt.title(*Temperature Throughout the Day (1 May 2012)")
plt.xlabel('Time')
plt.ylabel( ‘Temperature (C)')
pit. xticks(rotation=45)
plt.grid()
plt.show()
“Temperature Throughout the Day (1 May 2012)
‘Temperate ()
* > 2 3
& a ra Ca &
d#[ Month") = df.index.month
monthly_avg = d¥.groupby("Nonth')[‘Tenp_C*}.mean()
plt.Figure(Figsize-(10,5))
monthly _avg.plot(marker='0', color="purple’)
plt.title(‘Monthly Average Temperature’ )
plt.xlabel (‘Month' )
plt.ylabel(‘Avg Temperature (C)')
plt.grid()
plt.show()
Monthly Average Temperature
2
2
$0
2
Es
2
°
-s
3 7 3 3 % 2
Month
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score17]: X = d[['Rel Hum%', ‘Wind Speed_km/h’, ‘Press kPa']]
y = df{'Temp_c']
# split data
X_train, X test, y train, y test = train_test_split(X, y, test_size=0.2, random
# Train model
model = LinearRegression()
model. fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(x_test)
y_pred
Ly) array([ 8.31174767, 10.49252381, 1.6905262 , ..., 9.3846832 ,
13.71053101, 14.93376871])
1s]: print ("R2 Score:", r2_score(y test, y pred)
print("Mean Squared Error:", mean_squared_error(y test, y_pred))
R2 Score: @.17748486570306532
Mean Squared Error: 119.11967208953386