Python Handbook: Converting Time Series Data
to Supervised Learning Models
Table of Contents
1. Introduction
2. Understanding Time Series Data
3. Why Convert Time Series to Supervised Learning?
4. Steps to Convert Time Series Data
• 4.1 Importing Libraries
• 4.2 Loading the Data
• 4.3 Visualizing the Data
• 4.4 Creating Lag Features
• 4.5 Handling Missing Values
• 4.6 Splitting the Data
• 4.7 Training a Supervised Learning Model
• 4.8 Evaluating the Model
5. Advanced Techniques
• 5.1 Handling Stationarity
• 5.2 Incorporating Exogenous Variables
• 5.3 Dealing with Seasonality
6. Practical Example: Forecasting Electricity Consumption
7. Conclusion
1. Introduction
Time series data is ubiquitous across various domains, including finance, eco-
nomics, environmental science, and engineering. Traditionally, specialized mod-
els like ARIMA have been used for forecasting. However, converting time series
data into a supervised learning problem opens up powerful machine learning
techniques for prediction.
This handbook provides a comprehensive, step-by-step guide to transforming
time series data into a format compatible with machine learning algorithms
using Python.
2. Understanding Time Series Data
Time series data consists of observations recorded sequentially over time. Each
data point is inherently dependent on previous observations, creating temporal
dependencies that must be carefully considered during analysis.
3. Why Convert Time Series to Supervised Learning?
Converting time series to a supervised learning problem offers several advan-
tages:
1
• Algorithmic Flexibility: Utilize a wide range of machine learning algo-
rithms beyond traditional time series models.
• Feature Incorporation: Include multiple features, including external
(exogenous) variables.
• Robust Validation: Apply advanced cross-validation techniques.
• Complex Pattern Recognition: Handle intricate, non-linear relation-
ships in the data.
4. Steps to Convert Time Series Data
4.1 Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
4.2 Loading the Data
# Load a CSV file containing time series data
data = pd.read_csv('time_series_data.csv', parse_dates=['Date'], index_col='Date')
4.3 Visualizing the Data
plt.figure(figsize=(12, 6))
plt.plot(data.index, data['Value'])
plt.title('Time Series Data')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
4.4 Creating Lag Features
def create_lag_features(df, lag=1):
df_lag = df.copy()
for i in range(1, lag + 1):
df_lag[f'lag_{i}'] = df_lag['Value'].shift(i)
return df_lag
# Create lag features with a window size of 3
data_lagged = create_lag_features(data, lag=3)
4.5 Handling Missing Values
data_lagged.dropna(inplace=True)
2
4.6 Splitting the Data
train_size = int(len(data_lagged) * 0.8)
train, test = data_lagged.iloc[:train_size], data_lagged.iloc[train_size:]
4.7 Training a Supervised Learning Model
# Define input and output variables
X_train = train.drop('Value', axis=1)
y_train = train['Value']
X_test = test.drop('Value', axis=1)
y_test = test['Value']
# Initialize the model
model = RandomForestRegressor(n_estimators=100, random_state=42)
# Train the model
model.fit(X_train, y_train)
4.8 Evaluating the Model
# Make predictions
y_pred = model.predict(X_test)
# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f'Root Mean Squared Error: {rmse:.2f}')
# Plot actual vs. predicted values
plt.figure(figsize=(12, 6))
plt.plot(y_test.index, y_test, label='Actual')
plt.plot(y_test.index, y_pred, label='Predicted')
plt.title('Actual vs. Predicted Values')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()
5. Advanced Techniques
5.1 Handling Stationarity
# Differencing to remove trends
data_diff = data.diff().dropna()
3
5.2 Incorporating Exogenous Variables
# Include external factors
data_lagged['Exogenous_Var'] = data['Exogenous_Var']
5.3 Dealing with Seasonality
# Seasonal lag of 12 for monthly data with yearly seasonality
data_lagged['lag_12'] = data_lagged['Value'].shift(12)
data_lagged.dropna(inplace=True)
6. Practical Example: Forecasting Electricity Consumption
Step 1: Load the Dataset
data = pd.read_csv('electricity_consumption.csv', parse_dates=['Month'], index_col='Month')
Step 2: Visualize the Data
plt.figure(figsize=(12, 6))
plt.plot(data.index, data['Consumption'])
plt.title('Monthly Electricity Consumption')
plt.xlabel('Month')
plt.ylabel('Consumption (kWh)')
plt.show()
Step 3: Create Lag and Seasonal Features
data['lag_1'] = data['Consumption'].shift(1)
data['lag_12'] = data['Consumption'].shift(12)
data.dropna(inplace=True)
Step 4: Prepare the Data
X = data[['lag_1', 'lag_12']]
y = data['Consumption']
Step 5: Split the Data
train_size = int(len(X) * 0.8)
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train, y_test = y.iloc[:train_size], y.iloc[train_size:]
Step 6: Train the Model
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
4
Step 7: Evaluate the Model
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'Root Mean Squared Error: {rmse:.2f}')
Step 8: Plot the Results
plt.figure(figsize=(12, 6))
plt.plot(y_test.index, y_test, label='Actual')
plt.plot(y_test.index, y_pred, label='Predicted')
plt.title('Actual vs. Predicted Electricity Consumption')
plt.xlabel('Month')
plt.ylabel('Consumption (kWh)')
plt.legend()
plt.show()
7. Conclusion
Converting time series data into a supervised learning format empowers data
scientists and analysts to leverage a diverse range of machine learning algo-
rithms for forecasting tasks. By strategically creating lag features, addressing
stationarity, and incorporating exogenous variables, you can capture temporal
dependencies and significantly improve model performance.
Key Takeaways: - Time series data can be transformed into a supervised
learning problem - Lag features capture temporal dependencies - Machine learn-
ing models can effectively forecast time series data - Preprocessing techniques
like handling stationarity and seasonality are crucial
Next Steps: - Experiment with different machine learning algorithms - Try
various feature engineering techniques - Validate models using cross-validation