MINI PROJECT
CAR ENGINE ANOMALY
DETECTOR
REPORT
PREPARED BY:
YOGESH R-71772217155
PIYUSH KUMAR MAHTO-71772217158
PRINCE KUMAR MAHTO-71772217159
GNANESH G-71772217303
KAMALAKANNAN S-71772217L02
1. Introduction
Predictive maintenance is revolutionizing industries by minimizing downtime and reducing
repair costs. One critical aspect of this is determining the condition of an engine before failure
occurs. In this project, a smart system has been developed to classify an engine as Healthy or
Faulty based on real-time sensor readings using supervised machine learning.
This engine condition classifier uses sensor data such as oil pressure, RPM, fuel pressure, and
temperature to identify patterns and detect faults. The system incorporates data preprocessing,
class balancing with SMOTE, model training, hyperparameter tuning, and visual analytics. With
a user-friendly Streamlit app, this solution brings predictive insights directly to the user.
2. Objective
- To classify engine conditions (Healthy or Faulty) based on sensor input.
- To use machine learning models to accurately predict engine faults.
- To build a pipeline with data preprocessing, model comparison, tuning, and visualization.
- To deploy the trained model using an intuitive web interface.
3. Tools and Technologies Used
- Python – Programming language for development
- Pandas & Numpy – For data handling and numerical operations
- Scikit-learn – For building and tuning ML models
- Matplotlib & Seaborn – For data visualization
- imblearn (SMOTE) – For handling class imbalance
- Joblib – For model serialization
- Streamlit – For deploying a simple user interface
4. Dataset
The dataset engine_data.csv includes real engine sensor readings and labels that indicate the
engine condition:
Features:
- Engine rpm
- Lubricant oil pressure
- Fuel pressure
- Coolant pressure
- Lubricant oil temperature
- Coolant temperature
Target:
- Engine Condition
- 0 – Healthy
- 1 – Faulty
5. Model Design
- Preprocessing: StandardScaler was used to scale features to improve model convergence.
- Train-Test Split: 80% for training and 20% for testing.
- Class Balancing: SMOTE was applied to address class imbalance in the dataset.
- Model Selection: Multiple classifiers were evaluated:
- Random Forest
- Logistic Regression
- Support Vector Machine
- K-Nearest Neighbors
- Tuning: GridSearchCV was used on the Random Forest classifier to identify optimal
hyperparameters.
- Final Evaluation: The best model was evaluated using accuracy, confusion matrix, and
classification report.
6. Application Workflow
1. Load and explore the dataset.
2. Preprocess and scale features.
3. Split dataset into training and testing sets.
4. Apply SMOTE to balance training data.
5. Train and compare multiple classification models.
6. Tune the best-performing model (Random Forest).
7. Evaluate the final model on the test set.
8. Visualize feature importance and model metrics.
9. Save the trained model and scaler.
10. Use Streamlit to deploy the classifier for real-time predictions.
7. Code Explanation
- Data Loading: Read engine_data.csv using pandas.
- Preprocessing: StandardScaler used to normalize features.
- Model Training: Loop over four models and compare test accuracy.
- SMOTE: Synthetic data generated to oversample the minority class.
- Grid Search: Fine-tune Random Forest parameters:
{'max_depth': None, 'min_samples_split': 2, 'n_estimators': 200}
- Evaluation Metrics: Accuracy, classification report, confusion matrix, cross-validation score.
- Feature Importance: Bar plot generated using seaborn.
- Model Saving: Trained model and scaler saved as .pkl files.
- Web Interface: Streamlit app developed for real-time predictions.
8. Visual Analytics
- Confusion Matrix: Highlights model's performance on test data.
- Classification Report: Shows precision, recall, and F1-score for both classes.
- Feature Importance Plot: Visualizes how much each sensor contributes to prediction.
- Cross-Validation: Accuracy averaged over 5 folds ensures model generalization.
9. Strengths
- Handles imbalanced datasets using SMOTE
- Compares multiple models for robustness
- Uses hyperparameter tuning for better accuracy
- Saves model and scaler for future use
- Streamlit integration allows easy interaction
- Provides visual understanding of model and data
10. Limitations & Future Work
- The model is trained on a limited dataset; performance can be improved with more real-world
data.
- Only binary classification is supported (Healthy/Faulty).
- Deep learning models can be explored for further improvement.
- Additional sensor inputs (vibration, acoustic signals) can enhance accuracy.
- Future versions can include real-time sensor data input and dashboard integration.
11. Source Code
GitHub Repository: https://github.com/Gnanesh-Nani/EngineSense
# train_engine_model.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from imblearn.over_sampling import SMOTE
import joblib
# Load dataset
df = pd.read_csv('engine_data.csv')
print("📄 First 5 rows:")
print(df.head())
print("\n🔍 Info:")
print(df.info())
# Features and target
X = df.drop('Engine Condition', axis=1)
y = df['Engine Condition']
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Apply SMOTE
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
print("\n⚖️Class distribution after SMOTE:")
print(pd.Series(y_train_res).value_counts())
# Try different models
models = {
"Random Forest": RandomForestClassifier(random_state=42),
"Logistic Regression": LogisticRegression(max_iter=1000),
"SVM": SVC(),
"KNN": KNeighborsClassifier()
}
print("\n📊 Model Comparison:")
for name, model in models.items():
model.fit(X_train_res, y_train_res)
acc = model.score(X_test, y_test)
print(f"{name}: {acc:.2f}")
# Grid search on Random Forest
param_grid = {
'n_estimators': [100, 200],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5],
}
grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, n_jobs=-1)
grid.fit(X_train_res, y_train_res)
print("\n✅ Best Parameters:")
print(grid.best_params_)
# Final evaluation
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("\n📈 Classification Report:")
print(classification_report(y_test, y_pred))
print("🧮 Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
# Cross-validation score
cv_scores = cross_val_score(best_model, X_scaled, y, cv=5)
print("\n📉 Cross-Validated Accuracy: {:.2f}%".format(cv_scores.mean() * 100))
# Feature importance
importances = best_model.feature_importances_
feature_names = X.columns
plt.figure(figsize=(8, 5))
sns.barplot(x=importances, y=feature_names)
plt.title("Feature Importance")
plt.xlabel("Importance Score")
plt.tight_layout()
plt.show()
# Save model and scaler
joblib.dump(best_model, 'engine_condition_model.pkl')
joblib.dump(scaler, 'scaler.pkl')
print("\n💾 Model and scaler saved as 'engine_condition_model.pkl' and 'scaler.pkl'")
12. Output
Faulty Condition:
Normal Condition:
Accuracy:
Feature Importance:
Conclusion
This project demonstrates how machine learning can be effectively applied to monitor and
classify engine conditions. By building a robust classification pipeline with proper
preprocessing, class balancing, and model tuning, we can predict engine faults with promising
accuracy. The deployment-ready solution using Streamlit provides a strong foundation for
industrial applications in predictive maintenance and diagnostics. Future enhancements can make
the system more intelligent, scalable, and adaptive to new data streams.