2|Page
Sl.No. Contents Page No.
01 Data Description 3
02 Problem Statement 4
03 Methodology 5
(i) Handling Null Value 5
(ii) Handling Missing Value 5
(iii) Noise Removal 5
(iv) Split 6
(v) Classification 6
04 Coding 7
05 Result and Discussion 10
06 Conclusion 11
07 Reference 12
3|Page
The Influenza Dataset extract from the UCI Machine Learning
Repository contains information on patients with Influenza, detailing
both clinical symptoms and lab results.
Total Features (Attributes): 6 features ,these are [(flu_X_tr ),
(flu_Y_tr ),(flu_X_te ),(flu_Y_te),(flu_locs ),(flu_keywords)]
Target Variable: Class, which indicates whether Influenza is
present (1) or absent (0).
4|Page
The purpose of this project is to build a classification model to
predict the presence of Influenza in patients based on their clinical
and laboratory data. By predicting Influenza early, healthcare
professionals can make timely decisions, potentially improving
patient outcomes.
5|Page
The methodology outlines steps used to prepare the data, clean it,
and build a classification model.
I. Null Value Method:
The dataset is checked for null values, which are filled using statistical
methods (like median filling) to retain data consistency. Any null or
empty cells in columns are handled to maintain the integrity of the
model.
Null values or empty fields are identified and replaced using mean or
median values of that feature’s column, ensuring data completeness
without introducing bias.
II. Missing Value Method:
The dataset may contain missing entries denoted by “?”. These entries
are replaced with NaN values, which are then filled with the median
value for numerical features. This maintains a balanced representation
of each feature. This approach maintains the integrity of the dataset
while filling in gaps.
III. Noise Removal Method:
Noise (outliers or incorrect values) in data may affect the model’s
performance. We handle noise by scaling numerical data to ensure
consistency. However, since the dataset is relatively clean, minimal
processing is needed.
To minimize the impact of noise (outliers or inconsistent data), the
dataset is reviewed, and numerical values are scaled where necessary.
For example, outliers in numerical data, such as very high or low test
values, are adjusted by scaling to ensure uniform data.
6|Page
IV. Split:
The data is divided into two sets:
Training Set (80%): Used to train the SVM model.
Test Set (20%): Used to evaluate the model’s performance. This split
helps the model generalize and perform well on unseen data.
V. Method For Classification:
A Support Vector Machine(SVM) is used to classify patients based on
their medical attributes. SVM is effective to create the best line or
decision boundary that can segregate n-dimensionalspace into classes,so
that we can easily put the new data point in the correct category in the
future.This best decision boundary is called a hyperplane.
A SVM chooses the extreme point/vectors that help in creating the
hyperplane .These extreme cases are called as support vectors,and
hence algorithm is termed as Support Vector Machine(SVM).
7|Page
Here’s the Python code implementing the above steps for data preparation,
training, and evaluating the model.
Code:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Step 1: Load the data
df=pd.read_csv("C:/Users/KIIT0001/Downloads/influenza_outbreak_dataset.cs
v")
df
# Step 2: Assign column names
df.columns = [(flu_X_tr ),
(flu_Y_tr ),(flu_X_te ),(flu_Y_te),(flu_locs ),(flu_keywords)]
#Step 3: Data Cleaning
8|Page
#filling null values with a single value
#filling missing value using fillna()
ndf=df
ndf.fillna(0)
#To drop rows with at least 1 null value
ndf.dropna()
# Step 4: Split data into features(x) and target(y)[ i.e training and testing sets]
X = flu_df.drop[‘target’,axis=1]
y = flu_df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Step 5: Support Vector Machine(SVM)
from sklearn.svm import SVC
Svc_model=SVC()
# Step 6: Evaluate the model
y_pred = rf_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
9|Page
# Step 7: Feature Importance
feature_importance = rf_model.feature_importances_
indices = np.argsort(feature_importance)[::-1]
plt.figure(figsize=(10,6))
plt.title("Feature Importances")
plt.bar(range(X.shape[1]), feature_importance[indices], align="center")
plt.xticks(range(X.shape[1]), X.columns[indices], rotation=90)
plt.show()
10 | P a g e
Accuracy: The model's accuracy on the test set, indicating its ability to
classify hepatitis cases correctly.
Classification Report: The classification report includes metrics like
precision, recall, and F1-score for each class (hepatitis and non-hepatitis).
High F1-scores show balanced performance.
Feature Importance: Random Forest highlights which features most
influence predictions. Features such as liver test results and age may
emerge as significant, revealing key health indicators for predicting
hepatitis.
11 | P a g e
This project demonstrates how the Random Forest model can effectively
classify hepatitis cases based on clinical and laboratory data. With good
accuracy and interpretability (feature importance), this model can be helpful
for medical professionals in understanding and diagnosing hepatitis. However,
further refinement or alternative models (e.g., boosting techniques) could
improve performance.
12 | P a g e
UCI Machine Learning Repository - Hepatitis Dataset
Scikit-Learn Documentation: For model functions and metrics used.
General resources on data preprocessing, classification models, and
Random Forest methodology.