[go: up one dir, main page]

0% found this document useful (0 votes)
19 views12 pages

SVM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views12 pages

SVM

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

2|Page

Sl.No. Contents Page No.


01 Data Description 3

02 Problem Statement 4

03 Methodology 5
(i) Handling Null Value 5
(ii) Handling Missing Value 5
(iii) Noise Removal 5
(iv) Split 6
(v) Classification 6

04 Coding 7

05 Result and Discussion 10

06 Conclusion 11

07 Reference 12
3|Page

The Influenza Dataset extract from the UCI Machine Learning


Repository contains information on patients with Influenza, detailing
both clinical symptoms and lab results.

 Total Features (Attributes): 6 features ,these are [(flu_X_tr ),


(flu_Y_tr ),(flu_X_te ),(flu_Y_te),(flu_locs ),(flu_keywords)]
 Target Variable: Class, which indicates whether Influenza is
present (1) or absent (0).
4|Page

The purpose of this project is to build a classification model to


predict the presence of Influenza in patients based on their clinical
and laboratory data. By predicting Influenza early, healthcare
professionals can make timely decisions, potentially improving
patient outcomes.
5|Page

The methodology outlines steps used to prepare the data, clean it,
and build a classification model.

I. Null Value Method:


 The dataset is checked for null values, which are filled using statistical
methods (like median filling) to retain data consistency. Any null or
empty cells in columns are handled to maintain the integrity of the
model.
 Null values or empty fields are identified and replaced using mean or
median values of that feature’s column, ensuring data completeness
without introducing bias.

II. Missing Value Method:


 The dataset may contain missing entries denoted by “?”. These entries
are replaced with NaN values, which are then filled with the median
value for numerical features. This maintains a balanced representation
of each feature. This approach maintains the integrity of the dataset
while filling in gaps.

III. Noise Removal Method:


 Noise (outliers or incorrect values) in data may affect the model’s
performance. We handle noise by scaling numerical data to ensure
consistency. However, since the dataset is relatively clean, minimal
processing is needed.
 To minimize the impact of noise (outliers or inconsistent data), the
dataset is reviewed, and numerical values are scaled where necessary.
For example, outliers in numerical data, such as very high or low test
values, are adjusted by scaling to ensure uniform data.
6|Page

IV. Split:
The data is divided into two sets:

 Training Set (80%): Used to train the SVM model.


 Test Set (20%): Used to evaluate the model’s performance. This split
helps the model generalize and perform well on unseen data.

V. Method For Classification:


 A Support Vector Machine(SVM) is used to classify patients based on
their medical attributes. SVM is effective to create the best line or
decision boundary that can segregate n-dimensionalspace into classes,so
that we can easily put the new data point in the correct category in the
future.This best decision boundary is called a hyperplane.
 A SVM chooses the extreme point/vectors that help in creating the
hyperplane .These extreme cases are called as support vectors,and
hence algorithm is termed as Support Vector Machine(SVM).
7|Page

Here’s the Python code implementing the above steps for data preparation,
training, and evaluating the model.

Code:

# Import libraries

import pandas as pd

import numpy as np

from sklearn.svm import SVC

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, classification_report

# Step 1: Load the data

df=pd.read_csv("C:/Users/KIIT0001/Downloads/influenza_outbreak_dataset.cs
v")

df

# Step 2: Assign column names

 df.columns = [(flu_X_tr ),
(flu_Y_tr ),(flu_X_te ),(flu_Y_te),(flu_locs ),(flu_keywords)]

#Step 3: Data Cleaning


8|Page

#filling null values with a single value

#filling missing value using fillna()

ndf=df

ndf.fillna(0)

#To drop rows with at least 1 null value

ndf.dropna()

# Step 4: Split data into features(x) and target(y)[ i.e training and testing sets]

X = flu_df.drop[‘target’,axis=1]

y = flu_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Step 5: Support Vector Machine(SVM)

from sklearn.svm import SVC

Svc_model=SVC()

# Step 6: Evaluate the model

y_pred = rf_model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

print("Classification Report:\n", classification_report(y_test, y_pred))


9|Page

# Step 7: Feature Importance

feature_importance = rf_model.feature_importances_

indices = np.argsort(feature_importance)[::-1]

plt.figure(figsize=(10,6))

plt.title("Feature Importances")

plt.bar(range(X.shape[1]), feature_importance[indices], align="center")

plt.xticks(range(X.shape[1]), X.columns[indices], rotation=90)

plt.show()
10 | P a g e

 Accuracy: The model's accuracy on the test set, indicating its ability to
classify hepatitis cases correctly.

 Classification Report: The classification report includes metrics like


precision, recall, and F1-score for each class (hepatitis and non-hepatitis).
High F1-scores show balanced performance.

 Feature Importance: Random Forest highlights which features most


influence predictions. Features such as liver test results and age may
emerge as significant, revealing key health indicators for predicting
hepatitis.
11 | P a g e

This project demonstrates how the Random Forest model can effectively
classify hepatitis cases based on clinical and laboratory data. With good
accuracy and interpretability (feature importance), this model can be helpful
for medical professionals in understanding and diagnosing hepatitis. However,
further refinement or alternative models (e.g., boosting techniques) could
improve performance.
12 | P a g e

 UCI Machine Learning Repository - Hepatitis Dataset


 Scikit-Learn Documentation: For model functions and metrics used.
 General resources on data preprocessing, classification models, and
Random Forest methodology.

You might also like