[go: up one dir, main page]

0% found this document useful (0 votes)
10 views6 pages

DMML Lab Report 05

This lab report focuses on feature selection for predicting stroke using data mining techniques. It details the process of dividing the dataset into independent and dependent variables, applying SelectKBest with ANOVA for feature selection, and visualizing the results through bar plots and correlation heatmaps. The report concludes by identifying the top 8 features most correlated with the target variable 'stroke'.

Uploaded by

Atick Arman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

DMML Lab Report 05

This lab report focuses on feature selection for predicting stroke using data mining techniques. It details the process of dividing the dataset into independent and dependent variables, applying SelectKBest with ANOVA for feature selection, and visualizing the results through bar plots and correlation heatmaps. The report concludes by identifying the top 8 features most correlated with the target variable 'stroke'.

Uploaded by

Atick Arman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Lab report

Course code: CSE326


Course Title: Data Mining and Machine Learning Lab
Lab report: 05
Topic: Feature Selection.

Submitted To:
Name: Sadman Sadik Khan
Designation: Lecturer
Department: CSE
Daffodil International University

Submitted By:
Name: Fardus Alam
ID: 222-15-6167
Section: 62-G
Department: CSE
Daffodil International University

Submission Date: 15-03-2025


Code: Dividing the dataset to input and target

1. x = df2.drop('stroke',axis=1, inplace = False)


2. y =
df2['stroke'] 3.

Explanation:
Here I divide the whole data set into part, one is independent variables part (x) and another one is
dependent variable-target class (y).

Code: Feature Selection using SelectKBest and Anova test


1. from sklearn.feature_selection import SelectKBest
2. from sklearn.feature_selection import f_classif

5. fit_features = SelectKBest(score_func = f_classif)


6. fit_features.fit(x,y)
7.
8. fs = pd.DataFrame(fit_features.scores_,index=x.columns,
columns = ['score values'])
9.
10. fs.nlargest(7,'score values')
11.

Output:
Explanation:
This code performs feature selection to identify the most important features for predicting stroke using
SelectKBest.
Steps:
1. SelectKBest with f_classif: Selects the top features based on ANOVA F-value.
2. x = df2.drop('stroke', axis=1): Drops the target column (stroke), storing features in x.
3. y = df2['stroke']: Stores the target column (stroke) in y.
4. fit_features.fit(x, y): Fits the SelectKBest model on the data to score the features.
5. pd.DataFrame(fit_features.scores_): Creates a DataFrame of feature scores.
6. fs.nlargest(7, 'score values'): Selects the top 7 features with the highest scores.
Purpose:
 Identifies the most relevant features for predicting stroke.

Code:
1. fs.nlargest(7, 'score values').plot(kind="barh",
figsize=(10, 5), color='y', edgecolor='black')
2. plt.title('Top 7 Features by Score Value', fontsize=16)
3. plt.xlabel('Score Value', fontsize=12)
4. plt.ylabel('Feature', fontsize=12)
5. plt.grid(True, axis='x')
6. plt.tight_layout()
7. plt.show()
8.

Output:
Explanation:
This code generates a horizontal bar plot to visualize the top 7 features based on their score values from
the SelectKBest feature selection.

Code: Correlation

1. plt.figure(figsize=(12,8))
2. sns.heatmap(df2.corr(), annot=True, cmap="coolwarm", linewidths=1)
3. plt.title("Feature Correlation Matrix")
4. plt.show()
5.

Output:

Explanation:
This code generates a correlation heatmap to visualize the relationships between features in df2.
Explanation:
1. plt.figure(figsize=(12, 8)): Sets the figure size to 12x8 inches.
2. sns.heatmap(df2.corr(), annot=True, cmap="coolwarm", linewidths=0.5):
 df2.corr() computes the correlation matrix of df2.
 annot=True annotates the heatmap with correlation values.
 cmap="coolwarm" sets the color map for visualization (cool colors for
negative, warm for positive correlations).
 linewidths=1 adds separation lines between cells.

Code: top 8 columns

1. corr_matrix = df2.corr()
2.
3. corr_with_target = corr_matrix['stroke'].abs()
4.
5. top_8 = corr_with_target.sort_values(ascending=False).head(8)
6. print(f"Top 8 Features based on Correlation with Target: \n{top_8}")
7.

Output:

Explanation:
This code calculates and displays the top 8 features with the highest correlation to the target variable
stroke.
1. df2.corr(): Generates the correlation matrix for all features in df2.
2. corr_matrix['stroke']: Extracts the correlation values of all features with respect to stroke.
3. .abs(): Converts correlations to absolute values to focus on the strength of the
relationships, ignoring the direction.
4. .sort_values(ascending=False).head(8): Sorts the correlation values in descending order and
selects the top 8 features with the highest correlation.

You might also like