Lab report
Course code: CSE326
Course Title: Data Mining and Machine Learning Lab
Lab report: 05
Topic: Feature Selection.
Submitted To:
Name: Sadman Sadik Khan
Designation: Lecturer
Department: CSE
Daffodil International University
Submitted By:
Name: Fardus Alam
ID: 222-15-6167
Section: 62-G
Department: CSE
Daffodil International University
Submission Date: 15-03-2025
Code: Dividing the dataset to input and target
1. x = df2.drop('stroke',axis=1, inplace = False)
2. y =
df2['stroke'] 3.
Explanation:
Here I divide the whole data set into part, one is independent variables part (x) and another one is
dependent variable-target class (y).
Code: Feature Selection using SelectKBest and Anova test
1. from sklearn.feature_selection import SelectKBest
2. from sklearn.feature_selection import f_classif
5. fit_features = SelectKBest(score_func = f_classif)
6. fit_features.fit(x,y)
7.
8. fs = pd.DataFrame(fit_features.scores_,index=x.columns,
columns = ['score values'])
9.
10. fs.nlargest(7,'score values')
11.
Output:
Explanation:
This code performs feature selection to identify the most important features for predicting stroke using
SelectKBest.
Steps:
1. SelectKBest with f_classif: Selects the top features based on ANOVA F-value.
2. x = df2.drop('stroke', axis=1): Drops the target column (stroke), storing features in x.
3. y = df2['stroke']: Stores the target column (stroke) in y.
4. fit_features.fit(x, y): Fits the SelectKBest model on the data to score the features.
5. pd.DataFrame(fit_features.scores_): Creates a DataFrame of feature scores.
6. fs.nlargest(7, 'score values'): Selects the top 7 features with the highest scores.
Purpose:
Identifies the most relevant features for predicting stroke.
Code:
1. fs.nlargest(7, 'score values').plot(kind="barh",
figsize=(10, 5), color='y', edgecolor='black')
2. plt.title('Top 7 Features by Score Value', fontsize=16)
3. plt.xlabel('Score Value', fontsize=12)
4. plt.ylabel('Feature', fontsize=12)
5. plt.grid(True, axis='x')
6. plt.tight_layout()
7. plt.show()
8.
Output:
Explanation:
This code generates a horizontal bar plot to visualize the top 7 features based on their score values from
the SelectKBest feature selection.
Code: Correlation
1. plt.figure(figsize=(12,8))
2. sns.heatmap(df2.corr(), annot=True, cmap="coolwarm", linewidths=1)
3. plt.title("Feature Correlation Matrix")
4. plt.show()
5.
Output:
Explanation:
This code generates a correlation heatmap to visualize the relationships between features in df2.
Explanation:
1. plt.figure(figsize=(12, 8)): Sets the figure size to 12x8 inches.
2. sns.heatmap(df2.corr(), annot=True, cmap="coolwarm", linewidths=0.5):
df2.corr() computes the correlation matrix of df2.
annot=True annotates the heatmap with correlation values.
cmap="coolwarm" sets the color map for visualization (cool colors for
negative, warm for positive correlations).
linewidths=1 adds separation lines between cells.
Code: top 8 columns
1. corr_matrix = df2.corr()
2.
3. corr_with_target = corr_matrix['stroke'].abs()
4.
5. top_8 = corr_with_target.sort_values(ascending=False).head(8)
6. print(f"Top 8 Features based on Correlation with Target: \n{top_8}")
7.
Output:
Explanation:
This code calculates and displays the top 8 features with the highest correlation to the target variable
stroke.
1. df2.corr(): Generates the correlation matrix for all features in df2.
2. corr_matrix['stroke']: Extracts the correlation values of all features with respect to stroke.
3. .abs(): Converts correlations to absolute values to focus on the strength of the
relationships, ignoring the direction.
4. .sort_values(ascending=False).head(8): Sorts the correlation values in descending order and
selects the top 8 features with the highest correlation.