Practical No-2
Practical No-2
Data Wrangling II
Create an “Academic performance” dataset of students and perform the following operations
using Python.
1. Scan all variables for missing values and inconsistencies. If there are missing values and/or
inconsistencies, use any of the suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable techniques
to deal with them.
3. Apply data transformations on at least one of the variables.
The purpose of this transformation should be one of the following reasons: to change the scale
for better understanding of the variable, to convert a non-linear relation into a linear one, or to
decrease the skewness and convert the distribution into a normal distribution. Reason and
document your approach properly.
Python Code:
academic_df = pd.DataFrame(data)
plt.subplot(1, 2, 2)
sns.histplot(academic_df['Log_Study_Hours'], kde=True)
plt.title('Log_Study_Hours Distribution')
plt.show()
Explanation:
• The code starts by creating a sample "Academic Performance" dataset with variables
such as Math_Score, English_Score, Science_Score, Attendance_Percentage, and
Study_Hours_Per_Day.
• Some missing values and inconsistencies are introduced for demonstration purposes.
• Missing values and inconsistencies are handled using mean imputation for missing
values and replacing negative values with NaN.
• Outliers are identified using Z-scores, and extreme values are replaced with NaN.
• A log transformation is applied to the 'Study_Hours_Per_Day' variable to decrease
skewness and convert the distribution into a more normal shape.
• The code includes visualizations to compare the distribution before and after the log
transformation.
Output:
[5 rows x 6 columns]
[5 rows x 7 columns]