0% found this document useful (0 votes)

288 views4 pages

Practical No-2

Uploaded by

Deep Tayade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

288 views4 pages

Practical No-2

Uploaded by

Deep Tayade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Practical No-2

Date of Conduction: Date of Checking:

Data Wrangling II
Create an “Academic performance” dataset of students and perform the following operations
using Python.
1. Scan all variables for missing values and inconsistencies. If there are missing values and/or
inconsistencies, use any of the suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable techniques
to deal with them.
3. Apply data transformations on at least one of the variables.
The purpose of this transformation should be one of the following reasons: to change the scale
for better understanding of the variable, to convert a non-linear relation into a linear one, or to
decrease the skewness and convert the distribution into a normal distribution. Reason and
document your approach properly.

Python Code:

# Import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set a random seed for reproducibility

np.random.seed(42)

# 1. Create the "Academic Performance" dataset

data = {
'Student_ID': range(1, 101),
'Math_Score': np.random.randint(50, 100, size=100),
'English_Score': np.random.randint(40, 95, size=100),
'Science_Score': np.random.randint(55, 98, size=100),
'Attendance_Percentage': np.random.uniform(70, 100, size=100),
'Study_Hours_Per_Day': np.random.uniform(1, 6, size=100),
}

academic_df = pd.DataFrame(data)

# Introduce missing values and inconsistencies for demonstration

academic_df.loc[10:20, 'Math_Score'] = np.nan
academic_df.loc[30:40, 'English_Score'] = np.nan
academic_df.loc[50:60, 'Science_Score'] = np.nan
academic_df.loc[70:80, 'Attendance_Percentage'] = np.nan
# Display first few rows of the dataset
print("First few rows of the Academic Performance dataset:")
print(academic_df.head())

# 1. Scan all variables for missing values and

inconsistencies
# Use mean imputation for missing values and replace any negative values
with NaN
academic_df.fillna(academic_df.mean(), inplace=True)
academic_df[academic_df < 0] = np.nan

# Display the updated dataset after handling missing values and

inconsistencies
print("\nUpdated dataset after handling missing values and
inconsistencies:")
print(academic_df.head())

# 2. Scan all numeric variables for outliers

# Use Z-score to identify and handle outliers
numeric_vars = ['Math_Score', 'English_Score', 'Science_Score',
'Attendance_Percentage', 'Study_Hours_Per_Day']

z_scores = (academic_df[numeric_vars] - academic_df[numeric_vars].mean()) /

academic_df[numeric_vars].std()
outliers = (z_scores > 3) | (z_scores < -3)

# Replace outliers with NaN

academic_df[outliers] = np.nan

# Display the dataset after handling outliers

print("\nDataset after handling outliers:")
print(academic_df.head())

# 3. Apply data transformations

# Log transformation on 'Study_Hours_Per_Day' to decrease skewness
academic_df['Log_Study_Hours'] =
np.log1p(academic_df['Study_Hours_Per_Day'])

# Display the dataset after the log transformation

print("\nDataset after log transformation:")
print(academic_df.head())

# Visualize the distribution before and after the transformation

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.histplot(academic_df['Study_Hours_Per_Day'], kde=True)
plt.title('Study_Hours_Per_Day Distribution')

plt.subplot(1, 2, 2)
sns.histplot(academic_df['Log_Study_Hours'], kde=True)
plt.title('Log_Study_Hours Distribution')

plt.show()
Explanation:

• The code starts by creating a sample "Academic Performance" dataset with variables
such as Math_Score, English_Score, Science_Score, Attendance_Percentage, and
Study_Hours_Per_Day.
• Some missing values and inconsistencies are introduced for demonstration purposes.
• Missing values and inconsistencies are handled using mean imputation for missing
values and replacing negative values with NaN.
• Outliers are identified using Z-scores, and extreme values are replaced with NaN.
• A log transformation is applied to the 'Study_Hours_Per_Day' variable to decrease
skewness and convert the distribution into a more normal shape.
• The code includes visualizations to compare the distribution before and after the log
transformation.

Output:

"C:\Users\Ram Kumar Solanki\PycharmProjects\pythonProject\venv\Scripts\python.exe"

"C:\Users\Ram Kumar Solanki\PycharmProjects\MBA_BFS\main.py"
First few rows of the Academic Performance dataset:
Student_ID Math_Score ... Attendance_Percentage Study_Hours_Per_Day
0 1 88.0 ... 81.168483 5.847684
1 2 78.0 ... 98.204003 4.572976
2 3 64.0 ... 99.209915 1.205338
3 4 92.0 ... 78.517629 2.994105
4 5 57.0 ... 79.160916 3.167604

[5 rows x 6 columns]

Updated dataset after handling missing values and inconsistencies:

Student_ID Math_Score ... Attendance_Percentage Study_Hours_Per_Day
0 1 88.0 ... 81.168483 5.847684
1 2 78.0 ... 98.204003 4.572976
2 3 64.0 ... 99.209915 1.205338
3 4 92.0 ... 78.517629 2.994105
4 5 57.0 ... 79.160916 3.167604
[5 rows x 6 columns]
Dataset after handling outliers:
Student_ID Math_Score ... Attendance_Percentage Study_Hours_Per_Day
0 1 88.0 ... 81.168483 5.847684
1 2 78.0 ... 98.204003 4.572976
2 3 64.0 ... 99.209915 1.205338
3 4 92.0 ... 78.517629 2.994105
4 5 57.0 ... 79.160916 3.167604
[5 rows x 6 columns]
Dataset after log transformation:
Student_ID Math_Score ... Study_Hours_Per_Day Log_Study_Hours
0 1 88.0 ... 5.847684 1.923911
1 2 78.0 ... 4.572976 1.717929
2 3 64.0 ... 1.205338 0.790881
3 4 92.0 ... 2.994105 1.384819
4 5 57.0 ... 3.167604 1.427341

[5 rows x 7 columns]

..
67% (3)
..
151 pages
Introduction To Data Science Lab Manual
100% (1)
Introduction To Data Science Lab Manual
76 pages
Compiler Design Book PDF
100% (1)
Compiler Design Book PDF
101 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
AD3311-AI Lab Manual-Ex1a and 1b
No ratings yet
AD3311-AI Lab Manual-Ex1a and 1b
6 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
Dsbda Lab Manual Merged
No ratings yet
Dsbda Lab Manual Merged
117 pages
Machine Learning Lab Manual 06
100% (1)
Machine Learning Lab Manual 06
8 pages
Data Generalization
No ratings yet
Data Generalization
3 pages
STA112 - Lecture - 1 - Content - Probability 1
No ratings yet
STA112 - Lecture - 1 - Content - Probability 1
42 pages
Pattern Recognition and Anomaly Detection Lab
No ratings yet
Pattern Recognition and Anomaly Detection Lab
3 pages
Co Po Mapping Bda With Justiificaton
No ratings yet
Co Po Mapping Bda With Justiificaton
4 pages
SSP Hospital Empanelment Criteria Final
No ratings yet
SSP Hospital Empanelment Criteria Final
19 pages
Advanced MGT Accounting Paper 3.2 by PPL
100% (1)
Advanced MGT Accounting Paper 3.2 by PPL
354 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
00 PR SP 00001 - 2 Wet Hydrogen Sulphide (H2S) Service Specification
No ratings yet
00 PR SP 00001 - 2 Wet Hydrogen Sulphide (H2S) Service Specification
12 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
08 Perf Pipeline I
No ratings yet
08 Perf Pipeline I
65 pages
BDA Unit 1-1
No ratings yet
BDA Unit 1-1
21 pages
IJRFE Journal Volume 2 Issue 2
No ratings yet
IJRFE Journal Volume 2 Issue 2
66 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Lab Assignment Questions of Python
100% (1)
Lab Assignment Questions of Python
2 pages
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
100% (1)
Fundamentals of Data Science: Nehru Institute of Engineering and Technology
17 pages
PQC
No ratings yet
PQC
77 pages
CS3361 Set1
No ratings yet
CS3361 Set1
5 pages
4 LSTM Gru
No ratings yet
4 LSTM Gru
44 pages
Nickel's Worth Issue Date 12-13
No ratings yet
Nickel's Worth Issue Date 12-13
44 pages
R22-Ids-Question Bank
No ratings yet
R22-Ids-Question Bank
4 pages
DSBDA LAB - MANUAL (Autosaved) - Sd1-Converted-1-2
100% (1)
DSBDA LAB - MANUAL (Autosaved) - Sd1-Converted-1-2
256 pages
Email Classification: Roll No-41463 (LP-3)
No ratings yet
Email Classification: Roll No-41463 (LP-3)
5 pages
ML Unit Ii
No ratings yet
ML Unit Ii
30 pages
Composite Video Signal
No ratings yet
Composite Video Signal
18 pages
Digital Marketing Plan Template Smart Insights
100% (1)
Digital Marketing Plan Template Smart Insights
13 pages
Data Science PPT PD41
100% (1)
Data Science PPT PD41
8 pages
FDS Lesson Plan
No ratings yet
FDS Lesson Plan
8 pages
Ad3301 Data Exploration and Visualization
No ratings yet
Ad3301 Data Exploration and Visualization
24 pages
Ad3411 - Student
No ratings yet
Ad3411 - Student
27 pages
Ad3301 Dev Full Notes
No ratings yet
Ad3301 Dev Full Notes
53 pages
Hard Lock Guide
No ratings yet
Hard Lock Guide
27 pages
Ge8151 Phython Prog Unit 4 New
No ratings yet
Ge8151 Phython Prog Unit 4 New
33 pages
Earned Value Analysis
No ratings yet
Earned Value Analysis
25 pages
Q&A Univ 3unit
No ratings yet
Q&A Univ 3unit
18 pages
Untitled
No ratings yet
Untitled
4 pages
Aiml Lab Manual 2023
No ratings yet
Aiml Lab Manual 2023
17 pages
Ba Ae 1TL 1.8-4.2 en
No ratings yet
Ba Ae 1TL 1.8-4.2 en
56 pages
As Level Course Work Introduction
67% (3)
As Level Course Work Introduction
3 pages
DS+C25 PGDDS+Masters
No ratings yet
DS+C25 PGDDS+Masters
13 pages
ENG 240 Promotional Materials
0% (1)
ENG 240 Promotional Materials
9 pages
Quest B1 Cumulative Test 3 Units 7-9
No ratings yet
Quest B1 Cumulative Test 3 Units 7-9
4 pages
Plan de Mantenimiento Excavadora
No ratings yet
Plan de Mantenimiento Excavadora
2 pages
IV AI-DS AD3491 FDSA Unit3
No ratings yet
IV AI-DS AD3491 FDSA Unit3
35 pages
Process Mapping: Rejin SR M2 Ie Roll No:13
No ratings yet
Process Mapping: Rejin SR M2 Ie Roll No:13
29 pages
Ss Project With Python
No ratings yet
Ss Project With Python
9 pages
Object Oriented Analysis and Design - Syllabus
No ratings yet
Object Oriented Analysis and Design - Syllabus
1 page
Assignment I Data Analytics
No ratings yet
Assignment I Data Analytics
3 pages
Experiment 5
100% (1)
Experiment 5
6 pages
CAT988 B
No ratings yet
CAT988 B
14 pages
DS&BD Lab Manul
No ratings yet
DS&BD Lab Manul
98 pages
CCW331 Set1
No ratings yet
CCW331 Set1
4 pages
KCET Participating Institutes 2019
No ratings yet
KCET Participating Institutes 2019
11 pages
Aissce Practical Exam CS Question Paper - 2023-24
No ratings yet
Aissce Practical Exam CS Question Paper - 2023-24
5 pages
GTC 2024 s1 Class Test OL
No ratings yet
GTC 2024 s1 Class Test OL
4 pages
2142 2 Frequency Response: Semester 5
No ratings yet
2142 2 Frequency Response: Semester 5
9 pages
FDS Iat-2 Part-B
No ratings yet
FDS Iat-2 Part-B
4 pages
ccs346 Eda Unit 1 Notes
No ratings yet
ccs346 Eda Unit 1 Notes
20 pages
Eda Unit 1
No ratings yet
Eda Unit 1
57 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
EMAG Marketplace Dev Best Practices - Code Samples v1.0
No ratings yet
EMAG Marketplace Dev Best Practices - Code Samples v1.0
6 pages
Dbms
No ratings yet
Dbms
99 pages
20IT503 - Big Data Analytics - Unit2
No ratings yet
20IT503 - Big Data Analytics - Unit2
62 pages
Ad3301 Data Exploration and Visualization
No ratings yet
Ad3301 Data Exploration and Visualization
38 pages
Chapter 2 Introduction To R and Python
No ratings yet
Chapter 2 Introduction To R and Python
35 pages
Lab-manual-Advanced Python Programming 4321602
No ratings yet
Lab-manual-Advanced Python Programming 4321602
24 pages
Data Mining and Business Intelligence Lab Manual
No ratings yet
Data Mining and Business Intelligence Lab Manual
52 pages
Daa Assignment
No ratings yet
Daa Assignment
12 pages
Statistics On Decline of Classical Music
No ratings yet
Statistics On Decline of Classical Music
2 pages
Ball Joints With Female Thread: Technical Informations
No ratings yet
Ball Joints With Female Thread: Technical Informations
3 pages
Fdsa UNIT V
No ratings yet
Fdsa UNIT V
18 pages
AIML Lab Manual
No ratings yet
AIML Lab Manual
43 pages
Data Wrangling
No ratings yet
Data Wrangling
13 pages
Fundamentals of GIS: Applications With ArcGIS
100% (2)
Fundamentals of GIS: Applications With ArcGIS
78 pages
A0013300a - Quick Start Guide - IN700 750 US Hasler BD PDF
No ratings yet
A0013300a - Quick Start Guide - IN700 750 US Hasler BD PDF
1 page
Assignment-2 Data Visualization and Data Preprocessing
No ratings yet
Assignment-2 Data Visualization and Data Preprocessing
1 page
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
No ratings yet
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
35 pages
IAT-1 Workbook P3-Python
No ratings yet
IAT-1 Workbook P3-Python
16 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
How To Select A Lead Screw - A Motion Engineer's Guide
No ratings yet
How To Select A Lead Screw - A Motion Engineer's Guide
8 pages
Project
No ratings yet
Project
18 pages
Pincer Search Algo
No ratings yet
Pincer Search Algo
8 pages
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
From Everand
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
Carl A. Bolton
No ratings yet

Practical No-2

Uploaded by

Practical No-2

Uploaded by

Practical No-2

Date of Conduction: Date of Checking:

# Import necessary libraries

# Set a random seed for reproducibility

# 1. Create the "Academic Performance" dataset

# Introduce missing values and inconsistencies for demonstration

# 1. Scan all variables for missing values and

# Display the updated dataset after handling missing values and

# 2. Scan all numeric variables for outliers

z_scores = (academic_df[numeric_vars] - academic_df[numeric_vars].mean()) /

# Replace outliers with NaN

# Display the dataset after handling outliers

# 3. Apply data transformations

# Display the dataset after the log transformation

# Visualize the distribution before and after the transformation

"C:\Users\Ram Kumar Solanki\PycharmProjects\pythonProject\venv\Scripts\python.exe"

Updated dataset after handling missing values and inconsistencies:

You might also like