ASSIGNEMENTNO.
01
GROUP MEMBERS
MUHAMMAD FARAZ – 221980007
(LEAD)
HASSAN RASHEE – 221980038
COURSE
DATA MINNING LAB (A)
INSTRUCTOR
MAHAB KHADDIM
ASSIGNMENT-REPORT
Introduction:
The primary purpose of this task was to analyze the PSLM-2020 dataset using the
World Bank’s definition of poverty as the foundation for exploration. According to
the World Bank, a person is considered to be in poverty if their income is below $2.15
per day in terms of purchasing power parity (PPP). This analysis aimed to process the
PSLM-2020 dataset, review its instruction manual, and prepare the data for poverty
prediction and estimation.
Steps to Complete the Task:
1. Understanding the World Bank Definition of Poverty:
The World Bank defines poverty as a condition where individuals live on less than
$2.15 per day (adjusted for PPP). This threshold informed the development of features
and the criteria used for poverty classification in this task.
2. Reviewing the PSLM-2020 Dataset
The PSLM-2020 dataset and its instruction manual were reviewed to:
Understand the meaning and context of variables.
Identify income-related fields and other relevant data for poverty analysis, such as
household size, remittances, and value in kind.
Gain clarity on the dataset structure and missing data policies.
3. Data Loading
The datasets (SecE.sav and roster.sav) were loaded into an analytical environment
for inspection. Initial examination included understanding data types, column names,
and the extent of missing values. The instruction manual was used to interpret
variable meanings and ensure accurate data handling.
4. Data Exploration
Exploratory data analysis (EDA) was conducted to:
Examine distributions of income-related variables.
Identify relationships between household size, total income, and poverty.
Highlight anomalies or irregularities in data entries.
5. Data Cleaning
To prepare the data for analysis:
Missing values in income-related columns were replaced with zero, assuming
the absence of income data indicated no income from that source.
Irrelevant variables were removed based on the instruction manual.
Descriptive column names were assigned for clarity and consistency.
6. Data Transformation
Key transformations were applied to prepare the data for poverty prediction:
Household size was calculated by grouping individuals by their household ID
(hhcode).
Datasets were merged using a unique household identifier to consolidate
income data with demographic details.
Income components, such as monthly income, annual income, remittances,
and value in kind, were normalized for uniform analysis.
7. Feature Engineering
Derived features were created for poverty estimation:
Total income was calculated as the sum of all income components.
Each household’s daily income per person was calculated by dividing total
income by household size and normalizing for days in a month.
A binary poverty indicator was created based on whether the daily income per
person was below $2.15.
8. Validation
The pre-processed data was validated by:
Verifying calculations for total income and daily income per person.
Sampling data entries to confirm consistency with the original dataset.
Ensuring compliance with the World Bank’s poverty threshold criteria.
9. Analysis and Visualization
Key insights were drawn, focusing on:
The proportion of households living below the poverty line.
Variations in income distribution across regions.
The relationship between household size and poverty status.
Visualizations were generated to highlight these findings, such as bar charts for
poverty proportions and histograms for income distributions.
Pre-Processing Approach
1. Guided by the Instruction Manual:
Variable selection, handling, and transformations were informed by the PSLM-
2020 dataset instruction manual.
2. World Bank Poverty Definition as Benchmark:
All calculations and features, such as per-person daily income, were
benchmarked against the $2.15/day PPP threshold.
3. Data Integrity:
Steps were taken to ensure no critical data was lost during cleaning. Columns
were renamed and structured for clarity.
4. Feature Engineering:
Income was aggregated across various sources and normalized to a consistent
scale for effective analysis.
5. Validation:
Results were cross-verified to ensure alignment with the World Bank's poverty
criteria.
Conclusion
This task aimed to explore poverty using the PSLM-2020 dataset and the World
Bank’s definition of poverty. By combining robust pre-processing methods with
insights from the dataset’s instruction manual, the data was effectively prepared for
poverty prediction and estimation. The results provide valuable insights into
household income disparities and poverty levels, aiding policymakers and
stakeholders in addressing poverty-related challenges.