Datasets
General Datasets
1. Titanic Dataset
• Data Cleaning:
o Handle missing values (e.g., Age, Cabin).
o Remove duplicates if any.
• Preprocessing:
o Convert categorical variables like Sex and Embarked to numerical values using
one-hot encoding or label encoding.
• Visualization:
o Plot survival rates by gender, class, or embarkation point.
o Visualize age distribution using histograms.
https://www.kaggle.com/datasets/brendan45774/test-file
2. House Prices
• Data Cleaning:
o Handle missing values in attributes like LotFrontage or GarageYrBlt.
o Standardize and correct inconsistencies in categorical features like Neighborhood.
• Preprocessing:
o Encode categorical variables like Condition or Style.
o Create new features like price per square foot.
• Visualization:
o Plot price distributions using histograms.
o Visualize correlations between house prices and features like lot size or
neighborhood.
https://www.kaggle.com/datasets/lespin/house-prices-dataset
3. Amazon Product Reviews
• Data Cleaning:
o Handle missing reviews or ratings.
o Remove duplicates and irrelevant reviews.
• Preprocessing:
o Perform sentiment analysis on review text.
o Group data by product categories or brands.
• Visualization:
o Plot distributions of ratings using histograms.
o Create word clouds for commonly used terms in reviews.
https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews
4. Retail Store Sales
• Data Cleaning:
o Handle missing sales or customer data.
o Correct inconsistencies in product categories.
• Preprocessing:
o Aggregate sales by region, category, or time period.
o Create new metrics like average sales per customer.
• Visualization:
o Plot sales trends over time using line charts.
o Use pie charts to show sales contributions by product category.
o Create bar charts for top-performing regions.
https://www.kaggle.com/datasets/kyanyoga/sample-sales-data
5. Student Performance Dataset
• Data Cleaning:
o Handle missing entries for gender, parental education, or test scores.
o Standardize categories like test preparation status.
• Preprocessing:
o Create new metrics like average test score.
o Group data by gender or parental education level.
• Visualization:
o Plot test score distributions using histograms.
o Use bar charts to compare performance across genders or parental education
levels.
o Create scatter plots showing correlations between test preparation and scores.
https://www.kaggle.com/datasets/spscientist/students-performance-in-exams
Real-World Applications
6. Supermarket Sales
• Data Cleaning:
o Check for duplicates and remove them.
o Handle inconsistent entries in Customer Type or City.
• Preprocessing:
o Convert Date column to datetime format.
o Aggregate sales by month or category.
• Visualization:
o Visualize sales trends over time using line charts.
o Create a pie chart showing sales distribution by payment type.
https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales
Health and Social Sciences
7. World Happiness Report
• Data Cleaning:
o Handle missing socioeconomic indicators.
o Remove duplicates or invalid country records.
• Preprocessing:
o Scale happiness scores and other indicators.
o Create regional aggregates (e.g., average happiness by continent).
• Visualization:
o Visualize happiness scores on a world map.
o Create scatter plots comparing happiness to GDP or freedom scores.
https://www.kaggle.com/datasets/unsdsn/world-happiness
8. Diabetes Dataset
• Data Cleaning:
o Replace zeros in columns like BloodPressure or BMI with mean/median values.
• Preprocessing:
o Normalize health metrics for better analysis.
• Visualization:
o Visualize distributions of features like BMI and glucose levels.
o Use heatmaps to show correlations between features.
https://www.kaggle.com/datasets/mathchi/diabetes-data-set
Entertainment and Media
9. Netflix Movies and TV Shows
Dataset Link: Netflix Dataset on Kaggle
• Data Cleaning:
o Handle missing values in attributes like Director or Cast.
o Remove duplicate records for movies or shows.
• Preprocessing:
o Encode categorical variables like Genre and Country.
o Create new features like the release decade or duration category (e.g., short,
medium, long).
• Visualization:
o Plot counts of content types (movies vs. TV shows).
o Visualize the distribution of genres using bar charts.
o Show trends in content release over years using line plots.
https://www.kaggle.com/datasets/shivamb/netflix-shows
10. Spotify Tracks Dataset
• Data Cleaning:
o Remove duplicate tracks.
o Handle missing genres or artist data.
• Preprocessing:
o Scale numerical columns like Popularity or Duration_ms.
o Group data by artists or genres for aggregation.
• Visualization:
o Plot top genres using bar charts.
o Visualize trends in track popularity over time.
https://www.kaggle.com/datasets/zaheenhamidani/ultimate-spotify-tracks-db
Geographical and Environmental
11. Global Temperature Data
• Data Cleaning:
o Handle missing temperature records.
o Ensure proper datetime formatting.
• Preprocessing:
o Aggregate data by year or decade for trend analysis.
• Visualization:
o Plot temperature trends over time.
o Use choropleth maps to show regional temperature changes.
https://www.kaggle.com/datasets/berkeleyearth/climate-change-earth-surface-temperature-data