GD Topics & Datasets
GD Topics & Datasets
GD Topics & Datasets
Dataset 2:
The dataset related to life expectancy, health factors for 193 countries has been collected from the same
WHO data repository website and its corresponding economic data was collected from United Nation
website. The dataset focuses on immunization factors, mortality factors, economic factors, social factors
and other health related factors as well. Since the observations this dataset are based on different
countries, it will be easier for a country to determine the predicting factor which is contributing to lower
value of life expectancy. Provide suggestions on which areas should be given importance in order to
efficiently improve the life expectancy of its population.
https://www.kaggle.com/kumarajarshi/life-expectancy-who
Dataset 3:
World Health Organization has estimated 12 million deaths occur worldwide, every year due to Heart
diseases. Half the deaths in the United States and other developed countries are due to cardio-vascular
diseases. The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in
high-risk patients and in turn reduce the complications. This dataset contains items like sex, age, smoking
habits, previous diseases, heart rate etc. Analyse the data to pinpoint the most relevant/risk factors of
heart disease as well as predict the overall risk using logistic regression.
Dataset 4:
A large company named XYZ, employs, at any given point of time, around 4000 employees. However,
every year, around 15% of its employees leave the company and need to be replaced with the talent
pool available in the job market. The management believes that this level of attrition (employees
leaving, either on their own or because they got fired) is bad for the company, because of the
following reasons -
1. The former employees’ projects get delayed, which makes it difficult to meet timelines,
resulting in a reputation loss among consumers and partners
2. A sizeable department has to be maintained, for the purposes of recruiting new talent
3. More often than not, the new employees have to be trained for the job and/or given time to
acclimatise themselves to the company
Hence, the management has contracted an HR analytics firm to understand what factors they should
focus on, in order to curb attrition. In other words, they want to know what changes they should
make to their workplace, in order to get most of their employees to stay. Also, they want to know
which of these variables is most important and needs to be addressed right away.
Since you are one of the star analysts at the firm, this project has been given to you.
Goal of the case study
You are required to model the probability of attrition using a logistic regression. The results thus
obtained will be used by the management to understand what changes they should make to their
workplace, in order to get most of their employees to stay.
Dataset 5: Factors affecting stroke
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally,
responsible for approximately 11% of total deaths. Analyse this dataset to predict whether a patient is likely
to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each
row in the data provides relavant information about the patient.
https://www.kaggle.com/fedesoriano/stroke-prediction-dataset
https://www.kaggle.com/dileep070/heart-disease-prediction-using-logistic-regression
https://www.kaggle.com/vjchoudhary7/hr-analytics-case-study
Context
The Challenge - One challenge of modeling retail data is the need to make decisions based on limited
history. Holidays and select major events come once a year, and so does the chance to see how
strategic decisions impacted the bottom line. In addition, markdowns are known to affect sales – the
challenge is to predict which departments will be affected and to what extent.
Content
You are provided with historical sales data for 45 stores located in different regions - each store
contains a number of departments. The company also runs several promotional markdown events
throughout the year. These markdowns precede prominent holidays, the four largest of which are
the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are
weighted five times higher in the evaluation than non-holiday weeks.
Within the Excel Sheet, there are 3 Tabs – Stores, Features and Sales
Stores
Anonymized information about the 45 stores, indicating the type and size of store
Features
Contains additional data related to the store, department, and regional activity for the given dates.
• Store - the store number
• Date - the week
• Temperature - average temperature in the region
• Fuel_Price - cost of fuel in the region
• MarkDown1-5 - anonymized data related to promotional markdowns. MarkDown data is
only available after Nov 2011, and is not available for all stores all the time. Any missing
value is marked with an NA
• CPI - the consumer price index
• Unemployment - the unemployment rate
• IsHoliday - whether the week is a special holiday week
Sales
Historical sales data, which covers to 2010-02-05 to 2012-11-01. Within this tab you will find the
following fields:
• Store - the store number
• Dept - the department number
• Date - the week
• Weekly_Sales - sales for the given department in the given store
• IsHoliday - whether the week is a special holiday week
The Task
1. Predict the department-wide sales for each store for the following month
2. Model the effects of markdowns on holiday weeks
3. Provide recommended actions based on the insights drawn, with prioritization placed on
largest business impact
Dataset 2
LOGISTIC REGRESSION - HEART DISEASE PREDICTION
Introduction
World Health Organization has estimated 12 million deaths occur worldwide, every year due to
Heart diseases. Half the deaths in the United States and other developed countries are due to cardio
vascular diseases. The early prognosis of cardiovascular diseases can aid in making decisions on
lifestyle changes in high risk patients and in turn reduce the complications. This dataset intends to
pinpoint the most relevant/risk factors of heart disease as well as predict the overall risk using
logistic regression
Variables
Each attribute is a potential risk factor. There are both demographic, behavioral and medical risk
factors.
Demographic:
• Sex: male or female(Nominal)
• Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole
numbers, the concept of age is continuous)
Behavioral
• Current Smoker: whether or not the patient is a current smoker (Nominal)
• Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be
considered continuous as one can have any number of cigarettes, even half a cigarette.)
Medical( history)
• BP Meds: whether or not the patient was on blood pressure medication (Nominal)
• Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)
• Prevalent Hyp: whether or not the patient was hypertensive (Nominal)
• Diabetes: whether or not the patient had diabetes (Nominal)
Medical(current)
• Tot Chol: total cholesterol level (Continuous)
• Sys BP: systolic blood pressure (Continuous)
• Dia BP: diastolic blood pressure (Continuous)
• BMI: Body Mass Index (Continuous)
• Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact
discrete, yet are considered continuous because of large number of possible values.)
• Glucose: glucose level (Continuous)