[go: up one dir, main page]

0% found this document useful (0 votes)
23 views2 pages

SPA Group 13 - Assignment 2 Problem Statement

Uploaded by

2023dc04090
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views2 pages

SPA Group 13 - Assignment 2 Problem Statement

Uploaded by

2023dc04090
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Objective of Assignment:

 To apply Machine Learning model for the given dataset.


 To prepare a jupyter notebook or Google Colab to build, train and evaluate a Machine Learning
models using MLlib - PySpark DataFrames on Databricks for the given dataset.
 To provide appropriate analysis for the same and do the prediction for the test data and display
the results for the inference.

Please read the instructions carefully.

Dataset - https://www.kaggle.com/datasets/hosammhmdali/heart-disease-dataset

1. Import Libraries/Dataset
a. Download the dataset
b. Import the required libraries

2. Data Visualization and Exploration


a. Print at least 5 rows for sanity check to identify all the features present in the dataset and
if the target matches with them.
b. Print the description and shape of the dataset.
c. Provide appropriate visualization to get an insight about the dataset.
d. Try exploring the data and see what insights can be drawn from the dataset.

3. Data Pre-processing and cleaning


a. Do the appropriate preprocessing of the data like identifying NULL or Missing Values if
any, handling of outliers if present in the dataset, skewed data etc. Apply appropriate
feature engineering techniques for them.
b. Apply the feature transformation techniques like Standardization, Normalization, etc.
You are free to apply the appropriate transformations depending upon the structure and
the complexity of your dataset.
c. Do the correlational analysis on the dataset. Provide a visualization for the same.
4. Data Preparation
a. Do the final feature selection and extract them into Column X and the class label into
Column into Y.
b. Split the dataset into training and test sets.
5. Model Building
a. Perform Model Development using at least three models, separately. You are free to
apply any Machine Learning Models on the dataset by using MLlib- PySpark. Deep
Learning Models are strictly not allowed.
b. Train the model and print the training accuracy and loss values.
6. Performance Evaluation
a. Print the confusion matrix. Provide appropriate analysis for the same.
b. Do the prediction for the test data and display the results for the inference.
Instructions for Assignment Evaluation
1. Since this is a group assignment and only one ZIP file need to upload in the canvas which
consists of two files – HTML and .ipynb .
2. Please follow the naming convention as <Group no>_<Dataset name>.ipynb and <Group
no>_<Dataset name>.html
Eg. – for group 1 with a weather dataset your notebooks should be named as –
Group01_WeatherDataset.ipynb and Group01_WeatherDataset.html
3. Inside each jupyter notebook, you are required to mention your name, Group details and the
Assignment dataset you will be working on.
4. Organize your code in separate sections for each task. Add comments to make the code
readable.
5. Deep Learning Models are strictly not allowed. You are encouraged to learn classical Machine
learning techniques and experience their behaviour.
6. Notebooks without output shall not be considered for evaluation.
Mark Allocation - 10 Marks
1. Import Libraries/Dataset - 1 mark
2. Data Visualization and Exploration - 2 marks
3. Data Pre-processing and cleaning - 2 marks
4. Data Preparation – 2 marks
5. Model Building – 2 marks
6. Performance Evaluation – 1 marks

Reference:
https://docs.databricks.com/getting-started/dataframes-python.html
https://www.kaggle.com/code/towhidultonmoy/end-to-end-pyspark-project
https://www.kaggle.com/code/tientd95/advanced-pyspark-for-exploratory-data-analysis

------

You might also like