Questionaire - Case Study
Questionaire - Case Study
Dataset:
You can get dataset from below url:
Alzheimers-Disease-and-Healthy-Aging-Data
Tasks:
I. Data Ingestion:
o Create an S3 bucket to store the dataset.
o Use AWS mechanism (for example Glue) to extract the data from the S3 bucket
and load it into an Amazon Redshift data warehouse.
II. Data Cleaning and Transformation:
o Use AWS EMR data processing framework like to clean and transform the data.
This may involve tasks such as:
Handling missing values (If needed)
Removing outliers (If needed)
Normalizing data (If needed)
Creating derived features
III. Data Analysis:
o Use PowerBI or a machine learning framework like SageMaker to perform
exploratory data analysis and extract insights. This may include:
Calculating summary statistics
Creating visualizations
Building predictive models
IV. Data Visualization:
o Use a tool like Amazon QuickSight to create interactive dashboards and
visualizations to communicate the insights to stakeholders.
Deliverables:
A detailed design document outlining the data pipeline architecture, data ingestion and
transformation steps, and analysis techniques.
The AWS project and code used to implement the pipeline.
A presentation summarizing the key findings and insights from the data analysis.
Note: Remember to monitor costs when using AWS services. For example, running an EMR
cluster 24/7 can be expensive.
Page | 2
Objective:
Design and implement a data pipeline using a relational database management system (RDBMS)
to ingest, transform, and analyze a health data dataset to derive key insights.
Dataset:
Alzheimers-Disease-and-Healthy-Aging-Data
Tasks:
1. Data Ingestion:
Use SQL queries to clean and transform the data, including tasks like:
o Handling missing values (if needed)
o Removing outliers (if needed)
o Normalizing the data (if needed)
o Creating new columns based on existing data (derived features)
3. Data Analysis:
4. Data Visualization:
Deliverables:
Page | 3
A design document describing the pipeline, steps for data ingestion, transformation, and
analysis techniques.
The SQL scripts used for data cleaning, transformation, and analysis.
A summary presentation highlighting key findings and visualizations.
Page | 4
WORKING ON ASSIGNMENT 2
Objective:
Design and implement a data pipeline using a relational database management system (RDBMS)
to ingest, transform, and analyze a health data dataset to derive key insights.
Dataset:
Alzheimers-Disease-and-Healthy-Aging-Data
NOTE: Here in this file it contains 284143 serial numbers. So, I am taking first 100 values to
apply the functions.
Tasks:
1. Data Ingestion:
Fig: 1.1 Load the dataset into the database using MySQL Workbench.
Use SQL queries to clean and transform the data, including tasks like:
o Handling missing values (if needed)
o Removing outliers (if needed)
o Normalizing the data (if needed)
o Creating new columns based on existing data (derived features).
Copy all the raw data to this new table as “Alzheimer”. To make a copy of this is because of
saving the original data; if any wrong changes happens then we have a copy of original data.
Page | 7
The “Row_Num” as 1, if it is greater than 2 that means there is duplicate value in the data.
Page | 8
Fig2.2.1.1 There is no duplicate values in the table corresponding to their Class, Topic and
Question.
So, before doing any modification just check out the values importance.
P a g e | 11
Here every row has data, which has some relation to its column values. Either they can be
deleted which is not a good option or we can update the value in the table, without doing
modification in the actual table. So, Instead of deleting the values, let me update it with new
data values (Hypothetically Situation). First check out the number of values.
Fig2.3.4 Here 150 values, are there having blank values, corresponding to this join.
Now, checking the values for NULL, to make sure that I want to either delete the data or change
the data, one side I need to be sure that yes, I want to delete it. Therefore, honestly I am not
sure the use of this NULL data as of 100 %. So, deleting this NULL value data.
Break down complex datasets into smaller, digestible segments to prevent information
overload. Use multiple layers to present intricate relationships.
Include only the most relevant data points that support the data. Avoid overcrowding
visualizations with unnecessary data.
1. Enhanced Understanding.
2. Insight Generation.
3. Effective Communication.
4. Storytelling.
5. Improved Decision-making.