0% found this document useful (0 votes)

10 views14 pages

Big Data Analytics Assignment - 1

Uploaded by

suhanasingh2404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views14 pages

Big Data Analytics Assignment - 1

Uploaded by

suhanasingh2404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

BIG DATA ANALYTICS

Assignment :1

Submitted To:
Dr. Manisha Jailia Ma’am

Submitted By:
Srishti Kiran (2216079)
Srishty Kashyap(2216080)
Suhana Singh(2216081)
Btech CS-AI 4th year

Q.1 Discuss at least 3 visualization tools in detail also give merits and demerits
of all.
Do descriptive /diagnostic analytics on any secondary data set with either
Plotly/ Orange/Tableau/or any visualization tools.
(*** Download any data set from github/kaggle related to
healthcare/agriculture/football match etc. ***)
Note: Visualization must be clear and generate a summarized report.
Solution:
Data visualization tools help to represent raw data in graphical or pictorial form,
making patterns, trends, and insights easier to understand. Below are three widely
used tools:
1.)Tableau: It is one of the most widely used tools for turning raw data into
meaningful insights. Instead of writing complex code, users can simply drag and
drop data fields to create charts, graphs, and dashboards. It connects easily with
multiple data sources like Excel sheets, SQL databases, or cloud platforms, which
makes it very flexible for different industries. The main strength of Tableau is its
ability to let people explore data visually and interactively, which helps in spotting
patterns and trends quickly.
MERITS :

● Tableau is very easy to use because of its drag-and-drop feature, so even

people without technical skills can create charts and dashboards.
● It can connect with many different data sources such as Excel, SQL, and
cloud platforms, which makes it flexible in real projects.
● Dashboards in Tableau are interactive, which helps in exploring data and
telling a clear story through visuals.
● It is widely accepted in industry, so learning it gives good career benefits.

DEMERITS:
● Tableau is a paid tool, and the full version can be quite expensive for
students or small organizations.
● While basic charts are easy, using advanced features needs extra practice and
training.
● It may slow down when working with very large datasets, which sometimes
affects performance.
2.) Plotly : It is an open-source visualization library that allows users to create
highly interactive charts with just a few lines of code. It supports multiple
programming languages such as Python, R, and JavaScript, making it popular
among data scientists and developers. One of its strengths is the ability to build
dynamic dashboards and 3D visualizations, which are very useful in fields like
finance, healthcare, and research.
MERITS :

● Free and open-source, with professional features available in its commercial

version.
● Produces modern, interactive, and web-friendly charts.
● Supports advanced visualizations like heatmaps, 3D surfaces, maps, and
animations.
● Works seamlessly with Python libraries, making it a favorite in data science.

DEMERITS:

● Requires coding knowledge, which may discourage complete beginners.

● More complex customizations may take additional time and effort.
● Dashboards are functional but not as polished or business-oriented as
Tableau’s.

3.) Power BI : It is developed by Microsoft, is a powerful business analytics tool

that helps transform raw numbers into meaningful and interactive visuals. It allows
users to build dashboards and reports without needing advanced technical skills,
making it beginner-friendly. The tool connects seamlessly with Excel, databases,
and even cloud services, enabling real-time insights. With its simple drag-and-drop
interface, both professionals and newcomers can easily explore data. Overall,
Power BI supports organizations in making smarter, data-driven decisions by
turning complex information into clear and engaging stories.
MERITS:

● Very easy to use, especially for those already familiar with Excel.
● Provides real-time dashboards, so updates are seen instantly.
● Works well with Microsoft products and many data sources.

DEMERITS:

● Can feel a bit complex when using advanced formulas or DAX.

● Performance may slow down with very large datasets.
● Licensing plans may confuse beginners due to multiple versions.

REPORT
This project report provides a comprehensive analysis of a Netflix user-based
dataset. The primary objective is to perform both descriptive and diagnostic
analytics to uncover key insights about user behavior, demographics, and
subscription trends. The analysis was conducted using Power BI, a powerful
business intelligence tool, which allowed for the creation of interactive and
informative visualizations.

2. Dataset Overview
The dataset for this project contains detailed information on individual Netflix
users. The key fields include:

● User ID: A unique identifier for each user.

● Subscription Type: The plan the user is subscribed to (e.g., Basic, Standard,
Premium).
● Monthly Revenue: The amount of revenue generated from each user per
month.
● Joining Date: The date the user began their subscription.
● Country: The user's country of residence.
● Age: The age of the user.
● Gender: The user's gender.
● Device: The device used for streaming (e.g., Mobile, TV, Desktop).

3. Descriptive Analytics: What the Data Shows

Descriptive analytics helps us summarize and describe the main features of the
dataset.

● Customer Demographics The analysis shows a clear picture of our user

base. A look at the age distribution reveals the most common age ranges
among Netflix customers. The distribution of customers by gender is also
clearly illustrated.

● Revenue and Subscription Analysis We also analyzed the financial data,

providing insights into our monthly revenue and the different types of
subscription plans our users are on. The distribution of monthly revenue is
visualized below, which helps us understand typical user spending.
4. Diagnostic Analytics: The Why Behind the Data

Diagnostic analytics goes beyond a simple summary to help us understand the

relationships and reasons behind certain trends.

● Geographical Insights By analyzing the Country field, we can see where

our largest customer bases are located. The map visualization provides a
clear, at-a-glance view of customer concentration across different regions.
● A more detailed breakdown reveals which devices are most popular in
different countries, which can be valuable for targeted marketing efforts.
● User Behavior and Preferences This part of the analysis explores how
different variables relate to each other. For example, by comparing
Subscription Type with Gender, we can see if there are any differences in the
plans men and women choose.

5. Conclusion

The descriptive and diagnostic analysis of the Netflix dataset provides a solid
foundation for understanding our user base. We have identified key characteristics
of our customers, from their age and gender to their geographical location and
preferred devices. The insights gained from this project can be used to inform
future business strategies, such as developing new subscription plans, launching
targeted advertising campaigns, or focusing on growth in specific regions.
Q.2 What do you mean by statistical inference? Why statistical inference
plays important role in data Analytics. Explain chi square test method with
suitable examples.

Statistical inference is the process of drawing conclusions or making predictions

about a population based on data collected from a sample of that population. It
involves using statistical methods to analyze sample data and make inferences or
predictions about parameters or characteristics of the entire population from which the
sample was drawn.
Statistical inference is based on probability theory and probability distributions. It
involves making assumptions about the population and the sample, and using
statistical models to analyze the data.

Key Points:

1. Population vs. Sample

○ Population: The entire dataset (all records, events, or users).

○ Sample: A smaller, manageable subset of the data.

○ Since analyzing the whole population in big data is often impractical, we

infer properties from samples.

2. Purpose in Big Data Analytics

○ Helps understand trends, patterns, and relationships in large datasets.

○ Supports decision-making under uncertainty.

○ Useful in prediction, hypothesis testing, and model validation.

3. Main Techniques of Statistical Inference

○ Estimation: Predicting population parameters (e.g., mean, variance)

using sample statistics.
○ Hypothesis Testing: Checking if assumptions about data (e.g., "Does a
new algorithm improve accuracy?") are statistically valid.

○ Regression & Correlation Analysis: Inferring relationships between

variables.

○ Bayesian Inference: Updating beliefs as more data becomes available.

Important role:

Generalization Beyond the Sample We usually can’t analyze the entire population (too
large, costly, or time-consuming).

Inference helps extend insights from a smaller dataset to the whole population.
Example: Testing a drug on 500 patients to infer effects on millions. Decision-Making
Under Uncertainty Data always contains randomness and noise.

Inference quantifies this uncertainty using confidence intervals, p-values, and

probability models. This helps managers/scientists make data-driven decisions instead
of guesses.

Hypothesis Testing Lets us check if observed effects are real or due to chance.
Example: “Does a new marketing strategy actually improve sales, or is it random
variation?”

Prediction & Forecasting Inference forms the basis for predictive modeling
(regression, time series, machine learning). It estimates parameters and validates
whether predictions are statistically sound.

Without statistical inference, data analytics would just describe the past (descriptive
statistics). With inference, we can predict, test, and make confident decisions about
the future and the unseen population

Chi-Square Test
The Chi-Square test (χ² test) is a statistical hypothesis testing method used to
determine whether there is a significant association between categorical variables or
whether the observed data fits an expected distribution.
It is based on the comparison between observed frequencies (O) and expected
frequencies (E)

Formula
x^2=summation( (O-E)^2/E)

Where:

● O = Observed frequency

● E = Expected frequency

Types of Chi-Square Tests

1. Chi-Square Test of Independence – Checks whether two categorical variables are

related.

2. Chi-Square Goodness-of-Fit Test – Checks whether an observed frequency

distribution fits an expected distribution.

3. Chi-Square Test for Homogeneity – Compares categorical distributions across

multiple populations.

Test of Independence Example

A company wants to check if gender and preference for a product are related.

Gender Like Dislike Total

Male 30 20 50

Female 25 25 50

Total 55 45 100
●
Null Hypothesis (H₀): Gender and product preference are independent.

● Calculate expected frequencies:

E{Male,Like}= 50×55/100=27.5

The Chi-Square Test helps us find relationships or differences between categories. Its
main uses are:
1. Feature Selection in Machine Learning: It helps decide if a categorical feature

(like color or product type) is important for predicting the target (like sales or
satisfaction), improving model performance.
2. Testing Independence: It checks if two categorical variables are related or
independent. For example, whether age or gender affects product preferences.
3. Assessing Model Fit: It helps check if a model’s predicted categories match the
actual data, which is useful to improve classification models.

Chi-Square Goodness-of-Fit Test Example

A dice is rolled 60 times, and the outcomes are recorded as:

1: 8, 2: 10, 3: 9, 4: 12, 5: 11, 6: 10

Is the dice fair?

Step 1: Hypotheses

H0: The dice is fair (all outcomes equally likely).

H1: The dice is not fair.

Step 2: Expected Frequencies

If fair, expected frequency = 60/6 = 10 for each outcome.

Step 3: Compute χ²

χ² = (8-10)²/10 + (10-10)²/10 + (9-10)²/10 + (12-10)²/10 + (11-10)²/10 + (10-10)²/10

= (4/10) + 0 + (1/10) + (4/10) + (1/10) + 0

= 1.0

Step 4: Degrees of Freedom

df = k - 1 = 6 - 1 = 5

Step 5: Decision

At 5% significance, critical χ² = 11.07.

Since 1.0 < 11.07 → Fail to reject H0.

Conclusion: The dice appears to be fair.

Applications of Chi-Square Test in Data Analytics

1. Market Research – Testing if product preferences differ across demographics.

2. Healthcare Analytics – Checking whether treatment types and patient outcomes are

related.

3. Social Sciences – Analyzing relationships like education level vs job type.

4. Quality Control – Ensuring defect rates align with expected values.

5. E-commerce – Studying differences in purchase behavior across regions.

Business Analytics Using Excel
No ratings yet
Business Analytics Using Excel
56 pages
DV Lab - Session-1
No ratings yet
DV Lab - Session-1
4 pages
Dsbda Ut6
No ratings yet
Dsbda Ut6
11 pages
BDA Assignment
No ratings yet
BDA Assignment
14 pages
Da End Sem
No ratings yet
Da End Sem
5 pages
Assignment Week 1
No ratings yet
Assignment Week 1
3 pages
Digital Assignment - 2: Big Data Analytics
No ratings yet
Digital Assignment - 2: Big Data Analytics
7 pages
UNIT III Business Analytics Notes
No ratings yet
UNIT III Business Analytics Notes
7 pages
ASSIGNMENT 3 Big Data
No ratings yet
ASSIGNMENT 3 Big Data
3 pages
Tableau 3
No ratings yet
Tableau 3
9 pages
Data Analytics Fundamentals Guide
90% (10)
Data Analytics Fundamentals Guide
17 pages
Rohini 74892907252
No ratings yet
Rohini 74892907252
6 pages
Advanced Data Analytics and Visualization Course Material
No ratings yet
Advanced Data Analytics and Visualization Course Material
45 pages
Data Visualization Tableau
No ratings yet
Data Visualization Tableau
33 pages
Disruptive Technologies DA Lecture 10
No ratings yet
Disruptive Technologies DA Lecture 10
15 pages
Disruptive Technologies DA Lecture 9
No ratings yet
Disruptive Technologies DA Lecture 9
15 pages
Introduction To Data Analytics Techniques and Tools
No ratings yet
Introduction To Data Analytics Techniques and Tools
9 pages
Da Unit Ii
No ratings yet
Da Unit Ii
25 pages
Big Data Report
No ratings yet
Big Data Report
6 pages
Notes
No ratings yet
Notes
116 pages
Unit 2
No ratings yet
Unit 2
22 pages
Vol11Iss1 P4
No ratings yet
Vol11Iss1 P4
7 pages
DV Unit 2
No ratings yet
DV Unit 2
5 pages
Da Unit 1
No ratings yet
Da Unit 1
12 pages
Big Data: Concepts and Analytics Guide
No ratings yet
Big Data: Concepts and Analytics Guide
37 pages
Assignment Week 2 BDA
No ratings yet
Assignment Week 2 BDA
4 pages
1.data Analytics Overview and Variables Disruptive System
No ratings yet
1.data Analytics Overview and Variables Disruptive System
7 pages
DA Unit 1
No ratings yet
DA Unit 1
43 pages
Data Management & Data Architecture
No ratings yet
Data Management & Data Architecture
21 pages
Unit 1
No ratings yet
Unit 1
54 pages
Data Science Questions and Answers
No ratings yet
Data Science Questions and Answers
4 pages
Data Analytics Introduction Guide
No ratings yet
Data Analytics Introduction Guide
13 pages
2.1 Data Analytics
No ratings yet
2.1 Data Analytics
16 pages
DA unit-II
No ratings yet
DA unit-II
15 pages
105-106 Data Visualization Techniques Tools and Best Practices
No ratings yet
105-106 Data Visualization Techniques Tools and Best Practices
25 pages
CH 1
No ratings yet
CH 1
31 pages
Lecture 0
No ratings yet
Lecture 0
21 pages
BBA 4 - Unit
No ratings yet
BBA 4 - Unit
15 pages
Unit 1-2
No ratings yet
Unit 1-2
8 pages
Unit 4
No ratings yet
Unit 4
21 pages
BBA 4 - Unit
No ratings yet
BBA 4 - Unit
16 pages
Data Literacy
No ratings yet
Data Literacy
2 pages
Unit 2
No ratings yet
Unit 2
15 pages
Analysis
No ratings yet
Analysis
53 pages
CCW331 Unit 1 BA Part 2
No ratings yet
CCW331 Unit 1 BA Part 2
5 pages
Introduction To Data Analytics
No ratings yet
Introduction To Data Analytics
15 pages
Data Analytics II-unit
No ratings yet
Data Analytics II-unit
20 pages
Ba Notes Ete
No ratings yet
Ba Notes Ete
16 pages
Data Analytics PPT 1
No ratings yet
Data Analytics PPT 1
16 pages
Unit-1 DA
No ratings yet
Unit-1 DA
23 pages
Unit 4 - 250612 - 231911
No ratings yet
Unit 4 - 250612 - 231911
12 pages
BAIL504 Lab Manual
No ratings yet
BAIL504 Lab Manual
89 pages
Data Analysis Class-63820632
No ratings yet
Data Analysis Class-63820632
8 pages
01 Konsep Big Data
No ratings yet
01 Konsep Big Data
60 pages
What Is Data Analytics
No ratings yet
What Is Data Analytics
44 pages
CH 1
No ratings yet
CH 1
56 pages
CH 1
No ratings yet
CH 1
33 pages
Download
No ratings yet
Download
45 pages
Data Visualization Techniques Guide
No ratings yet
Data Visualization Techniques Guide
15 pages
FS BCOM-2 (Lesson1)
No ratings yet
FS BCOM-2 (Lesson1)
8 pages
Sampling Distributions & Probability
No ratings yet
Sampling Distributions & Probability
6 pages
IB Statistics for Students
No ratings yet
IB Statistics for Students
26 pages
Reliability
No ratings yet
Reliability
10 pages
BHM 203 BIO STAT Class Test
No ratings yet
BHM 203 BIO STAT Class Test
1 page
Buckless ContrastCodingRefinement 1990
No ratings yet
Buckless ContrastCodingRefinement 1990
14 pages
Message
No ratings yet
Message
8 pages
Chts Girls P
No ratings yet
Chts Girls P
15 pages
Errors and Stat Data in Chemical Analysis
No ratings yet
Errors and Stat Data in Chemical Analysis
20 pages
Kriging Guide for Engineers
No ratings yet
Kriging Guide for Engineers
78 pages
BRM CT 1 Ans
No ratings yet
BRM CT 1 Ans
6 pages
Module 4 Post Task
100% (1)
Module 4 Post Task
9 pages
2.1.1 Central Tendencies
No ratings yet
2.1.1 Central Tendencies
4 pages
DSUR I Chapter 06 (Correlation)
No ratings yet
DSUR I Chapter 06 (Correlation)
42 pages
Examen Parcial
No ratings yet
Examen Parcial
10 pages
Biometrical Genetics for Researchers
No ratings yet
Biometrical Genetics for Researchers
3 pages
Business Statistics Mathematics Solved Past Paper 2009
100% (1)
Business Statistics Mathematics Solved Past Paper 2009
12 pages
Hypothesis Testing
100% (3)
Hypothesis Testing
23 pages
Gamma
No ratings yet
Gamma
12 pages
Research Methods Overview
No ratings yet
Research Methods Overview
3 pages
Trần Mạnh Hùng 20192643.Ipynb - Colab
No ratings yet
Trần Mạnh Hùng 20192643.Ipynb - Colab
6 pages
Solutions Exercises 1 and 2 Multiple Linear Regression
No ratings yet
Solutions Exercises 1 and 2 Multiple Linear Regression
4 pages
Statistics Problem Class Guide
No ratings yet
Statistics Problem Class Guide
2 pages
Paul Lambe, Catherine Waters & David Bristow
No ratings yet
Paul Lambe, Catherine Waters & David Bristow
10 pages
Chapter III Research Methodology
100% (1)
Chapter III Research Methodology
22 pages
Essentials of Modern Business Statistics (7e) : Anderson, Sweeney, Williams, Camm, Cochran
No ratings yet
Essentials of Modern Business Statistics (7e) : Anderson, Sweeney, Williams, Camm, Cochran
57 pages
Unit 1-QTM-Introduction To Statistics-MBA 1
No ratings yet
Unit 1-QTM-Introduction To Statistics-MBA 1
48 pages
Skewness and Kurtosis Guide
No ratings yet
Skewness and Kurtosis Guide
21 pages
Helene Johnson - March Madness
No ratings yet
Helene Johnson - March Madness
5 pages
Biostatistics for Public Health
No ratings yet
Biostatistics for Public Health
63 pages

Big Data Analytics Assignment - 1

Uploaded by

Big Data Analytics Assignment - 1

Uploaded by

BIG DATA ANALYTICS

●​ Tableau is very easy to use because of its drag-and-drop feature, so even

●​ Free and open-source, with professional features available in its commercial

●​ Requires coding knowledge, which may discourage complete beginners.

3.) Power BI : It is developed by Microsoft, is a powerful business analytics tool

●​ Can feel a bit complex when using advanced formulas or DAX.

●​ User ID: A unique identifier for each user.

3. Descriptive Analytics: What the Data Shows

●​ Customer Demographics The analysis shows a clear picture of our user

●​ Revenue and Subscription Analysis We also analyzed the financial data,

Diagnostic analytics goes beyond a simple summary to help us understand the

●​ Geographical Insights By analyzing the Country field, we can see where

Statistical inference is the process of drawing conclusions or making predictions

1.​ Population vs. Sample​

○​ Population: The entire dataset (all records, events, or users).​

○​ Sample: A smaller, manageable subset of the data.​

○​ Since analyzing the whole population in big data is often impractical, we

2.​ Purpose in Big Data Analytics​

○​ Helps understand trends, patterns, and relationships in large datasets.​

○​ Supports decision-making under uncertainty.​

○​ Useful in prediction, hypothesis testing, and model validation.​

3.​ Main Techniques of Statistical Inference​

○​ Estimation: Predicting population parameters (e.g., mean, variance)

○​ Regression & Correlation Analysis: Inferring relationships between

○​ Bayesian Inference: Updating beliefs as more data becomes available.​

Inference quantifies this uncertainty using confidence intervals, p-values, and

Types of Chi-Square Tests

1. Chi-Square Test of Independence – Checks whether two categorical variables are

2. Chi-Square Goodness-of-Fit Test – Checks whether an observed frequency

3. Chi-Square Test for Homogeneity – Compares categorical distributions across

Test of Independence Example

Gender Like Dislike Total

●​ Calculate expected frequencies:​

Chi-Square Goodness-of-Fit Test Example

A dice is rolled 60 times, and the outcomes are recorded as:

1: 8, 2: 10, 3: 9, 4: 12, 5: 11, 6: 10

Is the dice fair?

H0: The dice is fair (all outcomes equally likely).

H1: The dice is not fair.

Step 2: Expected Frequencies

If fair, expected frequency = 60/6 = 10 for each outcome.

χ² = (8-10)²/10 + (10-10)²/10 + (9-10)²/10 + (12-10)²/10 + (11-10)²/10 + (10-10)²/10

= (4/10) + 0 + (1/10) + (4/10) + (1/10) + 0

Step 4: Degrees of Freedom

At 5% significance, critical χ² = 11.07.

Since 1.0 &lt; 11.07 → Fail to reject H0.

Conclusion: The dice appears to be fair.

Applications of Chi-Square Test in Data Analytics

1. Market Research – Testing if product preferences differ across demographics.

3. Social Sciences – Analyzing relationships like education level vs job type.

4. Quality Control – Ensuring defect rates align with expected values.

5. E-commerce – Studying differences in purchase behavior across regions.

You might also like

● Tableau is very easy to use because of its drag-and-drop feature, so even

● Free and open-source, with professional features available in its commercial

● Requires coding knowledge, which may discourage complete beginners.

● Can feel a bit complex when using advanced formulas or DAX.

● User ID: A unique identifier for each user.

● Customer Demographics The analysis shows a clear picture of our user

● Revenue and Subscription Analysis We also analyzed the financial data,

● Geographical Insights By analyzing the Country field, we can see where

1. Population vs. Sample

○ Population: The entire dataset (all records, events, or users).

○ Sample: A smaller, manageable subset of the data.

○ Since analyzing the whole population in big data is often impractical, we

2. Purpose in Big Data Analytics

○ Helps understand trends, patterns, and relationships in large datasets.

○ Supports decision-making under uncertainty.

○ Useful in prediction, hypothesis testing, and model validation.

3. Main Techniques of Statistical Inference

○ Estimation: Predicting population parameters (e.g., mean, variance)

○ Regression & Correlation Analysis: Inferring relationships between

○ Bayesian Inference: Updating beliefs as more data becomes available.

● Calculate expected frequencies:

Since 1.0 < 11.07 → Fail to reject H0.