BIG DATA ANALYTICS
Assignment :1
Submitted To:
Dr. Manisha Jailia Ma’am
Submitted By:
Srishti Kiran (2216079)
Srishty Kashyap(2216080)
Suhana Singh(2216081)
Btech CS-AI 4th year
Q.1 Discuss at least 3 visualization tools in detail also give merits and demerits
of all.
Do descriptive /diagnostic analytics on any secondary data set with either
Plotly/ Orange/Tableau/or any visualization tools.
(*** Download any data set from github/kaggle related to
healthcare/agriculture/football match etc. ***)
Note: Visualization must be clear and generate a summarized report.
Solution:
Data visualization tools help to represent raw data in graphical or pictorial form,
making patterns, trends, and insights easier to understand. Below are three widely
used tools:
1.)Tableau: It is one of the most widely used tools for turning raw data into
meaningful insights. Instead of writing complex code, users can simply drag and
drop data fields to create charts, graphs, and dashboards. It connects easily with
multiple data sources like Excel sheets, SQL databases, or cloud platforms, which
makes it very flexible for different industries. The main strength of Tableau is its
ability to let people explore data visually and interactively, which helps in spotting
patterns and trends quickly.
MERITS :
● Tableau is very easy to use because of its drag-and-drop feature, so even
people without technical skills can create charts and dashboards.
● It can connect with many different data sources such as Excel, SQL, and
cloud platforms, which makes it flexible in real projects.
● Dashboards in Tableau are interactive, which helps in exploring data and
telling a clear story through visuals.
● It is widely accepted in industry, so learning it gives good career benefits.
DEMERITS:
● Tableau is a paid tool, and the full version can be quite expensive for
students or small organizations.
● While basic charts are easy, using advanced features needs extra practice and
training.
● It may slow down when working with very large datasets, which sometimes
affects performance.
2.) Plotly : It is an open-source visualization library that allows users to create
highly interactive charts with just a few lines of code. It supports multiple
programming languages such as Python, R, and JavaScript, making it popular
among data scientists and developers. One of its strengths is the ability to build
dynamic dashboards and 3D visualizations, which are very useful in fields like
finance, healthcare, and research.
MERITS :
● Free and open-source, with professional features available in its commercial
version.
● Produces modern, interactive, and web-friendly charts.
● Supports advanced visualizations like heatmaps, 3D surfaces, maps, and
animations.
● Works seamlessly with Python libraries, making it a favorite in data science.
DEMERITS:
● Requires coding knowledge, which may discourage complete beginners.
● More complex customizations may take additional time and effort.
● Dashboards are functional but not as polished or business-oriented as
Tableau’s.
3.) Power BI : It is developed by Microsoft, is a powerful business analytics tool
that helps transform raw numbers into meaningful and interactive visuals. It allows
users to build dashboards and reports without needing advanced technical skills,
making it beginner-friendly. The tool connects seamlessly with Excel, databases,
and even cloud services, enabling real-time insights. With its simple drag-and-drop
interface, both professionals and newcomers can easily explore data. Overall,
Power BI supports organizations in making smarter, data-driven decisions by
turning complex information into clear and engaging stories.
MERITS:
● Very easy to use, especially for those already familiar with Excel.
● Provides real-time dashboards, so updates are seen instantly.
● Works well with Microsoft products and many data sources.
DEMERITS:
● Can feel a bit complex when using advanced formulas or DAX.
● Performance may slow down with very large datasets.
● Licensing plans may confuse beginners due to multiple versions.
REPORT
This project report provides a comprehensive analysis of a Netflix user-based
dataset. The primary objective is to perform both descriptive and diagnostic
analytics to uncover key insights about user behavior, demographics, and
subscription trends. The analysis was conducted using Power BI, a powerful
business intelligence tool, which allowed for the creation of interactive and
informative visualizations.
2. Dataset Overview
The dataset for this project contains detailed information on individual Netflix
users. The key fields include:
● User ID: A unique identifier for each user.
● Subscription Type: The plan the user is subscribed to (e.g., Basic, Standard,
Premium).
● Monthly Revenue: The amount of revenue generated from each user per
month.
● Joining Date: The date the user began their subscription.
● Country: The user's country of residence.
● Age: The age of the user.
● Gender: The user's gender.
● Device: The device used for streaming (e.g., Mobile, TV, Desktop).
3. Descriptive Analytics: What the Data Shows
Descriptive analytics helps us summarize and describe the main features of the
dataset.
● Customer Demographics The analysis shows a clear picture of our user
base. A look at the age distribution reveals the most common age ranges
among Netflix customers. The distribution of customers by gender is also
clearly illustrated.
● Revenue and Subscription Analysis We also analyzed the financial data,
providing insights into our monthly revenue and the different types of
subscription plans our users are on. The distribution of monthly revenue is
visualized below, which helps us understand typical user spending.
4. Diagnostic Analytics: The Why Behind the Data
Diagnostic analytics goes beyond a simple summary to help us understand the
relationships and reasons behind certain trends.
● Geographical Insights By analyzing the Country field, we can see where
our largest customer bases are located. The map visualization provides a
clear, at-a-glance view of customer concentration across different regions.
● A more detailed breakdown reveals which devices are most popular in
different countries, which can be valuable for targeted marketing efforts.
● User Behavior and Preferences This part of the analysis explores how
different variables relate to each other. For example, by comparing
Subscription Type with Gender, we can see if there are any differences in the
plans men and women choose.
5. Conclusion
The descriptive and diagnostic analysis of the Netflix dataset provides a solid
foundation for understanding our user base. We have identified key characteristics
of our customers, from their age and gender to their geographical location and
preferred devices. The insights gained from this project can be used to inform
future business strategies, such as developing new subscription plans, launching
targeted advertising campaigns, or focusing on growth in specific regions.
Q.2 What do you mean by statistical inference? Why statistical inference
plays important role in data Analytics. Explain chi square test method with
suitable examples.
Statistical inference is the process of drawing conclusions or making predictions
about a population based on data collected from a sample of that population. It
involves using statistical methods to analyze sample data and make inferences or
predictions about parameters or characteristics of the entire population from which the
sample was drawn.
Statistical inference is based on probability theory and probability distributions. It
involves making assumptions about the population and the sample, and using
statistical models to analyze the data.
Key Points:
1. Population vs. Sample
○ Population: The entire dataset (all records, events, or users).
○ Sample: A smaller, manageable subset of the data.
○ Since analyzing the whole population in big data is often impractical, we
infer properties from samples.
2. Purpose in Big Data Analytics
○ Helps understand trends, patterns, and relationships in large datasets.
○ Supports decision-making under uncertainty.
○ Useful in prediction, hypothesis testing, and model validation.
3. Main Techniques of Statistical Inference
○ Estimation: Predicting population parameters (e.g., mean, variance)
using sample statistics.
○ Hypothesis Testing: Checking if assumptions about data (e.g., "Does a
new algorithm improve accuracy?") are statistically valid.
○ Regression & Correlation Analysis: Inferring relationships between
variables.
○ Bayesian Inference: Updating beliefs as more data becomes available.
Important role:
Generalization Beyond the Sample We usually can’t analyze the entire population (too
large, costly, or time-consuming).
Inference helps extend insights from a smaller dataset to the whole population.
Example: Testing a drug on 500 patients to infer effects on millions. Decision-Making
Under Uncertainty Data always contains randomness and noise.
Inference quantifies this uncertainty using confidence intervals, p-values, and
probability models. This helps managers/scientists make data-driven decisions instead
of guesses.
Hypothesis Testing Lets us check if observed effects are real or due to chance.
Example: “Does a new marketing strategy actually improve sales, or is it random
variation?”
Prediction & Forecasting Inference forms the basis for predictive modeling
(regression, time series, machine learning). It estimates parameters and validates
whether predictions are statistically sound.
Without statistical inference, data analytics would just describe the past (descriptive
statistics). With inference, we can predict, test, and make confident decisions about
the future and the unseen population
Chi-Square Test
The Chi-Square test (χ² test) is a statistical hypothesis testing method used to
determine whether there is a significant association between categorical variables or
whether the observed data fits an expected distribution.
It is based on the comparison between observed frequencies (O) and expected
frequencies (E)
Formula
x^2=summation( (O-E)^2/E)
Where:
● O = Observed frequency
● E = Expected frequency
Types of Chi-Square Tests
1. Chi-Square Test of Independence – Checks whether two categorical variables are
related.
2. Chi-Square Goodness-of-Fit Test – Checks whether an observed frequency
distribution fits an expected distribution.
3. Chi-Square Test for Homogeneity – Compares categorical distributions across
multiple populations.
Test of Independence Example
A company wants to check if gender and preference for a product are related.
Gender Like Dislike Total
Male 30 20 50
Female 25 25 50
Total 55 45 100
●
Null Hypothesis (H₀): Gender and product preference are independent.
● Calculate expected frequencies:
E{Male,Like}= 50×55/100=27.5
The Chi-Square Test helps us find relationships or differences between categories. Its
main uses are:
1. Feature Selection in Machine Learning: It helps decide if a categorical feature
(like color or product type) is important for predicting the target (like sales or
satisfaction), improving model performance.
2. Testing Independence: It checks if two categorical variables are related or
independent. For example, whether age or gender affects product preferences.
3. Assessing Model Fit: It helps check if a model’s predicted categories match the
actual data, which is useful to improve classification models.
Chi-Square Goodness-of-Fit Test Example
A dice is rolled 60 times, and the outcomes are recorded as:
1: 8, 2: 10, 3: 9, 4: 12, 5: 11, 6: 10
Is the dice fair?
Step 1: Hypotheses
H0: The dice is fair (all outcomes equally likely).
H1: The dice is not fair.
Step 2: Expected Frequencies
If fair, expected frequency = 60/6 = 10 for each outcome.
Step 3: Compute χ²
χ² = (8-10)²/10 + (10-10)²/10 + (9-10)²/10 + (12-10)²/10 + (11-10)²/10 + (10-10)²/10
= (4/10) + 0 + (1/10) + (4/10) + (1/10) + 0
= 1.0
Step 4: Degrees of Freedom
df = k - 1 = 6 - 1 = 5
Step 5: Decision
At 5% significance, critical χ² = 11.07.
Since 1.0 < 11.07 → Fail to reject H0.
Conclusion: The dice appears to be fair.
Applications of Chi-Square Test in Data Analytics
1. Market Research – Testing if product preferences differ across demographics.
2. Healthcare Analytics – Checking whether treatment types and patient outcomes are
related.
3. Social Sciences – Analyzing relationships like education level vs job type.
4. Quality Control – Ensuring defect rates align with expected values.
5. E-commerce – Studying differences in purchase behavior across regions.