What Is Hypothesis Testing in Statistics?
Hypothesis testing is a structured method used to determine if the findings
of a study provide evidence to support a specific theory relevant to a larger
population.
Hypothesis Testing is a fundamental statistical technique used to make
decisions or inferences about a population based on a sample of data. You
start by making an assumption (hypothesis) about a population
parameter (like the mean or proportion), and then test whether this
assumption is supported by the sample data.
In simple terms, hypothesis testing helps us decide whether a
statement about a population is likely to be true, based on data
collected from a sample.
Examples:
● A teacher assumes that 60% of his college's students come from
lower-middle-class families.
● A doctor believes that 3D (Diet, Dose, and Discipline) is 90%
effective for diabetic patients.
Statistical analysts validate assumptions by collecting and evaluating a
representative sample from the data set under study.
Why Do We Need Hypothesis Testing?
We use hypothesis testing to:
● Validate or reject our assumptions
● Make data-driven decisions
● Avoid guessing or making decisions based on bias
Real-Life Analogy:
Imagine you're a doctor testing a new medicine. You can't assume it works
just because one person said so. You need to test it on many patients and
analyze the results statistically — that’s hypothesis testing!
There are usually two types of hypotheses:
● Null Hypothesis (H₀) – the default or no-change assumption.
● Alternative Hypothesis (H₁ or Ha) – the claim you're trying to
prove.
Each type of data needs a different kind of test.If you don’t use the
correct test for the type of data, your result may be wrong or misleading.
So before testing a hypothesis, you must first understand the type of
data you’re working with.
1.Nominal Data
Definition: Data in the form of categories, with no order between
them.
Examples:
● Gender: Male, Female, Other
● Blood Group: A, B, AB, O
● Car Brand: Honda, BMW, Toyota
ExampleL
You want to test if equal number of students prefer Python, Java,
or C++.This is nominal → use Chi-square test.
2.Ordinal Data
Definition: Data in categories with order, but no clear difference in
amount between them.
Examples:
● Customer Ratings: Poor, Average, Good, Excellent
● Pain Level: Mild, Moderate, Severe
● Education Level: High School, Bachelor’s, Master’s, PhD
Why important?
Since the order matters but the distance between categories is
unknown, we use non-parametric tests like Mann-Whitney U test or
Kruskal-Wallis test.
● Student Feedback Type: Satisfied / Not Satisfied
If you want to test how many people prefer different brands, or
which blood group is most common, you use Chi-Square Test or
proportion tests, which are designed for nominal data.
You want to compare customer satisfaction (rated as Poor, Good, Excellent)
between two banks.
This is ordinal → use Mann-Whitney U.
3.Interval Data (Numerical – No True Zero)
Definition: Numbers where you can measure the difference, but zero
doesn’t mean "none".
Examples: Temperature in Celsius, Calendar Years
You can calculate mean, but ratios don’t make sense (e.g., 20°C is not twice
as hot as 10°C)
Think of thermometer readings
Compare the average daily temperatures of Chennai and Delhi for June.
This is interval → use t-test.
4.Ratio Data (Numerical – Has True Zero)
Definition: Like interval data, but zero means absence of quantity.
Examples: Salary, Age, Height, Marks, Distance
You can compare ratios (e.g., ₹40,000 is twice ₹20,000)
Test if the average salary of IT employees in Bangalore is different from
those in Mumbai.
This is the ratio → use t-test or ANOVA.
Data Sources
What is Data Source:
A data source refers to the origin or
provider of the data that is used in analysis
to support decision-making. In Business
Analytics, knowing where the data comes
from, how reliable it is, and how it can be
used, is foundational.
Example:
A customer database in a retail company acts as a data source.
It contains details like customer names, contact info, purchase history, etc.
This database is accessed by other applications like a recommendation
engine or a sales dashboard — making it a data source.
Real Time Analogy:
1.Think of a water tank in a house.
● The water tank stores water (data).
● Taps (analytics tools, dashboards) draw water from the tank.
● The water tank is the data source, even if the water came from
somewhere else earlier (e.g., a river or a well).
So, just like a tap doesn't care where the water originally came
from, a program accessing a data source doesn't need to know where the
data originated — only where to find it now.
2.Imagine a library as your data source
The books are the data stored in an organized way (like tables or files).
● A student comes in to borrow a book — this is like a software
application or report accessing the data.
● It doesn’t matter whether the book was donated, bought, or printed
in-house — once it's in the library, it serves as the source for
others.
Even if a book has gone through editing or printing (data
transformation), once it’s on the shelf, it's now the source
anyone can use.
TYPES OF DATA SOURCE:
1. Primary Data Sources:
A Primary Data Source refers to data collected directly from the
original source, typically for a specific purpose or study, without any prior
processing or transformation.
Data that is collected first-hand by the analyst for a specific study.
It is first-hand data, meaning:
● Collected directly from people, events, or processes
● Gathered using methods like surveys, interviews, sensors, or
experiments
● Is raw and original, and has not been previously published or
analyzed
Key Characteristics:
● Directly collected, Specific to the problem, High accuracy and control
Example: A company conducts an employee satisfaction survey by
emailing a custom questionnaire to all employees.
The responses collected from the survey become primary data, as they
were gathered directly from the source (employees), for the first time.
Real-Time Use Case:
Use Case: Agricultural Yield Prediction
● Scenario: A government department wants to predict crop yield in
rural areas.
● Primary Data Source: Officials go to farms and collect data
through sensors and farmer interviews:
○ Soil moisture levels (via IoT sensors)
○ Farmer-reported seed usage
○ Rainfall patterns via local weather stations
Another Use Case: A startup wants to know why users uninstall their app
→ sends a feedback form post-uninstallation.
2. Secondary Data Sources:
A Secondary Data Source refers to data that has already been
collected, processed, and possibly analyzed by someone else, and is now
being reused for another purpose.
Data that has already been collected by someone else for another
purpose.
This data is:
● Not original to the user
● Often found in reports, publications, databases, websites
● Used for analysis, reference, or to supplement primary data
It may be aggregated, cleaned, or transformed, and is usually
easier and quicker to access, but may not be perfectly tailored to your
needs.
Key Characteristics:
● Less expensive
● Available faster
● Less control over data quality
Example:
A student writing a thesis on population trends uses Census data
published by the government.
● The data was not collected by the student.
● It was collected for a different (governmental/statistical) purpose.
● It is being reused.
Real-Time Use Case: Health Research Using WHO Data
Scenario:
A university research team is studying the global spread of diabetes
over the last two decades.
Secondary Data Sources Used:
● World Health Organization (WHO) datasets on diabetes rates
by country
● Published journal articles summarizing prior clinical trials
● Health Ministry reports from various governments
● Online health statistics portals (e.g., Data.gov, Statista)
Even if the data is useful and reliable, the researchers have no
control over how it was collected, so it qualifies as secondary data.
Another Use case : A company expanding to Brazil uses World
Bank data on internet penetration to estimate e-commerce potential
Data sources play a vital role for the organisations to make better decisions.
There are two main sources of data collection.
1.Internal source
2.External source
Internal Data
The data which is generated within an organization is called internal data. It is
more relevant to access and analyse to find business insights; insights are used
in business decisions.
Some of the most commonly used internal data sources are as −
1. Operational Data
Operational data includes day-to-day business operations like sales
transactions, customer data, inventory records, and production data.
Example:
A retail store's Point of Sale (POS) system records all product sales
and updates stock levels instantly.
2. Customer Data
It is one of the most crucial data which collected directly from customers using
CRM, feedback forms, surveys, and customer support systems. It is used in
data analysis to find customers opinions or sentiments.
Example:
An e-commerce website gathers customer satisfaction ratings and product
reviews after every purchase.
3. Employee Data
It includes Human resources data which can further analysed to get employees
performance, payroll information, and employee satisfaction.
Example:
A company uses an HRMS (Human Resource Management System) to track
employee attendance, monthly salary, and annual performance appraisals.
4. Financial Data
It includes financial data generated through financial systems, such as budgets,
profit and loss, balance sheets, and cash flow statements.
Example:
A business uses Tally or QuickBooks to generate quarterly profit & loss
reports for internal financial planning and audits.
5. Marketing Data
It is also considered as internal data collected from marketing campaigns,
website analytics, email marketing, and social media channels.
Example:
A digital marketing team uses Google Analytics to monitor website visitor
behavior and Mailchimp to track email campaign performance.
6. Production Data
Production data also collected from internal sources of an organisation like
manufacturing processes. It also involves machine performance, production
output, and quality control metrics.
Example:
A car manufacturing plant uses IoT sensors on machines to collect real-time
production efficiency data and downtime metrics.
Tools: SQL, SAP, Excel, Power BI
External Data
The data which is collected through outside boundaries of an organisation is
known as external data. External data is used by organisations to assess and
model economic, political, social, and environmental problems that influence
business.
Some of the most commonly used external data sources are as −
1. Public Data
The data which is collected from public platforms like
government databases, journals, magazines, industry
reports, and newspapers etc. Some common examples of public data are census
data, economic indicators, and public health data.
Data freely available from public sources like government and non-profit
agencies.
Example:
The Government of India’s Census portal provides demographic data like
population, literacy rate, and employment statistics which businesses can use
for market research.
2. Social Media Data
The data available on social media platforms like Twitter, Facebook, LinkedIn,
and Instagram. This encompasses user-generated content, sentiment analysis,
and engagement metrics.
Data from platforms like Twitter, Facebook, Instagram, etc., capturing user
behavior, trends, and sentiments.
Example:
A brand analyzes Twitter hashtags and comments during a product launch to
measure customer sentiment and campaign impact.
3. Web Scraping
Data extracted from websites using automated scripts like product reviews,
competitors data to do comparison and price comparison.
Automated data extraction from websites using tools or scripts.
Example:
An e-commerce company uses Python and BeautifulSoup to scrape Amazon
and Flipkart product prices for competitor analysis and dynamic pricing.
4. IoT Data
Data collected from sensors, smart devices, and wearables, providing real-time
data on environmental conditions, usage patterns, and more.
Example:
A smart thermostat collects temperature and energy usage data from homes
and sends it to a central system to optimize energy efficiency.
5. Partner Data
Data is shared between business partners, such as suppliers, distributors, or
strategic partnerships, to improve mutual understanding of market conditions or
client wants.
Data shared between companies in a partnership or supply chain.
Example:
A retail chain receives real-time inventory and delivery status data from
its logistics partner to manage stock and predict supply chain bottlenecks.
6. Open data
Open data is freely usable and available to everybody. It might not be very
relevant to you, though, whether it's high-level data or substantially
summarized and aggregated. Additionally, it might not be in the format you
require or it might be really challenging for you to understand. Making the data
usable may take a long time to prepare and then utilise it for analysis.
Freely accessible data, often raw or unstructured, that may need formatting
before use.
Example:
An environmental scientist uses NASA’s open climate datasets to study
global temperature trends—but first needs to clean and reformat the raw data
before analysis.
Tools: APIs, Web Scraping (Python), Postman
Classification Based on Structure:
A. Structured Data:
Structured data refers to highly organized information that is easily
searchable and stored in a predefined schema — usually in rows and
columns, like in spreadsheets or relational databases.
Feature Description
Predefined Schema Data is organized under columns with specified
data types (e.g., int, varchar).
Tabular Format Stored in tables (rows = records, columns =
attributes).
Easily Searchable Queries can be run using SQL to filter, sort, and
join data quickly.
Machine-readable Easy for machines to process and for BI tools to
visualize.
Real-Time Use Case
Use Case: Fraud Detection in Banking
A bank stores account data (Name, Account Number, Balance,
Transaction Type) in a relational database like Oracle.
Using SQL queries and Python scripts, analysts run anomaly
detection algorithms on transaction patterns (e.g., sudden large
withdrawals at odd hours) to detect fraud.
Semi-Structured Data
Semi-structured data has some organizational properties like tags or keys,
but it doesn’t follow a rigid tabular format like structured data. It sits
between structured and unstructured data.
NoSQL Databases
MongoDB stores data in flexible JSON-like documents.
Key Characteristics of Semi-Structured Data
Feature Description
Key-Value Pairing Data is grouped using tags, keys, or labels (e.g.,
XML, JSON).
Flexible Schema Structure is not fixed but can still be parsed
programmatically.
Human & Machine Understandable by both humans and machines.
Readable
Partial Query Support Not as powerful as SQL, but still searchable with
XPath, JSONPath, etc
An e-commerce platform like Amazon stores customer preferences in JSON
format from clickstream behavior.
This semi-structured data is analyzed using Python or NoSQL tools to
personalize recommendations in real time.
Unstructured Data
Unstructured data has no predefined format or schema and is often
text-heavy or multimedia. It’s the most abundant form of data today and
challenging to store, search, and analyze.
Key Characteristics of Unstructured Data
Feature Description
No Predefined No table-like structure; stored as-is
Schema
Text & Media-Based Includes text, images, audio, video, documents,
logs, etc.
Needs AI/ML to Traditional tools don’t work; requires NLP, computer
Process vision, etc.
High Business Value Extracting insights can give a huge competitive
advantage
Examples of Unstructured Data
1. Emails – Body, attachments, headers
2. Social Media Posts – Tweets, Instagram comments, memes
3. Customer Reviews – Amazon, TripAdvisor feedback
4. Call Center Transcripts – Recorded calls in MP3
5. Images and Videos – Product demos, CCTV footage
Real-Time Use Case
Use Case: Sentiment Analysis in Social Media
A company collects thousands of customer reviews and social media posts.
Using NLP and machine learning, it analyzes the tone
(positive/negative) to adjust product strategy.
Data Type Structure Tools Example
Use Case
Structured Fixed SQL, Excel, Bank
(tables) Power BI transaction
s, HR
database
Semi-Struct Flexible MongoDB, E-commerc
ured (JSON/XML Python, Kafka e
) personaliza
tion
Unstructure No NLP, Hadoop, Sentiment
d structure OpenCV, AI analysis,
tools image
recognition
Classification Based on Data Type
A. Discrete Data
Countable values, often integers.
Examples:
● Number of items sold
● Number of clicks
● No. of employees
Use Case: E-commerce website tracks number of purchases per customer per
month.
B. Continuous Data
Measurable values that can take infinite values within a range.
Examples:
● Temperature
● Time
● Revenue
Use Case: Google Fit records heart rate continuously from a smartwatch.
6. Classification Based on Time Frame
A. Cross-Sectional Data
Collected at one point in time across different entities.
Examples:
● Annual income of 1000 families in 2025
● Ratings of 10 restaurants on July 1
Use Case: Flipkart compares customer reviews across product categories on Big
Billion Day.
B. Time Series Data
Collected over a period of time for one variable/entity.
Examples:
● Daily sales from Jan to June
● Weekly footfall in a store.
Use Case: Swiggy forecasts future orders using past 6-months’ daily
order trends.
1. What is Data Extraction?
Definition:
Data extraction is the process of retrieving data from various sources — such
as databases, flat files, web APIs, or spreadsheets — for the purpose of
analysis, reporting, or migration.
Data extraction primarily refers to writing SQL queries using SELECT, JOIN,
WHERE, and GROUP BY to pull relevant data from relational tables.
SQL—---> SELECT emp_id, emp_name, salary
FROM employees
WHERE department_id = 10;
Use case – A retail company extracts sales data for the month using:
sql
SELECT product_id, SUM(sales_amount)
FROM sales
WHERE sale_date BETWEEN '01-JUN-2025' AND '30-JUN-2025'
GROUP BY product_id;
2. What is Data Cleaning?
Definition:
Data cleaning (or data cleansing) is the process of identifying and correcting
(or removing) inaccurate, inconsistent, or incomplete data from a database
to improve data quality and integrity.