0% found this document useful (0 votes)

5 views23 pages

BA - Unit 1

Uploaded by

usearch595

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views23 pages

BA - Unit 1

Uploaded by

usearch595

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

What Is Hypothesis Testing in Statistics?

Hypothesis testing is a structured method used to determine if the findings

of a study provide evidence to support a specific theory relevant to a larger
population.

Hypothesis Testing is a fundamental statistical technique used to make

decisions or inferences about a population based on a sample of data. You
start by making an assumption (hypothesis) about a population
parameter (like the mean or proportion), and then test whether this
assumption is supported by the sample data.

In simple terms, hypothesis testing helps us decide whether a

statement about a population is likely to be true, based on data
collected from a sample.

Examples:
● A teacher assumes that 60% of his college's students come from
lower-middle-class families.
● A doctor believes that 3D (Diet, Dose, and Discipline) is 90%
effective for diabetic patients.

Statistical analysts validate assumptions by collecting and evaluating a

representative sample from the data set under study.
Why Do We Need Hypothesis Testing?

We use hypothesis testing to:

● Validate or reject our assumptions

● Make data-driven decisions
● Avoid guessing or making decisions based on bias

Real-Life Analogy:

Imagine you're a doctor testing a new medicine. You can't assume it works
just because one person said so. You need to test it on many patients and
analyze the results statistically — that’s hypothesis testing!

There are usually two types of hypotheses:

● Null Hypothesis (H₀) – the default or no-change assumption.

● Alternative Hypothesis (H₁ or Ha) – the claim you're trying to
prove.

Each type of data needs a different kind of test.If you don’t use the
correct test for the type of data, your result may be wrong or misleading.

So before testing a hypothesis, you must first understand the type of

data you’re working with.
1.Nominal Data

Definition: Data in the form of categories, with no order between

them.

Examples:

● Gender: Male, Female, Other

● Blood Group: A, B, AB, O
● Car Brand: Honda, BMW, Toyota

ExampleL
You want to test if equal number of students prefer Python, Java,
or C++.This is nominal → use Chi-square test.

2.Ordinal Data

Definition: Data in categories with order, but no clear difference in

amount between them.

Examples:

● Customer Ratings: Poor, Average, Good, Excellent

● Pain Level: Mild, Moderate, Severe
● Education Level: High School, Bachelor’s, Master’s, PhD

Why important?

Since the order matters but the distance between categories is

unknown, we use non-parametric tests like Mann-Whitney U test or
Kruskal-Wallis test.
● Student Feedback Type: Satisfied / Not Satisfied

If you want to test how many people prefer different brands, or

which blood group is most common, you use Chi-Square Test or
proportion tests, which are designed for nominal data.

You want to compare customer satisfaction (rated as Poor, Good, Excellent)

between two banks.

This is ordinal → use Mann-Whitney U.

3.Interval Data (Numerical – No True Zero)

Definition: Numbers where you can measure the difference, but zero
doesn’t mean "none".

Examples: Temperature in Celsius, Calendar Years

You can calculate mean, but ratios don’t make sense (e.g., 20°C is not twice
as hot as 10°C)
Think of thermometer readings

Compare the average daily temperatures of Chennai and Delhi for June.

This is interval → use t-test.

4.Ratio Data (Numerical – Has True Zero)

Definition: Like interval data, but zero means absence of quantity.

Examples: Salary, Age, Height, Marks, Distance

You can compare ratios (e.g., ₹40,000 is twice ₹20,000)

Test if the average salary of IT employees in Bangalore is different from

those in Mumbai.

This is the ratio → use t-test or ANOVA.

Data Sources

What is Data Source:

A data source refers to the origin or
provider of the data that is used in analysis
to support decision-making. In Business
Analytics, knowing where the data comes
from, how reliable it is, and how it can be
used, is foundational.

Example:

A customer database in a retail company acts as a data source.

It contains details like customer names, contact info, purchase history, etc.
This database is accessed by other applications like a recommendation
engine or a sales dashboard — making it a data source.

Real Time Analogy:

1.Think of a water tank in a house.

● The water tank stores water (data).

● Taps (analytics tools, dashboards) draw water from the tank.
● The water tank is the data source, even if the water came from
somewhere else earlier (e.g., a river or a well).

So, just like a tap doesn't care where the water originally came
from, a program accessing a data source doesn't need to know where the
data originated — only where to find it now.
2.Imagine a library as your data source

The books are the data stored in an organized way (like tables or files).

● A student comes in to borrow a book — this is like a software

application or report accessing the data.
● It doesn’t matter whether the book was donated, bought, or printed
in-house — once it's in the library, it serves as the source for
others.

Even if a book has gone through editing or printing (data

transformation), once it’s on the shelf, it's now the source
anyone can use.

TYPES OF DATA SOURCE:

1. Primary Data Sources:

A Primary Data Source refers to data collected directly from the
original source, typically for a specific purpose or study, without any prior
processing or transformation.
Data that is collected first-hand by the analyst for a specific study.

It is first-hand data, meaning:

● Collected directly from people, events, or processes

● Gathered using methods like surveys, interviews, sensors, or
experiments
● Is raw and original, and has not been previously published or
analyzed

Key Characteristics:
● Directly collected, Specific to the problem, High accuracy and control

Example: A company conducts an employee satisfaction survey by

emailing a custom questionnaire to all employees.

The responses collected from the survey become primary data, as they
were gathered directly from the source (employees), for the first time.

Real-Time Use Case:

Use Case: Agricultural Yield Prediction

● Scenario: A government department wants to predict crop yield in

rural areas.
● Primary Data Source: Officials go to farms and collect data
through sensors and farmer interviews:
○ Soil moisture levels (via IoT sensors)
○ Farmer-reported seed usage
○ Rainfall patterns via local weather stations

Another Use Case: A startup wants to know why users uninstall their app
→ sends a feedback form post-uninstallation.

2. Secondary Data Sources:

A Secondary Data Source refers to data that has already been

collected, processed, and possibly analyzed by someone else, and is now
being reused for another purpose.
Data that has already been collected by someone else for another
purpose.

This data is:

● Not original to the user

● Often found in reports, publications, databases, websites
● Used for analysis, reference, or to supplement primary data

It may be aggregated, cleaned, or transformed, and is usually

easier and quicker to access, but may not be perfectly tailored to your
needs.

Key Characteristics:

● Less expensive
● Available faster
● Less control over data quality

Example:

A student writing a thesis on population trends uses Census data

published by the government.

● The data was not collected by the student.

● It was collected for a different (governmental/statistical) purpose.
● It is being reused.

Real-Time Use Case: Health Research Using WHO Data

Scenario:
A university research team is studying the global spread of diabetes
over the last two decades.

Secondary Data Sources Used:

● World Health Organization (WHO) datasets on diabetes rates

by country
● Published journal articles summarizing prior clinical trials
● Health Ministry reports from various governments
● Online health statistics portals (e.g., Data.gov, Statista)

Even if the data is useful and reliable, the researchers have no

control over how it was collected, so it qualifies as secondary data.

Another Use case : A company expanding to Brazil uses World

Bank data on internet penetration to estimate e-commerce potential

Data sources play a vital role for the organisations to make better decisions.
There are two main sources of data collection.
1.Internal source
2.External source

Internal Data

The data which is generated within an organization is called internal data. It is

more relevant to access and analyse to find business insights; insights are used
in business decisions.

Some of the most commonly used internal data sources are as −

1. Operational Data
Operational data includes day-to-day business operations like sales
transactions, customer data, inventory records, and production data.
Example:
A retail store's Point of Sale (POS) system records all product sales
and updates stock levels instantly.

2. Customer Data
It is one of the most crucial data which collected directly from customers using
CRM, feedback forms, surveys, and customer support systems. It is used in
data analysis to find customers opinions or sentiments.

Example:

An e-commerce website gathers customer satisfaction ratings and product

reviews after every purchase.

3. Employee Data
It includes Human resources data which can further analysed to get employees
performance, payroll information, and employee satisfaction.

Example:

A company uses an HRMS (Human Resource Management System) to track

employee attendance, monthly salary, and annual performance appraisals.

4. Financial Data
It includes financial data generated through financial systems, such as budgets,
profit and loss, balance sheets, and cash flow statements.

Example:

A business uses Tally or QuickBooks to generate quarterly profit & loss

reports for internal financial planning and audits.
5. Marketing Data
It is also considered as internal data collected from marketing campaigns,
website analytics, email marketing, and social media channels.

Example:

A digital marketing team uses Google Analytics to monitor website visitor

behavior and Mailchimp to track email campaign performance.

6. Production Data
Production data also collected from internal sources of an organisation like
manufacturing processes. It also involves machine performance, production
output, and quality control metrics.

Example:

A car manufacturing plant uses IoT sensors on machines to collect real-time

production efficiency data and downtime metrics.

Tools: SQL, SAP, Excel, Power BI

External Data
The data which is collected through outside boundaries of an organisation is
known as external data. External data is used by organisations to assess and
model economic, political, social, and environmental problems that influence
business.

Some of the most commonly used external data sources are as −

1. Public Data
The data which is collected from public platforms like
government databases, journals, magazines, industry
reports, and newspapers etc. Some common examples of public data are census
data, economic indicators, and public health data.

Data freely available from public sources like government and non-profit
agencies.

Example:

The Government of India’s Census portal provides demographic data like

population, literacy rate, and employment statistics which businesses can use
for market research.

2. Social Media Data

The data available on social media platforms like Twitter, Facebook, LinkedIn,
and Instagram. This encompasses user-generated content, sentiment analysis,
and engagement metrics.

Data from platforms like Twitter, Facebook, Instagram, etc., capturing user
behavior, trends, and sentiments.

Example:

A brand analyzes Twitter hashtags and comments during a product launch to

measure customer sentiment and campaign impact.

3. Web Scraping
Data extracted from websites using automated scripts like product reviews,
competitors data to do comparison and price comparison.

Automated data extraction from websites using tools or scripts.

Example:
An e-commerce company uses Python and BeautifulSoup to scrape Amazon
and Flipkart product prices for competitor analysis and dynamic pricing.

4. IoT Data
Data collected from sensors, smart devices, and wearables, providing real-time
data on environmental conditions, usage patterns, and more.

Example:

A smart thermostat collects temperature and energy usage data from homes
and sends it to a central system to optimize energy efficiency.

5. Partner Data
Data is shared between business partners, such as suppliers, distributors, or
strategic partnerships, to improve mutual understanding of market conditions or
client wants.

Data shared between companies in a partnership or supply chain.

Example:

A retail chain receives real-time inventory and delivery status data from
its logistics partner to manage stock and predict supply chain bottlenecks.

6. Open data
Open data is freely usable and available to everybody. It might not be very
relevant to you, though, whether it's high-level data or substantially
summarized and aggregated. Additionally, it might not be in the format you
require or it might be really challenging for you to understand. Making the data
usable may take a long time to prepare and then utilise it for analysis.

Freely accessible data, often raw or unstructured, that may need formatting
before use.
Example:

An environmental scientist uses NASA’s open climate datasets to study

global temperature trends—but first needs to clean and reformat the raw data
before analysis.

Tools: APIs, Web Scraping (Python), Postman

Classification Based on Structure:

A. Structured Data:

Structured data refers to highly organized information that is easily

searchable and stored in a predefined schema — usually in rows and
columns, like in spreadsheets or relational databases.

Feature Description

Predefined Schema Data is organized under columns with specified

data types (e.g., int, varchar).

Tabular Format Stored in tables (rows = records, columns =

attributes).

Easily Searchable Queries can be run using SQL to filter, sort, and
join data quickly.

Machine-readable Easy for machines to process and for BI tools to

visualize.
Real-Time Use Case

Use Case: Fraud Detection in Banking

A bank stores account data (Name, Account Number, Balance,

Transaction Type) in a relational database like Oracle.
Using SQL queries and Python scripts, analysts run anomaly
detection algorithms on transaction patterns (e.g., sudden large
withdrawals at odd hours) to detect fraud.

Semi-Structured Data

Semi-structured data has some organizational properties like tags or keys,

but it doesn’t follow a rigid tabular format like structured data. It sits
between structured and unstructured data.

NoSQL Databases

MongoDB stores data in flexible JSON-like documents.

Key Characteristics of Semi-Structured Data

Feature Description

Key-Value Pairing Data is grouped using tags, keys, or labels (e.g.,

XML, JSON).

Flexible Schema Structure is not fixed but can still be parsed

programmatically.

Human & Machine Understandable by both humans and machines.

Readable

Partial Query Support Not as powerful as SQL, but still searchable with
XPath, JSONPath, etc

An e-commerce platform like Amazon stores customer preferences in JSON

format from clickstream behavior.

This semi-structured data is analyzed using Python or NoSQL tools to

personalize recommendations in real time.

Unstructured Data

Unstructured data has no predefined format or schema and is often

text-heavy or multimedia. It’s the most abundant form of data today and
challenging to store, search, and analyze.

Key Characteristics of Unstructured Data

Feature Description
No Predefined No table-like structure; stored as-is
Schema

Text & Media-Based Includes text, images, audio, video, documents,

logs, etc.

Needs AI/ML to Traditional tools don’t work; requires NLP, computer

Process vision, etc.

High Business Value Extracting insights can give a huge competitive

advantage

Examples of Unstructured Data

1. Emails – Body, attachments, headers

2. Social Media Posts – Tweets, Instagram comments, memes
3. Customer Reviews – Amazon, TripAdvisor feedback
4. Call Center Transcripts – Recorded calls in MP3
5. Images and Videos – Product demos, CCTV footage

Real-Time Use Case

Use Case: Sentiment Analysis in Social Media

A company collects thousands of customer reviews and social media posts.

Using NLP and machine learning, it analyzes the tone
(positive/negative) to adjust product strategy.

Data Type Structure Tools Example

Use Case
Structured Fixed SQL, Excel, Bank
(tables) Power BI transaction
s, HR
database

Semi-Struct Flexible MongoDB, E-commerc

ured (JSON/XML Python, Kafka e
) personaliza
tion

Unstructure No NLP, Hadoop, Sentiment

d structure OpenCV, AI analysis,
tools image
recognition

Classification Based on Data Type

A. Discrete Data

Countable values, often integers.

Examples:

● Number of items sold

● Number of clicks
● No. of employees

Use Case: E-commerce website tracks number of purchases per customer per
month.
B. Continuous Data

Measurable values that can take infinite values within a range.

Examples:

● Temperature
● Time
● Revenue

Use Case: Google Fit records heart rate continuously from a smartwatch.

6. Classification Based on Time Frame

A. Cross-Sectional Data

Collected at one point in time across different entities.

Examples:

● Annual income of 1000 families in 2025

● Ratings of 10 restaurants on July 1

Use Case: Flipkart compares customer reviews across product categories on Big
Billion Day.

B. Time Series Data

Collected over a period of time for one variable/entity.

Examples:

● Daily sales from Jan to June

● Weekly footfall in a store.

Use Case: Swiggy forecasts future orders using past 6-months’ daily
order trends.
1. What is Data Extraction?

Definition:
Data extraction is the process of retrieving data from various sources — such
as databases, flat files, web APIs, or spreadsheets — for the purpose of
analysis, reporting, or migration.

Data extraction primarily refers to writing SQL queries using SELECT, JOIN,
WHERE, and GROUP BY to pull relevant data from relational tables.

SQL—---> SELECT emp_id, emp_name, salary

FROM employees

WHERE department_id = 10;

Use case – A retail company extracts sales data for the month using:

sql

SELECT product_id, SUM(sales_amount)

FROM sales

WHERE sale_date BETWEEN '01-JUN-2025' AND '30-JUN-2025'

GROUP BY product_id;

2. What is Data Cleaning?

Definition:
Data cleaning (or data cleansing) is the process of identifying and correcting
(or removing) inaccurate, inconsistent, or incomplete data from a database
to improve data quality and integrity.

Chapter 1
No ratings yet
Chapter 1
62 pages
STATISTICS N Quantitative
No ratings yet
STATISTICS N Quantitative
58 pages
FDS Sem5
No ratings yet
FDS Sem5
15 pages
Statistical Learning - Introduction
No ratings yet
Statistical Learning - Introduction
20 pages
SM Session 1 IPL 2024 Post Session Slides
No ratings yet
SM Session 1 IPL 2024 Post Session Slides
44 pages
570 Fainal
No ratings yet
570 Fainal
8 pages
Questions Data Analytics (1) 1
No ratings yet
Questions Data Analytics (1) 1
4 pages
QM 1
No ratings yet
QM 1
58 pages
Quantitative Methods 3
No ratings yet
Quantitative Methods 3
174 pages
Lecture 1 - Introduction To Statistics
No ratings yet
Lecture 1 - Introduction To Statistics
38 pages
Comprehensive Guide To Data Collection
No ratings yet
Comprehensive Guide To Data Collection
16 pages
472 Eb
No ratings yet
472 Eb
6 pages
CHAPTER 4 Data Management
No ratings yet
CHAPTER 4 Data Management
16 pages
Da - MP - 1
No ratings yet
Da - MP - 1
19 pages
Ch1 - Basics
No ratings yet
Ch1 - Basics
28 pages
Part 1 - Basic Statistics
No ratings yet
Part 1 - Basic Statistics
44 pages
4.02 Statistics Fundamentals
No ratings yet
4.02 Statistics Fundamentals
2 pages
15CS34E Analytic Computing Answer Key Part-A
No ratings yet
15CS34E Analytic Computing Answer Key Part-A
17 pages
Pa 1 2024
No ratings yet
Pa 1 2024
88 pages
Internship Report
No ratings yet
Internship Report
13 pages
15CS34E Analytic Computing Key
No ratings yet
15CS34E Analytic Computing Key
17 pages
Ds Notes-Unit 1, II and III Upto Part1
No ratings yet
Ds Notes-Unit 1, II and III Upto Part1
341 pages
Essay 2
No ratings yet
Essay 2
3 pages
Unit 1ppt
No ratings yet
Unit 1ppt
29 pages
Data and Its types-WPS Office-Conve)
No ratings yet
Data and Its types-WPS Office-Conve)
9 pages
EDA - Unit 1
No ratings yet
EDA - Unit 1
82 pages
Ch01 Business Statistics
No ratings yet
Ch01 Business Statistics
65 pages
1 - Lecture 1 - Introduction To Statistics
No ratings yet
1 - Lecture 1 - Introduction To Statistics
33 pages
Data Systems and Risk Chapter 1 Types of Data Sources
No ratings yet
Data Systems and Risk Chapter 1 Types of Data Sources
35 pages
Biostatistics - Data and Its Types
No ratings yet
Biostatistics - Data and Its Types
11 pages
Intro To Business Analytics
No ratings yet
Intro To Business Analytics
27 pages
Eda 1
No ratings yet
Eda 1
137 pages
Statistics
No ratings yet
Statistics
50 pages
Data Science Basics for Beginners
No ratings yet
Data Science Basics for Beginners
291 pages
Notes of B.Stats
No ratings yet
Notes of B.Stats
23 pages
Intro to Business Statistics
100% (1)
Intro to Business Statistics
31 pages
Quantitative Methods For Management: Term II 4 Credits MGT 408
No ratings yet
Quantitative Methods For Management: Term II 4 Credits MGT 408
49 pages
DA - Unit I
No ratings yet
DA - Unit I
83 pages
Unit 3 Data Analytics
No ratings yet
Unit 3 Data Analytics
16 pages
4.0 Introduction To Data
No ratings yet
4.0 Introduction To Data
16 pages
Chapter-1 Introduction To Data Analytics
No ratings yet
Chapter-1 Introduction To Data Analytics
34 pages
Data Sources
No ratings yet
Data Sources
19 pages
Chapter 1: Data and Statistics
No ratings yet
Chapter 1: Data and Statistics
33 pages
Module1 IntroductionToBusinessAnalytics Notes
No ratings yet
Module1 IntroductionToBusinessAnalytics Notes
27 pages
Slide 1 DA Basics
No ratings yet
Slide 1 DA Basics
55 pages
Topic 2 DA
No ratings yet
Topic 2 DA
3 pages
Statistical Learning - Introduction
No ratings yet
Statistical Learning - Introduction
20 pages
Unit1-Data Science Fundamentals
No ratings yet
Unit1-Data Science Fundamentals
35 pages
Data Literacy
No ratings yet
Data Literacy
9 pages
QR1 - 1 - 4 Types and Sources of Data
No ratings yet
QR1 - 1 - 4 Types and Sources of Data
3 pages
Understanding Secondary Data
No ratings yet
Understanding Secondary Data
29 pages
ToolKit 1 - Unit 1 - Introduction To Data Analytics
No ratings yet
ToolKit 1 - Unit 1 - Introduction To Data Analytics
15 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
16 pages
Data Basics for University Students
No ratings yet
Data Basics for University Students
34 pages
1.STA 112 Session 1
No ratings yet
1.STA 112 Session 1
7 pages
ML - Data - Preprocessing For Machine Learning
No ratings yet
ML - Data - Preprocessing For Machine Learning
44 pages
Unit-3 CFOA Notes
100% (1)
Unit-3 CFOA Notes
12 pages
Hospital Management System
No ratings yet
Hospital Management System
20 pages
Software Testing and Audit Ncs-071
0% (1)
Software Testing and Audit Ncs-071
1 page
Complete Opportunistic Networking: Vehicular, D2D and Cognitive Radio Networks 1st Edition Nazmul Siddique PDF For All Chapters
100% (1)
Complete Opportunistic Networking: Vehicular, D2D and Cognitive Radio Networks 1st Edition Nazmul Siddique PDF For All Chapters
65 pages
Pattern Recognition Assignments
No ratings yet
Pattern Recognition Assignments
8 pages
Problem: Reciprocating Pump Example 1
No ratings yet
Problem: Reciprocating Pump Example 1
8 pages
English Model Paper
No ratings yet
English Model Paper
6 pages
Slab Design Calculations
No ratings yet
Slab Design Calculations
21 pages
I2s Protocol
No ratings yet
I2s Protocol
28 pages
EE6009-Power Electronics For Renewable Energy Systems
No ratings yet
EE6009-Power Electronics For Renewable Energy Systems
10 pages
Keyestudio ESP32 Sensor Kit Guide
100% (1)
Keyestudio ESP32 Sensor Kit Guide
351 pages
AMD FreeSync™ Technology
No ratings yet
AMD FreeSync™ Technology
11 pages
Sponsorship Final With Header (1) - For Merge
No ratings yet
Sponsorship Final With Header (1) - For Merge
2 pages
Sony SCD 1 SCD 777 SM
No ratings yet
Sony SCD 1 SCD 777 SM
78 pages
HTML Imp Tags
No ratings yet
HTML Imp Tags
4 pages
Thesis Presentation Analysis and Interpretation of Data
100% (3)
Thesis Presentation Analysis and Interpretation of Data
5 pages
74dab84e271a9cbed183
No ratings yet
74dab84e271a9cbed183
26 pages
Healthcare Powerpoint Template: Insert Your Subtitle Here
No ratings yet
Healthcare Powerpoint Template: Insert Your Subtitle Here
11 pages
9231 s20 QP 21 PDF
No ratings yet
9231 s20 QP 21 PDF
16 pages
Solutions To Tutorial 12 (Week 13) : Web Page: Lecturer: Florica C. C Irstea
No ratings yet
Solutions To Tutorial 12 (Week 13) : Web Page: Lecturer: Florica C. C Irstea
13 pages
Clustering with Kruskal's Algorithm
No ratings yet
Clustering with Kruskal's Algorithm
1 page
PostgreSQL Compare High Availability Frameworks Infographic ScaleGrid DBaaS
No ratings yet
PostgreSQL Compare High Availability Frameworks Infographic ScaleGrid DBaaS
1 page
Roadmap
No ratings yet
Roadmap
2 pages
CMP A2f
No ratings yet
CMP A2f
2 pages
ADR111E
No ratings yet
ADR111E
14 pages
EN EN: European Commission
No ratings yet
EN EN: European Commission
13 pages
TOC - Sep - 2024
No ratings yet
TOC - Sep - 2024
2 pages
Vetassess Cover Letter
100% (1)
Vetassess Cover Letter
3 pages
Tilt Pitch Optimization Report With Covering Sheet - R002
No ratings yet
Tilt Pitch Optimization Report With Covering Sheet - R002
5 pages
Sustainable Future of Digital Textile Printing
No ratings yet
Sustainable Future of Digital Textile Printing
3 pages

BA - Unit 1

Uploaded by

BA - Unit 1

Uploaded by

What Is Hypothesis Testing in Statistics?

Hypothesis testing is a structured method used to determine if the findings

Hypothesis Testing is a fundamental statistical technique used to make

In simple terms, hypothesis testing helps us decide whether a

Statistical analysts validate assumptions by collecting and evaluating a

We use hypothesis testing to:

●​ Validate or reject our assumptions

There are usually two types of hypotheses:

●​ Null Hypothesis (H₀) – the default or no-change assumption.

So before testing a hypothesis, you must first understand the type of

Definition: Data in the form of categories, with no order between

●​ Gender: Male, Female, Other

Definition: Data in categories with order, but no clear difference in

●​ Customer Ratings: Poor, Average, Good, Excellent

Since the order matters but the distance between categories is

If you want to test how many people prefer different brands, or

You want to compare customer satisfaction (rated as Poor, Good, Excellent)

This is ordinal → use Mann-Whitney U.

3.Interval Data (Numerical – No True Zero)

Examples: Temperature in Celsius, Calendar Years​

This is interval → use t-test.

4.Ratio Data (Numerical – Has True Zero)

Definition: Like interval data, but zero means absence of quantity.

Examples: Salary, Age, Height, Marks, Distance

Test if the average salary of IT employees in Bangalore is different from

This is the ratio → use t-test or ANOVA.

What is Data Source:

A customer database in a retail company acts as a data source.​

Real Time Analogy:

1.Think of a water tank in a house.

●​ The water tank stores water (data).

●​ A student comes in to borrow a book — this is like a software

Even if a book has gone through editing or printing (data

TYPES OF DATA SOURCE:

1. Primary Data Sources:

It is first-hand data, meaning:

●​ Collected directly from people, events, or processes

Example: A company conducts an employee satisfaction survey by

Real-Time Use Case:

Use Case: Agricultural Yield Prediction

●​ Scenario: A government department wants to predict crop yield in

2. Secondary Data Sources:

A Secondary Data Source refers to data that has already been

This data is:

●​ Not original to the user

It may be aggregated, cleaned, or transformed, and is usually

A student writing a thesis on population trends uses Census data

●​ The data was not collected by the student.

Real-Time Use Case: Health Research Using WHO Data

Secondary Data Sources Used:

●​ World Health Organization (WHO) datasets on diabetes rates

Even if the data is useful and reliable, the researchers have no

Another Use case : A company expanding to Brazil uses World

The data which is generated within an organization is called internal data. It is

Some of the most commonly used internal data sources are as −

An e-commerce website gathers customer satisfaction ratings and product

A company uses an HRMS (Human Resource Management System) to track

A business uses Tally or QuickBooks to generate quarterly profit & loss

A digital marketing team uses Google Analytics to monitor website visitor

A car manufacturing plant uses IoT sensors on machines to collect real-time

Tools: SQL, SAP, Excel, Power BI

Some of the most commonly used external data sources are as −

The Government of India’s Census portal provides demographic data like

2. Social Media Data

A brand analyzes Twitter hashtags and comments during a product launch to

Automated data extraction from websites using tools or scripts.

Data shared between companies in a partnership or supply chain.

An environmental scientist uses NASA’s open climate datasets to study

Tools: APIs, Web Scraping (Python), Postman

Classification Based on Structure:

Structured data refers to highly organized information that is easily

Predefined Schema Data is organized under columns with specified

Tabular Format Stored in tables (rows = records, columns =

Machine-readable Easy for machines to process and for BI tools to

Use Case: Fraud Detection in Banking

A bank stores account data (Name, Account Number, Balance,

● Validate or reject our assumptions

● Null Hypothesis (H₀) – the default or no-change assumption.

● Gender: Male, Female, Other

● Customer Ratings: Poor, Average, Good, Excellent

Examples: Temperature in Celsius, Calendar Years

A customer database in a retail company acts as a data source.

● The water tank stores water (data).

● A student comes in to borrow a book — this is like a software

● Collected directly from people, events, or processes

● Scenario: A government department wants to predict crop yield in

● Not original to the user

● The data was not collected by the student.

● World Health Organization (WHO) datasets on diabetes rates

1. Emails – Body, attachments, headers

A company collects thousands of customer reviews and social media posts.

● Number of items sold

● Annual income of 1000 families in 2025

● Daily sales from Jan to June