[go: up one dir, main page]

0% found this document useful (0 votes)
5 views23 pages

BA - Unit 1

Uploaded by

usearch595
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views23 pages

BA - Unit 1

Uploaded by

usearch595
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

What Is Hypothesis Testing in Statistics?

Hypothesis testing is a structured method used to determine if the findings


of a study provide evidence to support a specific theory relevant to a larger
population.

Hypothesis Testing is a fundamental statistical technique used to make


decisions or inferences about a population based on a sample of data. You
start by making an assumption (hypothesis) about a population
parameter (like the mean or proportion), and then test whether this
assumption is supported by the sample data.

In simple terms, hypothesis testing helps us decide whether a


statement about a population is likely to be true, based on data
collected from a sample.

Examples:
●​ A teacher assumes that 60% of his college's students come from
lower-middle-class families.
●​ A doctor believes that 3D (Diet, Dose, and Discipline) is 90%
effective for diabetic patients.

Statistical analysts validate assumptions by collecting and evaluating a


representative sample from the data set under study.
Why Do We Need Hypothesis Testing?

We use hypothesis testing to:

●​ Validate or reject our assumptions


●​ Make data-driven decisions
●​ Avoid guessing or making decisions based on bias​

Real-Life Analogy:

Imagine you're a doctor testing a new medicine. You can't assume it works
just because one person said so. You need to test it on many patients and
analyze the results statistically — that’s hypothesis testing!

There are usually two types of hypotheses:

●​ Null Hypothesis (H₀) – the default or no-change assumption.


●​ Alternative Hypothesis (H₁ or Ha) – the claim you're trying to
prove.

Each type of data needs a different kind of test.If you don’t use the
correct test for the type of data, your result may be wrong or misleading.

So before testing a hypothesis, you must first understand the type of


data you’re working with.
1.Nominal Data

Definition: Data in the form of categories, with no order between


them.

Examples:

●​ Gender: Male, Female, Other


●​ Blood Group: A, B, AB, O
●​ Car Brand: Honda, BMW, Toyota

ExampleL​
You want to test if equal number of students prefer Python, Java,
or C++.This is nominal → use Chi-square test.

2.Ordinal Data

Definition: Data in categories with order, but no clear difference in


amount between them.

Examples:

●​ Customer Ratings: Poor, Average, Good, Excellent


●​ Pain Level: Mild, Moderate, Severe
●​ Education Level: High School, Bachelor’s, Master’s, PhD

Why important?

Since the order matters but the distance between categories is


unknown, we use non-parametric tests like Mann-Whitney U test or
Kruskal-Wallis test.
●​ Student Feedback Type: Satisfied / Not Satisfied

If you want to test how many people prefer different brands, or


which blood group is most common, you use Chi-Square Test or
proportion tests, which are designed for nominal data.

You want to compare customer satisfaction (rated as Poor, Good, Excellent)


between two banks.

This is ordinal → use Mann-Whitney U.

3.Interval Data (Numerical – No True Zero)

Definition: Numbers where you can measure the difference, but zero
doesn’t mean "none".

Examples: Temperature in Celsius, Calendar Years​


You can calculate mean, but ratios don’t make sense (e.g., 20°C is not twice
as hot as 10°C)​
Think of thermometer readings

Compare the average daily temperatures of Chennai and Delhi for June.

This is interval → use t-test.

4.Ratio Data (Numerical – Has True Zero)

Definition: Like interval data, but zero means absence of quantity.

Examples: Salary, Age, Height, Marks, Distance


You can compare ratios (e.g., ₹40,000 is twice ₹20,000)

Test if the average salary of IT employees in Bangalore is different from


those in Mumbai.

This is the ratio → use t-test or ANOVA.


Data Sources

What is Data Source:


A data source refers to the origin or
provider of the data that is used in analysis
to support decision-making. In Business
Analytics, knowing where the data comes
from, how reliable it is, and how it can be
used, is foundational.

Example:

A customer database in a retail company acts as a data source.​


It contains details like customer names, contact info, purchase history, etc.​
This database is accessed by other applications like a recommendation
engine or a sales dashboard — making it a data source.

Real Time Analogy:

1.Think of a water tank in a house.

●​ The water tank stores water (data).


●​ Taps (analytics tools, dashboards) draw water from the tank.
●​ The water tank is the data source, even if the water came from
somewhere else earlier (e.g., a river or a well).

So, just like a tap doesn't care where the water originally came
from, a program accessing a data source doesn't need to know where the
data originated — only where to find it now.
2.Imagine a library as your data source

The books are the data stored in an organized way (like tables or files).

●​ A student comes in to borrow a book — this is like a software


application or report accessing the data.
●​ It doesn’t matter whether the book was donated, bought, or printed
in-house — once it's in the library, it serves as the source for
others.

Even if a book has gone through editing or printing (data


transformation), once it’s on the shelf, it's now the source
anyone can use.

TYPES OF DATA SOURCE:

1. Primary Data Sources:


​ A Primary Data Source refers to data collected directly from the
original source, typically for a specific purpose or study, without any prior
processing or transformation.
Data that is collected first-hand by the analyst for a specific study.

It is first-hand data, meaning:

●​ Collected directly from people, events, or processes


●​ Gathered using methods like surveys, interviews, sensors, or
experiments
●​ Is raw and original, and has not been previously published or
analyzed

Key Characteristics:
●​ Directly collected, Specific to the problem, High accuracy and control

Example: A company conducts an employee satisfaction survey by


emailing a custom questionnaire to all employees.

The responses collected from the survey become primary data, as they
were gathered directly from the source (employees), for the first time.

Real-Time Use Case:

Use Case: Agricultural Yield Prediction

●​ Scenario: A government department wants to predict crop yield in


rural areas.
●​ Primary Data Source: Officials go to farms and collect data
through sensors and farmer interviews:
○​ Soil moisture levels (via IoT sensors)
○​ Farmer-reported seed usage
○​ Rainfall patterns via local weather stations

Another Use Case: A startup wants to know why users uninstall their app
→ sends a feedback form post-uninstallation.

2. Secondary Data Sources:

A Secondary Data Source refers to data that has already been


collected, processed, and possibly analyzed by someone else, and is now
being reused for another purpose.
Data that has already been collected by someone else for another
purpose.​

This data is:

●​ Not original to the user


●​ Often found in reports, publications, databases, websites
●​ Used for analysis, reference, or to supplement primary data

It may be aggregated, cleaned, or transformed, and is usually


easier and quicker to access, but may not be perfectly tailored to your
needs.

Key Characteristics:

●​ Less expensive
●​ Available faster
●​ Less control over data quality

Example:

A student writing a thesis on population trends uses Census data


published by the government.

●​ The data was not collected by the student.


●​ It was collected for a different (governmental/statistical) purpose.
●​ It is being reused.

Real-Time Use Case: Health Research Using WHO Data


Scenario:​
A university research team is studying the global spread of diabetes
over the last two decades.

Secondary Data Sources Used:

●​ World Health Organization (WHO) datasets on diabetes rates


by country
●​ Published journal articles summarizing prior clinical trials
●​ Health Ministry reports from various governments
●​ Online health statistics portals (e.g., Data.gov, Statista)

Even if the data is useful and reliable, the researchers have no


control over how it was collected, so it qualifies as secondary data.

Another Use case : A company expanding to Brazil uses World


Bank data on internet penetration to estimate e-commerce potential

Data sources play a vital role for the organisations to make better decisions.
There are two main sources of data collection.
​ 1.Internal source
​ 2.External source

Internal Data

The data which is generated within an organization is called internal data. It is


more relevant to access and analyse to find business insights; insights are used
in business decisions.

Some of the most commonly used internal data sources are as −

​ 1. Operational Data
​ Operational data includes day-to-day business operations like sales
transactions, customer data, inventory records, and production data.
​ Example:
​ A retail store's Point of Sale (POS) system records all product sales
and updates stock levels instantly.

2. Customer Data
It is one of the most crucial data which collected directly from customers using
CRM, feedback forms, surveys, and customer support systems. It is used in
data analysis to find customers opinions or sentiments.

Example:

An e-commerce website gathers customer satisfaction ratings and product


reviews after every purchase.

3. Employee Data
It includes Human resources data which can further analysed to get employees
performance, payroll information, and employee satisfaction.

Example:

A company uses an HRMS (Human Resource Management System) to track


employee attendance, monthly salary, and annual performance appraisals.

4. Financial Data
It includes financial data generated through financial systems, such as budgets,
profit and loss, balance sheets, and cash flow statements.

Example:

A business uses Tally or QuickBooks to generate quarterly profit & loss


reports for internal financial planning and audits.
5. Marketing Data
It is also considered as internal data collected from marketing campaigns,
website analytics, email marketing, and social media channels.

Example:

A digital marketing team uses Google Analytics to monitor website visitor


behavior and Mailchimp to track email campaign performance.

6. Production Data
Production data also collected from internal sources of an organisation like
manufacturing processes. It also involves machine performance, production
output, and quality control metrics.

Example:

A car manufacturing plant uses IoT sensors on machines to collect real-time


production efficiency data and downtime metrics.

Tools: SQL, SAP, Excel, Power BI

External Data
The data which is collected through outside boundaries of an organisation is
known as external data. External data is used by organisations to assess and
model economic, political, social, and environmental problems that influence
business.

Some of the most commonly used external data sources are as −

1. Public Data
The data which is collected from public platforms like
government databases, journals, magazines, industry
reports, and newspapers etc. Some common examples of public data are census
data, economic indicators, and public health data.

Data freely available from public sources like government and non-profit
agencies.

Example:

The Government of India’s Census portal provides demographic data like


population, literacy rate, and employment statistics which businesses can use
for market research.

2. Social Media Data


The data available on social media platforms like Twitter, Facebook, LinkedIn,
and Instagram. This encompasses user-generated content, sentiment analysis,
and engagement metrics.

Data from platforms like Twitter, Facebook, Instagram, etc., capturing user
behavior, trends, and sentiments.

Example:

A brand analyzes Twitter hashtags and comments during a product launch to


measure customer sentiment and campaign impact.

3. Web Scraping
Data extracted from websites using automated scripts like product reviews,
competitors data to do comparison and price comparison.

Automated data extraction from websites using tools or scripts.

Example:
An e-commerce company uses Python and BeautifulSoup to scrape Amazon
and Flipkart product prices for competitor analysis and dynamic pricing.

4. IoT Data
Data collected from sensors, smart devices, and wearables, providing real-time
data on environmental conditions, usage patterns, and more.

Example:

A smart thermostat collects temperature and energy usage data from homes
and sends it to a central system to optimize energy efficiency.

5. Partner Data
Data is shared between business partners, such as suppliers, distributors, or
strategic partnerships, to improve mutual understanding of market conditions or
client wants.

Data shared between companies in a partnership or supply chain.

Example:

A retail chain receives real-time inventory and delivery status data from
its logistics partner to manage stock and predict supply chain bottlenecks.

6. Open data
Open data is freely usable and available to everybody. It might not be very
relevant to you, though, whether it's high-level data or substantially
summarized and aggregated. Additionally, it might not be in the format you
require or it might be really challenging for you to understand. Making the data
usable may take a long time to prepare and then utilise it for analysis.

Freely accessible data, often raw or unstructured, that may need formatting
before use.
Example:

An environmental scientist uses NASA’s open climate datasets to study


global temperature trends—but first needs to clean and reformat the raw data
before analysis.

Tools: APIs, Web Scraping (Python), Postman

Classification Based on Structure:

A. Structured Data:

Structured data refers to highly organized information that is easily


searchable and stored in a predefined schema — usually in rows and
columns, like in spreadsheets or relational databases.

Feature Description

Predefined Schema Data is organized under columns with specified


data types (e.g., int, varchar).

Tabular Format Stored in tables (rows = records, columns =


attributes).

Easily Searchable Queries can be run using SQL to filter, sort, and
join data quickly.

Machine-readable Easy for machines to process and for BI tools to


visualize.
Real-Time Use Case

Use Case: Fraud Detection in Banking

A bank stores account data (Name, Account Number, Balance,


Transaction Type) in a relational database like Oracle.​
Using SQL queries and Python scripts, analysts run anomaly
detection algorithms on transaction patterns (e.g., sudden large
withdrawals at odd hours) to detect fraud.

Semi-Structured Data

Semi-structured data has some organizational properties like tags or keys,


but it doesn’t follow a rigid tabular format like structured data. It sits
between structured and unstructured data.

NoSQL Databases

MongoDB stores data in flexible JSON-like documents.

Key Characteristics of Semi-Structured Data


Feature Description

Key-Value Pairing Data is grouped using tags, keys, or labels (e.g.,


XML, JSON).

Flexible Schema Structure is not fixed but can still be parsed


programmatically.

Human & Machine Understandable by both humans and machines.


Readable

Partial Query Support Not as powerful as SQL, but still searchable with
XPath, JSONPath, etc

An e-commerce platform like Amazon stores customer preferences in JSON


format from clickstream behavior.

This semi-structured data is analyzed using Python or NoSQL tools to


personalize recommendations in real time.

Unstructured Data

Unstructured data has no predefined format or schema and is often


text-heavy or multimedia. It’s the most abundant form of data today and
challenging to store, search, and analyze.

Key Characteristics of Unstructured Data

Feature Description
No Predefined No table-like structure; stored as-is
Schema

Text & Media-Based Includes text, images, audio, video, documents,


logs, etc.

Needs AI/ML to Traditional tools don’t work; requires NLP, computer


Process vision, etc.

High Business Value Extracting insights can give a huge competitive


advantage

Examples of Unstructured Data

1.​ Emails – Body, attachments, headers


2.​ Social Media Posts – Tweets, Instagram comments, memes
3.​ Customer Reviews – Amazon, TripAdvisor feedback
4.​ Call Center Transcripts – Recorded calls in MP3
5.​ Images and Videos – Product demos, CCTV footage

Real-Time Use Case

Use Case: Sentiment Analysis in Social Media

A company collects thousands of customer reviews and social media posts.​


Using NLP and machine learning, it analyzes the tone
(positive/negative) to adjust product strategy.

Data Type Structure Tools Example


Use Case
Structured Fixed SQL, Excel, Bank
(tables) Power BI transaction
s, HR
database

Semi-Struct Flexible MongoDB, E-commerc


ured (JSON/XML Python, Kafka e
) personaliza
tion

Unstructure No NLP, Hadoop, Sentiment


d structure OpenCV, AI analysis,
tools image
recognition

Classification Based on Data Type

A. Discrete Data

Countable values, often integers.

Examples:

●​ Number of items sold


●​ Number of clicks
●​ No. of employees​

Use Case: E-commerce website tracks number of purchases per customer per
month.
B. Continuous Data

Measurable values that can take infinite values within a range.

Examples:

●​ Temperature
●​ Time
●​ Revenue

Use Case: Google Fit records heart rate continuously from a smartwatch.

6. Classification Based on Time Frame

A. Cross-Sectional Data

Collected at one point in time across different entities.

Examples:

●​ Annual income of 1000 families in 2025


●​ Ratings of 10 restaurants on July 1

Use Case: Flipkart compares customer reviews across product categories on Big
Billion Day.

B. Time Series Data

Collected over a period of time for one variable/entity.

Examples:

●​ Daily sales from Jan to June


●​ Weekly footfall in a store.

Use Case: Swiggy forecasts future orders using past 6-months’ daily
order trends.
1. What is Data Extraction?

Definition:​
Data extraction is the process of retrieving data from various sources — such
as databases, flat files, web APIs, or spreadsheets — for the purpose of
analysis, reporting, or migration.

Data extraction primarily refers to writing SQL queries using SELECT, JOIN,
WHERE, and GROUP BY to pull relevant data from relational tables.

SQL—---> SELECT emp_id, emp_name, salary

​ FROM employees

WHERE department_id = 10;

Use case – A retail company extracts sales data for the month using:

sql

SELECT product_id, SUM(sales_amount)

FROM sales

WHERE sale_date BETWEEN '01-JUN-2025' AND '30-JUN-2025'

GROUP BY product_id;

2. What is Data Cleaning?

Definition:​
Data cleaning (or data cleansing) is the process of identifying and correcting
(or removing) inaccurate, inconsistent, or incomplete data from a database
to improve data quality and integrity.

You might also like