Data Analytics - Unit 1 (Detailed Notes)
1.1 Data Analytics: An Overview
Data Analytics is the process of collecting, organizing, analyzing, and interpreting data to extract useful
insights.
It helps in making informed decisions based on facts, patterns, and trends.
Importance of Data Analytics:
- Helps in improving business strategies.
- Identifies customer behavior and preferences.
- Reduces operational costs by optimizing processes.
- Assists in fraud detection and risk management.
- Supports scientific research and healthcare analytics.
Example: An e-commerce company like Amazon uses data analytics to recommend products, improve
delivery time, and personalize user experience.
1.2 Types of Data Analytics
1. Descriptive Analytics: What happened? (e.g., Monthly sales report)
2. Diagnostic Analytics: Why did it happen? (e.g., Low sales due to fewer leads)
3. Predictive Analytics: What is likely to happen? (e.g., Forecasting future sales)
4. Prescriptive Analytics: What should we do? (e.g., Recommend discount strategy)
5. Visual Analytics: Uses dashboards, graphs, and charts to display complex data in simple visual formats.
1.3 Life Cycle of Data Analytics
The Data Analytics Life Cycle involves the following phases:
1. Data Discovery: Identify the problem and define objectives.
2. Data Collection: Collect data from sources like sensors, web, databases, or files.
3. Data Preparation: Clean the data (remove errors, fill missing values).
Data Analytics - Unit 1 (Detailed Notes)
4. Data Analysis: Use tools like Excel, Python, R to analyze data.
5. Data Visualization: Represent data using charts, dashboards, or plots.
6. Actionable Insights: Generate reports and suggest improvements.
1.4 Types of Data & Statistical Concepts
Types of Data:
- Structured: Organized in rows and columns (e.g., Excel, SQL database).
- Unstructured: Raw, not organized (e.g., emails, videos, social media).
- Semi-Structured: Partially organized (e.g., XML, JSON, NoSQL).
Measures of Central Tendency:
- Mean: Average of all values.
- Median: Middle value in a sorted list.
- Mode: Value that appears most frequently.
Measures of Dispersion:
- Range: Difference between the highest and lowest value.
- Variance: Average of squared differences from the mean.
- Standard Deviation: Square root of variance; tells how spread out the data is.
Example:
For values 5, 7, 7, 8, 9:
- Mean = 7.2, Median = 7, Mode = 7, Range = 4
1.5 Sampling Concepts & Probability
Sampling Funnel: Selecting a subset (sample) from a population to save time and cost. The sample must be
random and representative.
Data Analytics - Unit 1 (Detailed Notes)
Central Limit Theorem (CLT): When the sample size is large, the sampling distribution of the mean becomes
approximately normal, even if the original data is not.
Confidence Interval (CI): A range in which the population value is expected to fall, with a certain confidence
level (e.g., 95%).
Sampling Variation: Different samples may produce slightly different results.
Example:
If the average test score of a class (sample) is 70 with a 95% confidence interval of ±5, then we expect the
real population mean to lie between 65 and 75.