Data and Excel for Analytics
Types of Data
1. Structured Data
o Well-organized, stored in tables with rows and columns.
o Easy to store, search, and analyze.
o Example: Sales records, employee databases, bank transactions.
o In Excel: Appears as spreadsheets with headers like Name, Age, Salary.
2. Semi-structured Data
o Does not fit neatly into tables, but still has some organizational tags or
markers.
o Examples: JSON files, XML data from web APIs.
o In Excel: Can be imported and converted into tables using Power Query.
3. Unstructured Data
o No fixed format; messy and harder to analyze.
o Examples: Videos, audio files, social media comments, images, emails.
o In Excel: Needs preprocessing or text mining tools before analysis.
In short:
• Structured = Tables
• Semi-structured = Tagged data
• Unstructured = Freeform/messy data
Modifying Data in Excel
Excel provides many built-in tools for cleaning and preparing data for analytics:
• Sorting & Filtering:
o Sorting arranges data in ascending/descending order (e.g., sort sales by
amount).
o Filtering shows only relevant rows (e.g., filter customers from “Chennai”).
• VLOOKUP / XLOOKUP:
o Searches for a value in one column and returns related info from another
column.
o Example: Find employee salary by looking up their ID.
• Pivot Tables:
o Summarize large datasets quickly.
o Example: Show total sales by region or average marks by class.
• Data Cleaning Tools:
o Remove duplicates.
o Replace or fill missing values.
o Use “Text to Columns” to split combined data (like “FirstName LastName”).
These steps ensure data is clean, consistent, and ready for analysis.
Creating Distributions from Data
• Histograms (Frequency Distribution):
o Show how often values fall into ranges (bins).
o Example: Number of students scoring 0–10, 11–20, … in an exam.
o In Excel: Insert → Charts → Histogram.
• Probability Distributions:
o Represent the likelihood of outcomes.
o Example: Demand for a product may follow a Normal distribution.
o In Excel: Use NORM.DIST(), BINOM.DIST() functions for probability
calculations.
Histograms = visualize past data; Probability distributions = forecast future outcomes.
Measures of Location (Central Tendency)
1. Mean (Average):
o Formula: =AVERAGE(range)
o Example: If sales = {10, 20, 30}, mean = (10+20+30)/3 = 20.
o Use: General trend, but sensitive to outliers.
2. Median (Middle Value):
o Formula: =MEDIAN(range)
o Example: For {10, 20, 500}, median = 20 (better than mean when outliers
exist).
3. Mode (Most Frequent Value):
o Formula: =MODE.SNGL(range)
o Example: For {2, 2, 3, 4}, mode = 2.
o Use: Identifies most common category/value.
4. Percentiles/Quartiles:
o Formula: =PERCENTILE.EXC(range, k) or =QUARTILE.EXC(range, q)
o Example: 90th percentile = score below which 90% of students fall.
o Quartiles divide data into 4 equal parts (Q1, Q2, Q3).
Together, these measures help describe the center and spread of data.
Summary
• Data Types: Structured (tables), Semi-structured (XML/JSON), Unstructured
(media/text).
• Excel Modifications: Sorting, Filtering, VLOOKUP, Pivot Tables, Data Cleaning.
• Distributions: Histograms (past frequency), Probability Distributions (forecasting).
• Measures of Location: Mean, Median, Mode, Percentiles/Quartiles.