Descriptive Analytics
Descriptive analytics is the simplest form of analytics that mainly uses simple descriptive statistics, data visualization techniques, and
business related queries to understand past data. One of the primary objectives of descriptive analytics is innovative ways of data
summarization. Descriptive analytics is used for understanding the trends in past data which can be useful for generating insights.
Various tools and techniques are used in describing the data. Descriptive statistics such as measures of central tendency, measures of
variation and measures of shape can provide useful insights. Many different plots such as histogram, bar chart, pie-chart, box plot,
scatter plot and tree diagram can provide insights about past data and subsequently assist with further analysis by generating new
hypotheses.
Descriptive analytics is about finding “what has happened” by summarizing the data using innovative methods and analysing the past
data using simple queries. Analysing past data can provide insights that can assist organizations to take appropriate decisions.
Trends obtained through descriptive analytics can be used to derive actionable items. For example, when Hurricane Charley struck
the U.S. in 2004, Linda M. Dillman, Walmart’s Chief Information Officer, wanted to understand the purchasing behaviour of their
customers (Hays, 2004). Using data mining techniques, Walmart found that the demand for strawberry pop-tarts went up over 7 times
during the hurricane compared to their normal sales rate; the pre-hurricane top-selling item was found to be beer. These insights were
used by Walmart when the next hurricane — Hurricane Frances — hit the U.S. in August−September 2004; most of the items
predicted by Walmart sold quickly. Although the high pre-hurricane demand for beer can be intuitively predicted, the demand for
strawberry pop-tarts was a complete surprise.
DATA TYPES AND SCALES
Structured and Unstructured Data
Data at a macro-level can be classified as structured and unstructured data. Structured data means that the data is described in a matrix form with
labelled rows and columns. Any data that is not originally in the matrix form with rows and columns is an unstructured data. For example,
e-mails, click streams, textual data, images (photos and images generated by medical devices), log data, and videos. Machine generated data
such as images generated by satellite, magnetic resonance imaging (MRI), electrocar-diogram (ECG) and thermography are few examples of
unstructured data.
Cross-sectional, Time Series, and Panel Data
Another important classification of data is based on the type of data collected. Based on the type of data collected, the data is grouped into the
following three classes:
1. Cross-Sectional Data: A data collected on many variables of interest at the same time or duration of time is called cross-sectional data. For
example, consider data on movies such as budget, box-office collection, actors, directors, genre of the movie during year 2017.
2. Time Series Data: A data collected for a single variable such as demand for smartphones collected over several time intervals (weekly,
monthly, etc.) is called a time series data.
3. Panel Data: Data collected on several variables (multiple dimensions) over several time intervals is called panel data (also known as
longitudinal data). Example of a panel data is data collected on variables such as gross domestic product (GDP), Gini index, and unemployment
rate for several countries over several years.
TYPES OF DATA MEASUREMENT SCALES
Structured data can be either numeric or alphanumeric and may follow different scales of measurement (level of measurement). It is
important to understand the type of variables within the data with respect to the measurement scale since the model specification while
building analytics models such as regression may depend on the scale of measurement.
Nominal Scale (Qualitative Data)
Nominal scale refers to variables that are basically names (qualitative data) and also known as categorical variables. For example,
variables such as marital status (single, married, divorced) and industry type (manufacturing, healthcare, banking and finance) fall
under nominal scale. During data collection, it is usual to assign a numerical code to represent a nominal variable.
Ordinal Scale
Ordinal scale is a variable in which the value of the data is captured from an ordered set, which is recorded in the order of magnitude.
For example, in many survey data, Likert scale is used. Likert scale is finite (usually a 5 point scale) and the data collector would have
defined the order of preference.For example, assume that a feedback is collected on a training program using 5-point Likert scale in
which 1 = Poor, 2 = Fair, 3 = Good, 4 = Very Good, and 5 = Excellent.
Interval Scale
Interval scale corresponds to a variable in which the value is chosen from an interval set. Variable such as temperature measured in
centigrade (°C) or intelligence quotient (IQ) score are examples of interval scale. In interval scale, the ratios do not make sense. For
example, 40°C is not twice hot as 20°C.Similarly, a person with an IQ score of 160 is not twice smarter than a person with an IQ
score of 80. However, 40°C is 20°C more than 20°C, IQ score of 160 is 80 more than an IQ score of 80.
Ratio Scale
Any variable for which the ratios can be computed and are meaningful is called ratio scale. Most variables come under this type; for
example: demand for a product, market share of a brand, sales, salary, and so on. If Ms Hawai Sundari’s salary is 40,000 per month
and Ms Dawai Sundari’s salary is 90,000 per month then we can interpret that Dawai Sundari earns 2.25 times the salary of Hawai
Sundari.
POPULATION AND SAMPLE
Population is the set of all possible observations (often called cases, records, subjects or data points) for a given context
of the problem. The size of the population can be very large in many cases. For example, in 2014, close to 834.08 million
people were eligible to vote in the Indian general elections (Source: Election Commission of India). Thus, the population
size of the voters in 2014 was 834.08 million which included all eligible voters. During every election, media and other
organizations collect data to predict likely winner of election through opinion polls (and they rarely get it right due to
complexities associated with collecting right sample). It is very difficult (also practically impossible) to collect data from
all 834.08 million eligible voters about their choice of candidate, so the opinion polls are based on opinion expressed by a
subset of voters called sample.
Population (also known as universal set) is the set of all possible data for a given context whereas sample is the subset
taken from a population. In many analytical problems, we make inference about the population based on the sample data.
There are many challenges in sampling (process of selecting an observation from the population). An incorrect sample
may result in bias and incorrect inference about the population.
MEASURES OF CENTRAL TENDENCY
Measures of central tendency are the measures that are used for describing the data using a single value. Mean, median and mode are
the three measures of central tendency and are frequently used to compare different data sets. Measures of central tendency help users to
summarize and comprehend the data.
PERCENTILE, DECILE, AND QUARTILE
Percentile, decile and quartile are frequently used to identify the position of the observation in the data set. Percentile score is frequently
used in education to identify the position of a student in the group. Another frequent application of percentile is the percentile life used
in asset management.
MEASURES OF VARIATION
One of the primary objectives of analytics is to understand the variability in the data. Predictive analytics techniques such as regression
attempt to explain variation in the outcome variable (Y) using predictor variables (X). Variability in the data is measured using the
following measures:
1. Range 2. Inter-Quartile Distance (IQD) 3. Variance 4. Standard Deviation
MEASURES OF SHAPE − SKEWNESS AND KURTOSIS
Skewness is a measure of symmetry or lack of symmetry.
Kurtosis is another measure of shape, aimed at shape of the tail, that is, whether the tail of the data distribution is heavy or light.
DATA VISUALIZATION
Data visualization is an integral part of descriptive analytics and it assists decision makers with useful insights. There are many useful
charts such as histogram, bar chart, pie-chart, box-plot that would assist data scientist with visualization of the data.
Data visualization is crucial in making sense of data, especially in large volumes, and here’s why it's important:
1. Simplifies Complex Data
● Visual representations (like charts, graphs, maps) help convert complex datasets into easily understandable formats. This is
particularly helpful when dealing with large amounts of data that are hard to interpret just by looking at raw numbers.
2. Faster Decision Making
● Well-crafted visualizations allow decision-makers to quickly grasp insights and trends, leading to quicker and more informed
decisions. For example, sales trends can be identified at a glance from a line graph.
3. Identifies Trends and Patterns
● Patterns, correlations, or anomalies that are otherwise hard to see become apparent when data is visualized. Tools like scatter
plots or line charts reveal trends over time or relationships between variables.
4. Improves Data Storytelling
● Visualization helps in telling a story with data, making it more engaging and easier for audiences to remember key insights. It allows data
to be presented in a narrative format that resonates with stakeholders.
5. Enhances Data Accuracy
● By visualizing data, it becomes easier to spot errors, inconsistencies, or outliers that could skew analysis, helping ensure data accuracy and
reliability.
6. Supports Predictive Analytics
● Predictive trends or forecasting can be better understood with visual aids, as they clearly present historical data and predictions about future
outcomes, such as with time series charts.
7. Engages Audience
● Visual elements are more engaging than plain text or numbers. They can hold the attention of an audience and make presentations more
impactful, ensuring better communication of key findings.
8. Better Comparison
● Charts and graphs enable easier comparison between categories, time periods, or variables. Bar charts, pie charts, and histograms are perfect
for showcasing comparisons in sales, performance, or demographic data.
9. Facilitates Collaboration
● When multiple teams are involved in decision-making, visualizing data ensures that everyone, regardless of expertise, can understand the
data and contribute effectively.
Histogram
Histogram is the visual representation of the data which can be used to assess the probability distribution (frequency
distribution) of the data. It is a frequency distribution of data arranged in consecutive and non-overlapping intervals.
Histograms are created for continuous numerical) data. The following steps are used in constructing histograms:
1. Divide the data into finite number of non-overlapping and consecutive bins (intervals). The total number of bins to be
used can be calculated using Eqs
2. Count the number of observations from the data that fall under each bin (interval).
3. Create a frequency distribution (bin in the horizontal axis and frequency in
the vertical axis) using the information obtained in steps 1 and 2.
Histogram is very useful since it assists data scientist to identify the following:
1. The shape of the distribution and to assess the probability distribution of the data.
2. Measures of central tendency such as median and mode.
3. Measures of variability such as spread.
4. Measure of shape such as skewness.
Histograms are also useful in identifying the presence of outliers. One of the first steps in constructing histogram is
identifying the number of bins. There are many different formulas used in literature and one of the simplest formula is
Bar Chart
A bar chart uses rectangular bars to compare different categories. The length of each bar represents the value of the category.Bars can
be oriented vertically or horizontally.
Bar chart is a frequency chart for qualitative variable (or categorical variable). Histograms cannot be used when the variable is
qualitative. Bar chart can be used to assess the most-occurring and least-occurring categories within a data set.
How it's useful in Data Analytics:
Bar charts are great for comparing quantities across different categories, helping analysts easily identify the highest and lowest values.
Example:
You want to compare the sales of different products: Product A, B, and C.
Each product has a bar showing its sales volume.
Pie Chart
Pie chart is mainly used for categorical data and is a circular chart that displays the proportion of each category in the data set. Pie
chart helps to visualize the proportion (percentage) of each category as sector of
a circle. The pie chart for the movie genre based on the Bollywood movie
data set is shown in Figure
Example:
● If you want to show how a company's revenue is split across different
departments (e.g., marketing, sales, R&D), each slice of the pie will
represent the percentage of total revenue for each department.
How it's useful in Data Analytics:
● Pie charts are helpful in visualizing the relative proportions of parts to
the whole, making it easier to understand distributions.
Scatter Plot
Scatter plot is a plot of two variables that will assist data scientists to understand if there is any relationship between two variables. The
relationship could be linear or non-linear. Scatter plot is also useful for assessing the strength of the relationship and to find if there are
any outliers in the data.
Example:
● If you plot the height and weight of students, each student is represented by a point on the scatter plot, showing the relationship
between height and weight.
How it's useful in Data Analytics:
● Scatter plots are useful for identifying correlations between
variables and spotting trends or clusters in data.
Coxcomb Chart
Coxcomb chart (also known as polar area chart or roses) is an extension of pie chart made popular by Florence Nightingale (Lewi, 2006). In a
Coxcomb chart, each area represents the magnitude of the category. The main difference between the regular pie chart and coxcomb chart is that in
the case of pie chart the radius of each sector is same, whereas, in coxcomb chart the radius of the sector is adjusted to create the magnitude of the
area.
Florence Nightingale collected data from Crimean war (war between British and French on one side and Russians on the other side) on causes of
mortality among soldiers. She classified the causes into three categories:
1. Preventable diseases
2. Wounds sustained in the war
3. Other causes
In Figure (originally prepared by Florence Nightingale), the largest area of the chart corresponds to the cause ‘preventable diseases’.
How it's useful in Data Analytics:
● Coxcomb charts are useful when you want to compare categories over time or when pie charts might not visually differentiate categories
well.
Box Plot (or Box and Whisker Plot)
Box plot (aka Box and Whisker plot) is a graphical representation of numerical data that can be used to understand the variability of
the data and the existence of outliers. Box plot is designed by identifying the following descriptive statistics:
1. Lower quartile (1st Quartile), median and upper
quartile (3rd Quartile).
2. Lowest and highest value.
3. Inter-quartile range (IQR).
Example:
● You want to compare test scores in a class. A box plot would show the median score, the range of the middle 50% of scores,
and any outliers.
How it's useful in Data Analytics:
● Box plots are helpful for showing the spread and identifying outliers in data, giving a quick overview of the data distribution.
Treemap
Treemap is a hierarchical map made up of nested rectangles frequently used as part of business intelligence reports which helps
organizations to understand the data hierarchically. To construct a treemap,the data should be hierarchical with several levels. The size
of rectangle and colour are used for describing/differentiating the characteristics of the data.
● A treemap is a chart that displays data in nested rectangles, where the size of each rectangle is proportional to the value it
represents.
Example:
● If you want to show the market share of different smartphone brands, each brand would be a rectangle, and the size would
indicate its market share.
How it's useful in Data Analytics:
● Treemaps are excellent for displaying hierarchical data or comparing the
relative size of different categories in a clear and visual way.