Unit I-V
Unit I-V
Editors
Ms. Asha Yadav
Dr. Charu Gupta
Content Writers
Lavkush Gupta
Content Reviewer from the DDCE/COL/SOL
Dr. Reema Thareja
Ms. Aishwarya Anand Arora
Academic Coordinator
Mr. Deekshant Awasthi
Published by:
Department of Distance and Continuing Education
Campus of Open Learning/School of Open Learning,
University of Delhi, Delhi-110 007
Printed by:
School of Open Learning, University of Delhi
Printed at: Taxmann Publications Pvt. Ltd., 21/35, West Punjabi Bagh,
Printed at: Vikas Publishing House Pvt. Ltd. Plot 20/4, Site-IV, Industrial Area Sahibabad, Ghaziabad - 201 010 (600 Copies)
New Delhi - 110026 (500 Copies, 2025)
SYLLABUS
Data Interpretation and Visualization using Python
Syllabus Mapping
Unit-I:
Introduction: Motivation for using Python for Data Lesson 1: Introduction to Data Analysis
Visualization, Essential Python. Lesson 2: Statistical Foundations & Python
Libraries: NumPy, Pandas, Matplotlib, Import and Export of Libraries
Data, Import and Export of data using files. (Pages 3–48)
Unit-II:
Array manipulation using Numpy: Numpy array: Creating Lesson 3: NumPy: The Art of Array
Numpy arrays, Data Types for Numpy arrays, Arithmetic Manipulation
with NumPy Arrays, Basic Indexing and Slicing, swapping (Pages 51–77)
axes, transposing arrays.
Unit-III:
Data Manipulation using Pandas: Data Structures in Pandas: Lesson 4: Pandas Power Play: Mastering
Series, DataFrame, Index objects, Loading data into Pandas Data Manipulation
DataFrame. Working with DataFrames. Grouped and aggregate (Pages 81–126)
calculations.
Unit-IV:
Plotting and Visualization: Using matplotlib to plot data: Lesson 5: Plotting Perfection: Mastering
figures, subplots, markings, colour and line styles, labels and Plotting & Visualization
legends, plotting functions in Pandas: Line, bar, Scatter plots, (Pages 129–151)
histograms, stacked bars, boxplot.
Unit-V:
Data Aggregation and Group operations: Group by Mechanics, Lesson 6: Data Unification: Exploring
Data aggregation, General split-apply combine, Pivot tables Aggregation and Grouping
and cross tabulation. (Pages 155–171)
CONTENTS
UNIT I
LESSON 1 INTRODUCTION TO DATA ANALYSIS 3–17
UNIT II
LESSON 3 NUMPY: THE ART OF ARRAY MANIPULATION 51–77
UNIT III
LESSON 4 PANDAS POWER PLAY: MASTERING DATA
MANIPULATION 81–126
UNIT IV
LESSON 5 PLOTTING PERFECTION: MASTERING PLOTTING &
VISUALIZATION 129–151
UNIT V
LESSON 6 DATA UNIFICATION: EXPLORING AGGREGATION AND
GROUPING 155–171
LESSON 1 NOTES
Structure
1.1 Learning Objectives
1.2 Introduction
1.3 What is Data?
1.4 Significance of Data
1.5 Why Data Analysis?
1.6 Types of Data
1.7 Sources of Data Collection
1.8 Data Preparation
1.9 Exploratory Data Analysis
1.10 Summary
1.11 Glossary
1.12 Answers to In-text Questions
1.13 Self-Assessment Questions
1.14 References
1.15 Suggested Readings
NOTES
1.2 INTRODUCTION
The word ‘data’ is the plural form of the word ‘datum’, which means a single
piece of information. The data word is used to represent the raw information of
any organisation, company, population, etc. The raw information refers to the one
that are not more organised and suitable but still has its meaning. The scientific
meaning of data is “the facts and statistics collected for reference or analysis”.
The data always have their significance because we can extract meaningful
information from the collection of data. We are all living in the digital era and
have evolved with the huge amount of data. Without data, we can not presume
our daily needs. Data is like fuel for us. Every day, we generate huge amounts
of data by using smart devices like smartwatches and smart TVs, various mobile
Self-Instructional applications on phones, purchasing and selling items, billing or payments, e-mail
4 Material writing, and various social media platforms (Facebook, Instagram, WhatsApp, etc.).
NOTES
1.5 WHY DATA ANALYSIS?
Data analysis is a basic need for every organisation. Every organisation, like
schools, businesses, technologist, and hospitals, have bulks of information related
to their day-to-day operations. This information is called data. Intuitively, data
are available in raw formats, so that the raw data can be organised into a specific
format that would be useful for the people. We would need to apply some data
handling methods such as data collection, data preparation and data analysis
repetitively. Data analysis helps us in making important decisions, predictions,
operations, and improvements as per the existing knowledge domain. Data
Analysis has become an inevitable part of all business operations as it helps in
understanding the customer’s requirements, improving sales, optimising costs,
and creating better problem-solving strategies. In research, a huge amount of data
provides insights to present accurate and reliable models and develop graphical
representations.
In data analysis, various types of data are encountered, each requiring distinct
approaches for effective exploration and interpretation. These types can be
broadly categorised into two main groups: quantitative data and qualitative data.
Quantitative Data: Quantitative data represents measurable quantities and is
expressed in numerical terms. This type of data is often associated with objective
observations and is suitable for statistical analysis. Key subtypes include:
• Discrete Data: Comprising distinct, separate values, such as counts of items
or whole numbers.
• Continuous Data: Involving measurements that can take any value within
a given range, often associated with real numbers.
Quantitative data allows for mathematical operations, making it suitable
for statistical techniques like mean, median, and regression analysis. Common
sources of quantitative data include sensor readings, sales figures, and test scores. Self-Instructional
Material 5
Categorical Data: Categorical data represents distinct categories and is often NOTES
used to group observations based on qualitative attributes. Analysing this type
of data involves techniques like chi-square tests or logistic regression.
Example: Suppose you collect data on the preferred smartphone operating
systems among a group of individuals, and the categories include “iOS”,
“Android”, and “Other.” This information is categorical, and you can use methods
like chi-square tests to determine if there are significant associations between
preferences and other factors.
Understanding the type of data at hand is essential for selecting appropriate
analysis methods and drawing meaningful conclusions. Often, a combination
of quantitative and qualitative analyses are necessary to gain a comprehensive
understanding of complex phenomena.
IN-TEXT QUESTIONS
1. Plural of Datum is ________.
2. In a class, the students were categorised into first, second, and third.
This type of data is called __________.
3. The full form of EDA is ________.
4. The full form of GIS is __________.
5. The data collected for GIS is called ____________.
Data collection is a fundamental step in the data analysis process, and there are
various sources from which valuable data can be gathered. These sources can be
classified into primary and secondary sources, each providing unique insights
into different aspects of the subject under study.
Primary Sources: Primary sources involve the direct collection of data from
original or first-hand sources. These sources are often specific to the research or
analysis at hand and provide data that has yet to be previously processed. Key
examples include: Self-Instructional
Material 7
Example: A retail company analysing sales performance may use its internal NOTES
sales records to identify trends and make informed decisions about inventory
and marketing strategies.
External Sources: External sources encompass data collected by entities
outside the organisation. These sources provide context and external
benchmarks for analysis. Key examples include:
• Government Data: Census data, economic indicators, and regulatory
information provided by government agencies.
• Industry Reports: Data and analyses produced by industry associations or
market research firms.
• Open Data: Publicly available datasets shared by organisations,
governments, or communities.
Example: A researcher studying economic trends may use external sources such
as government reports on unemployment rates and GDP growth.
Effective data collection involves a thoughtful combination of these
sources, depending on the research objectives, available resources, and the
nature of the analysis. Researchers and analysts must carefully consider the
strengths and limitations of each source to ensure the reliability and validity of
the collected data.
It is the second most important step in the execution of data. Since the data we
collected from any source do not necessarily have to be available in a properly
organised format. Various kinds of dearth occur in collected data, such as
semantics mistakes, logical mistakes, value errors, poorly formatted, partially
available data, and irrelevant and misconfigured data. Before applying data
analysis, we apply data formatting in such a way that the collected data which
is prepared with correct semantics, is logical and relevant, and is configured for
analysis. Data preparation, a crucial facet of the data analysis process, is devoted
to the meticulous cleaning, organising, and transformation of raw data into a
Self-Instructional
format conducive to meaningful analysis. This preparatory phase is instrumental Material 9
NOTES in guaranteeing the accuracy, completeness, and relevance of the data in alignment
with the research or analysis objectives.
1. Data Cleansing:
Data cleansing is a meticulous process involving the identification and rectification
of errors, inconsistencies, and missing values within the dataset. This critical step
safeguards the integrity of the data, preventing inaccuracies from impeding the
analysis. Techniques such as imputation, outlier removal, and rectification of
data entry errors are frequently applied during this stage.
2. Data Conversion:
Data conversion is the art of transforming raw data into a standardised format
suitable for analysis. This encompasses the normalisation or scaling of numerical
variables, encoding of categorical variables, and the creation of novel derived
features to encapsulate underlying patterns in the data in a better way.
4. Feature Engineering:
Feature engineering is a creative process involving the creation of new features
or the modification of existing ones to enhance the performance of machine
learning models or facilitate the interpretation of statistical analyses. This may
encompass the generation of interactive features, aggregation of information, or
extraction of meaningful indicators from existing variables.
5. Data Integration:
Data integration entails the amalgamation of data from diverse sources to
construct a unified dataset. Resolving inconsistencies in variable names, types,
or units is often necessary in this process. The overarching goal is to forge a
Self-Instructional comprehensive database that encapsulates all pertinent information for analysis.
10 Material
Data analysis (EDA) is an important step in the data analysis process, aiming to
understand important features of the data set before design or evaluation. The main
purpose of EDA is to find patterns, relationships, inconsistencies and important
insights in data. Using Visualization and analysis techniques, EDA helps gain a
deeper understanding of the underlying structure of the data, allowing informed
decisions to be made and the next steps to be taken.
EDA works as a search engine, allowing analysts and data scientists to
obtain data while searching. This phase is accomplished through a combination
of statistical concepts, graphical representation, and data visualization techniques
that help uncover patterns and trends, thereby generating hypotheses for further
investigation. The key components are:
Descriptive Statistics: Descriptive statistics provide a summary of the main
characteristics of the dataset.
• Mean: The arithmetic average of a variable.
• Median: The middle value of a dataset, separating it into two halves. Self-Instructional
Material 11
• Time Series Analysis: Identifies trends, seasonality, and cycles in time- NOTES
dependent data.
• Spatial Analysis: Uncovers geographic patterns and relationships in spatial
datasets.
• Cluster Analysis: Groups similar observations based on certain
characteristics.
Self-Instructional
Fig 1.1: Data Analysis Flowchart Material 13
IN-TEXT QUESTIONS
6. Grouping observations based on similar characteristics is called
_________.
7. A chart that can be used to represent multivariate data visually is
__________.
8. Kurtosis measures __________ of the data distribution.
9. The full form of CRM is ____________.
10. The most occurring value in a data is called _________.
Self-Instructional
14 Material
NOTES
1.10 SUMMARY
1.11 GLOSSARY
• Data: The data is the plural of the word ‘datum’, which means the piece
of information. It can be available in any format like text, table, graph, etc.
• Semantics: It describes the meaning of data.
• Database: It is the collection of related data that have implicit meaning.
• Flowchart: It is a pictorial representation of some algorithmic task or problem.
• Dataset: It is the collection of data in a single file with an ordered and
well-structured format. It helps to summarise the properties of data.
• Data Wrangling: It is the process of cleaning, transforming and structuring
data from one raw form into a desired format to improve data quality and
make it more useful for data analysis.
1. Data
Self-Instructional
2. Ordinal Data Material 15
Self-Instructional
16 Material
9. Select any topic (educational, industrial, organisation) as per your interest, NOTES
explore and perform the following activities upon it:
• Identify different sources of public use open data repositories (Kaggle,
Google Data, Github, etc.)
• Download the dataset and identify the file (XML, .csv, .xls, .txt).
• Identify the number of rows (data) and columns (attributes) from the
data set downloaded.
• Extract the useful information from collected data.
• Define the attributes according to the properties of specific information.
• Insert the specific information into the correct attribute.
• Identify the columns or rows with missing values, duplicate rows, and
different formats (e.g. varied date formats, etc.)
1.14 REFERENCES
Self-Instructional
Material 17
LESSON 2 NOTES
Structure
2.1 Learning Objectives
2.2 Introduction
2.3 Importance of Statistics
2.4 Population
2.5 Sampling
2.6 Types of Statistics
2.7 Measures of Central Tendency
2.8 Measures of Dispersion
2.9 Scaling Features of Data
2.10 Relationship between random variables – Covariance & Correlation
2.11 Regression Analysis
2.12 Statistical Hypothesis Generation and Testing
2.13 Essentials and Motivation for using Python for Data
2.14 Python Libraries
2.15 Summary
2.16 Glossary
2.17 Answers to In-text Questions
2.18 Self-Assessment Questions
2.19 References
2.20 Suggested Readings
2.2 INTRODUCTION
role, the scope and importance of statistics extend across diverse domains, NOTES
influencing the planning of economic development and the analysis of business
data and making substantial contributions to fields such as biology, medical
science, and industry. Serving as the lifeline of data analysis, statistics delineates
the methods for accurate data collection, handling, and analysis. It introduces
a plethora of measures that enhance the precision, predictions, accuracy, and
estimation of data, making it an indispensable tool for anyone navigating the
vast landscape of information and insights within diverse fields.
2.4 POPULATION
In statistics, a “population” refers to the entire group that is the subject of a study
or analysis. It includes all possible individuals, items, or observations that share
a common characteristic or a set of characteristics. The key distinction is that
a population encompasses every element of interest, not just a subset. Some
examples are:
• Suppose you want to study the average height of all students in a school.
The population, in this case, would be every student enrolled in that school,
regardless of their age, grade, or other characteristics.
• Suppose a company wants to understand the average monthly spending of
all its customers. The population would be every customer who has ever
purchased with the company, regardless of when they started or how often
they buy.
• In a medical study examining the prevalence of a genetic trait in a region,
the population would be all individuals living in that region.
• During a national census, the population comprises every individual living
in the country, regardless of age, gender, or any other specific criteria.
Understanding the population is crucial in statistical analysis because it
helps researchers draw accurate and generalisable conclusions about the entire
group based on a subset known as a sample. Due to practical constraints, it is
often more feasible to study a subset of the population rather than the entire
group. The insights gained from analysing a sample are then extrapolated to Self-Instructional
Material 21
make inferences about the population.
NOTES
2.5 SAMPLING
NOTES
2.6 TYPES OF STATISTICS
1. Descriptive Statistics:
Descriptive statistics, as its name implies, focuses on the elucidation of data.
This branch aims to portray sample data in a manner that is not only meaningful
but also easily comprehensible. Various methods, such as graphs, charts, tables,
or spreadsheets, are employed to represent the data meaningfully. The goal is to
provide a clear and representative depiction of the sample.
2. Inferential Statistics:
Inferential statistics, on the other hand, involves making inferences about
populations based on data drawn from that population. This branch applies to a
subset of the population and attempts to derive results, subsequently extending
those findings to the entire population. Inferential statistics encompass activities
like comparing data, testing hypotheses, and making predictions.
For Example, consider a scenario where you want to understand if 80%
to 90% of a larger population (say, 500 people nearby) favour online shopping
at Amazon. Conducting a survey with a sample of 500 people can result in
descriptive statistics, like a bar chart representing “yes” or “no” responses.
Alternatively, inferential statistics come into play when you analyse the samples
and infer that most of the entire population likely shares the same preference
for shopping at Amazon. Inferential statistics can be broadly categorised into
two areas:
• Estimating Parameters: This involves utilising sample data to estimate
population parameters. Parameters, such as sample mean, sample median,
or mode, are inferred to make predictions about population parameters like
population mean, population median, or mode.
Hypothesis Testing: Hypothesis testing is a statistical method used to make
decisions based on experimental data. It involves forming assumptions
about population parameters, and statistical conclusions are drawn to either
Self-Instructional
accept or reject these assumptions. Material 23
In the realm of data analysis, measures of central tendency play a pivotal role in
distilling the essence of a dataset, revealing the central or representative values
around which the data tends to cluster. These measures, comprising the mean,
median, and mode provide valuable insights into the typical characteristics of a
dataset, aiding analysts in understanding its central tendencies.
Mean: The mean, often referred to as the average, is a fundamental measure of
central tendency. It is calculated by summing all values in the dataset and then
dividing the total by the number of observations. The mean is particularly useful
when seeking a balance point in the data.
For Example, let us consider a dataset representing the daily incomes of a group
of individuals:
Income = [100, 200, 300, 500, 600]
The mean is calculated as:
Mean = (100 + 200 + 300 + 500 + 600)/5 = 340
It is important to note that the mean is sensitive to extreme values. In cases
where outliers exist, the mean may be skewed towards these extremes.
Median: The median is another crucial measure of central tendency. It rep-
resents the middle value of a dataset when it is arranged in ascending or de-
scending order. Unlike the mean, the median is less influenced by extreme val-
ues, making it a robust indicator of the central point.
in this case) after sorting the data. Here, the median = 300. NOTES
The median is particularly valuable when dealing with skewed distributions
or datasets containing outliers.
Mode: The mode represents the most frequently occurring value in a dataset.
A dataset can be unimodal (one mode), bimodal (two modes), or multimodal
(more than two modes). The mode is especially useful for identifying peaks or
concentrations within the data.
For Example: In a dataset of exam scores:
Scores = [75, 85, 90, 75, 92, 85, 88, 90]
Here, the mode is 75 and 85 since these values occur more frequently
than others.
While the mode is straightforward to identify, it may not exist in every
dataset, and a dataset can have more than one mode.
Collectively, these measures offer a nuanced understanding of the central
tendencies within a dataset. Analysts choose the appropriate measure based on
the characteristics of the data and the specific insights sought, recognising that
each measure brings a unique perspective to the analysis.
NOTES Range: In a measure of dispersion, the range is defined as the difference between
the highest and lowest observed values in a given dataset. The range does not
depend upon the frequencies of the observations in the dataset. To find the range
of any dataset, all the individual data of the sets should be of the same unit. This
measure gives information about the range of values, not given any information
about how data are dispersed around its centre or mean.
Range = highest value – lowest value
For Example: Calculate the range of given data.
Data (x) 10 29 12 11 16
Frequency (f) 4 5 7 9 1
Firstly, check the highest and lowest values in the given data table, highest
value = 29 and lowest value = 10, irrespective of the frequency of data.
We know that range = highest value – lowest value
= 29 - 10 = 19
Variance: As we discussed in range, we would not be able to get detailed
information on data and how it is dispersed. The variance measures dispersion
around the mean and overcomes the measure of range. It is denoted by s 2. The
variance is computed by the ratio of the number of observations to the average
squared difference from the mean.
∑ (x − x)
n 2
22 i
ós = i =1
n
For Example: Calculate the variance of given data.
Data (x) 10 29 12 11 18
∑ (x − x)
n 2
22 i
Variance is calculated by - ós = i =1
n
Self-Instructional
26 Material
∑
n
x
i =1 i
x =
n
10 + 29 + 12 + 11 + 18 80
= = = 16
5 5
Now,variance =
5
( −6 ) + (13) + ( −4 ) + ( −5) + ( 2 )
2 2 2 2 2
=
5
36 + 169 + 16 + 25 + 4
=
5
250
= = 50
5
Standard Deviation: It overcomes the limitations of variance measure because
the variance computation is done by taking the square of differences of all
observations with mean. The standard deviation measure of dispersion is the
square root of the variance. It is denoted by σ.
∑ (x − x)
n 2
i
s2 = i =1
n
For example: Find the standard deviation of the above example.
In the above example, we calculated a variance of 50.
= 50
Coefficient of variation: Unlike variance and standard deviation, the coefficient
of variation is a relative measure of dispersion. This type of measure of dispersion
has no units. The ratio of standard deviation to the mean of observations is called
the coefficient of variation.
Self-Instructional
Material 27
NOTES
standard deviation
Coefficient of variation =
mean
For example, Calculate the coefficient of variation of given data.
Data (x) 10 29 12 11 18
standard deviation
We know that the coefficient of variation =
mean
First, calculate the variance, then use the formula of
∑ (x − x)
n 2
i
Standard Deviation = i =1
n
Finally, we can calculate the ratio of s tandard deviation and mean, which
is the coefficient of variation.
Variance is calculated by,
∑ (x − x)
n 2
22 i
ós = i =1
n
∑
n
i =1 i
x
First, calculate the mean, x =
n
10 + 29 + 12 + 11 + 18 80
= = = 16
5 5
Now, variance =
5
( −6 ) + (13) + ( −4 ) + ( −5) + ( 2 )
2 2 2 2 2
=
5
36 + 169 + 16 + 25 + 4
=
5
Self-Instructional 250
28 Material = = 50
5
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Statistical Foundations & Python Libraries
NOTES
Now, standard deviation = √ 50
The scaling of data is the need, and comparison of data is very difficult when
you do the distribution of data in either different measurement units or if data is
having different values. The scaling of data helps in a comparison of different
data values by applying the scaling features. For example, suppose the dataset
has a few columns in which their columns are of different measurement units
like meters and kilograms. In that case, it will be challenging to compare the
data of both columns due to different measuring units. Here, we are discussing
two methods of scaling data:
1. Min-Max scaling.
In min-max scaling of data, find the range using minimum and maximum values
of data. We must consider every data point, let us say x of the dataset, and subtract
the minimum value of the dataset; after that, take the ratio of the resultant value
with the range. Then, our data will be scaled or normalised, and we can do
comparisons. The formula of min-max scaling is:
x − xminimun
x scaled =
range
2. Standardisation method
In the standardisation method of scaling the data, we take the ratio of differences of
each data point to mean with the standard deviation. The formula is given below:
x− x
z=
s.d .
Here, z is the new value, x is the observation of old values or data points,
x is the mean, and s.d. is the standard deviation.
Self-Instructional
Material 29
NOTES For example, suppose we have a dataset of a few vehicles containing data
about their model, volume, weight, etc. We want to compare the volume and
weight, but both columns have different measurement units, and the values have
huge differences between them. Like volume 1.0 is very far from weight 600 in
magnitude, 1.5 is also far from 700.
First, calculate the mean of the weight column of the dataset by using the
NumPy library.
Self-Instructional
30 Material
NOTES
Similarly, find the mean and standard deviation of the volume column
using code:
If you take the weight column from the data set above, the first value is
600, and the scaled value will be (600 - 740.0) / 86.02 = -1.6275. If
you take the volume column from the data set above, the first value is 1.0, and
the scaled value will be (1.0 – 1.46) / 0.2870 = -1.6027.
Now, we can compare -1.6275 and -1.6027 (since both values are now
very close to each other) instead of comparing 600 and 1.0.
IN-TEXT QUESTIONS
1. How are mean, median, and mode related to each other?
2. Which measure of central tendency includes the magnitude of scores?
3. Mode refers to the value within a series that occurs ________ number
of times.
a) Maximum b) Minimum c) Zero d) Infinite
4. The sum of deviations from the _________ is always zero.
a) Median b) Mode c) Mean d) None of the above
Self-Instructional
Material 31
NOTES
2.10 RELATIONSHIP BETWEEN RANDOM
VARIABLES – COVARIANCE & CORRELATION
Covariance
Covariance describes how two random variables depend on each other. If
change occurs in one of them, then it affects another dependent variable too. The
mathematical values of correlation lie in between −∞ and ∞ .
For example, suppose Mr. X wants to purchase a house; he asks for the price
of different houses in the same area from a property dealer. The property dealer
provides a price list and size of houses to him, like below.
1000 1500000
1200 1800000
1500 2000000
2000 2500000
In the above list of prices and sizes, we can see that as the size of the
houses grows, prices also increase. So, we can say the variable price depends
on the other variable size. It means both variables are quantifying to each other
in a way that if one increases (or decreases), then the other also increases (or
decreases). It is the case of positive covariance.
NOTE: If two variables depend on each other in a way that if the value of one increases
(or decreases), then the value of others decreases (or increases), then the covariance will be
negative.
NOTES
Where n is the number of data points, X is the mean of all variables, X
and Y is the mean of all variables, Y and X i are some individual data points of
the sample set, and Yi are some individual data points of the sample set.
Correlation:
Correlation is the statistical term that tells how a pair of random variables are
strongly related to each other. The mathematical values of correlation lie in a
closed interval of [-1, 1]. It means if the values are close to -1, then data points
will have a strong negative correlation, and if values are close to +1, then data
points will have a strong positive correlation among them. In special cases, if
values are nearly 0, then data points will not be correlated.
The mathematical Equation of Correlation is:
∑ ( X − X (Y − Y )
n
i =1 i i
∑ ( X − X ) ∑ (Y − Y )
n 2 n 2
i i
=i 1 = i 1
Where X is the mean of all variables, X and Y is the mean of all variables
Y and X i and Yi are the individual data points of the sample set.
The measure of correlation is known as the correlation coefficient or
correlation index.
cov ( X , Y )
ρ (X,Y) =
s X sY
The values of the coefficient lies in -1 ≤ ρ ≤ 1
Where cov(X, Y) is the covariance between X and Y, sX represents the
standard deviation of X, and sY represents the standard deviation of Y.
There are some cases of the correlation coefficient, which is shown in the
XY plane graph given below. In the diagram, the * symbol represents the data
point, and X and Y are axes, respectively.
Self-Instructional
34 Material
Fig 2.3: Pearson Correlation
NOTES
2.11 REGRESSION ANALYSIS
The equation that describes linear regression is the slope form of a line in
mathematics.
=y mx + c
Hypothesis Testing
• Null Hypothesis (H0): The null hypothesis is a general given statement or
default position that there is no relationship between two measured cases
or no relationship among groups. In other words, it is a basic assumption
or made based on the problem knowledge.
• Alternative Hypothesis (H1): The alternative hypothesis is the hypothesis
used in hypothesis testing that is contrary to the null hypothesis.
• Level of significance: It refers to the degree of significance in which we
accept or reject the null hypothesis. 100% accuracy is not possible for
accepting a hypothesis; a level of significance that is usually 5% is selected.
This is normally denoted with α, and generally, it is 0.05 or 5%, which
means the output should be 95% confident to give a similar kind of result
in each sample.
• P-value: The P value, or calculated probability, is the probability of finding
the observed/extreme results when the null hypothesis (H0) of a study-
given problem is true. If the P-value is less than the chosen significance
level, then reject the null hypothesis, i.e. accept that your sample claims
to support the alternative hypothesis.
• Steps in Hypothesis Testing:
Step 1: Identify the problem and make an assumption statement so that
assumption statements are contradictory to one another
Step 2: Consider statistical assumptions such as whether the data is normal
or not and statistical independence between the data.
Step 3: Decide the test data on which the hypothesis will be tested.
Step 4: The data for the tests are evaluated. Evaluate various scores like
Self-Instructional
36 Material z-score and mean values.
Step 5: Decide whether to accept the null hypothesis or reject the null NOTES
hypothesis.
• Formula for Hypothesis Testing
To validate our hypothesis about a population parameter, we use statistical
functions. We use the z-score, p-value, and level of significance (alpha) to
make evidence for our hypothesis.
IN-TEXT QUESTIONS
5. The statistical term which tells about how the pair of random variables
are strongly related to each other is called __________.
6. The values of the coefficient lie in _________.
7. The ________ approach can be used to estimate the house prices based
on various input features like size, number of rooms, number of floors,
locality, etc.
8. In a hypothesis test, the p-value is compared to the significance level
(α) to make a decision. What happens if the p-value is less than α?
a. Reject the null hypothesis. b. Fail to reject the null hypothesis.
c. Accept the null hypothesis. d. The significance level is adjusted.
Self-Instructional
Material 37
NOTES
2.13 ESSENTIALS AND MOTIVATION FOR USING
PYTHON FOR DATA VISUALIZATION
Python’s rich ecosystem, led by libraries like Matplotlib and Seaborn, empowers
2.13 Essentials and Motivation for using Python for Data Visualization
you to create insightful and compelling visualizations effortlessly. Its simplicity,
versatility, and readability make Python accessible for beginners while offering
advanced capabilities for seasoned professionals. Harness the seamless integration
with data manipulation libraries like Pandas, ensuring a smooth workflow. Enjoy
the vibrant community support, extensive documentation, and constant updates,
ensuring you stay at the forefront of visualization techniques. Whether you’re
a data scientist, analyst, or enthusiast, Python is your key to transforming raw
data into meaningful, impactful insights.
The use of python is guided by Python Enhancement Proposals (PEP) and
the Zen of Python as they collectively represent the core values and evolutionary
mechanisms that guide the development and growth of the Python programming
language. PEPs are formal design documents providing information to the
Python community or describing a new feature for Python or its processes. These
proposals undergo rigorous review and discussion within the community before
being accepted or rejected. On the other hand, the Zen of Python, a collection
of aphorisms by Tim Peters, encapsulates the philosophy that shapes Python’s
design principles. It serves as a set of guiding ideals, emphasizing simplicity,
readability, and practicality in code. PEPs and the Zen of Python together
exemplify the commitment to transparency, collaboration, and a mindful approach
to software development that has contributed to Python’s success and popularity
in the programming world. They provide a framework that not only governs the
language’s evolution but also fosters a sense of community and shared values
among Python developers worldwide.
Few aphorisms of Zen of python are given below that should be kept in
mind of every programmer using python:
• Beautiful is better than ugly.
• Explicit is better than implicit.
Self-Instructional
38 Material
• Simple is better than complex.
Python libraries are pre-written sets of code that extend the functionality of the
Python programming language, making it more powerful and versatile for a
wide range of tasks. These libraries encapsulate reusable modules, functions,
and classes, allowing developers to save time and effort by leveraging existing
code rather than building everything from scratch. Python’s extensive library
ecosystem is a key factor in its popularity, contributing to its versatility across
diverse domains such as data science, machine learning, web development,
and more. Some well-known Python libraries include NumPy for numerical
computing, Pandas for data manipulation, Matplotlib for data visualization,
Self-Instructional
TensorFlow and PyTorch for machine learning, and Django for web development. Material 41
Numpy:
NumPy, short for Numerical Python, is a fundamental library for numerical
computing in Python. It provides support for large, multi-dimensional arrays
and matrices, along with a collection of mathematical functions to operate on
these arrays. NumPy is a cornerstone in the Python data science ecosystem
and is widely used for tasks involving numerical computations, linear algebra,
statistics, and more.
To install NumPy, you can use Python’s package manager, pip. Open a
terminal or command prompt and run the following command:
pip install numpy
This command fetches the latest version of NumPy from the Python
Package Index (PyPI) and installs it on your system. Alternatively, if you’re
using a Jupyter Notebook or an Integrated Development Environment (IDE) like
Anaconda, you can install NumPy using their respective package management
systems.
Once installed, you can import NumPy into your Python scripts or
notebooks using:
import numpy as np
This standard import statement is a convention, and it allows you to use
the alias np when referring to NumPy in your code, making it more concise
and readable. With NumPy, you can efficiently perform array manipulations,
mathematical operations, and other numerical tasks in a Pythonic and efficient
manner.
Pandas
Pandas is a powerful and widely-used open-source data manipulation and analysis
Self-Instructional
42 Material library for Python. It provides data structures like Series and DataFrame, which
are designed to handle and manipulate structured data seamlessly. Pandas excels NOTES
in tasks related to cleaning, transforming, and analyzing data, making it an
essential tool in the data science and analytics toolkit.
To install Pandas, you can use Python’s package manager, pip. Open a
terminal or command prompt and run the following command:
pip install pandas
This command fetches the latest version of Pandas from the Python Package
Index (PyPI) and installs it on your system. If you’re using a Jupyter Notebook
or an integrated development environment (IDE) like Anaconda, you can install
Pandas using their respective package management systems.
Once installed, you can import Pandas into your Python scripts or notebooks
using:
import pandas as pd
The standard import statement uses the alias pd for Pandas, making it a
common and convenient convention. With Pandas, you can efficiently handle
and manipulate tabular data, perform operations like filtering, grouping, and
aggregation, and seamlessly integrate data from various sources for in-depth
analysis.
Matplotlib
Matplotlib is a popular 2D plotting library for Python that produces high-quality
static, animated, and interactive visualizations. It provides a wide range of
customizable charts and plots, making it a versatile tool for data visualization.
To install Matplotlib, you can use Python’s package manager, pip. Open a
terminal or command prompt and run the following command:
pip install matplotlib
This command fetches the latest version of Matplotlib from the Python
Package Index (PyPI) and installs it on your system. If you’re using a Jupyter
Notebook or an integrated development environment (IDE) like Anaconda, you
can install Matplotlib using their respective package management systems.
Self-Instructional
Material 43
NOTES Once installed, you can import Matplotlib into your Python scripts or
notebooks using:
import matplotlib.pyplot as plt
The standard import statement uses the alias plt for Matplotlib’s pyplot
module, a widely adopted convention for brevity. With Matplotlib, you can
create line plots, scatter plots, bar charts, histograms, and more, enabling you
to effectively visualize and communicate insights from your data. Its flexibility
and integration with other libraries like NumPy make it a go-to choice for data
scientists, researchers, and analysts for generating high-quality visualizations.
Importing Data:
1. Using Pandas:
Importing data from a CSV file:
import pandas as pd
df = pd.read_csv(‘filename.csv’)
Importing data from an Excel file:
import pandas as pd
df = pd.read_excel(‘filename.xlsx’, sheet_name=’Sheet1’)
Importing data from a SQL database:
import pandas as pd
import sqlite3
conn = sqlite3.connect(‘database.db’)
query = ‘SELECT * FROM table_name’
df = pd.read_sql(query, conn)
Self-Instructional
44 Material
Exporting Data:
1. Using Pandas:
Exporting data to a CSV file:
import pandas as pd
df.to_csv(‘output_filename.csv’, index=False)
Exporting data to an Excel file:
import pandas as pd
df.to_excel(‘output_filename.xlsx’, index=False, sheet_name=’Sheet1’)
Exporting data to a SQL database:
import pandas as pd
import sqlite3
conn = sqlite3.connect(‘output_database.db’)
df.to_sql(‘output_table_name’, conn, index=False, if_exists=’replace’)
2. Using NumPy:
Exporting data to a text file:
import numpy as np
np.savetxt(‘output_filename.txt’, data)
Exporting data to a CSV file:
import numpy as np
np.savetxt(‘output_filename.csv’, data, delimiter=’,’)
Self-Instructional
Material 45
NOTES These examples showcase the basic usage of Pandas and NumPy for
importing and exporting data in various formats. Depending on your specific use
case and the type of data you are working with, you may choose the appropriate
library and method for your needs.
2.15 SUMMARY
2.16 GLOSSARY
• Observations: These are all the individual values of any data collection
(sample or population) or dataset.
• Variables: The properties or characteristics of any object that will be
analysed by using statistical techniques.
• Frequency: It is the occurrences of any data item(s) in a given data
collection/dataset.
• Absolute measure: It is the measure of dispersion that can be measurable
in some units with their data. It can be in meters, cm, grams, kg, etc.
• Relative measure: It is the measure of dispersion that has no measurable
units with their data. These are all coefficients based on coefficient of
Self-Instructional
variation, coefficient of range, coefficient of mean, coefficient of standard
46 Material deviation, etc.
• Dataset: It is the collection of large data items/ data points, which have NOTES
their values. Most of the time, we use the CSV (comma-separated values)
or XLS (excel spreadsheets) format of the dataset to solve the problems.
• Distribution: It means spreading of data across any point or region.
• Libraries: Python has a wide collection of built-in libraries, which contain
bunches of modules or groups of functions which help solve real problems.
It makes Python a more popular programming language for data analysis.
2.17 REFERENCES
Self-Instructional
48 Material
LESSON 3 NOTES
Structure
3.1 Learning Objectives
3.2 Introduction
3.3 NumPy Array
3.4 Creating Nd-array
3.5 Attributes of Nd-array
3.6 Data types of Nd-array
3.7 Mathematical operations in Nd-arrays
3.8 Random modules & their usage
3.9 Indexing & Slicing
3.10 Reshaping and Transposing Operations
3.11 Swapping Axes
3.12 Summary
3.13 Glossary
3.14 Answers to In-text Questions
3.15 Self-Assessment Questions
3.16 References
3.17 Suggested Readings
3.2 INTRODUCTION
Python is a very popular programming language which is widely used for data
analysis. Python has various highly well-defined, essential, and popular libraries,
which are the reason behind its use for data analysis and visualization. These
libraries save users time and are well-optimised in the implementation of tasks
for analysis and visualization purposes. NumPy refers to the “Numerical Python”.
In this lesson, we delve into the powerful capabilities of NumPy, a fundamental
library for numerical computing in Python. NumPy is an essential tool for many
data manipulation tasks, including cleaning, sub-setting, and transformation,
because it offers quick and vectorised array operations. We examine common
array algorithms such as set operations, unique identification, and sorting, as
well as effective methods for data summarisation and descriptive statistics. This
lesson also discusses relational manipulations and data alignment, showing how
NumPy makes it easier to combine and merge disparate datasets.
create the Nd-array object by using the array function and passing a list to it. NOTES
There are the following steps in creating a NumPy array:
Self-Instructional
54 Material
NOTE- We use the array() function to convert the list into Numpy array. Check the type of
numpy array using type(array_name)
NOTES
Numpy supports a variety of data types for its Nd-array objects, allowing you
to choose the appropriate type based on the nature of your data. Some common
data types in NumPy are given below:
Integer Types:
Self-Instructional
56 Material
np.int8, np.int16, np.int32, np.int64: Signed integers with 8, 16, 32, or 64 bits.
Self-Instructional
Material 57
NOTES
3.7 MATHEMATICAL OPERATIONS IN ND-ARRAYS
Element-wise Operations
NumPy allows you to perform element-wise operations, meaning that the
operation is applied to each element in the array.
Self-Instructional
58 Material
Aggregation Functions: Functions that are performed on the whole array NOTES
cumulatively.
NOTES
3.8 RANDOM MODULES AND THEIR USAGE
Random module: The random module is an already built-in module of the Python
library. This module helps in the generation of random numbers in statistical
tasks where few predictive numeric data are required. The random module
is leveraged with many functions or submodule(s), which helps in deep data
analysis, visualization and interpretation. A few important functions of random
modules are discussed below:
rand( ): This function is used to generate the real numbers randomly; it returns
a real number between 0 and 1.
modulename.rand()
The above real number generated using the random module lies between
0 and 1.
randint( ): This function returns the integer between the specified range
inside the function as an argument.
modulename.randint(start, end)
where start refers to the starting value of the specified range, and end refers
to the last value of the specified range.
Self-Instructional
60 Material
The above integer number, generated using the random module, lies NOTES
between a specified range of 1 and 5.
NOTE: The numbers generated using the random module would not be static. It means that
as you execute the code again and again, it is not necessary to give the current output equal
to the previous output.
Self-Instructional
NOTE: The elements of the generated array might be positive or negative real numbers. Material 61
NOTES
IN-TEXT QUESTIONS
1. What is the term used to describe the number of elements in an array?
2. What is the index of the first element in a one-dimensional array in
most programming languages?
3. In NumPy, what does the SHAPE of an array mean?
4. What is the correct syntax to check the number of dimensions in an
array?
If there are 10 elements in a list or array A, then the indexing will be like
below:
Index 0 1 2 3 4
Elements 10 12 14 16 18
Now, we can apply the slicing; suppose we want to extract elements 12,14,
and 16 whose indices are 1 to 4, then write A[1:3], where A is the name of the
list, colon(:) denotes the slice operation between starting index 1(inclusive) and
last index 4(exclusive).
Self-Instructional
62 Material
NOTES
In the above code, arr[1:4] slices the elements from index 1 to 3, i.e., the
last index value -1, since index 4 is not inclusive.
Like Python list, Nd-arrays are also mutable, i.e. we can change or modify
the elements of Nd-arrays.
Self-Instructional
Material 63
NOTES In the above 2×3 Nd-array, we can represent the elements in the following
tabular format :
numpy.array_name[start_index : last_index]
Where start_index is the first index, last_index refers to the predecessor
index value (last_index -1) whose values will be taken into the slice operation
respectively.
Program: Create a 6 × 6 matrix using Nd-array and illustrate the indexing and
slicing operation
Self-Instructional
64 Material
NOTES
Self-Instructional
Material 65
Explanation: In the above example, the name of the numpy array is arr, and
the boolean expression ‘arr > 5’ is used as the index of the array. The condition
is true for the elements whose values are greater than 5, i.e., 10, 18, 26, and 34.
For the true value of the array index, elements are displayed, and for false,
they are not.
Numpy.empty( ) function: It returns a new array of a given shape and type
Self-Instructional with random values.
66 Material Syntax: numpy.empty(shape, dtype = float)
Where shape represents a number of rows, and dtype is optional attribute that NOTES
represents the float(by default) data type of the returned array.
Fancy Indexing:
The fancy indexing of numpy in Python provides an advanced facility to retrieve
the group of elements in an array assigned to a variable. This indexing provides
more efficient features over the numpy array like sorting, filtering, conditional
access, etc.
NOTE: You should know one difference between slicing and fancy indexing is that the slic-
ing operation does not create a new numpy array object explicitly when it applies, but in the
case of fancy indexing, it creates a new numpy array and copies the operated data into a new
array.
Self-Instructional
68 Material
NOTES
3.10 RESHAPING AND TRANSPOSING OPERATIONS
Reshaping
Reshaping in NumPy refers to changing the shape or dimensions of an array while
maintaining the total number of elements. This operation is useful in various
scenarios, such as preparing data for specific algorithms, combining or splitting
arrays, or aligning data for mathematical operations. Below are some examples:
Self-Instructional
Material 69
NOTES The reshape function is a convenient way to change the shape of an array.
NOTES
Self-Instructional
Material 71
NOTES Transposing
Transposing in NumPy involves changing the arrangement of elements in an
array by flipping the array along its main diagonal. This operation is quite useful
in various mathematical and computational tasks, such as matrix operations,
data manipulation, and linear algebra. The transpose of a matrix is obtained by
switching its rows with columns.
Even though transposing a 1D array does not have a significant effect,
NumPy allows it. The result will be the same as the original array.
Self-Instructional
72 Material
NumPy arrays also have a .T attribute, which can be used to obtain the NOTES
transpose of an array.
Self-Instructional
Material 73
NOTES
3.11 SWAPPING AXES
The swapping axes are nothing but an interchange between two axes of the array.
numpy.swapaxes(arr, axis1, axis2)
Where the parameters are described as follows: arr is the name of the array,
axis1 represents the first axis, and axis2 represents the second axis.
NOTE: In case of applying the swap axes function, you should take the parameter tuple of
two elements having 1 and 0, both like (1,0) or (0,1). Suppose you take both elements of a
tuple the same, like (1,1) or (0,0). Then, the swapping of rows and columns will not be per-
formed.
IN-TEXT QUESTIONS
5. What does the reshape function in NumPy do?
6. What is the purpose of the transpose function in NumPy?
7. In NumPy, how is slicing different from indexing?
8. What does the following NumPy code accomplish: arr[::2]?
9. How can you flatten a multi-dimensional NumPy array?
Self-Instructional
74 Material
NOTES
3.12 SUMMARY
In this lesson, we discussed the numpy library and the basic NumPy mathematical
operations. The NumPy is a short form of Numerical Python, which is used for
numerical operations. The NumPy array is the main data structure of this library;
by using this, we can perform various operations like the creation of a multi-
dimensional array and matrix structure and maintaining the records in tabular
format. Indexing is an important feature of the NumPy array, which facilitates
frequent access to the data. Indexing can be of various types, such as simple
indexing, boolean indexing, and fancy or advanced indexing. According to the
requirements of data processing, we apply a specific type of indexing. Slicing
is also an important feature of a NumPy array, which we can use to retrieve the
subset or part of data or records stored in the array. The random modules is used
frequently for creating random numbers in Python; this concept facilitates solving
problems on random password generation, game winner predictions, lucky draw,
etc.
3.13 GLOSSARY
NOTES properties which we are performing over a numpy array represent the
attribute.
• Transposing: It is the operation of a matrix of Nd-array in which we
exchange the position of rows’ and columns’ elements with each other.
1. size
2. 0
3. The shape is the number of elements in each dimension.
4. arr.ndim
5. Reshapes the dimensions of an array
6. Swaps the axes of an array
7. Indexing is used for accessing individual elements while slicing extracts
a subarray
8. Select every second element of the array
9. Using the ravel function
1. Explain any two functions that can be used to create numpy array objects,
with suitable examples.
2. Define the following attributes in the context of a numpy array :
i. Data type
ii. Shape
iii. Reshape
iv. Dimension
3. Differentiate between numpy array and array.
Self-Instructional
76 Material
4. What do you mean by indexing and slicing? Discuss types of indexing and NOTES
illustrate with examples.
5. Describe the random modules and any two functions of it.
6. Write a Python code to create a numpy array of 5×5 and display the shape,
dimension, and data type.
7. Write a function in Python to take the list as input and return a numpy
array object as output.
8. Write the Python code to generate a 5×5 identity matrix whose elements
contain all ones.
9. Write a Python code to find the cube of all elements of the numpy array.
10. Write a Python code to generate the numpy array of any twenty random
numbers.
3.16 REFERENCES
Self-Instructional
Material 77
LESSON 4 NOTES
Structure
4.1 Learning Objectives
4.2 Introduction
4.3 Pandas Series
4.3.1 Creating a Pandas Series
4.3.2 Accessing Elements in a Pandas Series
4.3.3 Operations on Pandas Series
4.4 DataFrame
4.5 Index Objects
4.6 Working with DataFrame
4.6.1 Arithmetic Operations
4.6.2 Statistical Functions
4.7 Binning
4.8 Indexing and Reindexing
4.8.1 Indexing
4.8.2 Reindexing
4.9 Filtering
4.10 Handling Missing Data
4.11 Hierarchical Indexing
4.12 Data Wrangling
4.13 Summary
4.14 Glossary
4.15 Answers to In-text Questions
4.16 Self-Assessment Questions
4.17 References
4.18 Suggested Readings
Self-Instructional
Material 81
NOTES
4.1 LEARNING OBJECTIVES
4.2 INTRODUCTION
Python is a very powerful language, and it is increasingly being used for scientific
applications. Matrix and vector manipulations, which involve storing and
analysing data in single as well as multi-dimensional arrays, form the backbone
of scientific computations. We often use Pandas (“Python Data Analysis Library”)
as an essential library for applications, including machine learning and data
sciences, due to their extensive functionalities that support high-performance
matrix computation capabilities.
Before using this module, it must first be installed. Once installed, the
module can be imported through the Python script. Importing a module means
loading it in memory to use its functionalities. Pandas can be imported by writing
the following statement:
import pandas as pd
Self-Instructional
82 Material
NOTES
4.3 PANDAS SERIES
In Pandas, a Series is a one-dimensional labelled array that can hold any data type,
such as integers, strings, floats, or even Python objects. It provides a powerful
and flexible data structure for manipulating and analysing data. All Pandas data
structures are mutable. This means that their values can be changed. However,
a Series is an exception since it is immutable.
Series in Python is a one-dimensional array that can store data of any type
(integer, string, float, python objects, etc.). A pandas Series can be created using
the following constructor:
pandas.Series( data, index, dtype, copy)
where data can be a list of values, an nd-array, a dictionary, or even a scalar
value, constant index values must be unique and hashable, having the same length
as data (default np.arrange(n) if no index is passed), dtype specifies the data type
and copy means to copy data. By default, its value is False.
Below are some examples of creating a series:
Self-Instructional
Material 83
Self-Instructional
84 Material
Self-Instructional
Material 85
NOTES
There are several ways to access elements in a Pandas Series, including index,
label, or conditionally.
Self-Instructional
86 Material
Accessing by Label
You can also access elements in a Series using their label instead of the index.
To do this, you need to assign labels to the elements when creating the series.
Accessing Conditionally
You can also access elements in a Series based on a condition. For example, to
access all elements greater than 30, you can use the following code:
Self-Instructional
Material 87
NOTES In this case, the condition ‘s > 30’ returns a Boolean Series where each
element is compared to 30. Using this Boolean Series as an index for the original
Series ‘s’, we can access only the elements that satisfy the condition.
Arithmetic Operations
You can perform arithmetic operations on Pandas Series, such as addition,
subtraction, multiplication, and division. These operations are performed
element-wise.
In this example, we add two Series ‘s1’ and ‘s2’ using the ‘+’ operator.
The resulting Series ‘s3’ contains the element-wise addition of the corresponding
elements in ‘s1’ and ‘s2’.
Logical Operations
You can also perform logical operations on Pandas Series, such as equality (‘=
=’), inequality (‘!=’), greater than (‘>’), and less than (‘<’). These operations
Self-Instructional return a Boolean Series representing the result of the logical operation.
88 Material
NOTES
Statistical Calculations
Pandas Series provides various statistical functions to compute descriptive
statistics for the data. Some commonly used functions include ‘mean()’,
‘median()’, ‘sum()’, ‘min()’, ‘max()’, ‘std()’, ‘var()’, etc.
NOTES
IN-TEXT QUESTIONS
1.What is a Pandas Series?
2.How can you create a Pandas Series from a Python list?
3.What is the primary purpose of the index in a Pandas Series?
4.4 DATAFRAMES
In the realm of data analysis with Python, the Pandas library stands out for its
robust capabilities. At the core of Pandas lies the DataFrame, a versatile two-
dimensional data structure that simplifies the handling and manipulation of
structured data. It can be likened to a table or spreadsheet, where information is
organised into rows and columns. DataFrame offers a powerful toolset for data
manipulation, cleaning, and analysis. They provide a structured and efficient way
to manage diverse types of data, making them essential for tasks ranging from
exploratory data analysis to complex statistical modelling. The basic features of
DataFrame can be given as:
• Columns of different data types.
• Size is mutable.
• DataFrame has labelled axes for rows and columns.
• Arithmetic operations can be performed on rows and columns.
Creating a DataFrame
A pandas DataFrame can be created by using the pandas.DataFrame function.
The syntax of this function can be given as:
pandas.DataFrame( data, index, columns, dtype, copy)
where,
• data can be an nd-array, series, map, lists, dict, constants or any other
DataFrame.
• The index denotes the row labels for the resulting frame. It is an optional
90
Self-Instructional
Material
argument. By default, if no index is passed, then its value will be equal to
np.arange(n).
© Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Pandas Power Play: Mastering Data Manipulation
NOTES
Self-Instructional
Material 93
Hierarchical Index
You can create a DataFrame with a hierarchical index, which allows for multi-
level indexing.
Self-Instructional
94 Material
NOTES
Understanding and effectively using index objects in Pandas is crucial for efficient
data manipulation and analysis. Whether it is customising indexes, using date
ranges, or creating hierarchical indexes, the choice of index can significantly
impact how you interact with your data.
Self-Instructional
96 Material
NOTES
Broadcasting Operations
Broadcasting allows operations between a DataFrame and a scalar value. The
scalar value is broadcasted to all elements in the DataFrame.
Self-Instructional
Material 97
NOTES
Mathematical Functions
Pandas provides a range of mathematical functions that can be applied to entire
DataFrames.
Self-Instructional
98 Material
NOTES
Self-Instructional
Material 99
NOTES
Self-Instructional
100 Material
Self-Instructional
Material 101
NOTES 5. min()and max(): Compute the minimum and maximum values along a
specified axis.
Self-Instructional
102 Material
NOTES
Self-Instructional
Material 103
NOTES These are just a few examples of the statistical functions available in
Pandas. Pandas provides a wide range of statistical functions that can be applied to
DataFrames. Here is a list of some commonly used statistical functions in Pandas:
Descriptive Statistics
Aggregation
Missing Data
NOTES
• fillna(): Fill missing values.
Ranking
Miscellaneous
• idxmax(), idxmin(): Return the row labels of the maximum and
minimum values.
• mad(): Compute the mean absolute deviation.
NOTES below will give an output 2 since 3 is the maximum value that first occurs
at index 2 in the DataFrame.
IN-TEXT QUESTIONS
4. How can you calculate the mean of a specific column ‘A’ in a Pandas
DataFrame named df?
5. What does the describe() function in Pandas do when apply to a
DataFrame?
6. What is the purpose of the corr() function in Pandas when applied to a
DataFrame?
4.7 BINNING
Self-Instructional
106 Material
NOTES
In this example:
1. Create Sample DataFrame: We start with a simple DataFrame, ‘df’ that
contains a column named ‘Age’ with continuous age values.
2. Define Bins and Labels: We specify the bin edges using the ‘bins’ list. In
this case, the bins are ‘[18, 25, 35, 45, 60]’, indicating the age intervals.
We provide corresponding labels for each bin in the ‘labels’ list, such as
[‘18-25’, ‘26-35’, ‘36-45’, ‘46-60’].
3. Use ‘cut’ Function: The ‘pd.cut’ function takes the continuous variable
(‘df[‘Age’]’) and bins it based on the specified intervals and labels. The
‘right=False’ argument means that the intervals are left-closed, and the
Self-Instructional
right endpoint is excluded from the interval. Material 107
NOTES 4. Create a New Column: We create a new column, ‘AgeGroup’, in the DataFrame
to store the age group information obtained from the binning process.
4.8.1 Indexing
Self-Instructional
108 Material
Basic Indexing: You can access specific columns or rows using their labels.
NOTES
Self-Instructional
Material 109
Self-Instructional
110 Material
Fill Values during Reindexing: Specify a method to fill missing values during NOTES
reindexing.
Setting a New Index: Set a new column as the index. Default values in the new
index that are not present in the DataFrame are assigned NaN.
NOTES
4.9 FILTERING
Self-Instructional
112 Material
• Isin Method: Filter rows based on values present in a list or another NOTES
DataFrame.
Filtering Columns
Selecting Specific Columns: Choose specific columns from the DataFrame.
Self-Instructional
Material 113
NOTES ‘loc’ and ‘iloc’ Methods: Use ‘loc’ for label-based indexing and
‘iloc’ for integer-based indexing.
Self-Instructional
114 Material
NOTES
Self-Instructional
Material 115
NOTES Filling Missing Values: Use ‘fillna()’ to replace missing values with specified
values.
Self-Instructional
116 Material
Self-Instructional
Material 117
NOTES
In above example, first three indices 1, 2, 3 are clubbed with the name ‘a’,
second three indices 1, 3, 1 are clubbed with name ‘b’, next two indices 1, 2 are
clubbed with the name ‘c’ and last two indices 2, 3 are clubbed with name ‘d’.
How to print multi-index? To display the multi-index, we use the index
attribute with the accessible variable (where the series of data is stored). For
the above example, we are displaying the multi-index of the pandas data series.
We can also access the subset of multi-indexed data series, which is called
partial indexing.
Self-Instructional
118 Material
NOTES
Self-Instructional
Material 119
NOTES
4.12 DATA WRANGLING
When we apply the merge operation over the DataFrame objects, sometimes
it is difficult to understand how the operation is applied. So, for the readability Self-Instructional
of the Python code, we can use the ‘on’ attribute with the value ‘key’, which Material 121
NOTES indicates the merge operation performed over the key. This attribute is used when
we have the same column names in both the DataFrames. For this, we can write
the merge statement like below:
pd.merge(df1, df2, on = ‘key’)
In the specific scenario of DataFrames, if there are different column names,
then it is better to use the attributes left_on and right_on to represent the
DataFrames, respectively.
Outer Join
Self-Instructional
122 Material
NOTES
Left Join
Self-Instructional
Material 123
4.13 SUMMARY
In this lesson, we discussed the Pandas library and the basic panda’s mathematical
operations. It introduces two main tools in Pandas: Series (for one-dimensional
data) and DataFrame (for two-dimensional, tabular data). Series can be thought of
as a list with labels, and DataFrame as a table where you can easily organise and
analyse your data. This lesson also covers practical skills like loading different
types of data into Pandas, doing math and statistics with your data, and dealing
with missing information. Towards the end, the lesson introduces data wrangling,
which is like getting your data into the perfect shape for analysis.
4.14 GLOSSARY
• Index: It is an object in Pandas that provides labels to the data, facilitating NOTES
quick and efficient data retrieval and alignment.
• Data Wrangling: It is the process of cleaning, transforming, merging, and
reshaping raw data to make it suitable for analysis, often using tools like
Pandas.
• Binning: It means grouping continuous data into discrete intervals or bins
for better analysis or Visualization.
• Indexing: It means selecting specific rows or columns in a DataFrame
based on labels or positions.
• Reindexing: It means creating a new index for a DataFrame or altering
the existing index to match a specified set of labels.
• Attribute: The properties of any object are called attributes; here, the
properties which we are performing over Pandas objects represent the
attribute.
• Transposing: It is the operation of a matrix of nd-array in which we
exchange the position of rows and columns elements with each other.
1. Explain the data series and DataFrames objects with suitable examples.
Self-Instructional
2. Define the following attributes in the context of pandas: Material 125
4.17 REFERENCES
Self-Instructional
126 Material
LESSON 5 NOTES
Structure
5.1 Learning Objectives
5.2 Introduction
5.3 Matplotlib
5.3.1 Pyplot
5.3.2 Concept of figure, plot, and subplot
5.4 Plotting Functions with Examples
5.4.1 Basic Plot Functions
5.4.2 Colours, Markers, and Line Styles
5.4.3 Label and Legend
5.4.4 Saving a Plot
5.5 Plotting Functions in Pandas
5.6 Summary
5.7 Glossary
5.8 Answers to In-text Questions
5.9 Self-Assessment Questions
5.10 References
5.11 Suggested Readings
By the end of this lesson, learners should be equipped with the knowledge and
skills needed to create, customise, and interpret a variety of plots using both
Matplotlib and Pandas, enhancing their ability to communicate insights derived
from data effectively. Self-Instructional
Material 129
5.2 INTRODUCTION
In this lesson, we delve into the art and science of data visualization using the
powerful tools of Matplotlib and Pandas. Visualization plays a pivotal role in
data analysis, aiding in the exploration, interpretation, and communication of
complex datasets. Matplotlib, a widely used plotting library in Python, serves
as our primary tool for crafting visually engaging figures and subplots. We
embark on a journey to understand the intricacies of Matplotlib, exploring its
diverse capabilities, such as customising colours, line styles, and annotations.
Learners will grasp the fundamentals of constructing informative plots that not
only convey data trends but also cater to the aesthetic considerations crucial for
effective communication.
Moving beyond Matplotlib, this lesson introduces the plotting functions
available in Pandas, offering a seamless integration of visualization into data
manipulation workflows. From basic line plots to intricate heatmaps, we navigate
through the spectrum of Pandas plotting functions. The emphasis is not only
on creating diverse visualizations but also on understanding when to employ
each type of plot for optimal representation of different data scenarios. By the
end of this lesson, learners will be equipped with the skills to not only generate
compelling plots but also to discern the most suitable visualization techniques
for their specific analytical objectives. Whether it is depicting trends over time,
comparing distributions, or showcasing relationships between variables, the
mastery of Matplotlib and Pandas plotting functions empowers data enthusiasts
to unlock meaningful insights from their datasets.
Self-Instructional
130 Material
NOTES
5.3 MATPLOTLIB
5.3.1 PyPlot
NOTES legends, and titles, and manipulate various plot elements, all within a single
interface. Whether used for quick exploratory data analysis or the creation
of publication-ready visuals, PyPlot serves as a valuable tool for harnessing
Matplotlib’s power while streamlining the plotting process in Python.
Figure
• A Figure in Matplotlib is the top-level container that represents the entire
window or page where your plots are drawn.
• When you create a new plot using plt.plot() or a set of subplots using
plt.subplots(), you are creating them within a figure.
• You can explicitly create a new figure using plt.figure().
Plot
• A “plot” refers to the graphical representation of data within a figure. It
could be a line plot, scatter plot, or bar plot.
• Functions like plt.plot(), plt.scatter(), or plt.bar() are used to
create specific types of plots within a figure.
Subplot
• A “subplot” is a smaller plot that exists within a single figure. It allows
you to organise multiple plots in a grid-like fashion.
• The plt.subplots() function is commonly used to create subplots. It
returns a figure and an array of subplot axes.
Self-Instructional
132 Material
NOTES
5.4 PLOTTING FUNCTIONS WITH EXAMPLES
plot() function: This function is used to plot the points in graphical form. It
comes after initialising the data points in the variable. It draws the lines from
one point to another point over the axes. Sometimes, it takes two arguments to
denote the x-axis and y-axis, respectively.
show() function: This function is used to display the graph on the user’s output
screen.
Self-Instructional
Material 133
NOTES
The plot() function in the Matplotlib library offers parameters for customising
colours, line styles, and markers. These parameters allow explicit control over
the appearance of the plotted data. The colour, linestyle, and marker parameters
enable users to tailor the visual representation of the plot to their preferences.
Colour Parameter: The colour parameter is used to specify the colour of the
line or marker. It accepts various colour notations, such as named colours (‘red’,
Self-Instructional
134 Material
‘blue’), hexadecimal RGB values (‘#FF5733’), or RGB tuples ((1.0, 0.34, 0.20)). NOTES
plt.plot(x, y, colour=’green’, label=’Green Line’)
Markers Parameter: The marker parameter allows the selection of different
marker styles to represent data points. Common markers include ‘*’ for a star,
‘o’ for a circle, and ‘d’ for a diamond shape.
plt.plot(x, y, marker=’o’, label=’Circle Markers’)
Marker Size (ms) Parameter: The ms parameter, short for marker size, controls
the size of the markers on the plot. Numeric values can be assigned to adjust the
marker size according to specific requirements.
plt.plot(x, y, marker=’*’, ms=8, label=’Star Markers
(Size: 8)’)
Line Style Parameter: The linestyle parameter defines the style of the line
connecting the data points. Common line styles include ‘-’, ‘--’, and ‘:’ for solid,
dashed, and dotted lines, respectively.
plt.plot(x, y, linestyle=’--’, label=’Dashed Line’)
Self-Instructional
Material 135
NOTES
NOTE: If you want to shift the title location to the left or right, then use the loc = ‘left’
or loc = ‘right’. Also, if you want to give a common title name for group of subplots
then use the suptitle() function.
Self-Instructional
136 Material
NOTES
Self-Instructional
Material 137
After creating the plot, if you want to save the plotted output of your data, then
you can apply the savefig() function. This function will be executed using the
matplotlib module, which helps you save the figure according to your choice.
The figure below shows how to save the plotted figure in svg (scalable vector
graphics) image format. Similarly, you can save the figure in pdf, jpg or png
formats.
NOTE: The figure (plotted image file) will be saved at the same path or folder of your system
where your Python source code file is available. Here, I have opened the saved file; it resides
on the same path as my source code.
Self-Instructional
138 Material
NOTES
IN-TEXT QUESTIONS
1. What is Matplotlib?
2. Which module in Matplotlib is commonly used for creating basic plots?
3. What does the plot function in Matplotlib do?
4. How can you create multiple subplots in a single figure in Matplotlib?
5. Which parameter is used to customise the colour of a plot in Matplotlib?
6. What does the xlabel function in Matplotlib do?
7. How can you save a Matplotlib plot to a file?
8. Which type of plot is best suited for visualising the distribution of a
continuous variable?
9. What does the tight_layout() function in Matplotlib do?
Pandas is the Python library used in analysing the data. It uses the concepts of
DataFrames. It has various functionalities and offers solutions for data cleaning,
analysis, exploration, and editing. Apart from data manipulation with Pandas, you
can visually represent your findings through expressive plots. Whether you are
exploring data for insights or presenting results, the combination of Pandas and
Matplotlib provides a powerful toolkit for effective data visualization. Pandas
simplifies the process of working with structured data, providing a high-level
interface to organise, filter, and analyse datasets. When coupled with Matplotlib,
Pandas allows for efficient plotting without the need for extensive code. We
discuss various types of plots, for example:
Line plots: The group of lines can be plotted using pandas easily.
Self-Instructional
Material 139
NOTES
Bar plot: The bar plot key elements are rectangular bars with their length and
height, which tells about the data representation. The bar plot is also called as
bar chart or bar graph. The bar chart can be made in a horizontal or vertical
manner. The plot.bar() function is used for vertical bar charts and plots.
barh() is used for horizontal bar charts. In plotting a bar chart, we can use
pandas with matplot library(pyplot submodule of matplot) or numpy with
matplot library(pyplot submodule of matplot). With pandas module, we use
either DataFrame or series for indexing of bar plot.
Self-Instructional
140 Material
NOTES
Barplot using DataFrame: First, we use the DataFrame( ) function to create the
DataFrames, then plot the bar chart using dataframe.plot.bar() function.
Self-Instructional
Material 141
NOTES NOTE: We can view the DataFrame values from which the bar plots are drawn using the
statement: print(dataframe).
Horizontal bar plot: The horizontal barplot is also known as the stacked bar plot.
NOTE – We can use a stacked attribute with a TRUE boolean value to represent the stacked
horizontal bar plot. For the stacked vertical bar plot, we can use the df.plot.bar()
function.
Self-Instructional
142 Material
NOTES
Histogram: Like a bar plot, the histogram is also a plot which represents the
group of data points into a specific range; these ranges are called bins. Creating a
histogram can provide a visual representation of data distribution with a large set
of data. In the 2 D histogram, the horizontal axis (X-axis) represents the bins, and
the vertical axis (Y-axis) represents the number of frequent occurrences of data,
i.e. frequency. We can use the function plot.hist() to draw the histogram.
Self-Instructional
Material 143
NOTES
Density plot: Density Plot is the continuous, simple and smoothed version of
the histogram estimated from the observed data. It is estimated through Kernel,
which is a simpler distribution like the normal probability distribution. That is
why the density plots are also called Kernel Density Estimation (KDE) plots. In
this method, a Kernel (continuous curve) is drawn at every individual data point,
and then all these curves are added together to make a single smoothened density
estimation. Histogram fails when we want to compare the data distribution of
a single variable over multiple categories. At that time, a density plot is useful
for visualising the data.
Self-Instructional
144 Material
NOTES
Scatter Plot: A scatter plot is used to visualise the relationship between two
continuous variables. In Pandas, you can create a scatter plot using the `plot`
function with `kind='scatter'`.
Self-Instructional
Material 145
NOTES
Stacked Bar Plot: A stacked bar plot is useful to show the composition of
multiple categories. In Pandas, you can use the `plot` function with `kind='bar'`
and set `stacked=True`.
Self-Instructional
146 Material
NOTES
Self-Instructional
Material 147
NOTES
IN-TEXT QUESTIONS
10. Which Pandas plotting function is used to create a line plot?
11. How do you create a histogram in Pandas using the plot function?
12. To create a scatter plot in Pandas, which parameters can be used in the
plot function?
Self-Instructional
13. What does stacked=True achieve when creating a bar plot with Pandas?
148 Material
NOTES
5.6 SUMMARY
5.7 GLOSSARY
Self-Instructional
Material 149
NOTES
5.10 REFERENCES
Self-Instructional
Material 151
LESSON 6 NOTES
Structure
6.1 Learning Objectives
6.2 Introduction
6.3 Data Aggregation
6.4 GroupBy MECHANICS
6.5 Pivot Tables
6.6 Cross-Tabulation
6.7 Summary
6.8 Glossary
6.9 Answers to In-text Questions
6.10 Self-Assessment Questions
6.11 References
6.12 Suggested Readings
NOTES
6.2 INTRODUCTION
We are all living in the digital era and have evolved with the huge amount of
data. Without data, we cannot presume our daily needs. Data is like fuel for us.
The data aggregation process helps in statistical analysis for the collection of
objects and provides useful and summarised information. It helps in industrial
analysis, organisational data and many more. Often, the aggregation of data is
applied over a large amount of data. It helps in splitting and grouping data in
various ways. The group of programs used in the aggregation of data are known
as data aggregators.
There are groups of operations used in data aggregation. This process is
also called groupby mechanics. The groupby mechanics operations are similar
to divide and conquer algorithms; they are performed in the sequences of the
split, apply operation and conquer (combined), respectively. The first step, split,
means to break or divide the data into sublists or sub-sequences according to the
applied data structure, and then we get the small parts of lists that make things
easier. In the second step of applying the operation, we can apply the required
operation to each subpart of divided data (split group of data). In the last step of
combining, the resultant data is combined by making groups.
Self-Instructional
156 Material
There are various aggregation functions used with the groupby() function NOTES
in performing the statistical operations and analysis. A few aggregation functions,
which are applicable to groupby methods, are given below :
• mean
• median
• max
• min
• std
• var
• count
• sum
The purpose of the GroupBy function is to convert the random data sequence
into an arranged one. It also works like a filter operation in which we can fetch
the required data of a particular group only.
Self-Instructional
Material 157
NOTES
Self-Instructional
158 Material
NOTES
Iteration over the Grouped data: The iteration process helps while retrieving
the grouped elements (aggregated series data at stored memory location) from
contiguous memory blocks. Iterating over grouped data in Pandas involves
looping through the groups created by the groupby function. Each group is a
subset of the original DataFrame based on the unique values in the grouping
Self-Instructional column. Here is an example of how to iterate over grouped data:
160 Material
NOTES
In this example, the groupby operation creates two groups based on the
unique values in the ‘Category’ column (‘A’ and ‘B’). The loop iterates over
these groups, and for each group, it prints the group name (group_name) and the
corresponding DataFrame (group_data). You can then perform specific operations
or analyses within the loop for each group.
Additionally, you can use aggregate functions within the loop to calculate
group-specific statistics or apply custom functions to each group. The flexibility
of iteration over grouped data allows for dynamic and customised analyses based
on the unique characteristics of each group.
Self-Instructional
Material 161
NOTES Applying more than one aggregate function: You can apply multiple aggregate
functions simultaneously using the agg method. The agg method allows you to
specify different aggregation functions for each column.
IN-TEXT QUESTIONS
1. In Python’s Pandas library, what function is commonly used for grouping
data?
2. Which method is used to apply aggregation functions to grouped data?
3. Which aggregation function in pandas calculates the median of a
numeric column within each group?
4. In the context of pandas and data grouping, what does the term “multi-
level indexing” refer to?
Self-Instructional
162 Material
NOTES
6.5 PIVOT TABLES
DataFrame: It is a main object in pandas, used to represent data with rows and
columns like a tabular form of Excel data. Let’s consider a dataframe of weather
data that we will use in our next set of examples:
Self-Instructional
Material 163
NOTES
NOTE: If you want to display specific column values only, then specify those in the ‘values’
argument.
Here, only temperature values are displayed, while humidity values are
not displayed.
Pivot Table: The pivot table is used to summarise and aggregate data inside
the DataFrame. It aggregates the data similar to groupby method of series data
using one or more keys.
Self-Instructional
164 Material
NOTES
Now, if we have to show only one attribute temperature, then use the values
argument with value as temperature. It hides the humidity attribute.
Self-Instructional
Material 165
Note: You can replace aggfunc with any other aggregate function like sum, mean, or avg, as
discussed in earlier sections.
Self-Instructional
166 Material
NOTE: If any value is missing or not given in any column of the pivot table, then the cell NOTES
would be NaN (Not a Number) by default, but if you want to apply the aggregation over the
table, then you can use fill_value = 0 (the missing value will be filled with digit 0) as an
argument into pivot_table() function.
6.6 CROSS-TABULATIONS
This is another form of the pivot table. It gives short form of big table. The use
of cross-tables is often for the computation of group frequencies. Sometimes,
in short, cross-tabulation is also called a cross table or contingency table. In the
example below, a small dataset on student performance of different streams is
given:
Apply the cross-table function: The crosstab function is used to create a cross-
tabulation from the DataFrame.
Self-Instructional
Material 167
NOTES
Apply the margins with cross-tabulations: The margins provide the total
aggregate sum of each row and column values separately.
Self-Instructional
168 Material
NOTES
IN-TEXT QUESTIONS
5. In pandas, which method is used to create a pivot table?
6. What is the primary purpose of a pivot table?
7. In pandas, what does the margins parameter in the pivot_table() function
do?
8. In pandas, what is the primary function of the crosstab() function?
6.7 SUMMARY
Data aggregation and group operations involve organising data based on specific
criteria, applying operations within these groups, and combining the results. The
“Group by” mechanic in databases helps in categorising data for aggregation,
allowing the use of functions like sum or average within each group. A broader
approach, known as split-apply-combine, involves breaking down data, applying
operations independently to each segment, and then merging the outcomes.
This flexible method accommodates various analyses and transformations.
Pivot tables, commonly found in spreadsheet tools and programming libraries
like pandas, simplify data analysis by reshaping and summarising information.
Self-Instructional
Material 169
6.8 GLOSSARY
1. groupby()
2. agg()
3. median()
4. Grouping data at multiple levels of hierarchy 5. pivot_table()
6. Reshaping and summarising data based on specified criteria
7. Adds an extra column and row for subtotals
8. Generating contingency tables
1. How do you use the groupby function in the pandas library to group a
DataFrame by a specific column?
Self-Instructional
170 Material 2. Write a Python code snippet to calculate the sum of a numeric column
named ‘sales’ for each group in a pandas DataFrame after using the groupby NOTES
function.
3. How to create a pivot table from a DataFrame and show the average values
for two columns, ‘A’ and ‘B’, based on the grouping of column ‘X’.
4. How would you use the crosstab function from the pandas library to create
a contingency table for two categorical variables ‘Category’ and ‘Region’
from a DataFrame?
6.11 REFERENCES
Self-Instructional
Material 171