UNIT – II (2nd Half)
Reshaping with Hierarchical Indexing
Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There
are two primary actions:
stack
This “rotates” or pivots from the columns in the data to the rows
unstack
This pivots from the rows into the columns
Consider a small Data‐ Frame with string arrays as row and column indexes:
In [120]: data = pd.DataFrame(np.arange(6).reshape((2, 3)), .....: index=pd.Index(['Ohio',
'Colorado'], name='state'), .....: columns=pd.Index(['one', 'two', 'three'], .....:
name='number'))
In [121]: data
Out[121]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
Using the stack method on this data pivots the columns into the rows, producing a Series:
In [122]: result = data.stack()
In [123]: result
Out[123]:
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int64
From a hierarchically indexed Series, you can rearrange the data back into a Data‐ Frame
with unstack:
In [124]: result.unstack()
Out[124]:
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
PIVOTING
Pivoting is a fundamental data transformation technique used in data analysis, especially
when working with tabular data. It involves rearranging or rotating data from one format
to another to make it more readable, meaningful, or suitable for analysis. Pivoting is
typically used to summarize data, organize information more effectively, and prepare it
for reporting or visualization.
The two main types of pivoting are:
1. Pivoting (long to wide): Converting rows into columns.
2. Unpivoting (wide to long): Converting columns back into rows.
Why Pivot Data?
Raw data is often collected in a “long” format, which is excellent for storing and
processing, especially in databases. However, long format data can be harder for humans
to interpret or analyze directly.
Pivoting helps:
● Summarize large datasets
● Compare values across different categories
● Create dashboards and reports
● Facilitate data visualization
● Prepare data for machine learning or statistical models
The Long vs. Wide Format
Understanding these two data formats is essential before diving into pivoting.
Long Format:
● Each row is a single observation.
Product Month Sales
A Jan 100
A Feb 150
B Jan 200
B Feb 180
Wide Format:
● Data is spread across columns, often with months (or another variable) as
columns and values aggregated under them.
Product Jan Feb
A 100 150
B 200 180
Real-World Example: Sales Data
Suppose you are a data analyst working for a retail company, and you are given monthly
sales data in the long format (as shown above). You want to create a report comparing
sales figures across products and months. Looking at the long table isn’t very intuitive,
especially as the number of rows grows.
Pivoting the Data
By pivoting the data, you restructure it so each Product becomes a row, and each Month
becomes a column, with Sales as the values. This format is far easier to read, especially
when comparing figures side by side.
How Pivoting Works (Step-by-Step)
Let’s take the earlier long format data:
Product Month Sales
A Jan 100
A Feb 150
B Jan 200
B Feb 180
Step 1: Identify the Pivot Components
● Index (rows): Product
● Columns: Month
● Values: Sales
Step 2: Apply the Pivot
Product Jan Feb
A 100 150
B 200 180
This gives a cleaner, side-by-side view of each product’s performance over different
months.
Tools That Use Pivoting
1. Excel Pivot Tables
Excel offers a built-in feature called a Pivot Table that allows users to drag and drop
fields to create pivoted views of their data.
Steps in Excel:
1. Select your dataset.
2. Go to Insert → Pivot Table.
3. Choose your fields: place "Product" in the Rows area, "Month" in Columns, and
"Sales" in Values.
4. Excel generates the pivoted table for you automatically.
2. Pandas in Python
Python’s pandas library provides powerful pivot functionality through the pivot() or
pivot_table() methods.
import pandas as pd
# Sample data
data = {
'Product': ['A', 'A', 'B', 'B'],
'Month': ['Jan', 'Feb', 'Jan', 'Feb'],
'Sales': [100, 150, 200, 180]
}
df = pd.DataFrame(data)
# Pivot the data
pivoted = df.pivot(index='Product', columns='Month', values='Sales')
print(pivoted)
OUTPUT
Month Feb Jan
Product
A 150 100
B 180 200
Benefits of Pivoting
● Data Clarity: Makes complex data easier to understand.
● Quick Summaries: Helps summarize trends and totals quickly.
● Better Visualization: Useful for plotting charts like heatmaps, bar charts, etc.
● Reporting: Essential for business dashboards and KPI tracking.
Handling Missing Data
Missing data occurs commonly in many data analysis applications. One of the goals of
pandas is to make working with missing data as painless as possible. For example, all of
the descriptive statistics on pandas objects exclude missing data by default. The way that
missing data is represented in pandas objects is somewhat imperfect, but it is functional
for a lot of users. For numeric data, pandas uses the floating-point value NaN (Not a
Number) to represent missing data. We call this a sentinel value that can be easily
detected:
In [10]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
In [11]: string_data
Out[11]:
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
In [12]: string_data.isnull()
Out[12]:
0 False
1 False
2 True
3 False
dtype: bool
In pandas, we’ve adopted a convention used in the R programming language by referring
to missing data as NA, which stands for not available.
Filtering Out Missing Data
There are a few ways to filter out missing data. While you always have the option to do it
by hand using pandas.isnull and boolean indexing, the dropna can be helpful. On a Series,
it returns the Series with only the non-null data and index values:
In [15]: from numpy import nan as NA
In [16]: data = pd.Series([1, NA, 3.5, NA, 7])
In [17]: data.dropna()
Out[17]:
0 1.0
2 3.5
4 7.0
dtype: float64
This is equivalent to:
In [18]: data[data.notnull()]
Out[18]:
0 1.0
2 3.5
4 7.0
dtype: float64
With DataFrame objects, things are a bit more complex. You may want to drop rows or
columns that are all NA or only those containing any NAs. dropna by default drops any
row containing a missing value:
In [19]:
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], ....: [NA, NA, NA], [NA, 6.5, 3.]])
In [20]: cleaned = data.dropna()
In [21]: data
Out[21]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
In [22]: cleaned
Out[22]:
0 1 2
0 1.0 6.5 3.0
Filling In Missing Data
Rather than filtering out missing data (and potentially discarding other data along with
it), you may want to fill in the “holes” in any number of ways. For most purposes, the fillna
method is the workhorse function to use.
Calling fillna with a constant replaces missing values with that value:
In [33]: df.fillna(0)
Out[33]:
0 1 2
0 -0.204708 0.000000 0.000000
1 -0.555730 0.000000 0.000000
2 0.092908 0.000000 0.769023
3 1.246435 0.000000 -1.296221
4 0.274992 0.228913 1.352917
5 0.886429 -2.001637 -0.371843
6 1.669025 -0.438570 -0.539741
Data Transformation
Filtering, cleaning, and other transformations are another class of important operations.
Removing Duplicates
Duplicate rows may be found in a DataFrame for any number of reasons.
Here is an example:
In [45]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], ....: 'k2': [1, 1, 2, 3, 3, 4, 4]})
In [46]: data
Out[46]:
k1 k2
0 one 1
1 two 1
2 one 2
3 two 3
4 one 3
5 two 4
6 two 4
The DataFrame method duplicated returns a boolean Series indicating whether each row
is a duplicate (has been observed in a previous row) or not:
In [47]: data.duplicated()
Out[47]:
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool
Transforming Data Using a Function or Mapping
For many datasets, some transformations based on the values in an array, Series, or
column in a DataFrame are performed.
Consider the following hypotheti‐ cal data collected about various kinds of meat:
In [52]: data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', ....: 'Pastrami',
'corned beef', 'Bacon', ....: 'pastrami', 'honey ham', 'nova lox'], ....: 'ounces': [4, 3, 12, 6, 7.5,
8, 3, 5, 6]})
In [53]: data
Out[53]:
food ounces
0 bacon 4.0
1 pulled pork 3.0
2 bacon 12.0
3 Pastrami 6.0
4 corned beef 7.5
5 Bacon 8.0
6 pastrami 3.0
7 honey ham 5.0
8 nova lox 6.0
Suppose you wanted to add a column indicating the type of animal that each food came
from. Let’s write down a mapping of each distinct meat type to the kind of animal:
meat_to_animal = { 'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow',
'honey ham': 'pig', 'nova lox': 'salmon' }
The map method on a Series accepts a function or dict-like object containing a map‐ ping,
but here we have a small problem in that some of the meats are capitalized and others
are not. Thus, we need to convert each value to lowercase using the str.lower Series
method:
In [55]: lowercased = data['food'].str.lower()
In [56]: lowercased
Out[56]:
0 bacon
1 pulled pork
2 bacon
3 pastrami
4 corned beef
5 bacon
6 pastrami
7 honey ham
8 nova lox
Name: food, dtype: object
In [57]: data['animal'] = lowercased.map(meat_to_animal)
In [58]: data
Out[58]:
food ounces animal
0 bacon 4.0 pig
1 pulled pork 3.0 pig
2 bacon 12.0 pig
3 Pastrami 6.0 cow
4 corned beef 7.5 cow
5 Bacon 8.0 pig
6 pastrami 3.0 cow
7 honey ham 5.0 pig
8 nova lox 6.0 salmon
We could also have passed a function that does all the work:
In [59]: data['food'].map(lambda x: meat_to_animal[x.lower()])
Out[59]:
0 pig
1 pig
2 pig
3 cow
4 cow
5 pig
6 cow
7 pig
8 salmon
Name: food, dtype: object
Using map is a convenient way to perform element-wise transformations and other data
cleaning–related operations.
String Manipulation
Python has long been a popular raw data manipulation language in part due to its ease of
use for string and text processing. Most text operations are made simple with the string
object’s built-in methods. For more complex pattern matching and text manipulations,
regular expressions may be needed. pandas adds to the mix by ena‐ bling you to apply
string and regular expressions concisely on whole arrays of data, additionally handling
the annoyance of missing data.
String Object Methods
In many string munging and scripting applications, built-in string methods are sufficient.
As an example, a comma-separated string can be broken into pieces with split:
In [134]: val = 'a,b, guido'
In [135]: val.split(',')
Out[135]: ['a', 'b', ' guido']
split is often combined with strip to trim whitespace (including line breaks):
In [136]: pieces = [x.strip() for x in val.split(',')]
In [137]: pieces
Out[137]: ['a', 'b', 'guido']
These substrings could be concatenated together with a two-colon delimiter using
addition: In [138]: first, second, third = pieces
In [139]: first + '::' + second + '::' + third
Out[139]: 'a::b::guido'
A faster and more Pythonic way is to pass a list or tuple to the join method on the string
'::':
In [140]: '::'.join(pieces) Out[140]: 'a::b::guido'
Discretization and Binning
Continuous data is often discretized or otherwise separated into “bins” for analysis.In a
data about a group of people in a study, and need to group them into discrete age buckets:
In [75]: ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older.
To do so, you have to use cut, a function in pandas:
In [76]: bins = [18, 25, 35, 60, 100]
In [77]: cats = pd.cut(ages, bins)
In [78]: cats
Out[78]:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25,
35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
The object pandas returns is a special Categorical object. The output you see describes
the bins computed by pandas.cut. You can treat it like an array of strings indicating the
bin name; internally it contains a categories array specifying the distinct category names
along with a labeling for the ages data in the codes attribute:
In [79]: cats.codes
Out[79]: array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
In [80]: cats.categories
Out[80]:
IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
closed='right',
dtype='interval[int64]')
In [81]: pd.value_counts(cats)
Out[81]:
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
Note that pd.value_counts(cats) are the bin counts for the result of pandas.cut.
Consistent with mathematical notation for intervals, a parenthesis means that the side is
open, while the square bracket means it is closed (inclusive). You can change which side
is closed by passing right=False:
In [82]: pd.cut(ages, [18, 26, 36, 61, 100], right=False)
Out[82]:
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26,
36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
Outliers
An outlier is a data object that deviates significantly from the rest of the objects, as if it
were generated by a different mechanism.
For ease of presentation we may refer to data objects that are not outliers as “normal” or
expected data. Similarly, we may refer to outliers as “abnormal” data.
In the below figure, most objects follow a roughly Gaussian distribution. However, the
objects in region R are significantly different. It is unlikely that they follow the same
distribution as the other objects in the data set. Thus, the objects in R are outliers in the
data set.
Outliers are different from noisy data.
Noise is a random error or variance in a measured variable. In general, noise is not
interesting in data analysis, including outlier detection.
For example, in credit card fraud detection, a customer’s purchase behavior can be
modeled as a random variable. A customer may generate some “noise transactions” that
may seem like “random errors” or “variance,” such as by buying a bigger lunch one day,
or having one more cup of coffee than usual. Such transactions should not be treated as
outliers; otherwise, the credit card company would incur heavy costs from verifying that
many transactions. The company may also lose customers by bothering them with
multiple false alarms.
As in many other data analysis and data mining tasks, noise should be removed before
outlier detection. Outliers are interesting because they are suspected of not being
generated by the same mechanisms as the rest of the data. Therefore, in outlier detection,
it is important to justify why the outliers detected are generated by some other
mechanisms. This is often achieved by making various assumptions on the rest of the data
and showing that the outliers detected violate those assumptions significantly.
Outlier detection is also related to novelty detection in evolving data sets. For example,
by monitoring a social media web site where new content is incoming, novelty detection
may identify new topics and trends in a timely manner. Novel topics may initially appear
as outliers. To this extent, outlier detection and novelty detection share some similarity
in modeling and detection methods. However, a critical difference between the two is that
in novelty detection, once new topics are confirmed, they are usually incorporated into
the model of normal behavior so that follow-up instances are not treated as outliers
anymore.
Types of Outliers
In general, outliers can be classified into three categories, namely
● Global outliers
● Contextual (or conditional) outliers, and
● Collective outliers.
Global Outliers
In a given data set, a data object is a global outlier if it deviates significantly from the rest
of the data set. Global outliers are sometimes called point anomalies, and are the simplest
type of outliers. Most outlier detection methods are aimed at finding global outliers.
Consider the points in the above figure again. The points in region R significantly deviate
from the rest of the data set, and hence are examples of global outliers. To detect global
outliers, a critical issue is to find an appropriate measurement of deviation with respect
to the application in question. Various measurements are proposed, and, based on these,
outlier detection methods are partitioned into different categories. We will come to this
issue in detail later. Global outlier detection is important in many applications. Consider
intrusion detection in computer networks, for example. If the communication behavior of
a computer is very different from the normal patterns (e.g., a large number of packages is
broadcast in a short time), this behavior may be considered as a global outlier and the
corresponding computer is a suspected victim of hacking. As another example, in trading
transaction auditing systems, transactions that do not follow the regulations are
considered as global outliers and should be held for further examination.
Contextual Outliers
“The temperature today is 28◦ C. Is it exceptional (i.e., an outlier)?” It depends, for
example, on the time and location! If it is in winter in Toronto, yes, it is an outlier. If it is a
summer day in Toronto, then it is normal. Unlike global outlier detection, in this case,
Outlier Detection whether or not today’s temperature value is an outlier depends on the
context—the date, the location, and possibly some other factors.
In a given data set, a data object is a contextual outlier if it deviates
significantly with respect to a specific context of the object. Contextual outliers are also
known as conditional outliers because they are conditional on the selected context.
Therefore, in contextual outlier detection, the context has to be specified as part of the
problem definition. Generally, in contextual outlier detection, the attributes of the data
objects in question are divided into two groups:
Contextual attributes: The contextual attributes of a data object define the object’s
context. In the temperature example, the contextual attributes may be date and location.
Behavioral attributes: These define the object’s characteristics, and are used to evaluate
whether the object is an outlier in the context to which it belongs. In the temperature
example, the behavioral attributes may be the temperature, humidity, and pressure.
Unlike global outlier detection, in contextual outlier detection,
whether a data object is an outlier depends on not only the behavioral attributes but also
the contextual attributes. A configuration of behavioral attribute values may be
considered an outlier in one context (e.g., 28◦ C is an outlier for a Toronto winter), but
not an outlier in another context (e.g., 28◦ C is not an outlier for a Toronto summer).
Contextual outliers are a generalization of local outliers, a notion introduced in density-
based outlier analysis approaches. An object in a data set is a local outlier if its density
significantly deviates from the local area in which it occurs.
Collective Outliers
Suppose you are a supply-chain manager of AllElectronics. You handle thousands of
orders and shipments every day. If the shipment of an order is delayed, it may not be
considered an outlier because, statistically, delays occur from time to time. However, you
have to pay attention if 100 orders are delayed on a single day. Those 100 orders as a
whole form an outlier, although each of them may not be regarded as an outlier if
considered individually. You may have to take a close look at those orders collectively to
understand the shipment problem.
Given a data set, a subset of data objects forms a collective outlier if the
objects as a whole deviate significantly from the entire data set. Importantly, the
individual data objects may not be outliers.
In the below figure, the black objects as a whole form a collective outlier because the
density of those objects is much higher than the rest in the data set. However, every black
object individually is not an outlier with respect to the whole data set.