Windonction
Windonction
Windonction
A Window Function performs a Data Analysis calculation across a set of table rows that are
somehow related gives a single row or grouped by condition (refer to Figure 1).
Window function does not cause rows to become grouped into a single output row. Rows retain
their separate identities also able to access more than just the current row of the query result.
(refer to Figure 1).
For more information please visit my last blog, I Introduced the SQL Window Function-Part I
and gave an overview of different types of SQL Window Functions.
In this blog, I am explaining all three major types of SQL Window Function along with
examples.
a. AVG():
Example:
Query:
SELECT order_id, product_id, price, AVG(price) OVER (PARTITION BY order_id) AS
Average_Order_Price
FROM retails
To calculate average order price, window function used AVG() on price columns and
partition by on order_id
Consider order id: 1112 — consist 3 products (i.e. 1,2,5). The average value of those 3
products are (866 + 163 +173) / 3 = 400.667
b. COUNT():
Calculates the number of rows with NULL values too if available in column or
expression.
This window function is helpful while creating a new feature in the dataset. Like count
number of entries belong to each customer.
Example:
Query 2: Calculate the number of product sales purchased (running total) by the customer.
Explanation:
Query 1:
Partition by order_id counts number of records belongs to particular order_id.
In the output, we can see several products for each order displayed.
Query 2:
Order by order_id counts a number of records and particular order_id and then add
number records of consecutive order.
Output: We can see the count is increased by the number of records related to particular
order_id.
c. Min() or Max():
Min() or Max() return Minimum or Maximum value of the expression across the input
values respectively.
Both window function works with Numeric values and ignores NULL values.
Example:
Below query add a new feature into the result set, Minumum and Maximum price of the product
purchased in respective order.
Query:
SELECT order_id, name, product_id, price,
MIN(price) OVER (Partition BY order_id) AS Minimum_Price_Product,
MAX(price) OVER (Partition BY order_id) AS Maximum_Price_Product
FROM retails
Explanation:
For each order_id record respective minimum and maximum price of the product has
been added.
We can use each function separately too.
d. Sum():
Example:
Query:
SELECT order_id, name, product_id, price,
SUM(price) OVER (PARTITION BY order_id) AS Average_Order_Price
FROM retails
Explanation:
a. RANK():
The Rank of a value in a group of values based on the ORDER BY expression in the
OVER clause (refer Query 1).
Each value is ranked within its PARTITION BY expression (refer Query 2).
Rows with equal values for the ranking criteria receive the same rank.
Tie or same rank skip the consecutive rank eg. Rank (): 1,1,3,4,5.
Example:
Query 2: Rank the product based on their prices in each order (i.e. partition by order_id).
Explanation:
As we can see in both query, ORDER BY states the expression used to rank the values.
Query 1:
The ranking is done based on product_price.
Also note, 9 rows with the same value has tie rank 1.
So next Rank value starts with 10.
Query 2:
b. DENSE_RANK():
Similarly to Rank() function, Rank of a value in a group of values based on the ORDER
BY expression and the OVER clause and each value is ranked within its PARTITION
BY expression.
The difference is, Rows with equal values receive the same rank and Tie or the same rank
not skip the consecutive rank.
Example: Dense_Rank(): 1,1,2,3,4
EXAMPLE:
Query: Dense_Rank the product based on their prices in each order (i.e. partition by order_id).
Explanation:
As we can see ranking to each row done based on ORDER BY expression i.e. price
values also within each order_id i.e. (PARTITION BY order_id).
order_id 1114, have 5 products out of which 2 products having the same price hence rank
tie i.e. 1.
The next rank starts with 2 (this is the major difference between Rank() and
Desne_Rank() function).
Dense_Rank() not skip the consecutive rank number.
c. CUME_DIST():
Calculates Relative Rank of the current row within a window partition based on below
Formula:
EXAMPLE:
Query: CUME_DIST i.e. Relative rank the product based on their prices in each order (i.e.
partition by order_id).
Explanation:
Let's consider order_id 1112 having 3 products, relative rank calculated as discussed below
formula used:
Similarly, if product having same value or price then relative rank also same check order_id
1114 in output screenshot.
d. ROW_NUMBER():
An ordinal number of the current row within its partition based on ORDER BY
expression in the OVER clause.
Each value is ordered within its PARTITION BY expression.
Rows with equal values for the ORDER BY expressions receive different row numbers
non deterministically.
EXAMPLE:
Query: Assign Row_Number to the product based on their prices in each order (i.e. partition by
order_id).
Explanation:
As we can see in output screenshot, row number assign based on price (ORDER BY
expression) within each order (PARTITION BY order_id).
Not consider value same or not, just assign row_number to each row in the expression.
e. NTILE():
Divides the rows for each window partition, as equally as possible, into a specified
number of ranked groups.
Requires ORDER BY clause in the OVER clause.
The column or expression specified in ORDER BY clause, first all values has been sorted
in ascending order and then equally assign group number.
Example:
Query: Assign Group/cluster/bucket number to all row into 10 different groups based on the
product price.
Explanation:
Example:
Query: Calculate or assign percentage rank to all row based on the product price.
Explanation:
Let’s consider order_id 1114 having 5 products, relative rank calculated as discussed below
formula used:
Row no. 9— First product: (1–1)/(5–1) = 0
Row no. 10— Second product: (1–1)/(5–1) = 0
Row no. 11— Third product: (3–1)/(5–1) = 0.5
Row no. 12 — Forth product: (4–1)/(5–1) = 0.75
Row no. 13 — Fifth product: (5–1)/(5–1) = 1
a. LAG() or LEAD():
Syntax:
LAG or LEAD returns, value for the row value before or after the current row in a
partition respectively.
If no row exists, null is returned.
Example:
Query: Add new feature 1 step LAG or LEAD product price within each order (i.e.
PARTITION BY order_id)
Explanation:
As we can see both lag and lead column give 1 step value respectively within the
partition of order_id.
When its first row in LAG() gets imputed with NULL.
Similarly, in LEAD() the last row gets imputed with NULL.
b. FIRST_VALUE() or LAST_VALUE():
Syntax:
Example:
Query: Add new feature FIRST_VALUE or LAST_VALUE product price within each order
(i.e. PARTITION BY order_id).
Explanation:
As we can see both First_Value() and Last_Value() column give first-row value and last-
row value respectively.
c. NTH_VALUE():
Syntax:
NTH_VALUE return nth value assigned with expression with respect to the window
frame (partition) respectively.
If nth_value is not available then its get imputed with NULL.
Example:
Query: Add new feature NTH_VALUE product price within each order (i.e. PARTITION BY
order_id).
Explanation:
Conclusion:
In this blog, I tried to explain the major types of Window Function in SQL.
Window Functions are very useful while doing data analysis using SQL and easy to use.
Window Functions revolves around OVER, PARTITION BY and ROW or RANGE clauses.