Sorting and Aggregating in HiveQL
• Understanding
– ORDER BY,
– SORT BY,
– GROUP BY, and more
Introduction to Sorting
• Purpose: Arrange results in a specific
order
• Key Clauses: ORDER BY, SORT BY,
DISTRIBUTE BY, CLUSTER BY
ORDER BY
• Global sort across all data
• Uses a single reducer (slower for large
data)
• Example:
– SELECT name, salary FROM employees
ORDER BY salary DESC;
SORT BY
• Sorts data within each reducer
• Faster, but output not globally sorted
• Example:
– SELECT name, salary FROM employees
SORT BY salary ASC;
DISTRIBUTE BY + SORT BY
• Distributes data by a column, sorts inside
partitions
• Example:
– SELECT name, department, salary
– FROM employees
– DISTRIBUTE BY department
– SORT BY salary DESC;
CLUSTER BY
• Shortcut for DISTRIBUTE BY + SORT BY
(same column, ASC)
• Example:
– SELECT name, department FROM
employees CLUSTER BY department;
Aggregating in HiveQL
• Purpose: Summarize data
• Functions: COUNT(), SUM(), AVG(),
MIN(), MAX()
GROUP BY
• Groups rows and applies aggregation
• Example:
– SELECT department, AVG(salary) AS
avg_salary
– FROM employees
– GROUP BY department;
HAVING Clause
• Filters groups after aggregation
• Example:
– SELECT department, COUNT(*) AS
emp_count
– FROM employees
– GROUP BY department
– HAVING COUNT(*) > 5;
Summary & Key Points
• ORDER BY → global, slow for large data
• SORT BY → local per reducer sort
• GROUP BY + aggregates for summaries
• HAVING for post-aggregation filters