[go: up one dir, main page]

100% found this document useful (1 vote)
110 views10 pages

Sorting and Aggregating in HiveQL

The document explains sorting and aggregating in HiveQL, detailing key clauses such as ORDER BY, SORT BY, DISTRIBUTE BY, and CLUSTER BY for sorting data, as well as GROUP BY and HAVING for data aggregation. ORDER BY provides a global sort but is slower for large datasets, while SORT BY sorts within each reducer. The document also highlights the use of aggregation functions like COUNT(), SUM(), AVG(), MIN(), and MAX() for summarizing data.

Uploaded by

kanishqchezian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
110 views10 pages

Sorting and Aggregating in HiveQL

The document explains sorting and aggregating in HiveQL, detailing key clauses such as ORDER BY, SORT BY, DISTRIBUTE BY, and CLUSTER BY for sorting data, as well as GROUP BY and HAVING for data aggregation. ORDER BY provides a global sort but is slower for large datasets, while SORT BY sorts within each reducer. The document also highlights the use of aggregation functions like COUNT(), SUM(), AVG(), MIN(), and MAX() for summarizing data.

Uploaded by

kanishqchezian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Sorting and Aggregating in HiveQL

• Understanding
– ORDER BY,
– SORT BY,
– GROUP BY, and more
Introduction to Sorting
• Purpose: Arrange results in a specific
order
• Key Clauses: ORDER BY, SORT BY,
DISTRIBUTE BY, CLUSTER BY
ORDER BY
• Global sort across all data
• Uses a single reducer (slower for large
data)
• Example:
– SELECT name, salary FROM employees
ORDER BY salary DESC;
SORT BY
• Sorts data within each reducer
• Faster, but output not globally sorted
• Example:
– SELECT name, salary FROM employees
SORT BY salary ASC;
DISTRIBUTE BY + SORT BY
• Distributes data by a column, sorts inside
partitions
• Example:
– SELECT name, department, salary
– FROM employees
– DISTRIBUTE BY department
– SORT BY salary DESC;
CLUSTER BY
• Shortcut for DISTRIBUTE BY + SORT BY
(same column, ASC)
• Example:
– SELECT name, department FROM
employees CLUSTER BY department;
Aggregating in HiveQL
• Purpose: Summarize data
• Functions: COUNT(), SUM(), AVG(),
MIN(), MAX()
GROUP BY
• Groups rows and applies aggregation
• Example:
– SELECT department, AVG(salary) AS
avg_salary
– FROM employees
– GROUP BY department;
HAVING Clause
• Filters groups after aggregation
• Example:
– SELECT department, COUNT(*) AS
emp_count
– FROM employees
– GROUP BY department
– HAVING COUNT(*) > 5;
Summary & Key Points
• ORDER BY → global, slow for large data
• SORT BY → local per reducer sort
• GROUP BY + aggregates for summaries
• HAVING for post-aggregation filters

You might also like