MapReduce Scripts in HiveQL
• Using TRANSFORM, MAP, and REDUCE
clauses with external scripts.
• Integrates HiveQL with custom logic
written in languages like Python, Perl, or
Bash.
Introduction
• Hive executes queries as MapReduce jobs
internally.
• Sometimes built-in Hive functions are not
enough, so external scripts can be used.
• Keywords:
• - TRANSFORM: Pass data to external script and
get processed output.
• - MAP: Apply an external script in the mapper
phase.
• - REDUCE: Apply an external script in the
reducer phase.
TRANSFORM Clause
• Definition: TRANSFORM sends Hive query results to an
external script for processing.
• Syntax:
• SELECT TRANSFORM (col1, col2, ...)
• USING 'script_name'
• AS (new_col1, new_col2, ...)
• FROM table_name;
• Keywords:
• - USING: Specifies the script/command to run.
• - AS: Defines the schema for the output returned from
the script.
Example: Python Script with
TRANSFORM
• Python script (process_data.py):
• #!/usr/bin/env python
• import sys
• for line in sys.stdin:
• name, salary = line.strip().split('\t')
• print(f"{name}\t{float(salary)*1.1}")
• HiveQL:
• ADD FILE process_data.py; -- Make script available on cluster
• SELECT TRANSFORM (name, salary)
• USING 'python process_data.py'
• AS (name STRING, new_salary FLOAT)
• FROM employees;
MAP Clause
• Definition: Applies an external script during the map
phase of a MapReduce job.
• Syntax:
• MAP col1, col2 USING 'script_name' AS (output_cols...)
• Example:
• FROM (
• MAP employees.name, employees.salary
• USING 'python mapper.py'
• AS name STRING, salary FLOAT
• ) map_output
REDUCE Clause
• Definition: Applies an external script during the reduce
phase.
• Often used after MAP to aggregate or combine data.
• Syntax:
• REDUCE col1, col2 USING 'script_name' AS
(output_cols...)
• Example:
• REDUCE map_output.name, map_output.salary
• USING 'python reducer.py'
• AS name STRING, total_salary FLOAT;
Best Practices
• - Always use ADD FILE to distribute
scripts.
• - Scripts read input from STDIN and write
output to STDOUT.
• - Default field delimiter is TAB ('\t') —
handle it carefully.
• - Ensure output data types match Hive
table definitions.
• - Prefer TRANSFORM over
MAP/REDUCE for simplicity.
Summary
• TRANSFORM: Sends data to an external
script for row-wise processing.
• MAP: Custom processing in mapper
phase.
• REDUCE: Custom processing in reducer
phase.
• Best for advanced analytics, data
cleaning, text parsing, or ML integration.