[go: up one dir, main page]

0% found this document useful (0 votes)
15 views8 pages

MapReduce Scripts in HiveQL With Keywords

The document explains how to use MapReduce scripts in HiveQL with TRANSFORM, MAP, and REDUCE clauses to integrate external scripts written in languages like Python, Perl, or Bash. It provides syntax examples and best practices for implementing these clauses to enhance data processing capabilities in Hive. The use of external scripts is recommended for advanced analytics, data cleaning, and machine learning integration.

Uploaded by

kanishqchezian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views8 pages

MapReduce Scripts in HiveQL With Keywords

The document explains how to use MapReduce scripts in HiveQL with TRANSFORM, MAP, and REDUCE clauses to integrate external scripts written in languages like Python, Perl, or Bash. It provides syntax examples and best practices for implementing these clauses to enhance data processing capabilities in Hive. The use of external scripts is recommended for advanced analytics, data cleaning, and machine learning integration.

Uploaded by

kanishqchezian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 8

MapReduce Scripts in HiveQL

• Using TRANSFORM, MAP, and REDUCE


clauses with external scripts.
• Integrates HiveQL with custom logic
written in languages like Python, Perl, or
Bash.
Introduction
• Hive executes queries as MapReduce jobs
internally.
• Sometimes built-in Hive functions are not
enough, so external scripts can be used.
• Keywords:
• - TRANSFORM: Pass data to external script and
get processed output.
• - MAP: Apply an external script in the mapper
phase.
• - REDUCE: Apply an external script in the
reducer phase.
TRANSFORM Clause
• Definition: TRANSFORM sends Hive query results to an
external script for processing.
• Syntax:
• SELECT TRANSFORM (col1, col2, ...)
• USING 'script_name'
• AS (new_col1, new_col2, ...)
• FROM table_name;
• Keywords:
• - USING: Specifies the script/command to run.
• - AS: Defines the schema for the output returned from
the script.
Example: Python Script with
TRANSFORM
• Python script (process_data.py):
• #!/usr/bin/env python
• import sys
• for line in sys.stdin:
• name, salary = line.strip().split('\t')
• print(f"{name}\t{float(salary)*1.1}")

• HiveQL:
• ADD FILE process_data.py; -- Make script available on cluster
• SELECT TRANSFORM (name, salary)
• USING 'python process_data.py'
• AS (name STRING, new_salary FLOAT)
• FROM employees;
MAP Clause
• Definition: Applies an external script during the map
phase of a MapReduce job.
• Syntax:
• MAP col1, col2 USING 'script_name' AS (output_cols...)
• Example:
• FROM (
• MAP employees.name, employees.salary
• USING 'python mapper.py'
• AS name STRING, salary FLOAT
• ) map_output
REDUCE Clause
• Definition: Applies an external script during the reduce
phase.
• Often used after MAP to aggregate or combine data.
• Syntax:
• REDUCE col1, col2 USING 'script_name' AS
(output_cols...)
• Example:
• REDUCE map_output.name, map_output.salary
• USING 'python reducer.py'
• AS name STRING, total_salary FLOAT;
Best Practices
• - Always use ADD FILE to distribute
scripts.
• - Scripts read input from STDIN and write
output to STDOUT.
• - Default field delimiter is TAB ('\t') —
handle it carefully.
• - Ensure output data types match Hive
table definitions.
• - Prefer TRANSFORM over
MAP/REDUCE for simplicity.
Summary
• TRANSFORM: Sends data to an external
script for row-wise processing.
• MAP: Custom processing in mapper
phase.
• REDUCE: Custom processing in reducer
phase.
• Best for advanced analytics, data
cleaning, text parsing, or ML integration.

You might also like