0% found this document useful (0 votes)

15 views10 pages

Unit-4 Hive

Apache Hive is a data warehouse software built on Apache Hadoop that allows users to manage and analyze large datasets using a SQL-like language called HiveQL. Key features include scalability, performance, and integration with the Hadoop ecosystem, making it suitable for big data analytics. Hive also provides a suite of services, including Hive Metastore for metadata management and Hive Shell for interacting with Hive, along with support for user-defined functions to extend its capabilities.

Uploaded by

thaakuranujtomar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views10 pages

Unit-4 Hive

Uploaded by

thaakuranujtomar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Unit-4

Apache HIVE
What is Hive?

Apache Hive is a data warehouse software that facilitates reading, writing, and
managing large datasets residing in distributed storage using SQL. Hive is built on
top of Apache Hadoop and allows users to query and analyze large datasets
stored in Hadoop Distributed File System (HDFS) using a SQL-like language called
HiveQL (Hive Query Language). Hive provides a familiar SQL interface for users
who are not familiar with Hadoop, making it easier to analyze large datasets.

Key Features of Hive

SQL-like language (HiveQL): Hive uses a SQL-like language called HiveQL, which is
easy to learn and use for users who are familiar with SQL.

Scalability: Hive is designed to scale to handle large datasets, making it suitable

for big data analytics.

Performance: Hive can query large datasets efficiently by leveraging Hadoop's

distributed processing capabilities.

Flexibility: Hive supports various storage formats, including plain text, ORC, and
Parquet, and can integrate with other Hadoop ecosystem tools.

Ease of Use: Hive's SQL-like interface makes it easy for users who are familiar with
SQL to query and analyze large datasets in Hadoop.

Data Management: Hive provides data management capabilities, such as table

creation, data partitioning, and data summarization, making it easier to organize
and analyze large datasets.

Integration with Hadoop Ecosystem: Hive integrates seamlessly with other

Hadoop ecosystem tools, such as HDFS, MapReduce, and Spark, enabling a unified
big data analytics platform.

Use Cases of Hive

Data Warehousing: Hive can be used to create and manage data warehouses for
storing and analyzing large datasets.

ETL (Extract, Transform, Load): Hive can be used to extract data from various
sources, transform it into a suitable format, and load it into Hadoop for further
analysis.

Ad-hoc Analysis: Hive allows users to perform ad-hoc analysis on large datasets
using SQL-like queries.

Reporting: Hive can be used to generate reports and visualizations based on large
datasets.

Overall, Hive is a valuable tool for big data analytics, providing a user-friendly SQL
interface for querying and analyzing large datasets stored in Hadoop.

Hive Shell
What is Hive Shell?

Hive Shell, also known as the Hive CLI (command-line interface), is a primary way
to interact with Apache Hive. It provides a text-based interface for executing
HiveQL statements, managing Hive metadata, and interacting with HiveServer2.
Hive Shell is a legacy tool that was introduced in the early versions of Hive and is
still supported in newer versions. However, it is recommended to use Beeline, a
more modern and feature-rich CLI for Hive, for most use cases.

Features of Hive Shell

Interactive mode: Hive Shell allows users to interactively execute HiveQL
statements and view the results immediately.

Batch mode: Hive Shell can also be used to execute HiveQL statements from
scripts or files.

Metadata management: Hive Shell provides commands for managing Hive

metadata, such as creating, dropping, and altering tables and databases.

HiveServer2 interaction: Hive Shell can be used to connect to and interact with
HiveServer2, a remote Hive service that allows clients to execute HiveQL
statements and manage metadata.

Benefits of Using Hive Shell

Simplicity: Hive Shell is a simple and easy-to-use tool for executing HiveQL
statements and managing Hive metadata.

Flexibility: Hive Shell can be used in both interactive and batch modes, making it
suitable for a variety of use cases.

Maturity: Hive Shell is a mature tool that has been well-tested and is supported
by the Apache Hive community.

Hive Services
Apache Hive offers a comprehensive suite of services to facilitate data
warehousing and analytics in the Hadoop ecosystem. These services work
together to provide a seamless and efficient way to query, manage, and analyze
large datasets stored in distributed storage systems.

Core Hive Services

Hive Metastore: The Hive Metastore is a central repository that stores metadata
about Hive objects, such as tables, partitions, and UDFs (user-defined functions).
This metadata is essential for Hive to function properly and is used by various
Hive components, including the Hive Shell, HiveServer2, and Hive Web UI.

Hive Driver: The Hive Driver is the component responsible for receiving HiveQL
statements from clients and translating them into MapReduce jobs. It parses the
HiveQL statement, checks syntax and access permissions, and generates the
appropriate MapReduce job configuration.

Hive Compiler: The Hive Compiler transforms the HiveQL statement into an
execution plan, which is a series of MapReduce tasks that represent the steps
required to execute the query. The compiler optimizes the execution plan to
improve performance and takes into account factors such as data partitioning and
table schema.

Hive Execution Engine: The Hive Execution Engine executes the MapReduce jobs
generated by the Hive Compiler. It interacts with Hadoop to submit the jobs,
monitor their execution, and collect the results. The execution engine handles
various aspects of job execution, such as handling failures and managing task
dependencies.

Hive Server: Hive Server provides a remote service for clients to execute HiveQL
statements and manage Hive metadata. It acts as an intermediary between Hive
clients and the Hive Metastore, Hive Driver, and Hive Compiler. Clients can
connect to Hive Server using various protocols, such as JDBC/ODBC or Thrift.

Hive Web UI: The Hive Web UI provides a web-based interface for interacting
with Hive. It allows users to submit HiveQL statements, view query results,
browse Hive metadata, and manage Hive objects. The Hive Web UI is a
convenient tool for users who prefer a graphical interface.

Hive Meta Store

Hive Metastore is a crucial component of the Apache Hive data warehouse
system. It serves as a central repository for storing metadata about Hive objects,
such as tables, partitions, and user-defined functions (UDFs). This metadata is
essential for Hive to function properly and is used by various Hive components,
including the Hive Shell, HiveServer2, and Hive Web UI.

Core Functions of Hive Metastore

Metadata Storage: The Hive Metastore stores metadata about Hive objects,
including table names, column names, data types, partitions, and UDFs.

Metadata Access: The Hive Metastore provides an interface for Hive components
and clients to access and retrieve metadata about Hive objects. This allows clients
to query and analyze data without having to know the details of the underlying
storage format or schema.

Metadata Management: The Hive Metastore provides operations for managing

Hive metadata, such as creating, dropping, altering, and querying tables,
partitions, and UDFs. This allows users to maintain the integrity and consistency
of Hive metadata.

Transaction Management: The Hive Metastore supports ACID (Atomicity,

Consistency, Isolation, Durability) transactions to ensure the integrity of metadata
updates. This is crucial for maintaining the consistency of Hive metadata across
multiple Hive clients and operations.

Security: The Hive Metastore supports authentication and authorization

mechanisms to control access to metadata. This ensures that only authorized
users can create, modify, or delete Hive objects.

Traditional databases V/S Hive

HiveQL
HiveQL (Hive Query Language) is a dialect of SQL used to query and manage data
stored in Apache Hive, a data warehouse system built on top of Apache Hadoop.
HiveQL is similar to standard SQL, but it also includes extensions to handle the
unique characteristics of Hive, such as its support for distributed storage and its
ability to process large datasets.

Key Features of HiveQL

 SQL-like syntax: HiveQL uses a syntax that is very similar to standard SQL,
making it easy for users familiar with SQL to learn and use.
 Data definition statements (DDL): HiveQL supports DDL statements for
creating, dropping, and altering Hive objects, such as tables, partitions, and
UDFs (user-defined functions).
 Data manipulation statements (DML): HiveQL supports DML statements
for querying, inserting, updating, and deleting data in Hive tables.
 Data control language (DCL): HiveQL supports DCL statements for granting
and revoking access privileges to Hive objects.
 Support for distributed data: HiveQL can handle data stored in a
distributed manner across multiple nodes, making it suitable for querying
and analyzing large datasets.
 Support for custom data types and UDFs: HiveQL allows users to define
their own custom data types and UDFs, providing flexibility for data
analysis.

Common HiveQL Statements

 SELECT: Used to retrieve data from Hive tables.

 WHERE: Filters the rows to be retrieved based on certain conditions.
 GROUP BY: Groups the retrieved rows based on specified columns
 HAVING: Filters the groups of rows based on certain conditions
 ORDER BY: Sorts the retrieved rows based on specified columns.
 INSERT: Inserts data into a Hive table.
 UPDATE: Updates existing data in a Hive table.
 DELETE: Removes rows from a Hive table.
 CREATE TABLE: Creates a new Hive table.
 ALTER TABLE: Modifies an existing Hive table.
 DROP TABLE: Drops an existing Hive table.

Hive Tables
Hive tables are the fundamental data storage units in Apache Hive, a data
warehouse system built on top of Apache Hadoop. Hive tables are used to
organize and store large datasets in a structured manner, making them suitable
for querying and analyzing data using HiveQL, a dialect of SQL.

Types of Hive Tables

 Managed Tables: Managed tables are the default type of Hive tables. Hive
manages the metadata and data files associated with these tables.
 External Tables: External tables reference data files that are stored outside
of Hive's control. Hive only manages the metadata associated with these
tables.
 Index Tables: Index tables provide faster access to data by maintaining
indexes for specific columns.
 Materialized Views: Materialized views are pre-computed summaries of
data from other tables, enabling faster querying and analysis.
 Virtual Views: Virtual views are dynamically generated views that do not
store their own data but instead refer to the data from other tables.

Benefits of Using Hive Tables

 Structured Data Organization:
 ScalabilityQuery
 Performance Optimization
 Integration with HiveQL:
 Integration with Hadoop Ecosystem:

Querying Data and User Defined Functions

Querying data in Hive is primarily done using HiveQL (Hive Query Language), a
dialect of SQL that is specifically designed for interacting with Hive data stored in
distributed storage systems like HDFS (Hadoop Distributed File System). HiveQL
provides a familiar and powerful way to retrieve, manipulate, and analyze data
stored in Hive tables.
A basic HiveQL query typically consists of the following clauses:

 SELECT: Specifies the columns to be retrieved from the table(s).

 FROM: Identifies the table(s) from which data should be retrieved.
 WHERE: Filters the rows to be retrieved based on specific conditions.
 ORDER BY: Sorts the retrieved rows based on specified columns.
 LIMIT: Limits the number of rows to be retrieved.
Example

User-Defined Functions (UDFs) in Hive

Hive provides a mechanism for users to define their own custom functions, known
as UDFs (user-defined functions). UDFs extend the capabilities of HiveQL by
allowing users to implement complex logic or specialized operations that are not
built-in to HiveQL.

Creating and Using UDFs

UDFs can be written in various programming languages, such as Java, Python, or

Scala. The UDF code is compiled and packaged into a JAR file that is then
registered with Hive. Once registered, the UDF can be used in HiveQL queries like
any other built-in function.

Example
Benefits of Using UDFs

 Extensibility: UDFs allow users to extend HiveQL's capabilities to

implement complex logic or specialized operations.

 Reusability: UDFs can be written once and reused in multiple queries and
applications.

 Encapsulation: UDFs encapsulate complex logic, making queries more

concise and easier to understand.

Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
Apache HIVE
100% (1)
Apache HIVE
105 pages
Bda Unit 4 - Mam
No ratings yet
Bda Unit 4 - Mam
57 pages
Architecture and Working of Hive
No ratings yet
Architecture and Working of Hive
7 pages
Hive
No ratings yet
Hive
52 pages
Chapter 7
No ratings yet
Chapter 7
84 pages
NoteGPT - Apache Hive Tutorial For Beginners - Big Data Training - Edureka - Big Data Rewind
No ratings yet
NoteGPT - Apache Hive Tutorial For Beginners - Big Data Training - Edureka - Big Data Rewind
15 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
BDA Unit-5
No ratings yet
BDA Unit-5
26 pages
HIVE
No ratings yet
HIVE
18 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Unit 3 BDA
No ratings yet
Unit 3 BDA
44 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
Hive Tutorial
No ratings yet
Hive Tutorial
19 pages
Unit 3-1
No ratings yet
Unit 3-1
41 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Hive
No ratings yet
Hive
30 pages
BDA Unit-5
No ratings yet
BDA Unit-5
25 pages
Hive Full Lecture
No ratings yet
Hive Full Lecture
17 pages
HIVE
No ratings yet
HIVE
16 pages
Hive
No ratings yet
Hive
49 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Unit IV
No ratings yet
Unit IV
22 pages
Course3 Module2 Intro To Hive Slides
No ratings yet
Course3 Module2 Intro To Hive Slides
76 pages
Document 1
No ratings yet
Document 1
444 pages
DSS U4 HIVE Rev1.1
No ratings yet
DSS U4 HIVE Rev1.1
23 pages
7 Hive
No ratings yet
7 Hive
30 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Hive
No ratings yet
Hive
23 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Unit-5 Sgs
No ratings yet
Unit-5 Sgs
10 pages
Hive Updated
No ratings yet
Hive Updated
18 pages
Week 14 Hive
No ratings yet
Week 14 Hive
6 pages
Unit 3 Hive Overview and Architecture
No ratings yet
Unit 3 Hive Overview and Architecture
5 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
Unit V BD LM Cse
No ratings yet
Unit V BD LM Cse
34 pages
Hive Is A Data Warehouse Infrastructure Tool To Process Structured Data in Hadoop
No ratings yet
Hive Is A Data Warehouse Infrastructure Tool To Process Structured Data in Hadoop
30 pages
Hive Slides-2
No ratings yet
Hive Slides-2
25 pages
Bda Unit 5 Notes
No ratings yet
Bda Unit 5 Notes
23 pages
Ibiz Hive
No ratings yet
Ibiz Hive
27 pages
Hive
No ratings yet
Hive
2 pages
Report On Hive of Apache
No ratings yet
Report On Hive of Apache
3 pages
Bda 06
No ratings yet
Bda 06
15 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
Unit 3 Hive
No ratings yet
Unit 3 Hive
3 pages
Bda Report
No ratings yet
Bda Report
16 pages
Bda Exp-6
No ratings yet
Bda Exp-6
10 pages
04 Bigdata Hive
No ratings yet
04 Bigdata Hive
22 pages
Hive
No ratings yet
Hive
5 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
24 pages
Advanced Database Management System CO-318 Advanced Datatypes and New Applications
No ratings yet
Advanced Database Management System CO-318 Advanced Datatypes and New Applications
66 pages
Dbms Lab Manual-2017
No ratings yet
Dbms Lab Manual-2017
160 pages
Unit 3
No ratings yet
Unit 3
8 pages
Java Script Objects
No ratings yet
Java Script Objects
24 pages
Apache Hive
No ratings yet
Apache Hive
77 pages
Databases
No ratings yet
Databases
102 pages
Assignment 4-Gcc: Hive Is Not
No ratings yet
Assignment 4-Gcc: Hive Is Not
3 pages
Unit1 I
No ratings yet
Unit1 I
53 pages
MongoDB SI Associate Certification
100% (1)
MongoDB SI Associate Certification
4 pages
Introduction To HIVE
No ratings yet
Introduction To HIVE
8 pages
Python UNIT5 Notes
No ratings yet
Python UNIT5 Notes
52 pages
Top 50 Database Interview Questions
No ratings yet
Top 50 Database Interview Questions
10 pages
DBMS QP
No ratings yet
DBMS QP
15 pages
A Mobile App Platform For IoT
No ratings yet
A Mobile App Platform For IoT
9 pages
Data Storage and Querying
No ratings yet
Data Storage and Querying
3 pages
Chris McDonough-ZODB Tips and Tricks
No ratings yet
Chris McDonough-ZODB Tips and Tricks
27 pages
Dates
No ratings yet
Dates
35 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Sagacious IP Company Profile
No ratings yet
Sagacious IP Company Profile
3 pages
Grade 9 ICT Worksheet3
No ratings yet
Grade 9 ICT Worksheet3
6 pages
Project Title
No ratings yet
Project Title
2 pages
12459/asr Intercity Second Sitting (2S)
No ratings yet
12459/asr Intercity Second Sitting (2S)
3 pages
Commit With Error Handling
No ratings yet
Commit With Error Handling
2 pages
Database Administrator Teradata University Subscription
No ratings yet
Database Administrator Teradata University Subscription
2 pages
SQL Constraints
No ratings yet
SQL Constraints
1 page
Surprise Test-1
No ratings yet
Surprise Test-1
4 pages
CH 2 DB Security & Authorization
No ratings yet
CH 2 DB Security & Authorization
41 pages
ACADEMIC CALENDAR 2024-25 (Revised)
No ratings yet
ACADEMIC CALENDAR 2024-25 (Revised)
1 page
What Is Blockchain
No ratings yet
What Is Blockchain
8 pages
UNIT 7 - Atomic Transactions
No ratings yet
UNIT 7 - Atomic Transactions
30 pages
Alation Customer Case Study Riot Games
No ratings yet
Alation Customer Case Study Riot Games
4 pages
ODI Statement of Direction 20200501
No ratings yet
ODI Statement of Direction 20200501
6 pages
Microsoft PL-900 Exam - Questions and Answers - CertLibrary - Com-Pg7
No ratings yet
Microsoft PL-900 Exam - Questions and Answers - CertLibrary - Com-Pg7
10 pages
Basic MGMT Course Lab Guide 1.2
No ratings yet
Basic MGMT Course Lab Guide 1.2
19 pages
Access Test 3
No ratings yet
Access Test 3
18 pages
Zfs Internals Uli Graef
No ratings yet
Zfs Internals Uli Graef
32 pages
Nitin Mali.: 1. Cognizant Technology Solution, Pune
No ratings yet
Nitin Mali.: 1. Cognizant Technology Solution, Pune
2 pages
JDE C Funtion
No ratings yet
JDE C Funtion
5 pages
CAClarityPPM TechRefGuide ENU
No ratings yet
CAClarityPPM TechRefGuide ENU
67 pages
Analogy For Abstract Data Types
No ratings yet
Analogy For Abstract Data Types
5 pages
Informatica Performance Tuning
No ratings yet
Informatica Performance Tuning
35 pages
Object-Based Storage: IEEE Communications Magazine September 2003
No ratings yet
Object-Based Storage: IEEE Communications Magazine September 2003
8 pages
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Unit-4 Hive

Uploaded by

Unit-4 Hive

Uploaded by

Unit-4

Key Features of Hive

Scalability: Hive is designed to scale to handle large datasets, making it suitable

Performance: Hive can query large datasets efficiently by leveraging Hadoop's

Data Management: Hive provides data management capabilities, such as table

Integration with Hadoop Ecosystem: Hive integrates seamlessly with other

Use Cases of Hive

Features of Hive Shell

Metadata management: Hive Shell provides commands for managing Hive

Benefits of Using Hive Shell

Core Hive Services

Hive Meta Store

Core Functions of Hive Metastore

Metadata Management: The Hive Metastore provides operations for managing

Transaction Management: The Hive Metastore supports ACID (Atomicity,

Security: The Hive Metastore supports authentication and authorization

Traditional databases V/S Hive

Key Features of HiveQL

Common HiveQL Statements

 SELECT: Used to retrieve data from Hive tables.

Types of Hive Tables

Benefits of Using Hive Tables

Querying Data and User Defined Functions

 SELECT: Specifies the columns to be retrieved from the table(s).

User-Defined Functions (UDFs) in Hive

Creating and Using UDFs

UDFs can be written in various programming languages, such as Java, Python, or

 Extensibility: UDFs allow users to extend HiveQL's capabilities to

 Encapsulation: UDFs encapsulate complex logic, making queries more

You might also like