[go: up one dir, main page]

0% found this document useful (0 votes)
15 views1 page

Eti Appa

The micro-project report focuses on the use of Apache Hive for real-time queries and analytics in big data, specifically aimed at predicting credit card application approval statuses. It outlines the project's aims, methodology, and the skills developed, emphasizing the importance of real-time data analysis and the integration of big data technologies. The report concludes that the project successfully demonstrates the capabilities of Apache Hive for efficient, near real-time analytics, providing valuable insights for businesses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views1 page

Eti Appa

The micro-project report focuses on the use of Apache Hive for real-time queries and analytics in big data, specifically aimed at predicting credit card application approval statuses. It outlines the project's aims, methodology, and the skills developed, emphasizing the importance of real-time data analysis and the integration of big data technologies. The report concludes that the project successfully demonstrates the capabilities of Apache Hive for efficient, near real-time analytics, providing valuable insights for businesses.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

SHREEYASH PRATISHTHAN’S

SHREEYASH COLLEGE OF ENGINEERING AND TECHNOLOGY (POLYTECHNIC),


CHH. SAMBHAJINAGAR

MICRO-PROJECT REPORT

NAME OF DEPARTMENT:- ARTIFICIAL INTELLIGENCE & MACHINE LEARNING


ACADEMIC YEAR:- 2024-25
SEMESTER:- SIXTH
COURSE NAME:- BIG DATA ANLYTICS COURSE
CODE:- 22684
MICRO-PROJECT TITLE:- APACHE HIVE FOR REAL-TIME QUERIES AND
ANALYTICS

PREPARED BY:-
1) SUSHANT DUDHMAL EN. NO.2210920385

UNDER THE GUIDANCE OF:- Prof. P.N .GOPALE

MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION, MUMBAI


CERTIFICATE
This is to certify that Mr./ Ms. Sushant Daulat Dudhmal of 6TH Semester of Diploma in Artificial
Intelligence & Machine Learning Of Institute Shreeyash College Of Engineering &
Technology Chh.Sambhajinagar has successfully completed Micro- Project Work in Course of
big data analytics For The academic year 2024-25 as prescribed in the I-Scheme Curriculum.

Date:- Enrollment No:-


2210920385
Place:- CHH.SAMBHAJINAGAR Exam Seat No.:-

Signatur Signature Signature


e
Guide HOD Principal

Seal of Institute

ACKNOWLEDGEMENT
We wish to express our profound gratitude to our guide Prof.
P.N .GOPALE who guided us endlessly in framing and completion of Micro- Project. He / She
guided us on all the main points in that Micro-Project. We are indebted to his / her constant
encouragement, cooperation and help. It was his / her enthusiastic support that helped us in
overcoming of various obstacles in the Micro-Project.
We are also thankful to our Principal, HOD, Faculty Members and
classmates for extending their support and motivation in the completion of this Micro-
Project.

Annexure-1
Micro-Project Proposal

Title of Micro-Project:-
APACHE HIVE FOR REAL-TIME QUERIES AND ANALYTICS
1.0 Aims/Benefits of the Micro-Project
The aim of this microproject is to develop a predictive analytics system using Big Data technologies
and machine learning algorithms to accurately forecast the approval status of credit card
applications. By analyzing large-scale applicant data, the project seeks to automate decision-
making, improve accuracy, and assist financial institutions in minimizing risk and enhancing
operational efficiency.

2.0 Course Outcomes Addressed


a. Describe Big data and Big Data Analytics.
b. Apply the Big data Analytics procedure to work on datasets.
c. Describe Hadoop Distributed File System.
d. Analyze structured data using HIVE.

3.0 Proposed Methodology


 Data Collection:
Gather historical credit card application data with relevant features like income, age, credit
score, and approval status.
 Data Preprocessing:
Clean and transform the data by handling missing values, encoding categorical variables, and
normalizing numerical features.
 Big Data Integration:
Store and process the data using Big Data tools like Apache Spark or Hadoop for scalability and
efficiency.
 Model Development:
Apply machine learning algorithms (e.g., Logistic Regression, Decision Tree, Random Forest)
to build a predictive model.
 Model Evaluation:
Evaluate model accuracy using metrics like accuracy, precision, recall, and F1-score.
 Prediction:
Use the trained model to predict credit card approval status for new applicants.

4.0 Action Plan

Sr. Planned Planned


No. Week Details of activity Start date Finish
date
1 1 &2 Discussion & Finalization of 02/02/2025 05/02/2025
Topic
2 3 Preparation of the Abstract 05/02/2025 06/02/2025
3 4 Literature Review 06/02/2025 08/02/2025
4 5 Submission of Microproject 09/02/2025 10/02/2025
Proposal ( Annexure-I)
5 6 Collection of information about 11/02/2025 13/02/2025
Topic
6 7 Collection of relevant content / 14/02/2025 16/02/2025
materials for the execution of
Microproject.
7 8 Discussion and submission of 17/02/2025 19/02/2025
outline of the Microproject.
8 9 Analysis / execution of Collected 20/02/2025 25/02/2025
data / information and preparation
of Prototypes / drawings / photos /
charts
/ graphs / tables / circuits / Models
/ programs etc.
9 10 Completion of Contents of Project 26/02/2025 03/02/2025
Report
10 11 Completion of Weekly progress 1/03/2025 05/03/2025
Report
11 12 Completion of Project 06/03/2025 10/03/2025
Report ( Annexure-II)
12 13 Viva voce / Delivery of
Presentation

5.0 Resources Required

Sr. No. Name of Resources / Specification Qty Remarks


Materials
1 Computer Ram minimum 4gb ,i5 7th 1
gen
2 Operating system Windows 10 1
3 internet google 1

Names of Team Members with En. Nos.


1. SUSHANT DUDHMAL EN.NO:- 2210920385

Annexure-II

Micro-Project Report

Title of Micro-Project:- APACHE HIVE FOR REAL TIME QUERIES AND


ANALYCIS
1.0 Rationale

In the age of big data, organizations are increasingly seeking ways to make timely, data-driven
decisions. Traditional batch-processing systems often introduce delays that hinder responsiveness. This
microproject aims to bridge that gap by leveraging Apache Hive to perform near real-time analytics
on incoming data streams.

Apache Hive, though originally designed for batch processing, has evolved to support faster querying
through partitioning, bucketing, and integration with newer data formats like ORC and Parquet. By
simulating real-time data ingestion and applying optimized query techniques, Hive can be transformed
into a powerful engine for time-sensitive analytics on large-scale datasets.

2.0 Aims/Benefits of the Micro-Project:-

 Provides near real-time insights.


 Handles large-scale data efficiently.
 Fast queries using partitioned tables.
 Cost-effective and scalable.
 Offers hands-on experience with big data tools.
 Can be integrated with other analytics platforms.

3.0 Course Outcomes Achieved

 CO1: Understand Apache Hive and its role in big data.

 CO2: Create and manage Hive tables for structured data.

 CO3: Use partitioning to speed up queries.

 CO4: Perform near real-time data analysis with HiveQL.

 CO5: Analyze and interpret sales trends from data.

 CO6: Apply emerging tech to real-world analytics problems.

4.0 Literature Review:-

Sure! Here's a concise and well-structured Literature Review for your ETI microproject on Apache
Hive for Real-Time Queries and Analysis:

Big data technologies have transformed the way organizations handle and analyze vast volumes of
structured and unstructured data. Apache Hive, introduced by Facebook in 2008, was developed to
simplify querying large datasets stored in the Hadoop Distributed File System (HDFS). Hive translates
SQL-like queries (HiveQL) into MapReduce jobs, enabling non-programmers to analyze big data
efficiently.

Over the years, researchers and practitioners have explored Hive's capabilities for improving
performance in analytical workloads. Studies show that data partitioning, bucketing, and the use of
optimized file formats like ORC and Parquet can significantly reduce query latency. While Hive is
traditionally batch-oriented, recent developments like Hive LLAP (Live Long and Process) have
aimed to reduce query execution time, enabling near real-time analytical capabilities.

Literature also highlights the importance of integrating Hive with tools like Apache Flume, Kafka,
and Spark for streaming data processing. These integrations allow Hive to work as part of a hybrid
architecture, where real-time data is ingested, stored, and analyzed quickly.

In the context of business applications, Hive is widely used in e-commerce, telecommunications, and
financial services for real-time monitoring of transactions, sales, and customer behavior. This makes it
a valuable tool for building scalable and responsive analytics platforms.

4.0Actual Methodology Followed

 Problem Definition: Identified the need for real-time analysis of product sales data.

 Data Generation: Simulated sales data using a Python script (order_id, product details, timestamp).

 Data Ingestion: Ingested data into HDFS using scripts to simulate real-time data flow.

 Hive Table Creation: Created external and partitioned tables in Hive for efficient data storage.

 Data Transformation: Cleaned and loaded data into partitioned tables using HiveQL queries.

 Query Execution: Performed real-time analytical queries (e.g., total sales, top products).

 Visualization (Optional): Used Apache Superset for real-time data visualization.

 Testing: Evaluated query performance and data processing efficiency.

6.0 Actual Resources Used (Mention the actual resources used).

Sr. No. Name of Resources / Materials Specification Qty Remarks

1 Computer Ram minimum 4gb ,i5 7th 1


gen
2 Operating system Windows 10 1
3 internet google 1

7.0 Outputs of the Micro-Projects

 Real-time sales analytics (total sales, top products, hourly trends).

 Optimized Hive queries for fast data processing.

 Simulated data ingested periodically into HDFS.

 HiveQL scripts for data transformation and querying.

 Real-time dashboard for visualization (optional).

 Performance metrics for query response and data processing.

8.0 Skill Developed/Learning outcome of this Micro-Project

 Apache Hive Proficiency

 Hands-on experience in creating tables, partitioning, and querying data in Hive.

 Big Data Management

 Ability to manage large datasets in HDFS and optimize storage and querying with Hive.

 Data Ingestion & ETL

 Skills in data ingestion from various sources and ETL (Extract, Transform, Load) processes
using HiveQL.

 Real-Time Analytics

 Practical knowledge of setting up near real-time data analysis using Hive, with partitioning for
time-sensitive queries.

9.0 Applications of this Micro-Project:-

 Real-Time Sales Analytics

 Used by e-commerce and retail businesses to analyze and track product sales in near real-time,
enabling faster decision-making.

 Customer Behavior Analysis

 Helps businesses understand buying patterns and trends, improving marketing strategies and
customer targeting.

 Inventory Management

 Facilitates monitoring of inventory levels in real time, helping businesses manage stock and
avoid shortages or overstocking.

Conclusion

This microproject successfully demonstrates the potential of Apache Hive for near real-time
analytics on large-scale datasets. By simulating a continuous stream of product sales data, we were
able to efficiently perform time-sensitive queries and derive actionable business insights. The use of
partitioning and query optimization in Hive allowed for faster data processing, showcasing its ability
to handle large volumes of data while providing real-time insights.

Through this project, we developed essential skills in big data management, data ingestion, query
optimization, and real-time analytics. The system built here is applicable across industries like e-
commerce, retail, and finance, where quick access to data insights can drive business decisions.
Ultimately, the project highlights how emerging technologies like Apache Hive can be adapted for
modern, real-time data analysis, creating value for businesses looking to stay agile and data-driven.

Reference

1. Microsoft Power BI Documentation – https://learn.microsoft.com/power-bi


2. Splunk Official Website – https://www.splunk.com
3. Apache Kafka – https://kafka.apache.org
4. W3Schools – Python Pandas & Matplotlib
5. Medium.com – “How Event Data Drives Business Decisions”

Annexure-IV
MICRO-PROJECT EVOLUTION SHEET

Name of Student:- SUSHANT DUDHMAL En.no2210920385


Name of Program:-Artificial intelligence & machine learning
Semester:- 6TH
Course Name:- BIG DATA ANLYTICS
Course Code:- 22684
Title of The Micro-Project:- APACHE HIVE FOR REAL-TIME QUERIES AND ANALYTICS
Course Outcomes Achieved:-
a. Describe Big data and Big Data Analytics.
b. Apply the Big data Analytics procedure to work on datasets.
c. Describe Hadoop Distributed File System.
d. Analyze structured data using HIVE.

Sr. Characteristic to be Poor Averag Good Excellen Sub


Tota
No. assessed (Marks1- e (Marks 6- t (Marks9- l
3) (Marks4- 8) 10)
5)
(A) Process and Product Assessment (Convert Below total marks out of 6Marks)

1 Relevance to the course

2 Literature
Review/information
collection
3 Completion of the
Target as Per project
proposal
4 Analysis of Data and
representation
5
Quality of
Prototype/Model
6 Report Preparation
(B) Individual Presentation/Viva(Convert Below total marks out of
4Marks)
7 Presentation
8
Viva

(A) (B)
Process and Individual Presentation/ Viva Total
Product (4 marks) Marks
Assessment (6 10
marks)

Comments/Suggestions about team work/leadership/inter-personal communication

Name of Course Teacher:-

Dated Signature:-

You might also like