Setting up Apache Kafka on an Amazon EC2 instance can be a powerful way to handle real-time data streaming. This guide will walk you through installing Kafka on an EC2 instance, configuring Zookeeper, producer, and consumer, and streaming logs to an S3 bucket. We'll then use AWS Glue to create a data catalog and run SQL queries with Amazon Athena to analyze the data.
- An AWS account
- An EC2 instance (t2.medium recommended)
- SSH access to your EC2 instance
- Basic knowledge of Kafka and AWS services
-
Create an EC2 Instance:
- Type:
t2.medium
- OS:
Amazon Linux 2 AMI
- Security Group: Allow inbound traffic on ports
22
(SSH),2181
(Zookeeper),9092
(Kafka)
- Type:
-
SSH into Your EC2 Instance:
ssh -i "path/to/your-key.pem" ec2-user@your-ec2-public-ip
-
Install Java:
sudo yum install java-1.8.0-openjdk -y java -version
-
Download and Extract Kafka:
wget https://archive.apache.org/dist/kafka/3.3.1/kafka_2.12-3.3.1.tgz tar -xvf kafka_2.12-3.3.1.tgz cd kafka_2.12-3.3.1
- Start Zookeeper:
bin/zookeeper-server-start.sh config/zookeeper.properties
-
Open a New SSH Session and Connect to EC2 Instance:
ssh -i "path/to/your-key.pem" ec2-user@your-ec2-public-ip
-
Start Kafka Server:
export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M" cd kafka_2.12-3.3.1 bin/kafka-server-start.sh config/server.properties
-
Configure Kafka for Public Access:
- Edit
server.properties
:sudo nano config/server.properties
- Update
ADVERTISED_LISTENERS
to your EC2 instance's public IP:advertised.listeners=PLAINTEXT://your-ec2-public-ip:9092
- Edit
-
Open a New SSH Session and Connect to EC2 Instance:
ssh -i "path/to/your-key.pem" ec2-user@your-ec2-public-ip
-
Create the Topic:
cd kafka_2.12-3.3.1 bin/kafka-topics.sh --create --topic SAPDEMOTESTING --bootstrap-server your-ec2-public-ip:9092 --replication-factor 1 --partitions 1
-
Open a New SSH Session and Connect to EC2 Instance:
ssh -i "path/to/your-key.pem" ec2-user@your-ec2-public-ip
-
Start Kafka Producer:
cd kafka_2.12-3.3.1 bin/kafka-console-producer.sh --topic SAPDEMOTESTING --bootstrap-server your-ec2-public-ip:9092
-
Open a New SSH Session and Connect to EC2 Instance:
ssh -i "path/to/your-key.pem" ec2-user@your-ec2-public-ip
-
Start Kafka Consumer:
cd kafka_2.12-3.3.1 bin/kafka-console-consumer.sh --topic SAPDEMOTESTING --bootstrap-server your-ec2-public-ip:9092
- Configure Consumer to Stream Logs to S3:
- Create an S3 bucket and set up necessary permissions.
- Use a script or application to redirect consumer logs to the S3 bucket.
-
Set Up AWS Glue Crawler:
- Configure the crawler to crawl your S3 bucket and create a data catalog.
-
Run the Crawler:
- Verify that the crawler has created tables in the Glue Data Catalog.
-
Configure Athena to Use Glue Data Catalog:
- In the Athena console, set the Glue Data Catalog as the data source.
-
Run SQL Queries:
- Use SQL queries in Athena to analyze the Kafka consumer logs stored in S3.
The repository KAFKA-PYTHON-AWS-CRAWLER-AMAZON-ATHENA contains scripts and configurations that automate the steps mentioned above. Here’s an explanation of the key files and their content:
This script sets up a Kafka producer that sends data to the specified Kafka topic.
from kafka import KafkaProducer
from json import dumps
import time
producer = KafkaProducer(
bootstrap_servers=['13.201.13.205:9092'],
value_serializer=lambda x: dumps(x).encode('utf-8')
)
for j in range(999):
data = {'number': j}
producer.send('SAPDEMOTESTING', value=data)
time.sleep(1)
This script sets up a Kafka consumer that reads data from a specified Kafka topic and writes it to an S3 bucket.
from kafka import KafkaConsumer
from json import loads
import boto3
s3 = boto3.client('s3')
consumer = KafkaConsumer(
'SAPDEMOTESTING',
bootstrap_servers=['13.201.13.205:9092'],
auto_offset_reset='earliest',
enable_auto_commit=True,
group_id='my-group',
value_serializer=lambda x: loads(x.decode('utf-8'))
)
for message in consumer:
message = message.value
s3.put_object(Bucket='your-s3-bucket', Key=f'{message["number"]}.json', Body=str(message))
This script sets up an AWS Glue Crawler that creates a data catalog from the files stored in an S3 bucket.
import boto3
client = boto3.client('glue')
response = client.create_crawler(
Name='kafka-crawler',
Role='your-iam-role',
DatabaseName='kafka-db',
Description='Crawler for Kafka logs in S3',
Targets={
'S3Targets': [
{
'Path': 's3://your-s3-bucket/'
},
]
},
Schedule='cron(15 * * * ? *)'
)
This script uses AWS Athena to run SQL queries on the data catalog created by AWS Glue.
import boto3
client = boto3.client('athena')
response = client.start_query_execution(
QueryString='SELECT * FROM kafka_logs',
QueryExecutionContext={
'Database': 'kafka-db'
},
ResultConfiguration={
'OutputLocation': 's3://your-athena-query-results/'
}
)
By following these steps and utilizing the provided scripts, you can set up a Kafka environment on an Amazon EC2 instance, stream logs to S3, create a data catalog with AWS Glue, and query the data with Amazon Athena. This setup enables real-time data processing and powerful data analytics capabilities, leveraging the scalability and flexibility of AWS services.