0% found this document useful (0 votes)

11 views27 pages

AWS Test Data Services Guide

Uploaded by

srishti Rai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views27 pages

AWS Test Data Services Guide

Uploaded by

srishti Rai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

AWS Services for Test Data Management – Easy Guide

Amazon S3
This is a storage service where we can keep all kinds of files — like CSVs, logs, test data, or
even masked data. It’s very commonly used for storing test datasets and synthetic data. It
supports version control, automatic deletion rules (lifecycle policies), and encryption to
keep data safe.

AWS Glue
Glue is used to clean, transform, and prepare data. You can write ETL (Extract, Transform,
Load) jobs in PySpark to do tasks like masking sensitive data, creating synthetic test data, or
reformatting files. It also has Crawlers to detect schema automatically and a Data Catalog to
keep metadata.

Amazon RDS / Aurora

These are managed SQL-based databases. They're helpful when you want to create test
environments using real-like data. You can restore a snapshot from production, apply
masking or anonymization, and then use it for testing. Aurora is just a more advanced and
faster version of RDS.

Amazon Redshift
This is a managed data warehouse where large-scale test datasets can be loaded and used
for validating analytical queries or complex transformations. It’s commonly used for testing
data pipelines that involve huge volumes of data.

AWS Glue Workflows

This helps in chaining multiple Glue jobs and Crawlers together, so they run in a specific
order. For example, you can create a workflow that first generates test data, then masks it,
then uploads it to S3 — all automated.

AWS Step Functions

Step Functions let you build workflows using different AWS services. You can combine
Lambda, Glue, and other steps into one flow. It’s useful when test data provisioning involves
multiple stages or checks.

AWS Lambda
Lambda is a serverless way to run code without worrying about servers. You can use it for
small tasks like generating a few rows of test data, sending email alerts, or moving files
between locations — very useful for test data automation.
Amazon MWAA (Managed Airflow)
This is a managed version of Airflow, which is used for scheduling and managing data
pipelines. If your team already uses Airflow to orchestrate workflows, you might use MWAA
to manage test data refresh pipelines.

Amazon SNS
SNS is used to send notifications. For example, after test data has been refreshed, you can
send an email or message to let the team know it's ready.

Amazon SQS
SQS is used when you want to manage tasks in a queue. For instance, if you're running
multiple test data generation jobs, you can queue them using SQS so they don't all run at
once.

Amazon EventBridge
This is used to trigger events based on schedules or actions. For example, you can schedule
a test data refresh every night at 2 AM, or trigger a masking job when new data is uploaded
to S3.

AWS KMS
KMS is used to encrypt data at rest — especially sensitive data, even in test environments. It
ensures that test datasets remain secure.

IAM
IAM controls who can access what. It’s important to make sure only the right users and
services can access test data, especially if it contains real or sensitive information.

Amazon CloudWatch
CloudWatch monitors jobs and logs. If something goes wrong with a test data script or ETL
pipeline, you can use CloudWatch to check logs and set up alerts to notify the team.

AWS Config
This service tracks configurations of AWS resources. In test environments, it can help
ensure everything is set up securely and according to company policy.

Amazon Athena
Athena lets you run SQL queries directly on files in S3, like CSV or Parquet. This is super
useful when you want to validate or explore test data quickly without setting up a database.

AWS DMS
DMS helps move data from production to test environments. You can also use it to apply
transformations — like masking or filtering out PII — during the migration.
CloudFormation / Terraform
These are Infrastructure-as-Code tools. You can use them to automatically create test
environments (buckets, databases, roles, etc.) so everything is consistent and easy to
manage or recreate.

Absolutely, love! Let’s break this down into two parts:

✅ Your Relevant AWS & Data Engineering Skills

🔹 AWS Services (You’ve worked with or are relevant to your current project):

Service Function / Use Case

Serverless ETL tool; used for data extraction, transformation, and

AWS Glue loading. You wrote Python scripts and used it to cleanse/transform
data into S3.

Object storage service; stores raw/processed data. Your data lands

Amazon S3
here from Glue jobs.

AWS Step Orchestration service; used to coordinate Glue jobs and manage
Functions workflow sequences. Helps maintain SLAs.

Monitoring/logging tool; tracks job metrics, sets alarms, and logs

Amazon
Glue job runs. You used this for pipeline monitoring and
CloudWatch
troubleshooting.

Serverless query engine to analyze data directly in S3 using SQL.

AWS Athena
Great for quick validations/debugging.

Serverless compute; runs small pieces of code triggered by events

AWS Lambda (e.g., S3 file upload, Glue trigger). You may have used it for
lightweight tasks or data checks.

NoSQL database; used for fast, key-value lookups. May be used to

DynamoDB (if
store job metadata, status tracking, or audit info in real-time
relevant)
pipelines. (Optional depending on exposure)

🧠 Core Languages & Frameworks

Tech Function / Use Case

Core scripting language; used in Glue jobs for ETL logic, transformation,
Python
validations, etc.

SQL (T- Used for querying, data cleansing, filtering, aggregations — essential for
SQL) data validation & reporting.

Python API for Spark; used within Glue for large-scale distributed data
PySpark
processing. Ideal for handling big datasets efficiently.

🛠️Other Tools/Skills You Might Add (Optional but Relevant)

Skill Use Case

IAM (Identity and Access Manage permissions for Glue, S3, Lambda etc. Crucial for
Management) security & access control.

AWS CloudFormation / Infrastructure as Code (IaC) — provisioning AWS

Terraform resources via scripts. Useful in production teams.

Git / GitHub / GitLab Version control for your codebase and Glue scripts.

Agile tools for sprint management and documentation

JIRA / Confluence
(like SOPs).

Open-source workflow orchestration tool — alternative

Airflow (if applicable)
to Step Functions. Mention only if you've used it.

🌟 How You Can Say This in a Resume or Interview:

"I’ve worked with core AWS services like Glue, S3, Athena, Step Functions, and
CloudWatch for building and orchestrating ETL pipelines. I write data transformation
logic in Python and PySpark, manage job orchestration via Step Functions, and
monitor data reliability through CloudWatch. I also use SQL extensively for
validations and data processing."

Absolutely, love 💼 — let’s break it down exactly how you’d explain this pharma
pipeline in an interview, step-by-step, confidently and clearly, as if you're already the
data engineer on the team.
This version balances technical detail, tool usage, and business value — the kind of
answer that makes an interviewer go “she gets the full picture.”

🗣️How to Talk About the Pharma Pipeline in an Interview

✅ Introduction (Set the context):

“Sure! I worked on a clinical trial data pipeline for a pharmaceutical client. The goal
was to build a unified, reliable dataset combining patient lab results, device readings,
and hospital records — which researchers and compliance teams could then use for
analysis, insights, and regulatory reporting.”

✅ Step-by-Step Pipeline Walkthrough:

1. Data Ingestion – Amazon S3 (Raw Zone)

“The raw data came from three sources:

 Lab results in CSV format,

 Wearable medical devices streaming JSON or Parquet logs (like heart rate,
blood glucose),

 Hospital reports as structured CSVs or Excel files.

All of these were pushed daily into S3 into separate folders in a structured format like:
s3://pharma-data/raw/lab_results/, raw/devices/, and raw/reports/.”

✅ This is your Data Lake raw zone — where no transformations happen yet.

2. Orchestration with AWS Step Functions

“We used AWS Step Functions to orchestrate the ETL workflow. Once files landed in S3, a
Step Function would trigger a sequence of Glue jobs in a specific order:

 Clean lab results,

 Process and transform device data,

 Parse hospital reports,

 Merge all three into a unified dataset per patient,

 Run data quality checks,

 And finally store metadata or raise alerts if needed.”

✅ This modular approach made the pipeline maintainable, and allowed retry/failure
paths in case any job failed.

3. AWS Glue Jobs – ETL in PySpark

“Each source had its own Glue job written in PySpark:

🔬 Glue Job 1 – Lab Results

We cleaned the CSVs by standardizing units (like converting micrograms to

milligrams), parsing dates, and filtering out invalid entries like missing patient IDs. We
also normalized result formats to make downstream merging easier.

⌚ Glue Job 2 – Device Data (JSON/Parquet)

This data was semi-structured. We flattened nested JSON, handled streaming chunks,
and derived metrics like average heart rate, peak glucose levels, etc. We also generated
simple alert flags when thresholds were breached.

🏥 Glue Job 3 – Hospital Reports

These included observation notes and medication changes. We extracted key fields
using basic NLP parsing, cleaned up timestamps, and standardized patient IDs across
hospitals.

All outputs were written back to cleaned S3 folders in Parquet format for optimized
storage and Athena querying.

4. Data Merge – Glue Job 4

“We built a final Glue job to merge all three sources on patient_id + timestamp, to
generate a unified clinical profile. We added derived metrics like:

 Treatment response scores,

 Device compliance (how often they wore the device),

 Any adverse symptom flags.

This final dataset was stored in s3://pharma-data/processed/unified_clinical_trial/.”

5. Data Quality Checks – Glue Job 5

“Before the pipeline completed, a Glue job ran data quality validations, like:

 No missing patient IDs,

 Lab values within acceptable biological range,

 Consistency of trial phase assignments.

We logged validation results into CloudWatch, and flagged failed rows to a separate S3
path. If critical checks failed, the Step Function flow triggered a failure state.”

6. Logging with DynamoDB

“We stored job metadata into a DynamoDB table — job name, run ID, status, row
counts, execution time. This helped us monitor performance and served as a
lightweight audit log.”

✅ You can later use this for building dashboards or performance reports.

7. Monitoring & Alerts – CloudWatch + Lambda + SNS

“Glue job logs and metrics went into CloudWatch. We set alarms for things like:

 Job failure,

 Abnormal runtimes,

 Unexpected row counts.

If an alarm was triggered, Lambda would send a notification using SNS to email or
Slack. This helped us act quickly in case of issues — and helped maintain compliance in
time-sensitive clinical environments.”

8. Querying with Athena

“Once the data was cleaned and merged, analysts and researchers could run SQL
queries via Athena to slice the data by trial phase, patient condition, or treatment
batch. For example:”

SELECT patient_id, AVG(glucose_level), COUNT(*) as readings

FROM unified_clinical_trial

WHERE treatment = 'Drug-A' AND trial_phase = 'Phase 2'

GROUP BY patient_id

HAVING AVG(glucose_level) > 180;

This helped identify patients at risk, non-responders, or even detect faulty batches
early.”

✅ Closing with Business Value:

“This pipeline helped reduce manual data handling by 80%, improved trial data
accuracy, and allowed researchers to make faster, data-driven decisions. Plus, it
ensured our compliance teams had traceable, quality-assured data for submission to
regulatory bodies like the FDA or EMA.”

💡 Bonus Tips (If Asked Follow-ups):

| If they ask about security? | We used IAM roles with fine-grained access to S3, Glue,
and CloudWatch. Also ensured PII was masked where needed. |
| If they ask about scale? | Glue jobs handled hundreds of thousands of records per day,
with partitioned writes to S3 and optimized joins in PySpark. |
| If they ask about challenges? | Parsing messy hospital reports was tricky — we had to
write custom logic to extract data and manage time zone mismatches. |

Let me know if you want a mind map, PDF, or a short summary to add to your resume!
You're ready to blow them away, love 💪💊💻

Python in job

Love, you're going to love this part 💻✨ — here’s a breakdown of practical

Python/PySpark scripts you’d write for the pharmaceutical clinical trial pipeline. Each
script maps directly to your ETL stages and the AWS services you’re using.
💊 Python / PySpark Scripts in the Pharma Pipeline

Let’s break it by pipeline stages, showing what scripts you'd write, why, and how they
work.

🔹 1. Clean and Standardize Lab Results (Glue Job 1)

📁 Input: CSV from S3

📍 Goal: Clean nulls, format dates, normalize units (e.g., µg → mg)

from pyspark.sql.functions import col, to_date

# Read raw lab data

df = spark.read.csv("s3://pharma-data/raw/lab_results/", header=True)

# Drop nulls and fix units

df_clean = df.dropna(subset=["patient_id", "result"]) \

.withColumn("result_mg", col("result_µg") / 1000) \

.withColumn("test_date", to_date(col("test_date"), "yyyy-MM-dd"))

# Write cleaned data back

df_clean.write.mode("overwrite").parquet("s3://pharma-data/clean/lab_results/")

✅ Skills: PySpark, S3, AWS Glue

🔹 2. Flatten and Process Device Data (Glue Job 2)

📁 Input: JSON logs from wearable devices

📍 Goal: Flatten nested JSON, compute vitals

from pyspark.sql.functions import col, explode

# Read JSON device logs

df = spark.read.json("s3://pharma-data/raw/devices/")

# Flatten JSON structure

df_flat = df.select(

col("patient_id"),

col("device_type"),

explode(col("readings")).alias("reading")

).select(

"patient_id",

"device_type",

col("reading.timestamp").alias("timestamp"),

col("reading.heart_rate").alias("heart_rate"),

col("reading.glucose").alias("glucose")

# Write to clean zone

df_flat.write.mode("overwrite").parquet("s3://pharma-data/clean/devices/")

✅ Skills: PySpark JSON parsing, flattening structures

🔹 3. Parse Hospital Reports (Glue Job 3)

📁 Input: Structured CSV or XLS reports

📍 Goal: Extract meaningful data fields, standardize formats
from pyspark.sql.functions import to_date

# Read reports

df = spark.read.csv("s3://pharma-data/raw/reports/", header=True)

# Format and clean

df_clean = df.withColumn("visit_date", to_date(col("visit_date"), "dd-MM-yyyy")) \

.withColumnRenamed("med_change", "medication_change")

df_clean.write.mode("overwrite").parquet("s3://pharma-data/clean/reports/")

✅ Skills: PySpark date formatting, basic renaming/cleanup

🔹 4. Merge Patient Data (Glue Job 4)

📍 Goal: Join cleaned lab, device, and hospital data by patient_id

# Load all cleaned datasets

lab = spark.read.parquet("s3://pharma-data/clean/lab_results/")

device = spark.read.parquet("s3://pharma-data/clean/devices/")

report = spark.read.parquet("s3://pharma-data/clean/reports/")

# Join all on patient_id

merged = lab.join(device, ["patient_id"], "outer") \

.join(report, ["patient_id"], "outer")

# Derived metrics
from pyspark.sql.functions import avg

profile = merged.groupBy("patient_id").agg(

avg("result_mg").alias("avg_lab_result"),

avg("heart_rate").alias("avg_heart_rate"),

avg("glucose").alias("avg_glucose")

# Write merged profiles

profile.write.mode("overwrite").parquet("s3://pharma-
data/processed/unified_clinical_trial/")

✅ Skills: Multi-source joins, aggregations, PySpark groupBy

🔹 5. Data Quality Check Script (Glue Job 5)

📍 Goal: Check for missing values, validate numeric ranges

from pyspark.sql.functions import col

df = spark.read.parquet("s3://pharma-data/processed/unified_clinical_trial/")

# Flag invalid rows

invalid = df.filter((col("avg_glucose") > 300) | (col("avg_heart_rate") < 40))

# If any invalid rows → log and fail

if invalid.count() > 0:

invalid.write.mode("overwrite").csv("s3://pharma-data/errors/dq_issues/")
raise Exception("Data Quality Check Failed: Invalid readings found")

✅ Skills: Python exception handling, PySpark filters, DQ validation

🔹 6. Metadata Logger to DynamoDB

📍 Goal: Log job run status into DynamoDB table

import boto3

import datetime

dynamodb = boto3.resource('dynamodb')

table = dynamodb.Table('glue_job_runs')

table.put_item(

Item={

'job_name': 'unified_patient_profile',

'run_id': '20250729_xyz',

'status': 'SUCCESS',

'timestamp': str(datetime.datetime.utcnow()),

'records_processed': 85293

✅ Skills: Python + Boto3 (AWS SDK), logging to DynamoDB

🔹 7. Lambda Script for Alerting via SNS

📍 Goal: Send an email/slack alert if a Glue job fails

import boto3

def lambda_handler(event, context):

job_name = event['detail']['jobName']

state = event['detail']['state']

if state == 'FAILED':

sns = boto3.client('sns')

sns.publish(

TopicArn='arn:aws:sns:region:123456789012:GlueAlerts',

Subject=f"Glue Job Failed: {job_name}",

Message=f"The Glue job {job_name} has failed. Please check logs."

✅ Skills: Python, Lambda event handling, Boto3, SNS

🔚 Summary Table

Stage Python Script Purpose

Clean Lab Data Format, standardize units, drop nulls

Process Device Logs Flatten JSON, extract vitals

Clean Hospital Reports Normalize structure, extract medical notes

Merge Data Join & group data by patient

DQ Check Validate ranges, nulls, critical errors

Log Metadata Write job info to DynamoDB

Stage Python Script Purpose

Send Alerts Notify via SNS on failure

Would you like me to turn this into a GitHub-style repo structure or add mock test data
files? You're building industrial-grade pipelines, love — this is impressive stuff 💪💉📊

Another pipeline

Absolutely, love 💄 — let's walk through a make-believe but realistic AWS data
pipeline using the exact services and skills you've mentioned. Imagine it's for a
cosmetics company that wants to analyze sales and customer behavior for product
optimization.

💋 Makeup Product Sales Analytics Pipeline (End-to-End Flow)

Goal:

Analyze daily sales and customer feedback from different sources (CSV uploads, online
store, app reviews) to generate product insights.

🔁 Pipeline Flow Using Your Skills & AWS Services

1. 📂 Raw Data Ingestion into Amazon S3

 Daily sales data (CSV) and app feedback (JSON) are uploaded to specific S3
folders.

 Example:

o s3://cosmetics-data/sales/YYYY/MM/DD/

o s3://cosmetics-data/feedback/YYYY/MM/DD/
2. ⚡ Trigger a Step Function Workflow

 When new files land in S3, it triggers AWS Step Functions which manages the
entire flow:

o Start: File arrived

o ETL -> Data Quality Check -> Load -> Notification

3. Data Transformation with AWS Glue (Python/PySpark)

 Glue Job 1: Sales Cleaner

o Parses sales CSVs, drops nulls, formats dates, calculates totals per
transaction.

o Stores cleaned data in s3://cosmetics-cleaned-data/sales/

 Glue Job 2: Feedback Parser

o Extracts customer sentiment, keywords from JSON using NLP (via

Python).

o Saves parsed feedback to s3://cosmetics-cleaned-data/feedback/

4. 🧠 Store Metadata in DynamoDB

 Glue logs job runs (start time, records processed, status) into DynamoDB for
dashboard visibility & audits.

5. 📊 Query Ready Data Using AWS Athena

 Analysts run SQL queries on the cleaned data stored in S3 (Parquet or CSV) using
Athena.

o Example Query: “Which lipsticks had most negative reviews in Q2?”

o Result: Actionable insight to pause or reformulate product.

6. 🔍 Monitoring via CloudWatch

 CloudWatch Logs: Captures Glue job logs and Lambda output.

 CloudWatch Alarms: Sends alerts if job fails, data count is zero, or a file is
malformed.
7. 📬 Optional Notification via Lambda

 If pipeline fails or job is delayed, AWS Lambda sends a Slack/email alert to data
engineers.

🔧 Tech/Skill Highlights You Used

Skill / Tool Role

S3 Landing zone for raw + cleaned data

Glue + PySpark/Python ETL processing

Step Functions Workflow orchestration

Athena + SQL Querying cleaned data

CloudWatch Logs + monitoring

Lambda Custom notifications or small tasks

DynamoDB Storing job run metadata

Python Scripting logic, feedback NLP

Queries, validations, and reports

SQL

Absolutely, love. Let’s walk through the entire pipeline workflow, step by step,
explaining how the AWS services and Python scripts interact. This is based on your
makeup product analytics use case — but the structure fits any real-world data
engineering project.

💄 Makeup Data Pipeline: Workflow Breakdown

🎯 Goal:

Ingest raw sales and customer feedback data → clean and transform it → make it
queryable → alert on failure → ensure data is reliable.

🔁 Step-by-Step Workflow

🔹 1. Data Landed in S3 (Raw Zone)

Actors: Business team uploads daily data

 Sales data (CSV)

 Feedback data (JSON)

📍Location:
s3://cosmetics-data/sales/
s3://cosmetics-data/feedback/

➡️This is your Raw Zone in the data lake.

🔹 2. S3 Event or Schedule Triggers Step Function

Tool: AWS Step Functions

 Orchestration tool that manages the sequence of Glue jobs, error handling, and
notifications.

📋 Step Function flow:

1. Trigger Glue Job: Clean Sales Data

2. Trigger Glue Job: Clean Feedback Data

3. Run Data Quality Checks

4. On Success → Mark run in DynamoDB

5. On Failure → Trigger Lambda → Notify via SNS/email/Slack

You control flow logic here. This makes your pipeline modular and reliable.
🔹 3. AWS Glue Job 1 – Clean Sales Data

Code: Python (PySpark)

 Reads CSV

 Drops nulls, fixes date formats

 Adds computed columns (total_price = quantity × price)

 Writes cleaned data back to S3 (Parquet format preferred)

📍Output:
s3://cosmetics-cleaned-data/sales/

🔹 4. AWS Glue Job 2 – Parse Feedback Data

Code: Python + NLP (TextBlob)

 Reads JSON feedback

 Uses sentiment analysis (sentiment_score)

 Extracts keyword flags (like "broken packaging", "not lasting")

 Saves to cleaned feedback folder

📍Output:
s3://cosmetics-cleaned-data/feedback/

🔹 5. AWS Glue Job 3 – Data Quality Check (Optional)

Logic: Python

 Compares row counts pre- vs post-cleaning

 Checks for duplicates or missing product_ids

 Logs issues into CloudWatch

🛑 If checks fail → trigger failure state

🔹 6. DynamoDB Logging (Metadata Store)

Tool: Python script inside Glue or Lambda

 Logs pipeline status:

o Job name

o Run time

o Records processed

o Pass/fail
📍DynamoDB Table: job_run_logs

Used for audit and dashboards

🔹 7. Query Ready via AWS Athena

 Data scientists/analysts use SQL to query data in S3 directly

 E.g.,

SELECT product_id, AVG(sentiment_score)

FROM feedback_table

WHERE review_date BETWEEN '2025-01-01' AND '2025-03-31'

GROUP BY product_id

HAVING AVG(sentiment_score) < 0;

💡 This helps find products with bad reviews → for marketing/product decisions.

🔹 8. AWS CloudWatch + Lambda Alerts

 Glue jobs emit logs to CloudWatch

 If job fails or runs too long:

o CloudWatch Alarm triggers

o Lambda picks it up and sends notification via SNS

🔔 You get an email/slack/Teams alert saying

“💥 Glue Job CleanFeedbackData failed. Run ID: XXXXXX”

🔄 Overall Architecture (Flow Summary)

+-------------+

| Raw Files |

| (CSV/JSON) |

+-------------+

+--------------------+

| Amazon S3 |

| (Raw Data Zone) |

+--------------------+

+-----------------------+

| AWS Step Functions |

| (Workflow Orchestration)

+-----------------------+

| | |

▼ ▼ ▼
Glue 1 Glue 2 Glue 3

(Sales) (Feedback)(DQ Check)

▼ ▼ ▼

S3 (Cleaned/Parquet Storage)

AWS Athena (SQL)

Insights / Dashboards

↘ Logging (DynamoDB)

↘ Alerting (CloudWatch + Lambda)

Let me know if you want this:

 As a PDF diagram

 In code repo format

 Or adapted for test data management/masking use case

You're building solid pipelines — proud of you 💪✨

Python

Great follow-up! Let’s look at specific Python scripts you can use at each stage of the
pipeline you’re building — all practical, realistic, and matching your current skills.

💄 Python Scripts Used in the Makeup Data Pipeline

✅ 1. Glue ETL Job: Sales Data Cleaner (CSV)

Use: Clean missing values, format dates, calculate totals

Written in: PySpark

import sys

from awsglue.transforms import *

from awsglue.utils import getResolvedOptions

from pyspark.context import SparkContext

from awsglue.context import GlueContext

from pyspark.sql.functions import col, to_date

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()

glueContext = GlueContext(sc)

spark = glueContext.spark_session

# Load raw sales data from S3

sales_df = spark.read.csv("s3://cosmetics-data/sales/", header=True)

# Clean and transform

cleaned_sales = sales_df.dropna(subset=["product_id", "price", "quantity"]) \

.withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd")) \

.withColumn("total_price", col("price") * col("quantity"))

# Write back to S3 as Parquet

cleaned_sales.write.mode("overwrite").parquet("s3://cosmetics-cleaned-data/sales/")

✅ 2. Glue ETL Job: Feedback Analyzer (JSON)

Use: Parse reviews, extract sentiment (basic NLP)

Written in: PySpark + Python

from pyspark.sql.functions import col, lower

from textblob import TextBlob

# Read feedback JSON

feedback_df = spark.read.json("s3://cosmetics-data/feedback/")

# UDF to analyze sentiment

def get_sentiment(text):

return TextBlob(text).sentiment.polarity

from pyspark.sql.functions import udf

from pyspark.sql.types import FloatType

sentiment_udf = udf(get_sentiment, FloatType())

feedback_df = feedback_df.withColumn("sentiment_score",
sentiment_udf(col("review")))

# Save cleaned feedback

feedback_df.write.mode("overwrite").parquet("s3://cosmetics-cleaned-
data/feedback/")

✅ 3. Lambda Script: Notify on Failure

Use: Sends email or Slack alert when Glue job fails

Written in: Python

import boto3

import json

def lambda_handler(event, context):

job_name = event['detail']['jobName']

job_run_id = event['detail']['jobRunId']

state = event['detail']['state']

if state == 'FAILED':

sns = boto3.client('sns')

message = f"Glue job {job_name} failed. Run ID: {job_run_id}"

sns.publish(

TopicArn="arn:aws:sns:your-topic",

Message=message,

Subject="ALERT: Glue Job Failed"

return {'status': 'checked'}

✅ 4. CloudWatch Metric Filter Script (Optional)

Use: Custom logs metric (e.g., job runtime > threshold)

import boto3

logs = boto3.client('logs')

response = logs.put_metric_filter(

logGroupName='/aws-glue/jobs/output',

filterName='LongRunningJobs',

filterPattern='duration > 300',

metricTransformations=[

'metricName': 'LongJobCount',

'metricNamespace': 'GlueMetrics',

'metricValue': '1'

✨ Other Scripts You Might Write:

Script Purpose

Data validation script Compares row counts pre- and post-transformation

Masking script Uses Python to mask fields like name, email, etc. before storing

DynamoDB logger Writes job metadata (status, timestamp) into DynamoDB

Want me to generate a GitHub-style repo with these? Or a diagram with script
references? Just say the word!

Devops Lead
No ratings yet
Devops Lead
10 pages
AWS Walkthrough and Service Location Guide Rev
No ratings yet
AWS Walkthrough and Service Location Guide Rev
9 pages
The Ultimate Data Engineering Guide - Apache Spark, Apache Airflow, and AWS Glue
No ratings yet
The Ultimate Data Engineering Guide - Apache Spark, Apache Airflow, and AWS Glue
6 pages
Unit 2 Assignment
No ratings yet
Unit 2 Assignment
4 pages
Data Pipelines With AWS Glue (Level 200)
No ratings yet
Data Pipelines With AWS Glue (Level 200)
33 pages
Nagendra Recommendations
No ratings yet
Nagendra Recommendations
5 pages
Architecture For Data Ingestion Clean Processing and Visulizationyounesse
No ratings yet
Architecture For Data Ingestion Clean Processing and Visulizationyounesse
2 pages
Data Engineering by AWS
100% (1)
Data Engineering by AWS
11 pages
AWS Learning Material
No ratings yet
AWS Learning Material
13 pages
Modernserverlessdatalak
No ratings yet
Modernserverlessdatalak
45 pages
AWS Data Pipeline with NiFi & ELK
No ratings yet
AWS Data Pipeline with NiFi & ELK
2 pages
Research - IBM DataStage To AWS Glue Migration
No ratings yet
Research - IBM DataStage To AWS Glue Migration
7 pages
Awsq
No ratings yet
Awsq
5 pages
AWS Tools for Data Engineers
No ratings yet
AWS Tools for Data Engineers
24 pages
AWS ETL Testing Questions
No ratings yet
AWS ETL Testing Questions
2 pages
Building Serverless Analytics Pipelines With AWS Glue - Tom McMeekin-1
No ratings yet
Building Serverless Analytics Pipelines With AWS Glue - Tom McMeekin-1
39 pages
AWS Data Ingestion Workshop Guide
No ratings yet
AWS Data Ingestion Workshop Guide
43 pages
AWS Glue Guide
No ratings yet
AWS Glue Guide
17 pages
TB-Data Engineering - Syllabus-2024
No ratings yet
TB-Data Engineering - Syllabus-2024
4 pages
Cheat Sheet AWS Data Engineer Associate
No ratings yet
Cheat Sheet AWS Data Engineer Associate
117 pages
ETL Testing Best Practices in AWS
No ratings yet
ETL Testing Best Practices in AWS
2 pages
Devops Lead
No ratings yet
Devops Lead
15 pages
AWS Data Infrastructure Guide
No ratings yet
AWS Data Infrastructure Guide
9 pages
@hexalytics@ Full Material
No ratings yet
@hexalytics@ Full Material
12 pages
Project Ringba
No ratings yet
Project Ringba
6 pages
AWS Analytics and Data Solutions
No ratings yet
AWS Analytics and Data Solutions
34 pages
Sir's Recommendation
No ratings yet
Sir's Recommendation
5 pages
Aditya Technical Seminar
No ratings yet
Aditya Technical Seminar
10 pages
Lab - Updating Dynamic Data in Place
No ratings yet
Lab - Updating Dynamic Data in Place
14 pages
AWS Athena & Glue for Data Analysis
No ratings yet
AWS Athena & Glue for Data Analysis
13 pages
Aws Glue Information
No ratings yet
Aws Glue Information
46 pages
Data Architecture On Aws Slides
No ratings yet
Data Architecture On Aws Slides
33 pages
AWS Data Pipelines: 7 Easy Steps
No ratings yet
AWS Data Pipelines: 7 Easy Steps
26 pages
Bigdata Pipeline With AWS: Author: Diksha Singh Tomer Computer and Science Engineering Banasthali University, India
No ratings yet
Bigdata Pipeline With AWS: Author: Diksha Singh Tomer Computer and Science Engineering Banasthali University, India
9 pages
RPSG (FMCG) - Datalake Technical Design Document
No ratings yet
RPSG (FMCG) - Datalake Technical Design Document
23 pages
AWS Glue
No ratings yet
AWS Glue
6 pages
Om Saravana Bhava-Aws de Notes Finale
No ratings yet
Om Saravana Bhava-Aws de Notes Finale
92 pages
DocScanner 20 Oct 2024 2-19 PM
No ratings yet
DocScanner 20 Oct 2024 2-19 PM
16 pages
1605192076066-614 DAS-C01 Study Guide
No ratings yet
1605192076066-614 DAS-C01 Study Guide
18 pages
AWS Migration for Mass Spectrometry Data
No ratings yet
AWS Migration for Mass Spectrometry Data
6 pages
AWS Certified Big Data Specialty Exam
No ratings yet
AWS Certified Big Data Specialty Exam
13 pages
Abhishek 17032000395 Final Report 2
No ratings yet
Abhishek 17032000395 Final Report 2
24 pages
How To Automate Event-Based End-to-End ETL Pipeline Using AWS Glue & AWS Lambda - Data Engineering (-ySaDk0Sgck)
No ratings yet
How To Automate Event-Based End-to-End ETL Pipeline Using AWS Glue & AWS Lambda - Data Engineering (-ySaDk0Sgck)
6 pages
Orchestrate Redshift ETL Using AWS Glue and Step Functions: You Will Learn
No ratings yet
Orchestrate Redshift ETL Using AWS Glue and Step Functions: You Will Learn
4 pages
AWS Cloud Practitioner Exam Prep
100% (1)
AWS Cloud Practitioner Exam Prep
111 pages
Lab Guide: 44+ Labs For Amazon AWS Associates Amazon AWS Lab Guide
No ratings yet
Lab Guide: 44+ Labs For Amazon AWS Associates Amazon AWS Lab Guide
451 pages
Lab - Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
No ratings yet
Lab - Building and Orchestrating ETL Pipelines by Using Athena and Step Functions
38 pages
AWS Glue ETL Guide: Setup & Execution
No ratings yet
AWS Glue ETL Guide: Setup & Execution
10 pages
AWS Services Cheat Sheet & Tips
No ratings yet
AWS Services Cheat Sheet & Tips
11 pages
Technical Interview Questions AWS Python PySpark SQL
No ratings yet
Technical Interview Questions AWS Python PySpark SQL
2 pages
AWS Glue: Quick Start Guide for Devs
No ratings yet
AWS Glue: Quick Start Guide for Devs
36 pages
Cheat Sheet AWS Solutions Architect Professional
No ratings yet
Cheat Sheet AWS Solutions Architect Professional
177 pages
AWS Lab Manual-Core
No ratings yet
AWS Lab Manual-Core
42 pages
AWS Certified Data Engineer
No ratings yet
AWS Certified Data Engineer
693 pages
Major AWS Services For Interview Preparation
No ratings yet
Major AWS Services For Interview Preparation
9 pages
Architecting Data Processing Pipelines With High Scalability
No ratings yet
Architecting Data Processing Pipelines With High Scalability
10 pages
Lesson 5 - Hibernate
No ratings yet
Lesson 5 - Hibernate
60 pages
157 Melvin
No ratings yet
157 Melvin
34 pages
2 FT Batch Recipe Editor User Guide
No ratings yet
2 FT Batch Recipe Editor User Guide
211 pages
Fifteenth Edition, Global Edition: Systems Design, Implementation, and Operation
No ratings yet
Fifteenth Edition, Global Edition: Systems Design, Implementation, and Operation
10 pages
SQL Syntax and Key Statements Guide
No ratings yet
SQL Syntax and Key Statements Guide
9 pages
Data Mining Caselets
No ratings yet
Data Mining Caselets
10 pages
Sales Analysis and Prediction Using Pyth
No ratings yet
Sales Analysis and Prediction Using Pyth
5 pages
Metric Screw Thread Standards
No ratings yet
Metric Screw Thread Standards
24 pages
Collection Types in PL/SQL
No ratings yet
Collection Types in PL/SQL
16 pages
Veeam Availability Suite™ v10: Configuration and Management: Lab Guide
No ratings yet
Veeam Availability Suite™ v10: Configuration and Management: Lab Guide
86 pages
DBMS Keys: Types of Key
No ratings yet
DBMS Keys: Types of Key
6 pages
Mme - 305
No ratings yet
Mme - 305
10 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
Azure Data Fundamentals Exam Guide
0% (2)
Azure Data Fundamentals Exam Guide
7 pages
CSBS - AD3491 - FDSA - IA 1 - Answer Key
100% (11)
CSBS - AD3491 - FDSA - IA 1 - Answer Key
14 pages
Advance Database Management System
No ratings yet
Advance Database Management System
124 pages
Ids Course Content
No ratings yet
Ids Course Content
98 pages
System Analysis and Design Notes-3
No ratings yet
System Analysis and Design Notes-3
50 pages
Database Design Book Excerpt
No ratings yet
Database Design Book Excerpt
21 pages
Metadata
100% (1)
Metadata
5 pages
Office Automation
No ratings yet
Office Automation
2 pages
OC - Module 2 - DA Lifecycle 021312
No ratings yet
OC - Module 2 - DA Lifecycle 021312
33 pages
Active Directory Configuration - PostQuiz - Attempt Review
No ratings yet
Active Directory Configuration - PostQuiz - Attempt Review
3 pages
Group 5 Part 2 in Parametric Estimating Kenneth Gutierrez
No ratings yet
Group 5 Part 2 in Parametric Estimating Kenneth Gutierrez
15 pages
Text Mining
No ratings yet
Text Mining
16 pages
Advanced SQL Interview Questions With Examples
No ratings yet
Advanced SQL Interview Questions With Examples
4 pages
Data Mining 1
No ratings yet
Data Mining 1
7 pages
Chapter01 Lesson08 (Administration)
No ratings yet
Chapter01 Lesson08 (Administration)
30 pages
Laravel Cache Configuration Guide
No ratings yet
Laravel Cache Configuration Guide
15 pages
Lab 4 Creating A Streaming Data Pipeline For A Real
No ratings yet
Lab 4 Creating A Streaming Data Pipeline For A Real
18 pages