AWS EMR Interview Questions and Answers
AWS Interview Questions and Answers – Section 1: EMR (Elastic MapReduce)
Q: How many master nodes? Type of nodes in the cluster?
A: Typically, we use 1 master node (for cluster management), core nodes (for data storage and
processing), and task nodes (for transient processing). Core nodes persist data using HDFS, whereas
task nodes are ephemeral.
Q: What is the difference between core, master, and task nodes?
A:
Master Node: Manages the cluster and resource allocation.
Core Nodes: Handle both data storage and processing (via HDFS).
Task Nodes: Perform compute-only operations and can be removed anytime.
Q: What is the size of your EMR nodes?
A: We typically used m5.xlarge or r5.2xlarge instances depending on workloads. Memory-intensive
jobs used r5 family.
Q: Guidelines to choose size of nodes (Horizontal vs Vertical Scaling)?
A:
Vertical Scaling: Increase CPU/RAM of single node.
Horizontal Scaling: Add more nodes for distributed processing — preferred in Spark jobs.
Q: Type of cluster scaling? (Auto-scaling)
A: We used Auto Scaling Policies on instance fleets to scale core/task nodes based on metrics like
YARN memory and CPU usage.
Q: How do you distribute PySpark load across nodes?
A: By using --deploy-mode cluster in spark-submit. This offloads both driver and executors to the
cluster for balanced execution.
Q: How do you submit Spark jobs?
A: Submitted using EMR Steps, either from the AWS Console, CLI, or programmatically via boto3.
Q: Applications used on EMR?
A: Spark, Hive, Hadoop, Livy, Hue.
Q: Uses of these applications?
Spark: Data transformation
Hive: SQL-like queries
Hadoop: Underlying YARN/HDFS
Livy: Spark REST API
Hue: UI-based job and query management
Q: Use of Hive (Hive Metastore)?
A: Used to store metadata of Hive tables (schemas, locations). Hive Metastore enables schema
sharing across tools like Presto, Athena.
Q: Difference between external and managed Hive tables?
A:
Managed: Hive manages table metadata and data (deleting table deletes data).
External: Data stored outside Hive’s control (deleting table keeps data intact).
Q: Versions used (EMR, Spark, Python)?
A: EMR: v6.6.0
Spark: 3.x
Python: 3.x
Q: Cost optimization of EMR?
A:
Auto scaling
Spot instances
Transient clusters (terminate after job ends)
Fine-tuned cluster sizing
Q: What is a transient cluster?
A: A cluster that is created temporarily for a job and terminated post completion to save cost.
Q: Types of nodes provisioned (On-demand/Spot/Reserved)?
A:
Master: On-Demand
Core: Mixed (On-Demand + Spot)
Task: Spot for cost savings
Q: When to use On-demand, Spot, Reserved nodes?
A:
On-Demand: Predictable and stable workloads
Spot: For flexible, fault-tolerant jobs
Reserved: For long-term, stable environments (Prod)
Q: What is serverless EMR? EMR Studio?
A:
EMR Serverless: Fully managed runtime for Spark/Presto without provisioning clusters
EMR Studio: Web-based IDE to visually develop and monitor jobs
Q: When do you choose EMR vs Glue/Athena/Kinesis?
A:
EMR: Large-scale batch/stream processing
Glue: ETL with managed scheduling
Athena: Ad-hoc querying
Kinesis: Real-time streaming
Q: Where are EMR logs stored?
A: In Amazon S3 or CloudWatch Logs (configurable in EMR settings)
Q: How to monitor EMR applications?
A:
CloudWatch Metrics
Spark History Server
Ganglia
Resource Manager UI
Q: How to debug EMR jobs?
A:
Analyze logs (stderr/stdout)
Use Spark UI
Check EMR Step output and failure logs
Q: What is Spark application and resource monitoring UI?
A:
Spark History Server: Tracks past jobs
YARN UI: Shows active job resources
Resource Manager UI: Overall cluster health
Q: How do monitoring UIs help debug jobs?
A: Identify skewed stages, memory issues, stragglers, and executor failures.
Q: Spark processing architecture?
A: Driver → DAG Scheduler → Task Scheduler → Executors → Cluster Manager → HDFS/S3
Q: Terminate vs Stop EMR?
A:
Terminate: Deletes the cluster and data (unless persisted on S3)
Stop: Not applicable; EMR doesn’t support “stop” like EC2
Q: What is AMI and how is it used in EMR?
A: Amazon Machine Image (AMI) provides pre-installed software and OS for clusters.
Q: EMR disk volume type and size?
A: Used EBS gp2/gp3 volumes. Size depends on job and data, usually 100–500 GB per node.
Q: Was data stored on HDFS or S3?
A: Data stored primarily on Amazon S3 using EMRFS. Temporary data may go to HDFS.
Q: What is EMRFS?
A: EMR File System (EMRFS) enables EMR clusters to use S3 as the primary data source.
Q: What are tags in EMR?
A: Metadata key-value pairs for organizing and cost tracking (e.g., Project, Env, Owner)
Q: What are bootstrap actions in EMR?
A: Shell scripts that run during cluster startup to install libraries, configure environment.
Q: What is instance fleet and when do we use it?
A: Instance Fleet enables use of multiple instance types within a role (master/core/task) for cost and
availability optimization.
Q: How do you increase EMR instance quota in an AZ?
A: Raise a Service Quota Increase request in AWS Support.
Q: What is auto termination in EMR?
A: Automatically terminates cluster if idle for specified time. Useful for cost savings.
Q: Use of security groups in EMR?
A: Control network access to cluster nodes via inbound/outbound rules.
Q: Data volume and format?
A: Processed 100s of GBs to TBs. Formats: Parquet, CSV, JSON, ORC.
Q: What is Parquet/ORC file format? Columnar vs Row-based formats?
A:
Parquet/ORC: Columnar formats, better for analytical queries
CSV/JSON: Row-based, simpler but slower for large data
Q: How do you tune EMR performance?
A:
Partition tuning
Data coalescing
Executor memory and core tuning
Using broadcast joins and caching
Choosing optimized file formats (Parquet/ORC)
AWS Glue Interview Questions and Answer0073xx
AWS Interview Questions and Answers – Section 2: AWS Glue
Q: What is AWS Glue? Why do we use the Glue service?
A: AWS Glue is a serverless data integration and ETL (Extract, Transform, Load) service. It is used to
prepare and transform data for analytics, machine learning, and application development by
automating the ETL process.
Q: What type of Glue engine have you used?
A: We used the Spark-based Glue engine (version 2.0 and 3.0), which supports distributed
processing using PySpark.
Q: What Python version have you used in Glue?
A: Python 3.x, primarily with Glue version 2.0 and above.
Q: What is a DPU (Data Processing Unit)?
A: A DPU provides the processing resources for Glue. 1 DPU = 4 vCPU + 16 GB RAM. You can scale
jobs by allocating more DPUs.
Q: What transformations have you used in your Glue job?
A:
DropFields, ApplyMapping, ResolveChoice, SelectFields, Join, Relationalize, Unnest
Also used custom transformations using PySpark and DynamicFrame functions.
Q: Have you used custom scripts for your Glue job?
A: Yes, we wrote custom PySpark scripts for complex ETL logic and reusable components.
Q: What libraries and packages have you used?
A:
pandas, boto3, pyarrow, numpy, requests
Additionally, custom .whl and .zip packages uploaded to S3
Q: How do you use custom packages or libraries in Glue?
A:
Upload the package to S3
Use job parameter: --additional-python-modules or --extra-py-files
Q: How do we pass parameters to a Glue job?
A:
Using --arguments flag in Glue job or
Via Job bookmarks, Triggers, or boto3 when triggered programmatically
Q: What is the use of Tags in Glue?
A: Tags are used to organize, identify, and control access or costs associated with Glue resources.
Q: How do you track changes in a Glue job (versioning)?
A:
Track via source control (Git)
Document changes using job description/comments
Enable job history tracking in AWS
Q: How do we monitor a Glue job?
A:
CloudWatch Logs
Glue Job Run History
Metrics like DPU hours, job duration, success/failure states
Q: How do we troubleshoot a Glue job?
A:
Analyze logs in CloudWatch
Use try-except in PySpark code
Use job metrics and bookmarks to resume jobs from failure
Q: How do we develop Glue jobs locally?
A:
Use AWS-provided Glue Docker image for local testing
Develop in PyCharm/Jupyter and simulate job runs with local inputs
Q: How can we push the Glue job code to a repository?
A:
Use Git or GitHub
Deploy with CI/CD tools like CodePipeline, Jenkins, or manually through boto3 scripts
Q: What if a Glue job takes too long to execute?
A:
Tune transformations
Use pushdown predicates
Increase DPU allocation
Optimize joins and repartition data
Q: How do I set or increase job timeout?
A:
Set job timeout (max 48 hrs) in the Glue console/job config
Also configurable through boto3 or CLI
Q: What is a Glue workflow and its use?
A:
Glue Workflow orchestrates multiple jobs, crawlers, and triggers
Ensures dependency management and end-to-end ETL pipeline automation
Q: Glue workflow vs Apache Airflow?
A:
Glue Workflow: Simple, AWS-native, serverless, tightly integrated with Glue
Airflow: Complex DAGs, cross-service orchestration, supports retries/conditionals
Q: What is a Glue Crawler? When do we use it?
A:
A crawler scans a data store (S3, JDBC, etc.), infers schema, and populates the Glue Data
Catalog
Used before querying via Athena or ETL
Q: What is Glue Data Catalog? How do you update it?
A:
Centralized metadata repository
Updated using crawlers, scripts, or boto3 API
Q: How do you detect schema changes automatically?
A:
Enable “Update in-place” or “Add new columns” in crawler settings
Monitor schema evolution logs
Q: What is a Schema Registry? Why is it used?
A:
AWS Glue Schema Registry manages schema versions for streaming data (e.g., Kafka)
Ensures backward/forward compatibility
Q: How do you performance tune Glue jobs?
A:
Use pushdown predicates
Avoid shuffles/re-partitions unless necessary
Read data in partitioned format (e.g., Parquet)
Optimize DPU allocation and parallelism
AWS S3 Interview Questions and Answers
AWS Interview Questions and Answers – Section 3: Amazon S3 (Simple Storage Service)
Q: What is the purpose of S3? What is an object store?
A: Amazon S3 is a highly scalable, durable object storage service used to store any amount of data.
An object store stores data as objects (key, value, metadata) rather than in blocks or files.
Q: What is the difference between object store and block store?
A:
Object Store (S3): Stores data with metadata and unique key. Ideal for backups, media, logs.
Block Store (EBS): Divides data into fixed-size blocks. Ideal for databases, file systems, etc.
Q: What is the maximum size of a single object in S3?
A: Up to 5 TB per object. Multipart uploads are recommended for objects larger than 100 MB.
Q: What is the total size limit for S3?
A: There is no limit. S3 scales automatically to store any amount of data.
Q: Can two buckets have the same name?
A: No. Bucket names are globally unique across all AWS accounts and regions.
Q: Can two different accounts have buckets with the same name in different regions?
A: No. Bucket names must be globally unique regardless of region or account.
Q: What is bucket versioning? How can I recover a deleted file?
A:
Versioning retains all versions of an object.
If a file is deleted, it creates a delete marker. You can restore the file by deleting the delete
marker.
Q: What is a delete marker?
A: A delete marker is a placeholder indicating the deletion of the latest version.
Removing it restores the previous version as the latest.
Q: How is S3 data encrypted?
A:
At rest: SSE-S3 (AES-256), SSE-KMS, SSE-C
In transit: HTTPS (TLS/SSL)
Q: What are the different S3 storage classes (cost layers)?
A:
S3 Standard
S3 Intelligent-Tiering
S3 One Zone-IA
S3 Glacier Instant Retrieval
S3 Glacier Flexible Retrieval
S3 Glacier Deep Archive
Q: How do I log and get notified when someone accesses an object in a bucket?
A:
Enable Access Logs or CloudTrail
Configure Event Notifications using SNS, Lambda, or SQS for actions like PUT, DELETE, etc.
Q: How to classify and move data to save cost?
A:
Use S3 Lifecycle Policies to transition data across storage classes based on age or access
pattern.
Q: What is Glacier storage? When is it used?
A: Glacier is archival storage used for infrequently accessed data (compliance, backups).
Retrieval times range from milliseconds to hours depending on the tier.
Q: How do I prevent users from deleting objects in S3?
A:
Use IAM policies with Deny s3:DeleteObject
Use S3 Object Lock with governance or compliance mode
Q: How do I restrict access to specific users?
A:
Use Bucket Policies, IAM Policies, and/or Access Control Lists (ACLs)
Define who can access, what actions are allowed, and under what conditions
Q: How do you give access to users from different AWS accounts (CORS)?
A:
Use Bucket Policy or Cross-Account IAM Role
CORS (Cross-Origin Resource Sharing) applies to browser-based access
Q: How do you monitor S3 usage (size, objects, etc.)?
A:
Use S3 Storage Lens
CloudWatch metrics
AWS Cost Explorer
CloudTrail for audit logs
Q: I want to delete objects older than 180 days. How?
A:
Configure an S3 Lifecycle Rule to delete objects based on their last modified date (e.g., after
180 days).
Q: How to ensure data is protected if a region fails?
A:
Enable Cross-Region Replication (CRR) to replicate objects to another region automatically.
AWS RDS Interview Questions and Answers
AWS Interview Questions and Answers – Section 4: Amazon RDS (Relational Database Service)
Q: What is the difference between RDS and hosting a database on-premises or on EC2?
A:
RDS is a fully managed service. AWS handles backups, patching, failover, and scaling.
On EC2 or on-premises, you must manage everything yourself including high availability,
backups, and software updates.
Q: What types of databases are available in RDS?
A:
Amazon Aurora (MySQL/PostgreSQL compatible)
MySQL
PostgreSQL
MariaDB
Oracle
SQL Server
Q: What are the differences between these databases? Which one have you used?
A:
Aurora: AWS-optimized, high performance, fault-tolerant
MySQL/PostgreSQL: Open-source, widely supported
SQL Server/Oracle: Enterprise-grade with licensing
I have used PostgreSQL and Aurora for transactional and analytical workloads.
Q: How do I ensure high availability in RDS?
A:
Use Multi-AZ deployment, which maintains a synchronous standby replica in another AZ.
Automatic failover is handled by AWS in case of failure.
Q: How do you handle load on an RDS instance?
A:
Use Read Replicas to offload read traffic
Scale up instance class
Optimize query performance and indexes
Q: How do you manage growing storage in RDS?
A:
Enable storage autoscaling
Monitor via CloudWatch and plan upgrades as needed
Q: How do you back up your database automatically?
A:
Enable automated backups (retention up to 35 days)
Optionally use snapshots for point-in-time recovery
Q: How do you secure RDS data?
A:
Enable encryption at rest using AWS KMS
Use SSL for in-transit encryption
Restrict access using VPC Security Groups and IAM roles
Q: Who handles patching and updates for RDS?
A: AWS automatically handles patching during the preferred maintenance window you configure.
Q: Can you define maintenance windows for AWS RDS?
A: Yes, you can set a maintenance window in which AWS applies patches or upgrades.
Q: How do you monitor and troubleshoot slow queries?
A:
Enable Performance Insights
Use Enhanced Monitoring, CloudWatch, and query logs
Optimize slow queries using EXPLAIN plans
Q: How do you restrict access to RDS instances?
A:
Use VPC security groups, DB subnet groups, and IAM policies
Ensure only approved applications or users can connect
Q: What’s the difference between SQL and NoSQL databases?
A:
SQL (Relational): Structured schema, ACID compliant (e.g., MySQL, PostgreSQL)
NoSQL: Schema-less, scalable horizontally (e.g., DynamoDB, MongoDB)
Q: What’s the difference between a snapshot and a backup?
A:
Automated backups are periodic and support point-in-time recovery
Snapshots are user-initiated and persist until manually deleted
Q: What RDS instance types have you used and for what environments?
A:
db.t3.medium or db.t3.micro for dev/test
db.r5.large or Aurora Serverless for production