[go: up one dir, main page]

0% found this document useful (0 votes)
99 views16 pages

AWS Interview Questions and Answers

The document provides a comprehensive set of interview questions and answers related to AWS services, specifically focusing on EMR, Glue, S3, and RDS. It covers various topics such as node types in EMR, cost optimization strategies, Glue job monitoring, S3 storage classes, and RDS database management. Each section includes detailed explanations and best practices for utilizing AWS services effectively.

Uploaded by

hirwesaurav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views16 pages

AWS Interview Questions and Answers

The document provides a comprehensive set of interview questions and answers related to AWS services, specifically focusing on EMR, Glue, S3, and RDS. It covers various topics such as node types in EMR, cost optimization strategies, Glue job monitoring, S3 storage classes, and RDS database management. Each section includes detailed explanations and best practices for utilizing AWS services effectively.

Uploaded by

hirwesaurav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

AWS EMR Interview Questions and Answers

AWS Interview Questions and Answers – Section 1: EMR (Elastic MapReduce)

Q: How many master nodes? Type of nodes in the cluster?

A: Typically, we use 1 master node (for cluster management), core nodes (for data storage and
processing), and task nodes (for transient processing). Core nodes persist data using HDFS, whereas
task nodes are ephemeral.

Q: What is the difference between core, master, and task nodes?

A:

 Master Node: Manages the cluster and resource allocation.

 Core Nodes: Handle both data storage and processing (via HDFS).

 Task Nodes: Perform compute-only operations and can be removed anytime.

Q: What is the size of your EMR nodes?

A: We typically used m5.xlarge or r5.2xlarge instances depending on workloads. Memory-intensive


jobs used r5 family.

Q: Guidelines to choose size of nodes (Horizontal vs Vertical Scaling)?

A:

 Vertical Scaling: Increase CPU/RAM of single node.

 Horizontal Scaling: Add more nodes for distributed processing — preferred in Spark jobs.

Q: Type of cluster scaling? (Auto-scaling)

A: We used Auto Scaling Policies on instance fleets to scale core/task nodes based on metrics like
YARN memory and CPU usage.

Q: How do you distribute PySpark load across nodes?

A: By using --deploy-mode cluster in spark-submit. This offloads both driver and executors to the
cluster for balanced execution.

Q: How do you submit Spark jobs?


A: Submitted using EMR Steps, either from the AWS Console, CLI, or programmatically via boto3.

Q: Applications used on EMR?

A: Spark, Hive, Hadoop, Livy, Hue.

Q: Uses of these applications?

 Spark: Data transformation

 Hive: SQL-like queries

 Hadoop: Underlying YARN/HDFS

 Livy: Spark REST API

 Hue: UI-based job and query management

Q: Use of Hive (Hive Metastore)?

A: Used to store metadata of Hive tables (schemas, locations). Hive Metastore enables schema
sharing across tools like Presto, Athena.

Q: Difference between external and managed Hive tables?

A:

 Managed: Hive manages table metadata and data (deleting table deletes data).

 External: Data stored outside Hive’s control (deleting table keeps data intact).

Q: Versions used (EMR, Spark, Python)?

A: EMR: v6.6.0
Spark: 3.x
Python: 3.x

Q: Cost optimization of EMR?

A:

 Auto scaling

 Spot instances

 Transient clusters (terminate after job ends)

 Fine-tuned cluster sizing


Q: What is a transient cluster?

A: A cluster that is created temporarily for a job and terminated post completion to save cost.

Q: Types of nodes provisioned (On-demand/Spot/Reserved)?

A:

 Master: On-Demand

 Core: Mixed (On-Demand + Spot)

 Task: Spot for cost savings

Q: When to use On-demand, Spot, Reserved nodes?

A:

 On-Demand: Predictable and stable workloads

 Spot: For flexible, fault-tolerant jobs

 Reserved: For long-term, stable environments (Prod)

Q: What is serverless EMR? EMR Studio?

A:

 EMR Serverless: Fully managed runtime for Spark/Presto without provisioning clusters

 EMR Studio: Web-based IDE to visually develop and monitor jobs

Q: When do you choose EMR vs Glue/Athena/Kinesis?

A:

 EMR: Large-scale batch/stream processing

 Glue: ETL with managed scheduling

 Athena: Ad-hoc querying

 Kinesis: Real-time streaming

Q: Where are EMR logs stored?

A: In Amazon S3 or CloudWatch Logs (configurable in EMR settings)


Q: How to monitor EMR applications?

A:

 CloudWatch Metrics

 Spark History Server

 Ganglia

 Resource Manager UI

Q: How to debug EMR jobs?

A:

 Analyze logs (stderr/stdout)

 Use Spark UI

 Check EMR Step output and failure logs

Q: What is Spark application and resource monitoring UI?

A:

 Spark History Server: Tracks past jobs

 YARN UI: Shows active job resources

 Resource Manager UI: Overall cluster health

Q: How do monitoring UIs help debug jobs?

A: Identify skewed stages, memory issues, stragglers, and executor failures.

Q: Spark processing architecture?

A: Driver → DAG Scheduler → Task Scheduler → Executors → Cluster Manager → HDFS/S3

Q: Terminate vs Stop EMR?

A:

 Terminate: Deletes the cluster and data (unless persisted on S3)

 Stop: Not applicable; EMR doesn’t support “stop” like EC2

Q: What is AMI and how is it used in EMR?


A: Amazon Machine Image (AMI) provides pre-installed software and OS for clusters.

Q: EMR disk volume type and size?

A: Used EBS gp2/gp3 volumes. Size depends on job and data, usually 100–500 GB per node.

Q: Was data stored on HDFS or S3?

A: Data stored primarily on Amazon S3 using EMRFS. Temporary data may go to HDFS.

Q: What is EMRFS?

A: EMR File System (EMRFS) enables EMR clusters to use S3 as the primary data source.

Q: What are tags in EMR?

A: Metadata key-value pairs for organizing and cost tracking (e.g., Project, Env, Owner)

Q: What are bootstrap actions in EMR?

A: Shell scripts that run during cluster startup to install libraries, configure environment.

Q: What is instance fleet and when do we use it?

A: Instance Fleet enables use of multiple instance types within a role (master/core/task) for cost and
availability optimization.

Q: How do you increase EMR instance quota in an AZ?

A: Raise a Service Quota Increase request in AWS Support.

Q: What is auto termination in EMR?

A: Automatically terminates cluster if idle for specified time. Useful for cost savings.

Q: Use of security groups in EMR?

A: Control network access to cluster nodes via inbound/outbound rules.

Q: Data volume and format?


A: Processed 100s of GBs to TBs. Formats: Parquet, CSV, JSON, ORC.

Q: What is Parquet/ORC file format? Columnar vs Row-based formats?

A:

 Parquet/ORC: Columnar formats, better for analytical queries

 CSV/JSON: Row-based, simpler but slower for large data

Q: How do you tune EMR performance?

A:

 Partition tuning

 Data coalescing

 Executor memory and core tuning

 Using broadcast joins and caching

 Choosing optimized file formats (Parquet/ORC)


AWS Glue Interview Questions and Answer0073xx

AWS Interview Questions and Answers – Section 2: AWS Glue

Q: What is AWS Glue? Why do we use the Glue service?

A: AWS Glue is a serverless data integration and ETL (Extract, Transform, Load) service. It is used to
prepare and transform data for analytics, machine learning, and application development by
automating the ETL process.

Q: What type of Glue engine have you used?

A: We used the Spark-based Glue engine (version 2.0 and 3.0), which supports distributed
processing using PySpark.

Q: What Python version have you used in Glue?

A: Python 3.x, primarily with Glue version 2.0 and above.

Q: What is a DPU (Data Processing Unit)?

A: A DPU provides the processing resources for Glue. 1 DPU = 4 vCPU + 16 GB RAM. You can scale
jobs by allocating more DPUs.

Q: What transformations have you used in your Glue job?

A:

 DropFields, ApplyMapping, ResolveChoice, SelectFields, Join, Relationalize, Unnest

 Also used custom transformations using PySpark and DynamicFrame functions.

Q: Have you used custom scripts for your Glue job?

A: Yes, we wrote custom PySpark scripts for complex ETL logic and reusable components.

Q: What libraries and packages have you used?

A:

 pandas, boto3, pyarrow, numpy, requests

 Additionally, custom .whl and .zip packages uploaded to S3


Q: How do you use custom packages or libraries in Glue?

A:

 Upload the package to S3

 Use job parameter: --additional-python-modules or --extra-py-files

Q: How do we pass parameters to a Glue job?

A:

 Using --arguments flag in Glue job or

 Via Job bookmarks, Triggers, or boto3 when triggered programmatically

Q: What is the use of Tags in Glue?

A: Tags are used to organize, identify, and control access or costs associated with Glue resources.

Q: How do you track changes in a Glue job (versioning)?

A:

 Track via source control (Git)

 Document changes using job description/comments

 Enable job history tracking in AWS

Q: How do we monitor a Glue job?

A:

 CloudWatch Logs

 Glue Job Run History

 Metrics like DPU hours, job duration, success/failure states

Q: How do we troubleshoot a Glue job?

A:

 Analyze logs in CloudWatch

 Use try-except in PySpark code

 Use job metrics and bookmarks to resume jobs from failure


Q: How do we develop Glue jobs locally?

A:

 Use AWS-provided Glue Docker image for local testing

 Develop in PyCharm/Jupyter and simulate job runs with local inputs

Q: How can we push the Glue job code to a repository?

A:

 Use Git or GitHub

 Deploy with CI/CD tools like CodePipeline, Jenkins, or manually through boto3 scripts

Q: What if a Glue job takes too long to execute?

A:

 Tune transformations

 Use pushdown predicates

 Increase DPU allocation

 Optimize joins and repartition data

Q: How do I set or increase job timeout?

A:

 Set job timeout (max 48 hrs) in the Glue console/job config

 Also configurable through boto3 or CLI

Q: What is a Glue workflow and its use?

A:

 Glue Workflow orchestrates multiple jobs, crawlers, and triggers

 Ensures dependency management and end-to-end ETL pipeline automation

Q: Glue workflow vs Apache Airflow?

A:

 Glue Workflow: Simple, AWS-native, serverless, tightly integrated with Glue


 Airflow: Complex DAGs, cross-service orchestration, supports retries/conditionals

Q: What is a Glue Crawler? When do we use it?

A:

 A crawler scans a data store (S3, JDBC, etc.), infers schema, and populates the Glue Data
Catalog

 Used before querying via Athena or ETL

Q: What is Glue Data Catalog? How do you update it?

A:

 Centralized metadata repository

 Updated using crawlers, scripts, or boto3 API

Q: How do you detect schema changes automatically?

A:

 Enable “Update in-place” or “Add new columns” in crawler settings

 Monitor schema evolution logs

Q: What is a Schema Registry? Why is it used?

A:

 AWS Glue Schema Registry manages schema versions for streaming data (e.g., Kafka)

 Ensures backward/forward compatibility

Q: How do you performance tune Glue jobs?

A:

 Use pushdown predicates

 Avoid shuffles/re-partitions unless necessary

 Read data in partitioned format (e.g., Parquet)

 Optimize DPU allocation and parallelism


AWS S3 Interview Questions and Answers

AWS Interview Questions and Answers – Section 3: Amazon S3 (Simple Storage Service)

Q: What is the purpose of S3? What is an object store?

A: Amazon S3 is a highly scalable, durable object storage service used to store any amount of data.
An object store stores data as objects (key, value, metadata) rather than in blocks or files.

Q: What is the difference between object store and block store?

A:

 Object Store (S3): Stores data with metadata and unique key. Ideal for backups, media, logs.

 Block Store (EBS): Divides data into fixed-size blocks. Ideal for databases, file systems, etc.

Q: What is the maximum size of a single object in S3?

A: Up to 5 TB per object. Multipart uploads are recommended for objects larger than 100 MB.

Q: What is the total size limit for S3?

A: There is no limit. S3 scales automatically to store any amount of data.

Q: Can two buckets have the same name?

A: No. Bucket names are globally unique across all AWS accounts and regions.

Q: Can two different accounts have buckets with the same name in different regions?

A: No. Bucket names must be globally unique regardless of region or account.

Q: What is bucket versioning? How can I recover a deleted file?

A:

 Versioning retains all versions of an object.

 If a file is deleted, it creates a delete marker. You can restore the file by deleting the delete
marker.

Q: What is a delete marker?


A: A delete marker is a placeholder indicating the deletion of the latest version.
Removing it restores the previous version as the latest.

Q: How is S3 data encrypted?

A:

 At rest: SSE-S3 (AES-256), SSE-KMS, SSE-C

 In transit: HTTPS (TLS/SSL)

Q: What are the different S3 storage classes (cost layers)?

A:

 S3 Standard

 S3 Intelligent-Tiering

 S3 One Zone-IA

 S3 Glacier Instant Retrieval

 S3 Glacier Flexible Retrieval

 S3 Glacier Deep Archive

Q: How do I log and get notified when someone accesses an object in a bucket?

A:

 Enable Access Logs or CloudTrail

 Configure Event Notifications using SNS, Lambda, or SQS for actions like PUT, DELETE, etc.

Q: How to classify and move data to save cost?

A:

 Use S3 Lifecycle Policies to transition data across storage classes based on age or access
pattern.

Q: What is Glacier storage? When is it used?

A: Glacier is archival storage used for infrequently accessed data (compliance, backups).
Retrieval times range from milliseconds to hours depending on the tier.

Q: How do I prevent users from deleting objects in S3?


A:

 Use IAM policies with Deny s3:DeleteObject

 Use S3 Object Lock with governance or compliance mode

Q: How do I restrict access to specific users?

A:

 Use Bucket Policies, IAM Policies, and/or Access Control Lists (ACLs)

 Define who can access, what actions are allowed, and under what conditions

Q: How do you give access to users from different AWS accounts (CORS)?

A:

 Use Bucket Policy or Cross-Account IAM Role

 CORS (Cross-Origin Resource Sharing) applies to browser-based access

Q: How do you monitor S3 usage (size, objects, etc.)?

A:

 Use S3 Storage Lens

 CloudWatch metrics

 AWS Cost Explorer

 CloudTrail for audit logs

Q: I want to delete objects older than 180 days. How?

A:

 Configure an S3 Lifecycle Rule to delete objects based on their last modified date (e.g., after
180 days).

Q: How to ensure data is protected if a region fails?

A:

 Enable Cross-Region Replication (CRR) to replicate objects to another region automatically.


AWS RDS Interview Questions and Answers

AWS Interview Questions and Answers – Section 4: Amazon RDS (Relational Database Service)

Q: What is the difference between RDS and hosting a database on-premises or on EC2?

A:

 RDS is a fully managed service. AWS handles backups, patching, failover, and scaling.

 On EC2 or on-premises, you must manage everything yourself including high availability,
backups, and software updates.

Q: What types of databases are available in RDS?

A:

 Amazon Aurora (MySQL/PostgreSQL compatible)

 MySQL

 PostgreSQL

 MariaDB

 Oracle

 SQL Server

Q: What are the differences between these databases? Which one have you used?

A:

 Aurora: AWS-optimized, high performance, fault-tolerant

 MySQL/PostgreSQL: Open-source, widely supported

 SQL Server/Oracle: Enterprise-grade with licensing


I have used PostgreSQL and Aurora for transactional and analytical workloads.

Q: How do I ensure high availability in RDS?

A:

 Use Multi-AZ deployment, which maintains a synchronous standby replica in another AZ.

 Automatic failover is handled by AWS in case of failure.

Q: How do you handle load on an RDS instance?


A:

 Use Read Replicas to offload read traffic

 Scale up instance class

 Optimize query performance and indexes

Q: How do you manage growing storage in RDS?

A:

 Enable storage autoscaling

 Monitor via CloudWatch and plan upgrades as needed

Q: How do you back up your database automatically?

A:

 Enable automated backups (retention up to 35 days)

 Optionally use snapshots for point-in-time recovery

Q: How do you secure RDS data?

A:

 Enable encryption at rest using AWS KMS

 Use SSL for in-transit encryption

 Restrict access using VPC Security Groups and IAM roles

Q: Who handles patching and updates for RDS?

A: AWS automatically handles patching during the preferred maintenance window you configure.

Q: Can you define maintenance windows for AWS RDS?

A: Yes, you can set a maintenance window in which AWS applies patches or upgrades.

Q: How do you monitor and troubleshoot slow queries?

A:

 Enable Performance Insights

 Use Enhanced Monitoring, CloudWatch, and query logs


 Optimize slow queries using EXPLAIN plans

Q: How do you restrict access to RDS instances?

A:

 Use VPC security groups, DB subnet groups, and IAM policies

 Ensure only approved applications or users can connect

Q: What’s the difference between SQL and NoSQL databases?

A:

 SQL (Relational): Structured schema, ACID compliant (e.g., MySQL, PostgreSQL)

 NoSQL: Schema-less, scalable horizontally (e.g., DynamoDB, MongoDB)

Q: What’s the difference between a snapshot and a backup?

A:

 Automated backups are periodic and support point-in-time recovery

 Snapshots are user-initiated and persist until manually deleted

Q: What RDS instance types have you used and for what environments?

A:

 db.t3.medium or db.t3.micro for dev/test

 db.r5.large or Aurora Serverless for production

You might also like