**Top Data Engineering Interview Questions with Answers**
---
## Python Interview Questions
### Intermediate
1. **What are Python’s key features?**
**Answer:** Python is:
- Interpreted and dynamically typed, so no need for compiling or declaring variables.
- Object-oriented and supports multiple programming paradigms.
- Highly readable and concise.
- Comes with a large standard library.
- Portable and supports integration with other languages.
2. **Explain list comprehension with an example.**
**Answer:** It's a concise way to create lists.
```python
nums = [x for x in range(10) if x % 2 == 0] # [0, 2, 4, 6, 8]
```
3. **What is the difference between `is` and `==`?**
**Answer:**
- `==` compares values.
- `is` compares memory locations (object identity).
```python
a = [1, 2]; b = [1, 2]
a == b # True
a is b # False
```
4. **What are *args and **kwargs?**
**Answer:** They allow variable numbers of arguments.
```python
def example(*args, **kwargs):
print(args, kwargs)
example(1, 2, a=3, b=4) # (1, 2) {'a': 3, 'b': 4}
```
5. **How is memory managed in Python?**
**Answer:**
- Python uses reference counting and a garbage collector to manage memory.
- Memory is managed in private heap space.
### Advanced
1. **Explain Python's GIL.**
**Answer:** Global Interpreter Lock allows only one thread to execute at a time in
CPython. It simplifies memory management but limits CPU-bound multi-threading.
2. **What are decorators?**
**Answer:** Decorators wrap functions to add functionality.
```python
def decorator(func):
def wrapper():
print("Before")
func()
print("After")
return wrapper
@decorator
def greet():
print("Hello")
greet() # Output: Before, Hello, After
```
3. **Difference between deep copy and shallow copy.**
**Answer:**
- Shallow copy creates a new object but copies references.
- Deep copy copies everything recursively.
```python
import copy
a = [[1, 2]]
shallow = copy.copy(a)
deep = copy.deepcopy(a)
```
4. **Python OOP concepts:**
- Inheritance: Acquiring properties of parent class.
- Polymorphism: Same interface, different behavior.
- Encapsulation: Data hiding.
- Abstraction: Hiding internal implementation.
5. **Generators vs Iterators.**
**Answer:**
- Iterators: Objects implementing `__iter__()` and `__next__()`.
- Generators: Functions using `yield` to return an iterator.
```python
def gen():
yield 1
yield 2
for i in gen(): print(i)
```
---
## PySpark Interview Questions
### Intermediate
1. **Transformations vs Actions?**
- Transformations are lazy (e.g., `filter`, `map`), actions trigger computation (e.g., `show`,
`collect`).
2. **Wide vs Narrow Transformations:**
- Narrow: No shuffle, e.g., `map`, `filter`
- Wide: Data shuffle, e.g., `reduceByKey`, `join`
3. **Joins in PySpark:**
```python
df1.join(df2, df1.id == df2.id, 'inner')
```
Types: inner, left, right, outer, semi, anti
4. **Using UDFs:**
```python
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
upper_udf = udf(lambda x: x.upper(), StringType())
df.withColumn("upper_name", upper_udf(df.name)).show()
```
5. **Broadcast variables:**
```python
bc_var = sc.broadcast([1, 2, 3])
print(bc_var.value)
```
Used for performance improvement in joins.
### Advanced
1. **Catalyst Optimizer:** Optimizes logical and physical plans.
2. **Tungsten Engine:** Improves memory and CPU efficiency.
3. **Coalesce vs Repartition:**
- `coalesce(n)`: Less shuffling
- `repartition(n)`: Full shuffle
4. **Skew handling:** Add salt keys, use broadcast joins.
5. **Delta Lake & Streaming:** ACID compliance + real-time ingestion.
---
## SQL Interview Questions
### Intermediate
1. **CTEs:**
```sql
WITH emp_cte AS (SELECT * FROM employees)
SELECT * FROM emp_cte;
```
2. **Recursive Queries:** Useful for hierarchies.
```sql
WITH RECURSIVE nums AS (
SELECT 1 AS n
UNION ALL
SELECT n+1 FROM nums WHERE n < 5
) SELECT * FROM nums;
```
3. **Pivot/Unpivot:** Transform rows to columns and vice versa.
4. **Indexes:** Improve search speed, but slow down insert/update.
5. **Explain/Analyze:** Used to view query execution plans.
### Advanced
1. **Second Highest Salary:**
```sql
SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM
employees);
```
2. **OLAP vs OLTP:** OLAP is analytical, OLTP is transactional.
3. **Temporal Queries:**
```sql
SELECT id, LAG(salary) OVER (PARTITION BY id ORDER BY date) FROM salaries;
```
4. **Star vs Snowflake Schema:**
- Star: Denormalized, fast.
- Snowflake: Normalized, less redundancy.
5. **Materialized Views:** Store precomputed data for fast reads.
---
## AWS Interview Questions
### Intermediate
1. **S3 vs EBS:**
- S3: Object storage.
- EBS: Block storage.
2. **IAM Concepts:**
- Users, Groups, Roles, Policies
3. **Glue vs EMR:**
- Glue: Serverless ETL
- EMR: Hadoop/Spark cluster
4. **Athena SQL on S3:** Query directly using SQL without ETL.
5. **Security Best Practices:**
- Least privilege
- Encryption at rest & transit
### Advanced
1. **Data Pipeline using S3, Kinesis, Glue, Athena:**
Real-time ingestion with Kinesis → Glue transform → S3 → Athena query
2. **Redshift Tuning:** Sort keys, dist keys, vacuum, analyze
3. **Lake Formation:** Secure data lake on top of S3
4. **CloudWatch vs CloudTrail:**
- CloudWatch: Monitoring
- CloudTrail: API activity logging
5. **Kinesis Real-Time Example:**
```python
import boto3
kinesis = boto3.client('kinesis')
kinesis.put_record(StreamName="my_stream", Data=b"my_data", PartitionKey="key")
```