Top Data Engineering Interview
Questions with Answers
## Python Interview Questions
### Intermediate
1. **What are Python’s key features?**
**Answer:** Interpreted, dynamically typed, object-oriented, portable, and has extensive
libraries.
2. **Explain list comprehension with an example.**
```python
nums = [x for x in range(10) if x % 2 == 0]
```
3. **What is the difference between `is` and `==`?**
- `==` checks value equality.
- `is` checks object identity.
4. **What are *args and **kwargs?**
```python
def example(*args, **kwargs):
print(args, kwargs)
```
5. **How is memory managed in Python?**
- Managed using reference counting and garbage collection.
### Advanced
1. **Explain Python's GIL.**
**Answer:** Global Interpreter Lock allows only one thread to execute Python bytecode at
a time.
2. **What are decorators?**
```python
def decorator(func):
def wrapper():
print("Before")
func()
print("After")
return wrapper
@decorator
def greet():
print("Hello")
```
3. **Difference between deep copy and shallow copy.**
- `copy()` creates a shallow copy.
- `deepcopy()` creates a full independent copy.
4. **Python OOP concepts:** Inheritance, polymorphism, encapsulation, abstraction.
5. **Generators vs Iterators.**
```python
def gen():
yield 1
yield 2
```
## PySpark Interview Questions
### Intermediate
1. **Transformations vs Actions?**
```python
df.filter(df.age > 30) # Transformation
df.show() # Action
```
2. **Wide vs Narrow Transformations:**
- Narrow: `filter`, `map` (no shuffling)
- Wide: `groupBy`, `join` (requires shuffle)
3. **Joins in PySpark:**
```python
df1.join(df2, df1.id == df2.id, 'inner')
```
4. **Using UDFs:**
```python
udf_func = udf(lambda x: x.upper(), StringType())
df.withColumn("upper", udf_func(df.name))
```
5. **Broadcast variables:**
```python
broadcast_var = sc.broadcast([1, 2])
acc = sc.accumulator(0)
```
### Advanced
1. **Catalyst Optimizer & Tungsten Engine**
2. **Coalesce vs Repartition**
```python
df.coalesce(1), df.repartition(10)
```
3. **Skew handling:** Salt keys, broadcast small tables
4. **Performance tuning:** Cache, partitioning, pruning
5. **Delta Lake + Structured Streaming**
## SQL Interview Questions
### Intermediate
1. **CTEs:**
```sql
WITH temp AS (SELECT * FROM employees)
SELECT * FROM temp;
```
2. **Recursive Queries**
3. **Pivot/Unpivot**
4. **Indexes:** Speed up search queries
5. **Explain/Analyze** for query plans
### Advanced
1. **Second Highest Salary:**
```sql
SELECT MAX(salary) FROM emp WHERE salary < (SELECT MAX(salary) FROM emp);
```
2. **OLAP vs OLTP**
3. **Temporal Queries**
```sql
SELECT id, LAG(salary) OVER (...) FROM table;
```
4. **Star vs Snowflake Schema**
5. **Materialized Views**
## AWS Interview Questions
### Intermediate
1. **S3 vs EBS**
2. **IAM concepts**
3. **Glue vs EMR**
4. **Athena SQL on S3**
5. **Security Best Practices**
### Advanced
1. **Data Pipeline using S3, Kinesis, Glue, Athena**
2. **Redshift Tuning**
3. **AWS Lake Formation**
4. **CloudWatch vs CloudTrail**
5. **Kinesis Real-Time Example:**
```python
boto3.client('kinesis').put_record(...)
```