[go: up one dir, main page]

0% found this document useful (0 votes)
13 views6 pages

Complete Data Engineering Interview QA

The document provides a comprehensive list of interview questions and answers across various topics including Python, PySpark, SQL, and AWS. It covers both intermediate and advanced levels, detailing key concepts, code examples, and best practices. This resource is designed to help candidates prepare for data engineering interviews.

Uploaded by

tejaswini6299
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

Complete Data Engineering Interview QA

The document provides a comprehensive list of interview questions and answers across various topics including Python, PySpark, SQL, and AWS. It covers both intermediate and advanced levels, detailing key concepts, code examples, and best practices. This resource is designed to help candidates prepare for data engineering interviews.

Uploaded by

tejaswini6299
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

**Top Data Engineering Interview Questions with Answers**

---

## Python Interview Questions

### Intermediate

1. **What are Python’s key features?**


**Answer:** Python is:
- Interpreted and dynamically typed, so no need for compiling or declaring variables.
- Object-oriented and supports multiple programming paradigms.
- Highly readable and concise.
- Comes with a large standard library.
- Portable and supports integration with other languages.

2. **Explain list comprehension with an example.**


**Answer:** It's a concise way to create lists.
```python
nums = [x for x in range(10) if x % 2 == 0] # [0, 2, 4, 6, 8]
```

3. **What is the difference between `is` and `==`?**


**Answer:**
- `==` compares values.
- `is` compares memory locations (object identity).
```python
a = [1, 2]; b = [1, 2]
a == b # True
a is b # False
```

4. **What are *args and **kwargs?**


**Answer:** They allow variable numbers of arguments.
```python
def example(*args, **kwargs):
print(args, kwargs)

example(1, 2, a=3, b=4) # (1, 2) {'a': 3, 'b': 4}


```

5. **How is memory managed in Python?**


**Answer:**
- Python uses reference counting and a garbage collector to manage memory.
- Memory is managed in private heap space.

### Advanced

1. **Explain Python's GIL.**


**Answer:** Global Interpreter Lock allows only one thread to execute at a time in
CPython. It simplifies memory management but limits CPU-bound multi-threading.

2. **What are decorators?**


**Answer:** Decorators wrap functions to add functionality.
```python
def decorator(func):
def wrapper():
print("Before")
func()
print("After")
return wrapper

@decorator
def greet():
print("Hello")
greet() # Output: Before, Hello, After
```

3. **Difference between deep copy and shallow copy.**


**Answer:**
- Shallow copy creates a new object but copies references.
- Deep copy copies everything recursively.
```python
import copy
a = [[1, 2]]
shallow = copy.copy(a)
deep = copy.deepcopy(a)
```

4. **Python OOP concepts:**


- Inheritance: Acquiring properties of parent class.
- Polymorphism: Same interface, different behavior.
- Encapsulation: Data hiding.
- Abstraction: Hiding internal implementation.
5. **Generators vs Iterators.**
**Answer:**
- Iterators: Objects implementing `__iter__()` and `__next__()`.
- Generators: Functions using `yield` to return an iterator.
```python
def gen():
yield 1
yield 2

for i in gen(): print(i)


```

---

## PySpark Interview Questions

### Intermediate

1. **Transformations vs Actions?**
- Transformations are lazy (e.g., `filter`, `map`), actions trigger computation (e.g., `show`,
`collect`).

2. **Wide vs Narrow Transformations:**


- Narrow: No shuffle, e.g., `map`, `filter`
- Wide: Data shuffle, e.g., `reduceByKey`, `join`

3. **Joins in PySpark:**
```python
df1.join(df2, df1.id == df2.id, 'inner')
```
Types: inner, left, right, outer, semi, anti

4. **Using UDFs:**
```python
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

upper_udf = udf(lambda x: x.upper(), StringType())


df.withColumn("upper_name", upper_udf(df.name)).show()
```

5. **Broadcast variables:**
```python
bc_var = sc.broadcast([1, 2, 3])
print(bc_var.value)
```
Used for performance improvement in joins.

### Advanced

1. **Catalyst Optimizer:** Optimizes logical and physical plans.


2. **Tungsten Engine:** Improves memory and CPU efficiency.
3. **Coalesce vs Repartition:**
- `coalesce(n)`: Less shuffling
- `repartition(n)`: Full shuffle
4. **Skew handling:** Add salt keys, use broadcast joins.
5. **Delta Lake & Streaming:** ACID compliance + real-time ingestion.

---

## SQL Interview Questions

### Intermediate

1. **CTEs:**
```sql
WITH emp_cte AS (SELECT * FROM employees)
SELECT * FROM emp_cte;
```

2. **Recursive Queries:** Useful for hierarchies.


```sql
WITH RECURSIVE nums AS (
SELECT 1 AS n
UNION ALL
SELECT n+1 FROM nums WHERE n < 5
) SELECT * FROM nums;
```

3. **Pivot/Unpivot:** Transform rows to columns and vice versa.


4. **Indexes:** Improve search speed, but slow down insert/update.
5. **Explain/Analyze:** Used to view query execution plans.

### Advanced

1. **Second Highest Salary:**


```sql
SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM
employees);
```

2. **OLAP vs OLTP:** OLAP is analytical, OLTP is transactional.


3. **Temporal Queries:**
```sql
SELECT id, LAG(salary) OVER (PARTITION BY id ORDER BY date) FROM salaries;
```

4. **Star vs Snowflake Schema:**


- Star: Denormalized, fast.
- Snowflake: Normalized, less redundancy.

5. **Materialized Views:** Store precomputed data for fast reads.

---

## AWS Interview Questions

### Intermediate

1. **S3 vs EBS:**
- S3: Object storage.
- EBS: Block storage.

2. **IAM Concepts:**
- Users, Groups, Roles, Policies

3. **Glue vs EMR:**
- Glue: Serverless ETL
- EMR: Hadoop/Spark cluster

4. **Athena SQL on S3:** Query directly using SQL without ETL.


5. **Security Best Practices:**
- Least privilege
- Encryption at rest & transit

### Advanced

1. **Data Pipeline using S3, Kinesis, Glue, Athena:**


Real-time ingestion with Kinesis → Glue transform → S3 → Athena query
2. **Redshift Tuning:** Sort keys, dist keys, vacuum, analyze
3. **Lake Formation:** Secure data lake on top of S3
4. **CloudWatch vs CloudTrail:**
- CloudWatch: Monitoring
- CloudTrail: API activity logging

5. **Kinesis Real-Time Example:**


```python
import boto3
kinesis = boto3.client('kinesis')
kinesis.put_record(StreamName="my_stream", Data=b"my_data", PartitionKey="key")
```

You might also like