[go: up one dir, main page]

0% found this document useful (0 votes)
9 views4 pages

Data Engineering Interview QA Updated

The document provides a comprehensive set of interview questions and answers for a Data Engineering role, covering key topics such as Python (Pandas, NumPy), data transformations, Apache Airflow, cloud data pipeline tools, and Python scripting for data preprocessing. It includes practical coding examples and explanations of concepts like handling missing values, task scheduling in Airflow, and model evaluation metrics. This resource serves as a guide for candidates preparing for data engineering interviews.

Uploaded by

sanjushree12f
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views4 pages

Data Engineering Interview QA Updated

The document provides a comprehensive set of interview questions and answers for a Data Engineering role, covering key topics such as Python (Pandas, NumPy), data transformations, Apache Airflow, cloud data pipeline tools, and Python scripting for data preprocessing. It includes practical coding examples and explanations of concepts like handling missing values, task scheduling in Airflow, and model evaluation metrics. This resource serves as a guide for candidates preparing for data engineering interviews.

Uploaded by

sanjushree12f
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Interview Questions and Answers for Data Engineering Role

SECTION 1: Python (Pandas, NumPy, Matplotlib) - Data Exploration &

Visualization
Q: What are the key differences between Pandas Series and NumPy arrays?

A: Pandas Series has axis labels and can hold heterogeneous data types. NumPy arrays are more

efficient for numerical computations but lack axis labels.

Q: How would you handle missing values in a dataset using Pandas?

A: Using methods like dropna(), fillna(), or isnull() to detect, remove or fill missing values.

Q: Write a Pandas code to group a dataset by a column and calculate the mean of each group.

A: df.groupby('column_name').mean()

Q: How do you merge two dataframes in Pandas?

A: Using pd.merge(df1, df2, on='key') with join types like 'inner', 'left', 'right', or 'outer'.

Q: What is the difference between .loc[] and .iloc[] in Pandas?

A: .loc[] is label-based, while .iloc[] is integer-position based.

SECTION 2: Data Transformations using Python / PySpark / SQL


Q: How would you transform a wide dataset into a long format using Pandas?

A: Using pd.melt(df, id_vars=['id'], value_vars=['col1', 'col2'])

Q: Write a SQL query to extract the top 3 highest-paid employees in each department.

A: SELECT * FROM (SELECT *, RANK() OVER (PARTITION BY department ORDER BY salary

DESC) as rnk FROM employees) WHERE rnk <= 3

Q: What-s the difference between map(), apply(), and applymap() in Pandas?

A: map() for Series, apply() for Series and DataFrames, applymap() for DataFrame element-wise

operations.

Q: How would you handle duplicate data entries in Pandas?

A: Using df.duplicated() to find and df.drop_duplicates() to remove duplicates.


Q: Describe how you can write a custom UDF in PySpark and apply it to a DataFrame.

A: Define function, register with udf(), and use with withColumn().

SECTION 3: Apache Airflow / Workflow Orchestration


Q: What is Apache Airflow and why is it used in data engineering?

A: Airflow is a workflow orchestration tool used to author, schedule, and monitor data workflows.

Q: What is a DAG in Airflow and how is it defined?

A: A Directed Acyclic Graph defined in Python code representing task dependencies.

Q: How do you schedule tasks in Airflow using cron expressions?

A: Set the 'schedule_interval' parameter in the DAG definition using cron syntax.

Q: What is the role of task dependencies in Airflow and how do you define them?

A: Dependencies ensure tasks run in order using bitshift operators (>> or <<).

Q: Explain the purpose of the @dag and @task decorators in Airflow 2.x.

A: They simplify DAG and task creation by using Python functions as tasks.

SECTION 4: Optional - Cloud Data Pipeline Tools (ADF / AWS Glue / GCP

Dataflow)
Q: What is Azure Data Factory and what are its main components?

A: ADF is a cloud-based ETL service. Main components: Pipelines, Activities, Linked Services,

Datasets.

Q: Compare Azure Data Factory and Apache Airflow.

A: ADF is managed, GUI-based. Airflow is open-source and code-based with more customization.

Q: What is AWS Glue and how does it differ from traditional ETL tools?

A: AWS Glue is serverless and automatically generates code. Traditional tools require manual

infrastructure.

Q: Explain the concept of Dataflow in GCP and its use cases.

A: Dataflow is a serverless stream and batch processing tool based on Apache Beam.

Q: What are the advantages of using managed cloud orchestration services over Airflow?
A: Scalability, less maintenance, integrated monitoring, and better cloud integration.

SECTION 5: Writing and Debugging Python Scripts for Data Preprocessing and

Modeling
Q: What are the common steps involved in data preprocessing using Python?

A: Typical steps include handling missing values, encoding categorical variables, feature scaling,

and data splitting.

Q: How do you handle missing values in Python?

A: Using Pandas: df.dropna(), df.fillna(), or imputation with sklearn's SimpleImputer.

Q: What is the difference between Label Encoding and One-Hot Encoding?

A: Label Encoding assigns a unique number to each category, One-Hot Encoding creates binary

columns for each category.

Q: How do you normalize or standardize a dataset in Python?

A: Use sklearn.preprocessing.StandardScaler for standardization or MinMaxScaler for

normalization.

Q: How would you debug a Python script that is throwing an error in the model training phase?

A: Use print/logging statements, traceback, and IDE debuggers to check data types, model

parameters, and training flow.

Q: What are some common issues you might face while training a machine learning model?

A: Overfitting, underfitting, data imbalance, feature leakage, incorrect data preprocessing.

Q: How do you split a dataset into training and testing sets in Python?

A: Using train_test_split() from sklearn.model_selection.

Q: How do you evaluate the performance of a classification model?

A: Using metrics like accuracy, precision, recall, F1-score, and ROC-AUC.

Q: What is cross-validation and why is it useful?

A: Cross-validation is a method to evaluate model performance by splitting the data into multiple

train-test sets.
Q: How do you save and load a trained model in Python?

A: Use joblib or pickle libraries to save (.pkl) and load the model objects.

You might also like