Interview Questions and Answers for Data Engineering Role
SECTION 1: Python (Pandas, NumPy, Matplotlib) - Data Exploration &
Visualization
Q: What are the key differences between Pandas Series and NumPy arrays?
A: Pandas Series has axis labels and can hold heterogeneous data types. NumPy arrays are more
efficient for numerical computations but lack axis labels.
Q: How would you handle missing values in a dataset using Pandas?
A: Using methods like dropna(), fillna(), or isnull() to detect, remove or fill missing values.
Q: Write a Pandas code to group a dataset by a column and calculate the mean of each group.
A: df.groupby('column_name').mean()
Q: How do you merge two dataframes in Pandas?
A: Using pd.merge(df1, df2, on='key') with join types like 'inner', 'left', 'right', or 'outer'.
Q: What is the difference between .loc[] and .iloc[] in Pandas?
A: .loc[] is label-based, while .iloc[] is integer-position based.
SECTION 2: Data Transformations using Python / PySpark / SQL
Q: How would you transform a wide dataset into a long format using Pandas?
A: Using pd.melt(df, id_vars=['id'], value_vars=['col1', 'col2'])
Q: Write a SQL query to extract the top 3 highest-paid employees in each department.
A: SELECT * FROM (SELECT *, RANK() OVER (PARTITION BY department ORDER BY salary
DESC) as rnk FROM employees) WHERE rnk <= 3
Q: What-s the difference between map(), apply(), and applymap() in Pandas?
A: map() for Series, apply() for Series and DataFrames, applymap() for DataFrame element-wise
operations.
Q: How would you handle duplicate data entries in Pandas?
A: Using df.duplicated() to find and df.drop_duplicates() to remove duplicates.
Q: Describe how you can write a custom UDF in PySpark and apply it to a DataFrame.
A: Define function, register with udf(), and use with withColumn().
SECTION 3: Apache Airflow / Workflow Orchestration
Q: What is Apache Airflow and why is it used in data engineering?
A: Airflow is a workflow orchestration tool used to author, schedule, and monitor data workflows.
Q: What is a DAG in Airflow and how is it defined?
A: A Directed Acyclic Graph defined in Python code representing task dependencies.
Q: How do you schedule tasks in Airflow using cron expressions?
A: Set the 'schedule_interval' parameter in the DAG definition using cron syntax.
Q: What is the role of task dependencies in Airflow and how do you define them?
A: Dependencies ensure tasks run in order using bitshift operators (>> or <<).
Q: Explain the purpose of the @dag and @task decorators in Airflow 2.x.
A: They simplify DAG and task creation by using Python functions as tasks.
SECTION 4: Optional - Cloud Data Pipeline Tools (ADF / AWS Glue / GCP
Dataflow)
Q: What is Azure Data Factory and what are its main components?
A: ADF is a cloud-based ETL service. Main components: Pipelines, Activities, Linked Services,
Datasets.
Q: Compare Azure Data Factory and Apache Airflow.
A: ADF is managed, GUI-based. Airflow is open-source and code-based with more customization.
Q: What is AWS Glue and how does it differ from traditional ETL tools?
A: AWS Glue is serverless and automatically generates code. Traditional tools require manual
infrastructure.
Q: Explain the concept of Dataflow in GCP and its use cases.
A: Dataflow is a serverless stream and batch processing tool based on Apache Beam.
Q: What are the advantages of using managed cloud orchestration services over Airflow?
A: Scalability, less maintenance, integrated monitoring, and better cloud integration.
SECTION 5: Writing and Debugging Python Scripts for Data Preprocessing and
Modeling
Q: What are the common steps involved in data preprocessing using Python?
A: Typical steps include handling missing values, encoding categorical variables, feature scaling,
and data splitting.
Q: How do you handle missing values in Python?
A: Using Pandas: df.dropna(), df.fillna(), or imputation with sklearn's SimpleImputer.
Q: What is the difference between Label Encoding and One-Hot Encoding?
A: Label Encoding assigns a unique number to each category, One-Hot Encoding creates binary
columns for each category.
Q: How do you normalize or standardize a dataset in Python?
A: Use sklearn.preprocessing.StandardScaler for standardization or MinMaxScaler for
normalization.
Q: How would you debug a Python script that is throwing an error in the model training phase?
A: Use print/logging statements, traceback, and IDE debuggers to check data types, model
parameters, and training flow.
Q: What are some common issues you might face while training a machine learning model?
A: Overfitting, underfitting, data imbalance, feature leakage, incorrect data preprocessing.
Q: How do you split a dataset into training and testing sets in Python?
A: Using train_test_split() from sklearn.model_selection.
Q: How do you evaluate the performance of a classification model?
A: Using metrics like accuracy, precision, recall, F1-score, and ROC-AUC.
Q: What is cross-validation and why is it useful?
A: Cross-validation is a method to evaluate model performance by splitting the data into multiple
train-test sets.
Q: How do you save and load a trained model in Python?
A: Use joblib or pickle libraries to save (.pkl) and load the model objects.