[go: up one dir, main page]

0% found this document useful (0 votes)
113 views11 pages

Azure Data Engineer Mock Interview - Project Special

Raj has five years of experience in data engineering and ETL consulting, primarily working with Azure Data Factory and databases. He is currently involved in an insurance project utilizing Delta Lakehouse architecture, managing data from various sources and implementing a three-layer structure for data processing. Raj has faced challenges in scheduling pipelines around holidays and has developed solutions to ensure data processing aligns with business days.

Uploaded by

rovag49792
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views11 pages

Azure Data Engineer Mock Interview - Project Special

Raj has five years of experience in data engineering and ETL consulting, primarily working with Azure Data Factory and databases. He is currently involved in an insurance project utilizing Delta Lakehouse architecture, managing data from various sources and implementing a three-layer structure for data processing. Raj has faced challenges in scheduling pipelines around holidays and has developed solutions to ensure data processing aligns with business days.

Uploaded by

rovag49792
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 11

00:01 but

00:02 any so welcome Raj for this interview on


00:07 aure data engineering role so can you
00:09 please tell me about
00:10 yourself yes yeah
00:13 uh yeah de hi de my name is Raj I have
00:17 completed my engineering in year 2019
00:20 from Mumbai University since then I'm
00:22 having total five years of experience
00:24 out of which my initial 2.5 years uh I
00:28 have spent as a ETL consultant in which
00:30 I was basically working with databases
00:33 and data warehousing and my recent 2.5
00:36 years of experience is in uh as a data
00:40 engineer in which I am uh using Azure
00:43 data Factory uh data braks and have a
00:46 good command over writing P spark spark
00:49 and python I have good background on SQL
00:51 as well like total five years I have
00:53 been into SQL yeah this is a crisp brief
00:56 about myself great so can you tell me
00:59 about your project please
01:00 please yeah so my project is basically
01:04 of uh like Insurance domain we are uh
01:07 following basically Delta Lakehouse
01:10 architecture and uh here we are having
01:13 like a Medan architecture which is on
01:15 Delta lak house uh so we are having
01:18 basically three layers bronze silver and
01:20 gold what we are doing is we are getting
01:23 our data from multiple sources out of
01:26 which uh one is from on Prim SQL DB
01:30 which is nothing but ODC we are getting
01:33 our three basic customer data Branch
01:36 data and uh uh agent data from on
01:39 premises so uh like that we're getting
01:42 the data from other data we are getting
01:44 in the form of directly from Upstream
01:47 system in form of Json and CSV and other
01:50 because in an insurance company we have
01:51 to uh take good understanding about
01:54 weather as well so weather information
01:56 we getting from rest API once this
01:58 information is with up we are just
02:00 pulling this data with the help of ADF
02:03 and loading it into our ADLs location
02:07 and once this uh data is in adus
02:09 location then we perform different
02:11 cleansing optimity uh optim uh
02:13 activities and according to the business
02:15 logic we are finally loading the data
02:18 into our gold layer and then from that
02:20 our uh this thing powerbi people or an
02:23 can uh take away the data from there so
02:27 so what is your team
02:29 size we are having yeah we are seven
02:32 people and others are testers but
02:34 basically data enges we are seven
02:36 including our team
02:37 lead okay and how many pipelines you
02:40 have in your project so yeah so in our
02:44 project we are having 30 to 35
02:47 pipelines okay so far so it's look like
02:49 it's a recently started project because
02:51 seven people and we have just like 30 35
02:53 pipelines so oh no uh I mean I have uh I
02:58 have made 30 pipelines okay and in total
03:01 yeah in total there are more than uh 150
03:04 200 pipelines are there oh awesome looks
03:06 like a big project okay um so what is
03:11 the most challenging situation you which
03:14 you face you you have created a couple
03:16 of pipelines so you might have
03:18 encountered some of the problems so what
03:19 do you feel that this are one of the
03:21 most challenging situation you have been
03:24 came
03:26 across uh challenging
03:28 situation yeah uh yeah recently we got
03:32 uh this thing uh you know a requirement
03:35 from the client side that a certain
03:37 pipeline should work only on uh you know
03:40 only the business days uh like uh only
03:43 when there are no holidays for the
03:45 company so in schedule window uh in the
03:47 schedule trigger what we are having
03:49 there we can just set like only to
03:51 Monday to Friday we can set but uh there
03:53 was this problem which we are facing if
03:55 Suppose there is 15th August which is a
03:57 holiday how to do that so for that was a
04:00 challenging solution I took the
04:01 initiative myself uh what I did was I
04:04 prepared directly a Excel file in which
04:07 uh we put all the holidays uh which were
04:10 there uh for our company from the
04:12 company's calendar and just before
04:15 running that pipeline we were having uh
04:17 basically we were having a lookup
04:18 activity that lookup activity used to go
04:21 into that file and check whether if is a
04:23 holiday then we were putting a if Loop
04:25 and if only if that particular date is
04:28 not there then pipeline will run and in
04:30 schedule schedule trigger we are just
04:32 putting it Monday to Friday so in this
04:34 manner we have like uh made it possible
04:37 uh yeah okay got got it great so that's
04:43 great so you said that you have an um on
04:47 premises ODS system from which you are
04:50 pulling the data so can you tell me what
04:53 was exact that data source on Prem what
04:56 kind of a database it is and how exactly
04:59 you pull the data from
05:01 it yeah it is a SQL uh database uh Azure
05:07 SQL database and we are pulling the data
05:10 with the help of uh copy when you say
05:13 Azure SQL database so it is it a nonpr
05:15 or it is a cloud
05:19 one oh oh my bad my bad okay I was just
05:22 thinking about another data no no the on
05:24 Prim one was Oracle database it was on
05:28 Prim Oracle database yes okay on Prem
05:30 Oracle database okay okay now how you
05:33 how you pull the data from Oracle on
05:35 Prem to
05:36 the
05:39 cloud uh with the help of copy
05:43 activity just only copy activity
05:46 sufficient enough to do this are you
05:48 sure yeah yeah so uh basically we will
05:52 have to set up a pipeline to pull the
05:55 data from our onr to ADLs location uh in
05:59 this pipeline we we will use a uh
06:02 runtime so actually there is auto
06:05 resolve integration runtime which will
06:07 not work in this case because we are
06:09 pulling the data from onr so to data
06:12 pull the data I hold it hold it you say
06:15 normally Auto resolve integration run
06:17 time will not work can you tell me why
06:19 it is not going to
06:21 work uh because it it is the kind of uh
06:25 integration runtime that pulls the data
06:28 uh only from the uh web like only on the
06:31 cloud because the it is something like
06:34 that uh we have to use the self hosted
06:37 integration runtime which will sit on
06:39 the all premises TV and it will provide
06:42 the computation power over there from On
06:44 premine and On premise and then only we
06:47 can pull the data so generally why I
06:50 mean the question remains same that why
06:53 you need this self poost integration
06:55 runtime why normal Auto resolve
06:57 integration run time will not going to
06:59 work when you are pulling from this on
07:03 prep
07:11 okay
07:16 yeah okay no worries no worries okay
07:21 okay okay
07:23 now that's okay you pull the data from I
07:25 assume let's say you you you create
07:27 yourself Hoster integration R time you
07:29 pull the
07:31 data now what next okay your table data
07:35 get come from your sqil server or
07:38 whatever you say Oracle server it get it
07:41 where you are pushing it and where you
07:43 are keeping it what is the next step it
07:45 is going on in your project yes yes yes
07:48 uh as soon as the data comes in uh we
07:51 are just dumping all the data into our
07:53 Landing layer uh so once the data is in
07:56 our Landing layer from there we will uh
07:60 what is this Landing layer where it is
08:01 you keep this Landing layer uh so this
08:04 is basically in our ADLs account it is
08:07 it is storage okay aure data L Storage
08:11 so there we are having uh like a
08:14 container which is nam as Landing
08:15 container it is a container name is
08:18 landing and all the data directly it is
08:20 coming into the landing container okay
08:24 then yeah yeah yeah yeah once the data
08:26 is there in landing container with the
08:29 then we are again using the copy
08:31 activity and we are pushing this data
08:34 into our bronze
08:35 layer okay so you use a copy activity to
08:39 push the return to the bronze layer so
08:41 where is this bronze layer within the uh
08:44 ads ADLs or database or rest API what
08:47 kind of it is your bronze layer bronze
08:50 layer is also in ADLs only okay bronze
08:53 layer is also in the ADLs only now you
08:58 copy the data from one Landing Zone to
08:60 the Bron layer okay done copied okay now
09:04 then yeah yeah once our data is there in
09:07 our bronze layer we uh add like flag to
09:11 it so that uh when it when it is going
09:13 to next layer then we can identify this
09:17 is this flag is just added to
09:19 distinguish which data we have pushed
09:21 further and which is remaining so this
09:24 is how uh the flag thing works okay so
09:28 if my understanding is correct you said
09:31 that you first pull the data from your
09:33 on Prem to the ADLs and that is in terms
09:36 of a file or table yes yes is it file or
09:40 table yes it is a
09:44 table so how you keep the table in
09:49 ADLs uh we save it as a table it is it
09:52 is ADLs it is getting saved underneath
09:55 and in the form of PAR file so it is a
09:58 table or it is a PAR file
10:01 it is a p file in ADLs environment okay
10:04 so it is a file in ad ADLs in okay then
10:09 yeah what is your bronze layer is also
10:10 then file only you said you C it
10:14 is yes it is a it is a p file and only
10:19 okay it is a p file but it is a table or
10:21 it is a
10:22 file it is a file uh just that we can
10:26 access uh the file in the form of table
10:29 because of this uh data lake house
10:33 architecture uh we are using uh L house
10:38 architecture yeah yeah yeah we are
10:40 implementing a Delta Lake upon it it is
10:43 Delta lake is like a framework sorry to
10:46 interrupt but what you said we'll go
10:49 step by step okay don't want to confuse
10:52 you you said you move the data from on
10:54 print to the cloud using copy activity
10:58 you're keeping it as an in ID s then
11:00 again you are using copy activity to
11:03 move the data from your lending to the
11:05 bronze layer right that is also a a copy
11:08 activity now where the Delta lck comes
11:10 in
11:13 here uh you uh you ask right uh so what
11:17 is the format so that is so you keep it
11:19 into the you copy it as a using copy
11:25 activity you copying it as a file just
11:27 right park file
11:29 M so how it will become a data lake
11:32 house
11:37 architecture so we are saving this file
11:40 in P format right and this P format we
11:44 are saving it in our ADLs location but p
11:48 is not directly accessible we cannot uh
11:51 do DML operation on the P file as it is
11:55 so that is why uh we are using this kind
11:57 of Delta l in which we can uh do DML
12:01 operation and asset properties are there
12:03 and we this this file will behave like a
12:06 table and we can do all the commands
12:09 like we used to do on the normal
12:11 Warehouse normal table okay sorry don't
12:15 but seems like there is some confusion
12:19 you might be talking about the data
12:20 brakes you started talking about the
12:22 data braks not only the ADF because you
12:25 cannot do this DML operation Etc in the
12:29 uh the tables which you're going to
12:31 create like a spark
12:32 tables okay okay okay you getting my
12:37 point I'll come back to that in a few
12:38 minutes okay so assume that all that all
12:41 this get done okay
12:45 now assume that all this is working fine
12:49 so I assume you're using the data break
12:51 as well because as you said you're using
12:53 the data BS for cleaning and all what
12:55 size of a data you are getting
12:58 in we are getting think roughly uh 5 GB
13:02 of
13:03 data per pipeline basis you're talking
13:06 about or overall you're talking
13:08 about uh so per pipeline we are getting
13:11 nearly half a GB of
13:13 data per pipeline yeah per pipeline half
13:17 a GB of data we getting you just said
13:21 5gb oh oh I said Point okay okay by
13:25 mistake it is 0.5 GB of data per
13:28 pipeline okay
13:30 five yeah5 GB of a data per pipeline
13:33 based okay yes yes is it a a big size
13:38 data or it is a small size data only
13:40 what do you
13:41 feel I I feel it is uh not that big size
13:45 of data but comparing we are having uh
13:48 so so many like lots of pipelines so
13:50 that is why that imagine a
13:53 situation that instead of this your .5
13:56 GB you got 50 GB in your one pipeline
14:01 okay and you are doing that cleaning in
14:05 the data braks using the
14:07 spark
14:09 okay how would you do this spark
14:11 performance
14:13 optimization okay so uh there are an a
14:18 numer of way of using a spark
14:20 performance the first way is that uh we
14:23 can like if there are two tables on
14:26 which uh suppose joining is happening uh
14:29 and we will see like if suppose any one
14:32 of the table is smaller one we can go
14:35 for broadcast joint was broadcast joint
14:37 will do is uh the smaller table which is
14:41 whichever it is there it will uh
14:42 distribute it along all the partition so
14:45 that during the joining operation
14:47 essentially shuffling will be avoided
14:49 because it will go and like each larger
14:52 table can join with that smaller table
14:54 without any shuffling so uh what we say
14:58 uh data passing
14:60 across the partition can be stopped and
15:02 our uh uh time can be reduced this is
15:05 one another one is we can use broadcast
15:08 joint uh sorry broadcast joint uh then
15:11 we can use that is bucket bucketing
15:13 bucket joint uh in this the data is
15:16 stored in form of like we can say on the
15:20 like joining is happening if uh on basis
15:23 of country country so that whatever
15:26 table which are there we will put them
15:29 in uh Partition by Group by by country
15:32 and the other table is also there and if
15:34 the joining key foreign key is country
15:36 they will also fall back in different
15:38 different containers on Bas of the
15:39 country just there is one condition that
15:43 number of uh buckets for table one and
15:45 table two should be same and then only
15:47 this bucket joint can happen other than
15:50 that there is uh something called as uh
15:52 repartition uh this we are doing to
15:55 reduce the data skewness and apart from
15:58 this there is adaptive query uh this
16:01 thing which is spark internal
16:03 optimization that we don't need to take
16:06 care spark on its own can optimize the
16:09 query performance okay great thank you
16:12 so much Raj so I think uh let's get into
16:16 the second half of this thank you for
16:18 coming in so in second half this we'll
16:20 analyze and we'll understand like where
16:23 you did good and where you did little
16:26 bit Improvement is needed okay so let's
16:28 get started with with this okay
16:30 assessment so it's starting with it your
16:34 presentation is good your com skills is
16:36 good so there you're doing perfectly
16:38 well second thing is uh at one point you
16:42 said that like we are pulling the data
16:44 from data sources which is one of them
16:46 is ODC so it's not ODC it's ODS
16:50 operational data store okay second thing
16:52 I feel like you somehow wanted to talk
16:55 about the SQL server but you end up with
16:57 talking asure SQL database so so you
16:59 cannot do this mistake when you're
17:01 saying that my source is on on Prem and
17:04 then I'm pulling the data from on Prem
17:06 when you somebody ask you what is your
17:08 on PM DB answer is CL clear SQL Server
17:11 maybe in our case you might wanted to
17:13 say SQL server but you end up with
17:15 saying that Azure SQL DB which is not
17:18 right because Azure SQL database or
17:21 Azure SQL Server is basically a cloud
17:23 one it's not a on-prem okay so keep I
17:27 mean you know take care of that then
17:29 your project explanation startup is good
17:32 perfectly fine that is okay so that is
17:34 you are ready with it then uh then I
17:38 second question probably I have asked is
17:40 about the challenging situation which
17:42 you come up with right only one thing
17:45 which I would love would be is like you
17:48 don't say how you did it until unless
17:50 I'll ask so I ask so I want you to go
17:53 slow on that that you said whatever was
17:55 the problem you taken it you set the
17:58 stage very rightly then you'll leave it
18:00 ideally rather than saying it like how
18:02 you'd solve that let's wait for that
18:05 question to come in okay that how you
18:07 solve it so you just explain the problem
18:09 and pause okay and let the interviewer
18:12 ask okay done then the third thing which
18:15 we talked about uh is how you do the onr
18:18 to the cloud right then you talked about
18:20 okay we use a copy activity and then
18:24 when I ask for in and I ask it like only
18:26 the copy activity then you said about
18:29 that we need to have you were got little
18:31 bit fumble about about the integration
18:34 runtime okay
18:37 so and then you talked about that we
18:39 need a self forer integration runtime
18:41 instead of that the auto resolve
18:43 integration runtime but you are not very
18:46 clear why we need uh the self hosted one
18:49 so the answer is like Auto resolve
18:52 integration runtime is picking and
18:54 calling it from random IP addresses so
18:57 probably your private Network can not
18:59 get reached from that random IP
19:02 addresses right because that is not
19:03 something which is binded okay so
19:07 probably that is probably your um your
19:10 private network will not allow to that
19:12 connection to get passed through because
19:13 of your Fireball and all and that's
19:15 where we need a self integration on time
19:17 so that self integration on time will be
19:19 a fixed one which is connecting to your
19:20 database server because your database
19:22 server is expecting a call from a
19:24 limited IP addresses only and then your
19:26 selfhosted machine will get fixed and
19:28 then that's IP address will be you know
19:30 WID listed by the firewall of your
19:33 database server so that that connection
19:35 get passed through so that is something
19:37 which you need to work upon remember we
19:39 have covered in our sessions okay so
19:41 just go back to those sessions and you
19:43 know cover that that why we need the
19:46 self hosted IR instead of the your on
19:49 Prem sorry for your auto resolve okay
19:53 then we talked about your project okay
19:56 then okay data comes in and
19:59 then how you going to you know uh how
20:03 you going to take the data forward after
20:05 your lending data comes in probably at
20:08 that piece ideally as per the process
20:11 maybe it's more of like either you're
20:13 are copying into the bronze layer via
20:15 data brakes if you remember okay the
20:18 project which you were talking about is
20:21 um as you said as you mentioned it that
20:24 it is something which you'll be talking
20:25 about in the data break side so your
20:28 data is getting saved into the bronze
20:30 layer as a Delta table but you creating
20:33 those datta table using the data brakes
20:35 right you create the bronze layer
20:37 database and then you create you're
20:39 storing that bronze layer tables as a
20:43 bronze tables as a um Delta table and
20:48 under the hood table will get Sav as a
20:50 park for you can go and do all that DML
20:53 operation generally normal DML operation
20:56 is not allowed on normal spark tables
20:59 but when you make a spark table as a
21:01 Delta table then you can apply all that
21:04 so then the flow is like from there once
21:06 you get the data from your on Prem and
21:08 it is there in ADLs you go back to your
21:11 data brakes and from data breas you are
21:12 fetching that uh you know tables and
21:15 pushing it into yeah using the mounting
21:17 and all so I think that piece you got
21:20 missed okay so just cover that
21:23 up then um when we talked about uh about
21:29 the size of the data for your pipeline
21:32 you set it like 5 GB initially then you
21:34 get it to the. 5 GB you know so that is
21:38 something which you know that's an
21:40 expected question which you can always
21:42 you know uh have a clearcut in the mind
21:44 that what is your correct size okay so
21:46 don't get confused between the 0.5 5gb
21:49 500 MB per pipeline don't just say that
21:53 when I when I asked like how much data
21:54 you are processing and you said
21:57 5gb if it is was 1 terab it was okay but
22:01 5gb looks very small so when you giving
22:03 a very small thing you have to Crystal
22:06 Clear said it okay 5gb per pipeline okay
22:10 or per day basis per pipeline per day
22:12 something like that you getting my point
22:15 so you have to Crystal Clear uh Define
22:17 it so that eventually your number should
22:19 look bigger the whole idea is that
22:21 number should look bigger so rather than
22:24 talking about less number try to talk
22:25 about a bigger number okay and then from
22:28 that you you cut down to the lower
22:29 number rather than take out a lower
22:32 number and then you know extrapolating
22:34 it to the larger number so you might
22:36 started saying that okay we are
22:37 processing a one tab of data on a
22:38 monthly basis okay overall then he'll
22:41 say you're every pipeline no no we have
22:42 a 10 pipeline every pipeline is might
22:44 having a 100 GB or like 100 one 1 GB and
22:48 we're running 100 times a pipeline in a
22:50 month something like that overall this
22:52 has how it become a terabyte okay do
22:55 that then lastly we talked about um the
22:58 spark optic optimization so for the
22:60 spark
23:01 optimization yeah I mean I understand
23:04 you trying to talk about all those
23:06 Concepts but I feel that it could be
23:08 further more you know um you should
23:12 create a kind of a scenario and then
23:13 talk about it because otherwise will
23:15 will be very generic so maybe um when I
23:19 ask that okay you got a 500 terabytes of
23:22 of 1 terab of a data or something like
23:24 that then how you will optimize your uh
23:27 this spark code then you must come up
23:29 with something like that okay so let's
23:32 assume that uh on this a large database
23:35 on large this data generally uh whenever
23:37 we have a small I mean when we have a
23:39 narrow transformation problem does not
23:41 happens up mostly the spark optimation
23:43 you need to come in when we get into the
23:45 kind of a white transformation so I
23:47 assume that we have a lot of white
23:49 transformation because of that probably
23:51 this problem is coming up got it so that
23:55 that's how you start up with and then um
23:58 you talk about it okay that there could
24:00 be a join is there so now we might do a
24:03 join optimization so there could be a
24:05 multiple ways in which we can do a join
24:06 optimization maybe it could we could do
24:09 kind of repartitioning do using a
24:10 bucketing using a um something like s
24:13 mer joint or some different way
24:15 techniques right so that's how you
24:17 slowly slowly progress it and the
24:19 another good thing is that you are
24:21 talking about the Delta tables okay so
24:23 then uh there is a performance
24:25 optimization in the Delta Lake of its
24:26 own like you have an optimized command
24:29 right or you can run a vacuum command
24:31 also which can also you know cut down uh
24:33 some extra data which is not
24:35 automatically you know deleted which is
24:37 outside the retention period so all
24:39 those things you keep in the mind but I
24:41 mean to to make your story looks more
24:46 brighter if somebody is asking a very
24:48 vag vague question something like that
24:51 tell me uh how you can optimize it so
24:53 you create a scenario around it okay so
24:55 that your answer code looks more
24:57 Justified okay so for example you talk
25:00 about the country now country is out of
25:04 the scope right I mean so then you have
25:07 to Define your data set and then talk
25:08 about the country
25:11 okay so so this look so let's say for
25:14 example we have a lot lots of data and
25:16 then our data is scattered across a
25:18 multiple countries and now let's say we
25:20 are doing a joining around it so maybe
25:22 we can partition the datab base on the
25:24 country and then it will make sense to
25:26 optimize it you you getting my point so
25:29 yeah that's how it's although your
25:32 answer was looks okay it's not that I
25:34 mean bad but I feel like how we can make
25:37 make it more you know looks more better
25:39 so that could be few points otherwise I
25:43 overall it looks okay I was purposefully
25:45 you know trying too much digging into
25:47 the project because a lot of people was
25:49 wanted to check it like how the project
25:51 thing goes on so probably these points I
25:54 think in the real interview maybe you
25:57 not get too much dig down on to this
25:59 project thing also but yeah I was
26:01 purposefully doing it so I hope you you
26:03 get you get the point like where you
26:05 have to work upon so yes we can connect
26:08 another time after you prepare for all
26:10 this and then we have one more round of
26:11 interview okay yes great I hope you
26:14 enjoyed it and those who are watching
26:16 this also enjoyed and learned a lot
26:18 thank you so much Raj for coming on this
26:20 see you soon than byk you

You might also like