Azure Data Engineer Mock Interview - Project Special
Raj has five years of experience in data engineering and ETL consulting, primarily working with Azure Data Factory and databases. He is currently involved in an insurance project utilizing Delta Lakehouse architecture, managing data from various sources and implementing a three-layer structure for data processing. Raj has faced challenges in scheduling pipelines around holidays and has developed solutions to ensure data processing aligns with business days.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
113 views11 pages
Azure Data Engineer Mock Interview - Project Special
Raj has five years of experience in data engineering and ETL consulting, primarily working with Azure Data Factory and databases. He is currently involved in an insurance project utilizing Delta Lakehouse architecture, managing data from various sources and implementing a three-layer structure for data processing. Raj has faced challenges in scheduling pipelines around holidays and has developed solutions to ensure data processing aligns with business days.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 11
00:01 but
00:02 any so welcome Raj for this interview on
00:07 aure data engineering role so can you 00:09 please tell me about 00:10 yourself yes yeah 00:13 uh yeah de hi de my name is Raj I have 00:17 completed my engineering in year 2019 00:20 from Mumbai University since then I'm 00:22 having total five years of experience 00:24 out of which my initial 2.5 years uh I 00:28 have spent as a ETL consultant in which 00:30 I was basically working with databases 00:33 and data warehousing and my recent 2.5 00:36 years of experience is in uh as a data 00:40 engineer in which I am uh using Azure 00:43 data Factory uh data braks and have a 00:46 good command over writing P spark spark 00:49 and python I have good background on SQL 00:51 as well like total five years I have 00:53 been into SQL yeah this is a crisp brief 00:56 about myself great so can you tell me 00:59 about your project please 01:00 please yeah so my project is basically 01:04 of uh like Insurance domain we are uh 01:07 following basically Delta Lakehouse 01:10 architecture and uh here we are having 01:13 like a Medan architecture which is on 01:15 Delta lak house uh so we are having 01:18 basically three layers bronze silver and 01:20 gold what we are doing is we are getting 01:23 our data from multiple sources out of 01:26 which uh one is from on Prim SQL DB 01:30 which is nothing but ODC we are getting 01:33 our three basic customer data Branch 01:36 data and uh uh agent data from on 01:39 premises so uh like that we're getting 01:42 the data from other data we are getting 01:44 in the form of directly from Upstream 01:47 system in form of Json and CSV and other 01:50 because in an insurance company we have 01:51 to uh take good understanding about 01:54 weather as well so weather information 01:56 we getting from rest API once this 01:58 information is with up we are just 02:00 pulling this data with the help of ADF 02:03 and loading it into our ADLs location 02:07 and once this uh data is in adus 02:09 location then we perform different 02:11 cleansing optimity uh optim uh 02:13 activities and according to the business 02:15 logic we are finally loading the data 02:18 into our gold layer and then from that 02:20 our uh this thing powerbi people or an 02:23 can uh take away the data from there so 02:27 so what is your team 02:29 size we are having yeah we are seven 02:32 people and others are testers but 02:34 basically data enges we are seven 02:36 including our team 02:37 lead okay and how many pipelines you 02:40 have in your project so yeah so in our 02:44 project we are having 30 to 35 02:47 pipelines okay so far so it's look like 02:49 it's a recently started project because 02:51 seven people and we have just like 30 35 02:53 pipelines so oh no uh I mean I have uh I 02:58 have made 30 pipelines okay and in total 03:01 yeah in total there are more than uh 150 03:04 200 pipelines are there oh awesome looks 03:06 like a big project okay um so what is 03:11 the most challenging situation you which 03:14 you face you you have created a couple 03:16 of pipelines so you might have 03:18 encountered some of the problems so what 03:19 do you feel that this are one of the 03:21 most challenging situation you have been 03:24 came 03:26 across uh challenging 03:28 situation yeah uh yeah recently we got 03:32 uh this thing uh you know a requirement 03:35 from the client side that a certain 03:37 pipeline should work only on uh you know 03:40 only the business days uh like uh only 03:43 when there are no holidays for the 03:45 company so in schedule window uh in the 03:47 schedule trigger what we are having 03:49 there we can just set like only to 03:51 Monday to Friday we can set but uh there 03:53 was this problem which we are facing if 03:55 Suppose there is 15th August which is a 03:57 holiday how to do that so for that was a 04:00 challenging solution I took the 04:01 initiative myself uh what I did was I 04:04 prepared directly a Excel file in which 04:07 uh we put all the holidays uh which were 04:10 there uh for our company from the 04:12 company's calendar and just before 04:15 running that pipeline we were having uh 04:17 basically we were having a lookup 04:18 activity that lookup activity used to go 04:21 into that file and check whether if is a 04:23 holiday then we were putting a if Loop 04:25 and if only if that particular date is 04:28 not there then pipeline will run and in 04:30 schedule schedule trigger we are just 04:32 putting it Monday to Friday so in this 04:34 manner we have like uh made it possible 04:37 uh yeah okay got got it great so that's 04:43 great so you said that you have an um on 04:47 premises ODS system from which you are 04:50 pulling the data so can you tell me what 04:53 was exact that data source on Prem what 04:56 kind of a database it is and how exactly 04:59 you pull the data from 05:01 it yeah it is a SQL uh database uh Azure 05:07 SQL database and we are pulling the data 05:10 with the help of uh copy when you say 05:13 Azure SQL database so it is it a nonpr 05:15 or it is a cloud 05:19 one oh oh my bad my bad okay I was just 05:22 thinking about another data no no the on 05:24 Prim one was Oracle database it was on 05:28 Prim Oracle database yes okay on Prem 05:30 Oracle database okay okay now how you 05:33 how you pull the data from Oracle on 05:35 Prem to 05:36 the 05:39 cloud uh with the help of copy 05:43 activity just only copy activity 05:46 sufficient enough to do this are you 05:48 sure yeah yeah so uh basically we will 05:52 have to set up a pipeline to pull the 05:55 data from our onr to ADLs location uh in 05:59 this pipeline we we will use a uh 06:02 runtime so actually there is auto 06:05 resolve integration runtime which will 06:07 not work in this case because we are 06:09 pulling the data from onr so to data 06:12 pull the data I hold it hold it you say 06:15 normally Auto resolve integration run 06:17 time will not work can you tell me why 06:19 it is not going to 06:21 work uh because it it is the kind of uh 06:25 integration runtime that pulls the data 06:28 uh only from the uh web like only on the 06:31 cloud because the it is something like 06:34 that uh we have to use the self hosted 06:37 integration runtime which will sit on 06:39 the all premises TV and it will provide 06:42 the computation power over there from On 06:44 premine and On premise and then only we 06:47 can pull the data so generally why I 06:50 mean the question remains same that why 06:53 you need this self poost integration 06:55 runtime why normal Auto resolve 06:57 integration run time will not going to 06:59 work when you are pulling from this on 07:03 prep 07:11 okay 07:16 yeah okay no worries no worries okay 07:21 okay okay 07:23 now that's okay you pull the data from I 07:25 assume let's say you you you create 07:27 yourself Hoster integration R time you 07:29 pull the 07:31 data now what next okay your table data 07:35 get come from your sqil server or 07:38 whatever you say Oracle server it get it 07:41 where you are pushing it and where you 07:43 are keeping it what is the next step it 07:45 is going on in your project yes yes yes 07:48 uh as soon as the data comes in uh we 07:51 are just dumping all the data into our 07:53 Landing layer uh so once the data is in 07:56 our Landing layer from there we will uh 07:60 what is this Landing layer where it is 08:01 you keep this Landing layer uh so this 08:04 is basically in our ADLs account it is 08:07 it is storage okay aure data L Storage 08:11 so there we are having uh like a 08:14 container which is nam as Landing 08:15 container it is a container name is 08:18 landing and all the data directly it is 08:20 coming into the landing container okay 08:24 then yeah yeah yeah yeah once the data 08:26 is there in landing container with the 08:29 then we are again using the copy 08:31 activity and we are pushing this data 08:34 into our bronze 08:35 layer okay so you use a copy activity to 08:39 push the return to the bronze layer so 08:41 where is this bronze layer within the uh 08:44 ads ADLs or database or rest API what 08:47 kind of it is your bronze layer bronze 08:50 layer is also in ADLs only okay bronze 08:53 layer is also in the ADLs only now you 08:58 copy the data from one Landing Zone to 08:60 the Bron layer okay done copied okay now 09:04 then yeah yeah once our data is there in 09:07 our bronze layer we uh add like flag to 09:11 it so that uh when it when it is going 09:13 to next layer then we can identify this 09:17 is this flag is just added to 09:19 distinguish which data we have pushed 09:21 further and which is remaining so this 09:24 is how uh the flag thing works okay so 09:28 if my understanding is correct you said 09:31 that you first pull the data from your 09:33 on Prem to the ADLs and that is in terms 09:36 of a file or table yes yes is it file or 09:40 table yes it is a 09:44 table so how you keep the table in 09:49 ADLs uh we save it as a table it is it 09:52 is ADLs it is getting saved underneath 09:55 and in the form of PAR file so it is a 09:58 table or it is a PAR file 10:01 it is a p file in ADLs environment okay 10:04 so it is a file in ad ADLs in okay then 10:09 yeah what is your bronze layer is also 10:10 then file only you said you C it 10:14 is yes it is a it is a p file and only 10:19 okay it is a p file but it is a table or 10:21 it is a 10:22 file it is a file uh just that we can 10:26 access uh the file in the form of table 10:29 because of this uh data lake house 10:33 architecture uh we are using uh L house 10:38 architecture yeah yeah yeah we are 10:40 implementing a Delta Lake upon it it is 10:43 Delta lake is like a framework sorry to 10:46 interrupt but what you said we'll go 10:49 step by step okay don't want to confuse 10:52 you you said you move the data from on 10:54 print to the cloud using copy activity 10:58 you're keeping it as an in ID s then 11:00 again you are using copy activity to 11:03 move the data from your lending to the 11:05 bronze layer right that is also a a copy 11:08 activity now where the Delta lck comes 11:10 in 11:13 here uh you uh you ask right uh so what 11:17 is the format so that is so you keep it 11:19 into the you copy it as a using copy 11:25 activity you copying it as a file just 11:27 right park file 11:29 M so how it will become a data lake 11:32 house 11:37 architecture so we are saving this file 11:40 in P format right and this P format we 11:44 are saving it in our ADLs location but p 11:48 is not directly accessible we cannot uh 11:51 do DML operation on the P file as it is 11:55 so that is why uh we are using this kind 11:57 of Delta l in which we can uh do DML 12:01 operation and asset properties are there 12:03 and we this this file will behave like a 12:06 table and we can do all the commands 12:09 like we used to do on the normal 12:11 Warehouse normal table okay sorry don't 12:15 but seems like there is some confusion 12:19 you might be talking about the data 12:20 brakes you started talking about the 12:22 data braks not only the ADF because you 12:25 cannot do this DML operation Etc in the 12:29 uh the tables which you're going to 12:31 create like a spark 12:32 tables okay okay okay you getting my 12:37 point I'll come back to that in a few 12:38 minutes okay so assume that all that all 12:41 this get done okay 12:45 now assume that all this is working fine 12:49 so I assume you're using the data break 12:51 as well because as you said you're using 12:53 the data BS for cleaning and all what 12:55 size of a data you are getting 12:58 in we are getting think roughly uh 5 GB 13:02 of 13:03 data per pipeline basis you're talking 13:06 about or overall you're talking 13:08 about uh so per pipeline we are getting 13:11 nearly half a GB of 13:13 data per pipeline yeah per pipeline half 13:17 a GB of data we getting you just said 13:21 5gb oh oh I said Point okay okay by 13:25 mistake it is 0.5 GB of data per 13:28 pipeline okay 13:30 five yeah5 GB of a data per pipeline 13:33 based okay yes yes is it a a big size 13:38 data or it is a small size data only 13:40 what do you 13:41 feel I I feel it is uh not that big size 13:45 of data but comparing we are having uh 13:48 so so many like lots of pipelines so 13:50 that is why that imagine a 13:53 situation that instead of this your .5 13:56 GB you got 50 GB in your one pipeline 14:01 okay and you are doing that cleaning in 14:05 the data braks using the 14:07 spark 14:09 okay how would you do this spark 14:11 performance 14:13 optimization okay so uh there are an a 14:18 numer of way of using a spark 14:20 performance the first way is that uh we 14:23 can like if there are two tables on 14:26 which uh suppose joining is happening uh 14:29 and we will see like if suppose any one 14:32 of the table is smaller one we can go 14:35 for broadcast joint was broadcast joint 14:37 will do is uh the smaller table which is 14:41 whichever it is there it will uh 14:42 distribute it along all the partition so 14:45 that during the joining operation 14:47 essentially shuffling will be avoided 14:49 because it will go and like each larger 14:52 table can join with that smaller table 14:54 without any shuffling so uh what we say 14:58 uh data passing 14:60 across the partition can be stopped and 15:02 our uh uh time can be reduced this is 15:05 one another one is we can use broadcast 15:08 joint uh sorry broadcast joint uh then 15:11 we can use that is bucket bucketing 15:13 bucket joint uh in this the data is 15:16 stored in form of like we can say on the 15:20 like joining is happening if uh on basis 15:23 of country country so that whatever 15:26 table which are there we will put them 15:29 in uh Partition by Group by by country 15:32 and the other table is also there and if 15:34 the joining key foreign key is country 15:36 they will also fall back in different 15:38 different containers on Bas of the 15:39 country just there is one condition that 15:43 number of uh buckets for table one and 15:45 table two should be same and then only 15:47 this bucket joint can happen other than 15:50 that there is uh something called as uh 15:52 repartition uh this we are doing to 15:55 reduce the data skewness and apart from 15:58 this there is adaptive query uh this 16:01 thing which is spark internal 16:03 optimization that we don't need to take 16:06 care spark on its own can optimize the 16:09 query performance okay great thank you 16:12 so much Raj so I think uh let's get into 16:16 the second half of this thank you for 16:18 coming in so in second half this we'll 16:20 analyze and we'll understand like where 16:23 you did good and where you did little 16:26 bit Improvement is needed okay so let's 16:28 get started with with this okay 16:30 assessment so it's starting with it your 16:34 presentation is good your com skills is 16:36 good so there you're doing perfectly 16:38 well second thing is uh at one point you 16:42 said that like we are pulling the data 16:44 from data sources which is one of them 16:46 is ODC so it's not ODC it's ODS 16:50 operational data store okay second thing 16:52 I feel like you somehow wanted to talk 16:55 about the SQL server but you end up with 16:57 talking asure SQL database so so you 16:59 cannot do this mistake when you're 17:01 saying that my source is on on Prem and 17:04 then I'm pulling the data from on Prem 17:06 when you somebody ask you what is your 17:08 on PM DB answer is CL clear SQL Server 17:11 maybe in our case you might wanted to 17:13 say SQL server but you end up with 17:15 saying that Azure SQL DB which is not 17:18 right because Azure SQL database or 17:21 Azure SQL Server is basically a cloud 17:23 one it's not a on-prem okay so keep I 17:27 mean you know take care of that then 17:29 your project explanation startup is good 17:32 perfectly fine that is okay so that is 17:34 you are ready with it then uh then I 17:38 second question probably I have asked is 17:40 about the challenging situation which 17:42 you come up with right only one thing 17:45 which I would love would be is like you 17:48 don't say how you did it until unless 17:50 I'll ask so I ask so I want you to go 17:53 slow on that that you said whatever was 17:55 the problem you taken it you set the 17:58 stage very rightly then you'll leave it 18:00 ideally rather than saying it like how 18:02 you'd solve that let's wait for that 18:05 question to come in okay that how you 18:07 solve it so you just explain the problem 18:09 and pause okay and let the interviewer 18:12 ask okay done then the third thing which 18:15 we talked about uh is how you do the onr 18:18 to the cloud right then you talked about 18:20 okay we use a copy activity and then 18:24 when I ask for in and I ask it like only 18:26 the copy activity then you said about 18:29 that we need to have you were got little 18:31 bit fumble about about the integration 18:34 runtime okay 18:37 so and then you talked about that we 18:39 need a self forer integration runtime 18:41 instead of that the auto resolve 18:43 integration runtime but you are not very 18:46 clear why we need uh the self hosted one 18:49 so the answer is like Auto resolve 18:52 integration runtime is picking and 18:54 calling it from random IP addresses so 18:57 probably your private Network can not 18:59 get reached from that random IP 19:02 addresses right because that is not 19:03 something which is binded okay so 19:07 probably that is probably your um your 19:10 private network will not allow to that 19:12 connection to get passed through because 19:13 of your Fireball and all and that's 19:15 where we need a self integration on time 19:17 so that self integration on time will be 19:19 a fixed one which is connecting to your 19:20 database server because your database 19:22 server is expecting a call from a 19:24 limited IP addresses only and then your 19:26 selfhosted machine will get fixed and 19:28 then that's IP address will be you know 19:30 WID listed by the firewall of your 19:33 database server so that that connection 19:35 get passed through so that is something 19:37 which you need to work upon remember we 19:39 have covered in our sessions okay so 19:41 just go back to those sessions and you 19:43 know cover that that why we need the 19:46 self hosted IR instead of the your on 19:49 Prem sorry for your auto resolve okay 19:53 then we talked about your project okay 19:56 then okay data comes in and 19:59 then how you going to you know uh how 20:03 you going to take the data forward after 20:05 your lending data comes in probably at 20:08 that piece ideally as per the process 20:11 maybe it's more of like either you're 20:13 are copying into the bronze layer via 20:15 data brakes if you remember okay the 20:18 project which you were talking about is 20:21 um as you said as you mentioned it that 20:24 it is something which you'll be talking 20:25 about in the data break side so your 20:28 data is getting saved into the bronze 20:30 layer as a Delta table but you creating 20:33 those datta table using the data brakes 20:35 right you create the bronze layer 20:37 database and then you create you're 20:39 storing that bronze layer tables as a 20:43 bronze tables as a um Delta table and 20:48 under the hood table will get Sav as a 20:50 park for you can go and do all that DML 20:53 operation generally normal DML operation 20:56 is not allowed on normal spark tables 20:59 but when you make a spark table as a 21:01 Delta table then you can apply all that 21:04 so then the flow is like from there once 21:06 you get the data from your on Prem and 21:08 it is there in ADLs you go back to your 21:11 data brakes and from data breas you are 21:12 fetching that uh you know tables and 21:15 pushing it into yeah using the mounting 21:17 and all so I think that piece you got 21:20 missed okay so just cover that 21:23 up then um when we talked about uh about 21:29 the size of the data for your pipeline 21:32 you set it like 5 GB initially then you 21:34 get it to the. 5 GB you know so that is 21:38 something which you know that's an 21:40 expected question which you can always 21:42 you know uh have a clearcut in the mind 21:44 that what is your correct size okay so 21:46 don't get confused between the 0.5 5gb 21:49 500 MB per pipeline don't just say that 21:53 when I when I asked like how much data 21:54 you are processing and you said 21:57 5gb if it is was 1 terab it was okay but 22:01 5gb looks very small so when you giving 22:03 a very small thing you have to Crystal 22:06 Clear said it okay 5gb per pipeline okay 22:10 or per day basis per pipeline per day 22:12 something like that you getting my point 22:15 so you have to Crystal Clear uh Define 22:17 it so that eventually your number should 22:19 look bigger the whole idea is that 22:21 number should look bigger so rather than 22:24 talking about less number try to talk 22:25 about a bigger number okay and then from 22:28 that you you cut down to the lower 22:29 number rather than take out a lower 22:32 number and then you know extrapolating 22:34 it to the larger number so you might 22:36 started saying that okay we are 22:37 processing a one tab of data on a 22:38 monthly basis okay overall then he'll 22:41 say you're every pipeline no no we have 22:42 a 10 pipeline every pipeline is might 22:44 having a 100 GB or like 100 one 1 GB and 22:48 we're running 100 times a pipeline in a 22:50 month something like that overall this 22:52 has how it become a terabyte okay do 22:55 that then lastly we talked about um the 22:58 spark optic optimization so for the 22:60 spark 23:01 optimization yeah I mean I understand 23:04 you trying to talk about all those 23:06 Concepts but I feel that it could be 23:08 further more you know um you should 23:12 create a kind of a scenario and then 23:13 talk about it because otherwise will 23:15 will be very generic so maybe um when I 23:19 ask that okay you got a 500 terabytes of 23:22 of 1 terab of a data or something like 23:24 that then how you will optimize your uh 23:27 this spark code then you must come up 23:29 with something like that okay so let's 23:32 assume that uh on this a large database 23:35 on large this data generally uh whenever 23:37 we have a small I mean when we have a 23:39 narrow transformation problem does not 23:41 happens up mostly the spark optimation 23:43 you need to come in when we get into the 23:45 kind of a white transformation so I 23:47 assume that we have a lot of white 23:49 transformation because of that probably 23:51 this problem is coming up got it so that 23:55 that's how you start up with and then um 23:58 you talk about it okay that there could 24:00 be a join is there so now we might do a 24:03 join optimization so there could be a 24:05 multiple ways in which we can do a join 24:06 optimization maybe it could we could do 24:09 kind of repartitioning do using a 24:10 bucketing using a um something like s 24:13 mer joint or some different way 24:15 techniques right so that's how you 24:17 slowly slowly progress it and the 24:19 another good thing is that you are 24:21 talking about the Delta tables okay so 24:23 then uh there is a performance 24:25 optimization in the Delta Lake of its 24:26 own like you have an optimized command 24:29 right or you can run a vacuum command 24:31 also which can also you know cut down uh 24:33 some extra data which is not 24:35 automatically you know deleted which is 24:37 outside the retention period so all 24:39 those things you keep in the mind but I 24:41 mean to to make your story looks more 24:46 brighter if somebody is asking a very 24:48 vag vague question something like that 24:51 tell me uh how you can optimize it so 24:53 you create a scenario around it okay so 24:55 that your answer code looks more 24:57 Justified okay so for example you talk 25:00 about the country now country is out of 25:04 the scope right I mean so then you have 25:07 to Define your data set and then talk 25:08 about the country 25:11 okay so so this look so let's say for 25:14 example we have a lot lots of data and 25:16 then our data is scattered across a 25:18 multiple countries and now let's say we 25:20 are doing a joining around it so maybe 25:22 we can partition the datab base on the 25:24 country and then it will make sense to 25:26 optimize it you you getting my point so 25:29 yeah that's how it's although your 25:32 answer was looks okay it's not that I 25:34 mean bad but I feel like how we can make 25:37 make it more you know looks more better 25:39 so that could be few points otherwise I 25:43 overall it looks okay I was purposefully 25:45 you know trying too much digging into 25:47 the project because a lot of people was 25:49 wanted to check it like how the project 25:51 thing goes on so probably these points I 25:54 think in the real interview maybe you 25:57 not get too much dig down on to this 25:59 project thing also but yeah I was 26:01 purposefully doing it so I hope you you 26:03 get you get the point like where you 26:05 have to work upon so yes we can connect 26:08 another time after you prepare for all 26:10 this and then we have one more round of 26:11 interview okay yes great I hope you 26:14 enjoyed it and those who are watching 26:16 this also enjoyed and learned a lot 26:18 thank you so much Raj for coming on this 26:20 see you soon than byk you