[go: up one dir, main page]

0% found this document useful (0 votes)
2K views11 pages

Multiple Response Tasks

The document contains multiple response tasks that cover various topics related to data processing, big data tools, and system architecture. It includes questions on schema approaches, YARN, NoSQL databases, data lineage, and ETL pipelines, among others. Additionally, it describes a task to construct a ski slope using SQL queries and an ETL job for analyzing electric vehicle charge point data using PySpark.

Uploaded by

tvr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views11 pages

Multiple Response Tasks

The document contains multiple response tasks that cover various topics related to data processing, big data tools, and system architecture. It includes questions on schema approaches, YARN, NoSQL databases, data lineage, and ETL pipelines, among others. Additionally, it describes a task to construct a ski slope using SQL queries and an ETL job for analyzing electric vehicle charge point data using PySpark.

Uploaded by

tvr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Task 1

Mul ple response task

This is a set of mul ple response ques ons.


Please choose as many answers as you think are correct.
It is possible for all the answers or none to be correct.
Also, incorrect answers will nega vely impact your score.

Display images in original size

1.

Which of the following are the advantages of the ‘schema on read’ approach over ‘schema on write’?

A: Support for unstructured data

B: Faster loads to the storage layer

C: The flexibility of how data is consumed

D: Faster reads from the storage layer

2.

When it comes to big data tools, what does the acronym YARN stand for?

A: Yet Another Resource Network

B: Yet Another Release Note

C: Yet Another Rou ng Network

D: Yet Another Resource Nego ator

3.

You are trying to decide whether to use a single machine or cluster compu ng tools in your next
project. Which of the following is the premise for using single machine architecture?

A: Your load might increase dras cally over me.

B: You are only expec ng to be loading a small amount of data.

C: You are expec ng your tasks to be very memory-intensive.

4.

Which of the following are useful Python packages for data processing and analysis projects?

A: An gravity

B: Pandas

C: Seaborn

D: Pyglet
5.

Your system is ge ng more trac on and starts to require more compu ng power. Which of the
following are reasons for scaling your system horizontally as opposed to ver cally?

A: You are looking for more compu ng flexibility.

B: You are concerned about down me when upgrading your machine.

C: You are unable to split your app into smaller logical blocks.

D: You want stable costs.

6.

Which phase is usually the one we would like to get rid of, but might also be the most memory-
intensive?

A: Map

B: Shuffle

C: Reduce

7.

Which of the following statements about ELT are true?

A: An ELT model enables faster loading mes than ETL.

B: An ELT model is an alterna ve to ETL.

C: With an ELT model, users can run transforma ons directly on the raw data.

D: An ELT model increases the me data spends in transit.

8.

Which of the following are equivalent to AWS S3?

A: Google Big Query

B: Azure Blob Storage

C: Google Cloud Storage

D: Azure Data Factory

9.

In terms of a Hadoop cluster, what is the heartbeat?

A: It is a signal sent from a name node to data nodes informing them about cluster health.

B: It is a signal sent from a name node to external applica ons informing them about cluster health.

C: It is a signal sent from external applica ons to a name node asking about system health.

D: It is a signal sent from data nodes to a name node informing it about node health.
10.

Match the following technologies with their applica on:

1. Spark, 2. Cassandra, 3. Zookeeper, 4. Ka a, 5. Keras, 6. Superset

A. Database, B. Visualiza on, C. Orchestra on, D. Analy cs, E. Machine Learning, F. Streaming

A: 1F, 2A, 3C, 4A, 5B, 6E

B: 1D, 2A, 3C, 4F, 5E, 6B

C: 1D, 2F, 3E, 4A, 5C, 6B

D: 1B, 2A, 3D, 4F, 5E, 6C

Please review your answers. A er submi ng you won't be able to change them.

Task 2 II

Task 2

Mul ple response task

This is a set of mul ple response ques ons.


Please choose as many answers as you think are correct.
It is possible for all the answers or none to be correct.
Also, incorrect answers will nega vely impact your score.

Display images in original size

1.

Which of the following sentences about eventual consistency are true?

A: Eventually consistent writes are o en faster than strongly consistent ones.

B: Eventually consistent systems implement BASE proper es.

C: In eventually consistent systems, reads are o en faster than in strongly consistent ones.

D: Eventually consistent systems might not always return the same result.

2.

Which of the following are the characteris cs of a distributed system?

A: Shared global clock

B: Concurrency of components

C: Independent failure of components

D: Scalability

3.
What are the main Lambda architecture layers?

A: Process Layer, Serving Layer

B: Batch Layer, Speed Layer, Data Layer

C: Stream Layer, Data Layer, Storage Layer

D: Batch Layer, Speed Layer, Serving Layer

4.

In which of the following situa ons would you recommend using a NoSQL database?

A: The input data structure is expected to change o en.

B: The database is expected to serve complex queries on structured tables.

C: ACID support is required.

D: The database is expected to be able to serve changing workloads.

5.

Which of the following statements describe data lineage?

A: Data lineage is a process of discovering pa erns in large data sets.

B: Data lineage gives visibility while greatly simplifying the ability to trace errors.

C: Data lineage provides a way of tracking data from its origin to its des na on.

D: Data lineage is a process of extrac ng data from output coming from another program.

6.

You are looking for a distributed data store that will con nue to work if one of the nodes fails, and
will deliver the same most recent result to all clients. Which two guarantees of CAP theorem should
your system fulfill?

A: CA

B: CP

C: AP

7.

Which of the following statements about big data file formats are true?

A: AVRO offers be er schema evolu on than the other two formats.

B: Parquet and ORC are row-based whereas AVRO is a column-based format.

C: All three are machine-readable binary formats.

D: Parquet is be er op mized for use with Apache Spark, whereas ORC works be er with Hive.

8.
Which of the following would be reasons for building a data lake as opposed to a data warehouse for
your next project?

A: The input data might have different formats.

B: You want to store your data in a transformed and structured way.

C: You want to store as much data as possible and decide how to use it later.

D: You have a set of predefined queries.

9.

You work for a financial ins tu on and are trying to decide whether to build a serverless data
pla orm using tools offered by one of the cloud providers. Which of the following would be
drawbacks of the serverless architecture?

A: No access to virtual machines

B: Security concerns over mul tenancy problems

C: More server maintenance required

D: High upfront costs

10.

What are some of the challenges when using real- me data processing?

A: Dealing with repeated data

B: Dealing with structured data

C: Dealing with out of order events

D: Dealing with small numbers of events

Please review your answers. A er submi ng you won't be able to change them.

Copyright 2009-2024 by Codility Limited. All Rights Reserved. Unauthorized copying, publica on or
disclosure prohibited.

All changes saved

Task3 III

Task 3

Task descrip on

A ski resort company is planning to construct a new ski slope using a pre-exis ng network of
mountain huts and trails between them. A new slope has to begin at one of the mountain huts, have
a middle sta on at another hut connected with the first one by a direct trail, and end at the third
mountain hut which is also connected by a direct trail to the second hut. The al tude of the three
huts chosen for construc ng the ski slope has to be strictly decreasing.

You are given two SQL tables, mountain_huts and trails, with the following structure:

create table mountain_huts ( id integer not null, name varchar(40) not null, al tude integer not null,
unique(name), unique(id) ); create table trails ( hut1 integer not null, hut2 integer not null );

Each entry in the table trails represents a direct connec on between huts with IDs hut1 and hut2.
Note that all trails are bidirec onal.

Create a query that finds all triplets (startpt, middlept, endpt) represen ng the mountain huts that
may be used for construc on of a ski slope. Output returned by the query can be ordered in any way.

Examples:

1. Given the tables:

mountain_huts: +----+----------+----------+ | id | name | al tude | +----+----------+----------+ | 1 | Dakonat


| 1900 | | 2 | Na sa | 2100 | | 3 | Gajantut | 1600 | | 4 | Rifat | 782 | | 5 | Tupur | 1370 | +----+-----
-----+----------+ trails: +------+------+ | hut1 | hut2 | +------+------+ | 1 | 3 | | 3 | 2 | | 3 | 5 | | 4 | 5 | | 1
| 5 | +------+------+

your query should return:

+----------+----------+-------+ | startpt | middlept | endpt | +----------+----------+-------+ | Dakonat |


Gajantut | Tupur | | Dakonat | Tupur | Rifat | | Gajantut | Tupur | Rifat | | Na sa | Gajantut | Tupur
| +----------+----------+-------+
v
2. Given the tables:

mountain_huts: +----+-----------+----------+ | id | name | al tude | +----+-----------+----------+ | 1 | Adam


| 2100 | | 2 | Emily | 1800 | | 3 | Diana | 1800 | | 4 | Bobs Inn | 1400 | | 5 | Carls Inn | 1350 | | 6 |
Hannah | 2300 | +----+-----------+----------+ trails: +------+------+ | hut1 | hut2 | +------+------+ | 2 | 1 | |
2 | 3 | | 2 | 4 | | 2 | 5 | | 3 | 1 | | 3 | 4 | | 3 | 5 | | 3 | 6 | +------+------+

your query should return:

+---------+----------+-----------+ | startpt | middlept | endpt | +---------+----------+-----------+ | Adam |


Diana | Bobs Inn | | Adam | Diana | Carls Inn | | Adam | Emily | Bobs Inn | | Adam | Emily | Carls
Inn | | Hannah | Diana | Bobs Inn | | Hannah | Diana | Carls Inn | +---------+----------+-----------+

Assume that:

 there is no trail going from a hut back to itself;

 for every two huts there is at most one direct trail connec ng them;

 each hut from table trails occurs in table mountain_huts.


Task4 III

Task 4

Task descrip on

To access CSV data sets download zipped files

ETL Pipeline

The objec ve of this task is to create an ETL job which will read data from a file, transform it into the
desired state and save it to an output loca on.

 NOTE: This task runs against Spark version 3.1.1.

The input file electric-chargepoints-2017.csv

(available in input_path inside ChargePointsETLJob

class) contains sample of data published by the UK Department for Transport and presents
informa on about the usage of electric vehicle charge points in 2017.

Here are five random rows from this file:

ChargingEven StartDat StartTim EndDat EndTim Energ


CPID PluginDura on
t e e e e y

AN0726 2017-10- 2017- 3.633333333333333


15554472 13:30:00 17:08:00 5.3
3 29 10-29 3

AN1509 2017-10- 2017- 11.81666666666666


15329256 17:37:00 05:26:00 19.2
2 14 10-15 6

AN2259 2017-06- 2017-


2344473 16:10:19 13:03:21 11.5 20.88388888888889
4 02 06-03

AN1021 2017-03- 2017-


12184545 21:43:37 20:18:29 12.1 22.58111111111111
8 20 03-21

AN0213 2017-03- 2017- 31.81611111111111


11984777 10:21:17 18:10:15 7.8
7 07 03-08 3

PluginDura on

column stores plugin dura on in hours.

For each charge point, iden fied by its unique ID (CPID

), we would like to know the dura on (in hours) of the longest plugin and the dura on (in hours) of
the average plugin.
Requirements

To achieve this, please use PySpark and complete the ETL pipeline containing the following three
methods:

extract

– this method should return a Spark dataframe which contains raw data from the input file in
input_path

transform

– this method should get a raw dataframe as an input parameter and return a dataframe containing
the following three columns:

chargepoint_id, max_dura on, avg_dura on

load – this method should take this transformed dataframe as input parameter and save the data as
parquet format to output path in output_path

Example output

Here's an example row from the transformed dataframe returned by the transform

method and saved to the output parquet file:

chargepoint_id max_dura on avg_dura on

AN06056 11.98 4.76

Hints

Please make sure that the output file contains one row for each charge point.

Please make sure that the columns are named correctly.

Please round numbers to two decimal places.

Copyright 2009–2024 by Codility Limited. All Rights Reserved. Unauthorized copying, publica on or
disclosure prohibited.

Solu on

from pyspark.sql import SparkSession

class ChargePointsETLJob:

input_path = 'data/input/electric-chargepoints-2017.csv'

output_path = 'data/output/chargepoints-2017-analysis'
def __init__(self):

self.spark_session = (SparkSession.builder

.master("local[*]")

.appName("ElectricChargePointsETLJob")

.getOrCreate())

def extract(self):

pass

def transform(self, df):

pass

def load(self, df):

pass

def run(self):

self.load(self.transform(self.extract()))

You might also like