Multiple Response Tasks
Multiple Response Tasks
1.
Which of the following are the advantages of the ‘schema on read’ approach over ‘schema on write’?
2.
When it comes to big data tools, what does the acronym YARN stand for?
3.
You are trying to decide whether to use a single machine or cluster compu ng tools in your next
project. Which of the following is the premise for using single machine architecture?
4.
Which of the following are useful Python packages for data processing and analysis projects?
A: An gravity
B: Pandas
C: Seaborn
D: Pyglet
5.
Your system is ge ng more trac on and starts to require more compu ng power. Which of the
following are reasons for scaling your system horizontally as opposed to ver cally?
C: You are unable to split your app into smaller logical blocks.
6.
Which phase is usually the one we would like to get rid of, but might also be the most memory-
intensive?
A: Map
B: Shuffle
C: Reduce
7.
C: With an ELT model, users can run transforma ons directly on the raw data.
8.
9.
A: It is a signal sent from a name node to data nodes informing them about cluster health.
B: It is a signal sent from a name node to external applica ons informing them about cluster health.
C: It is a signal sent from external applica ons to a name node asking about system health.
D: It is a signal sent from data nodes to a name node informing it about node health.
10.
A. Database, B. Visualiza on, C. Orchestra on, D. Analy cs, E. Machine Learning, F. Streaming
Please review your answers. A er submi ng you won't be able to change them.
Task 2 II
Task 2
1.
C: In eventually consistent systems, reads are o en faster than in strongly consistent ones.
D: Eventually consistent systems might not always return the same result.
2.
B: Concurrency of components
D: Scalability
3.
What are the main Lambda architecture layers?
4.
In which of the following situa ons would you recommend using a NoSQL database?
5.
B: Data lineage gives visibility while greatly simplifying the ability to trace errors.
C: Data lineage provides a way of tracking data from its origin to its des na on.
D: Data lineage is a process of extrac ng data from output coming from another program.
6.
You are looking for a distributed data store that will con nue to work if one of the nodes fails, and
will deliver the same most recent result to all clients. Which two guarantees of CAP theorem should
your system fulfill?
A: CA
B: CP
C: AP
7.
Which of the following statements about big data file formats are true?
D: Parquet is be er op mized for use with Apache Spark, whereas ORC works be er with Hive.
8.
Which of the following would be reasons for building a data lake as opposed to a data warehouse for
your next project?
C: You want to store as much data as possible and decide how to use it later.
9.
You work for a financial ins tu on and are trying to decide whether to build a serverless data
pla orm using tools offered by one of the cloud providers. Which of the following would be
drawbacks of the serverless architecture?
10.
What are some of the challenges when using real- me data processing?
Please review your answers. A er submi ng you won't be able to change them.
Copyright 2009-2024 by Codility Limited. All Rights Reserved. Unauthorized copying, publica on or
disclosure prohibited.
Task3 III
Task 3
Task descrip on
A ski resort company is planning to construct a new ski slope using a pre-exis ng network of
mountain huts and trails between them. A new slope has to begin at one of the mountain huts, have
a middle sta on at another hut connected with the first one by a direct trail, and end at the third
mountain hut which is also connected by a direct trail to the second hut. The al tude of the three
huts chosen for construc ng the ski slope has to be strictly decreasing.
You are given two SQL tables, mountain_huts and trails, with the following structure:
create table mountain_huts ( id integer not null, name varchar(40) not null, al tude integer not null,
unique(name), unique(id) ); create table trails ( hut1 integer not null, hut2 integer not null );
Each entry in the table trails represents a direct connec on between huts with IDs hut1 and hut2.
Note that all trails are bidirec onal.
Create a query that finds all triplets (startpt, middlept, endpt) represen ng the mountain huts that
may be used for construc on of a ski slope. Output returned by the query can be ordered in any way.
Examples:
Assume that:
for every two huts there is at most one direct trail connec ng them;
Task 4
Task descrip on
ETL Pipeline
The objec ve of this task is to create an ETL job which will read data from a file, transform it into the
desired state and save it to an output loca on.
class) contains sample of data published by the UK Department for Transport and presents
informa on about the usage of electric vehicle charge points in 2017.
PluginDura on
), we would like to know the dura on (in hours) of the longest plugin and the dura on (in hours) of
the average plugin.
Requirements
To achieve this, please use PySpark and complete the ETL pipeline containing the following three
methods:
extract
– this method should return a Spark dataframe which contains raw data from the input file in
input_path
transform
– this method should get a raw dataframe as an input parameter and return a dataframe containing
the following three columns:
load – this method should take this transformed dataframe as input parameter and save the data as
parquet format to output path in output_path
Example output
Here's an example row from the transformed dataframe returned by the transform
Hints
Please make sure that the output file contains one row for each charge point.
Copyright 2009–2024 by Codility Limited. All Rights Reserved. Unauthorized copying, publica on or
disclosure prohibited.
Solu on
class ChargePointsETLJob:
input_path = 'data/input/electric-chargepoints-2017.csv'
output_path = 'data/output/chargepoints-2017-analysis'
def __init__(self):
self.spark_session = (SparkSession.builder
.master("local[*]")
.appName("ElectricChargePointsETLJob")
.getOrCreate())
def extract(self):
pass
pass
pass
def run(self):
self.load(self.transform(self.extract()))