Big Data Analytics
ST2BIG
2020 - 2021
December 16, 2020
Lecturer : Issam Falih
Time Limit : 1h 45 minutes
  Firstname:
  Lastname:
  Promotion:
  Group:
  Rules (please read carefully)
   - You have 1h 45 minutes for the exam.
   - Before you start, write down your name and student number on this page.
   - The use of material (book, slides, laptop, etc.) during the exam is not allowed.
   - Every multiple-choice question has just one correct answer. To select an answer, circle
     the letter.
   - Any answers or marks on other pages than the answer sheet instead of the last two
     questions will be completely ignored even if correct.
   - If you picked an answer and then would like to change it, make it very clear with an
     additional circle around the newly chosen answer. Any ambiguity will be handled with
     no point granted.
   - Only use a black or a blue pen. DO NOT use a pencil. DO NOT use red pen.
   - The total number of available points is 4O.
                                                                                  Good luck !
1        Multiple-choice questions [15 pts]
    1.    Which type of data Hadoop can deal
         with is
                                                           5. Which among the following is the Re-
         (a) structured                                       source Management Layer
         (b) semi-structured                                  (a) Mapreduce
         (c) unstructured                                     (b) YARN
         (d) All of the above                                 (c) HDFS
    2. What is a Resilient Distributed Dataset?               (d) HIVE
         (a) An immutable distributed collection
             of elements                                   6. Which one of these technologies deals
         (b) A mutable distributed collection of              with graph data?
             elements.
                                                              (a) Google BigTable
         (c) A write-enabled distributed collec-
                                                              (b) Mongodb
             tion of elements.
                                                              (c) Apache HBase
         (d) A spilled distributed collection of el-
             ements                                           (d) neo4j
    3. In Hadoop, the optimal input split size is
       the same as the                                     7. In MapReduce, the shuffling phase takes
         (a) block size                                       place...
         (b) average file size in the cluster                 (a) Before and right after mapping
         (c) mininum hard disk size in the clus-              (b) Before mapping
             ter                                              (c) After mapping and before reducing
         (d) number of DataNodes                              (d) After reducing
    4. How does the DataNode protocol work?                8. Which one of these input formats cannot
                                                              be processed by MapReduce at all?
         (a) The NameNode always initiates the
             connection, and the DataNodes                    (a) Key-value pairs
             only answer                                      (b) Unstructured lines of text
         (b) The DataNode always initiates the                (c) Tables
             connection, and the NameNode
                                                              (d) None of them: all three formats are
             only answers
                                                                  in fact supported by MapReduce
         (c) The client initiates the connection
             to the DataNode
         (d) Both the NameNode and the DataN-              9. For 129 MB file how many Blocks will be
             ode may initiate the connection                  created
                                                       2
     (a) 3                                                  (c) Before a crash, the primary Na-
                                                                meNode always copies its memory
     (b) 1
                                                                content to the secondary NameN-
     (c) 2                                                      ode
     (d) 4                                                 (d) The secondary NameNode replays
                                                               the edit log, which contains the
                                                               blocks’ locations
10. Which languages are most appropriate
    for querying graph databases?
                                                     13. To start a Hadoop cluster, it is necessary
     (a) Cypher                                          to start both of which two clusters?
     (b) Java                                              (a) HDFS and YARN
     (c) XQuery, JSONiq                                    (b) SPARK and CLOUDERA
     (d) SQL                                                (c) YARN and SPARK
                                                           (d) NoSQL and HDFS
11. The default replication factor for HDFS
    file system in Hadoop is which of the fol-       14.    Let’s assume we have a Hadoop clus-
    lowing?                                                ter with 12 Petabytes of disk space and
                                                           replication factor 4. What can you say
     (a) 1                                                 about the maximum possible file size?
     (b) 4
                                                           (a) The maximum size of a file is re-
     (c) 2                                                     stricted to the disk size of the
     (d) 3                                                     largest DataNode.
                                                           (b) The maximum size of a file cannot
                                                               exceed 3 Petabytes.
12. When the primary NameNode crashes,                      (c) The maximum size of a file is re-
    the secondary NameNode takes over.                          stricted by the physical disk space
    NameNodes do not persistently store                         available on the NameNode.
    (i.e.   write to disk) the location of                 (d) Files of any size can be processed
    blocks. How does the secondary Na-                         in the cluster.
    meNode learn about the blocks’ loca-
    tions in the cluster?
                                                     15. Bob has a Hadoop cluster with 20 ma-
     (a) DataNodes send regular heartbeat
                                                         chines with the following Hadoop setup:
         messages, which include informa-
                                                         replication factor 2, 128MB input split
         tion about blocks they maintain
                                                         size. Each machine has 500GB of HDFS
     (b) The secondary NameNode sends a                  disk space.    The cluster is currently
         special message to all DataNodes,               empty (no job, no data). Bob intends
         asking for their block information              to upload 4 Terabytes of plain text (in 4
                                                 3
    files of approximately 1 Terabyte each),                 (c) MapReduce
    followed by running Hadoop’s standard                    (d) Mongodb
    WordCount1 job. What is going to hap-
    pen?
     (a) The data upload fails at the first file:       19. What happens if a number of reducers
         it is too large to fit onto a DataNode.            are set to 0?
     (b) The data upload fails at a later                    (a) Reduce-only job take place
         stage: the disks are full.                          (b) Map-only job take place
     (c) WordCount fails:     too many input                 (c) Reducer output will be the final out-
         splits to process.                                      put
     (d) WordCount runs successfully.
                                                        20. Which of the following is the correct se-
16. In Hadoop, the optimal input split size                 quence of MapReduce flow?
    is the same as the
                                                             (a) Map>Combine>Reduce
     (a) average file size in the cluster.                   (b) Combine>Reduce>Map
     (b) block size.                                         (c) Map>Reduce>Combine
     (c) mininum hard disk size in the clus-                 (d) Reduce>Combine>Map
         ter.
     (d) number of DataNodes.
                                                        21. Which of the following phases occur si-
                                                            multaneously
17. The time it takes for a Hadoop job’s Map                 (a) Shuffle and Map
    task to finish mostly depends on:                        (b) Reduce and Sort
     (a) the placement of the NameNode in                    (c) Shuffle and Sort
         the cluster.
     (b) the placement of the blocks re-
         quired for the Map task.                       22. Which statement is true about passive
                                                            NameNode in Hadoop
     (c) the duration of the job’s shuffle &
         sort phase.                                         (a) It is a standby namenode
     (d) the duration of the job’s Reduce                    (b) It simply acts as a slave
         task.                                               (c) Provide a fast failover
                                                             (d) All of these
18. HDFS is inspired by which of following
    Google project                                      23. Which concept is not part of the “3V’s of
     (a) BigTable                                           Big Data”?
     (b) GFS                                                 (a) Velocity
                                                    4
       (b) Variety
       (c) Valorisation
                                                      27.    Which of the following statements is
       (d) Velocity                                         true?
                                                            (a) The input to the Mapper is the out-
                                                                put of the Reducer.
24. Which command is used to check the sta-
    tus of all daemons running in the HDFS.                 (b) The input to the Combiner is the
                                                                output of the Reducer.
       (a) jps
                                                            (c) The input to the Combiner is the in-
       (b) fsck
                                                                put o f the Mapper.
       (c) distcp
                                                            (d) The input to the Reducer is the out-
       (d) None of the above                                    put of the Mapper.
25.    Which of the following statements is
                                                      28. Where in the Hadoop framework is the
      NOT true? It is possible to run a Hadoop
                                                          mapping from files to blocks stored?
      job which consists only of a (two correct
      answer)                                               (a) DataNode.
       (a) Reducer                                          (b) BlockNode.
       (b) Combiner                                         (c) NameNode.
       (c) Mapper                                           (d) FileNode.
       (d) None of the above
                                                      29. Which among the following is ulti-
26. Which statement is true about NameN-                  mate authority that arbitrates resources
    ode High Availability                                 among all the applications in the system.
       (a) Solve Single point of failure                    (a) NodeManager
       (b) For high scalability                             (b) ResourceManager
       (c) Reduce storage overhead to 50%                   (c) ApplicationMaster
       (d) All of the above                                 (d) All of the above
2     Multiple Answer Question [5 pts]
    1. Which of the following operations require the client to communicate with the NameN-
       ode?
       (a) A client deleting a file from HDFS.
       (b) A client writing to a new file on HDFS.
       (c) A client appending data to the end of an existing file on HDFS.
                                                  5
       (d) A client reading a file from HDFS.
    2. Bob has a Hadoop cluster with 20 machines with the following Hadoop setup: replication
       factor 2, 128MB input split size. Each machine has 500GB of HDFS disk space. The
       cluster is currently empty (no job, no data). Bob intends to upload 4 Terabytes of plain
       text (in 4 files of approximately 1 Terabyte each), followed by running Hadoop’s standard
       WordCount1 job. What is going to happen?
        (a) The data upload fails at the first file: it is too large to fit onto a DataNode
       (b) The data upload fails at a later stage: the disks are full
        (c) WordCount fails: too many input splits to process
       (d) WordCount runs successfully
    3. The distributed file systems GFS and HDFS were devised with a number of use cases
       (data scenarios) in mind. Consider the following data storage scenarios:
       (S1) A global company dealing with the data of its one hundred million employees (salary,
       bonuses, age, performance, etc.)
       (S2) A Web search engine’s query log (each search request by a user is logged)
       (S3) A hospital’s medical imaging data generated during an MRI scan
       (S4) Data sent by the Hubble telescope to the Space Telescope Science Institute
       For which of these scenarios are GFS or HDFS a good choice?
        (a) Scenarios (S1) and (S4)
       (b) Scenarios (S2) and (S3)
        (c) Scenarios (S2) and (S4)
       (d) Scenarios (S1) and (S3)
3    Free Form Question [20 pts]
    1. In the shuffle & sort phase, a job with m mappers and r reducers may involve up to m × r
       distinct copy operations. In which scenario are exactly m × r copy operations necessary?
    2. How to recover a NameNode when it is down?
                                                  6
3. Why is Hadoop used for Big Data Analytics?
4. A large cluster runs HDFS on 100 nodes. Each node in the cluster, including the NameN-
   ode, has 16 Terabytes of hard disk storage and 64 Gigabytes of main memory available.
   The cluster uses a block-size of 64 Megabytes and a replication factor of 3. The master
   maintains 64 bytes of metadata for each 64MB block.
   (a) What is the cluster’s disk storage capacity? Explain your answer.
   (b) A client downloads a 1 Gigabyte file from the cluster: explain precisely how data
       flows between the client, NameNode and the DataNodes.
5. Name three different purposes of Heartbeat messages in a Hadoop cluster.
6. Write MapReduce pseudcode for the following problems. You will be graded on how
   appropriate your solution is to the MapReduce framework and the quality of you descrip-
   tions. Note, for some of these problems you may have to write more than one pair of
   map/reduce functions.
                                           7
Problem 1: Anagrams
           Given a file containing text, the program should output key/value pairs where the
           value is a comma separated list of lines in the file that are anagrams, i.e. use exactly
           the same letters (ignoring spaces). The output key is up to you and will depend on
           your implementation.
Problem 2: Feature normalization
             Given a file containing a list of examples of the form:
             <label> <feature1> <feature2> <feature3> ... <featurem>
             we want to generate a version of this file where the feature values have been mean
             centered. The output examples should include the label and should also appear in
             the same order as the original file.
             You may assume that each example has all of the features defined
                                                  8
THE END.
   9
10