UNIVERSITI MALAYA
EXAMINATION FOR THE DEGREE OF MASTER OF DATA SCIENCE
ACADEMIC SESSION 2020/2021                  : SEMESTER I
WQD7005: Data Mining
14th January 2020 from 8.00 am to 15th January 2020 5.00 pm
INSTRUCTIONS TO CANDIDATES:
Answer ALL questions (50 marks).
              (This exam contains 4 pages including the first title page)
                                                                                        WQD7005
PART A                                                                  (30 marks)
1) Define "Data Mining" in terms of Business Intelligence (keeping in mind the data
   transformation from Online Transaction Process (OLTP) to Online Analytic Process
   (OLAP)).
                                                                          (5 marks)
2) Suppose that the data for analysis includes the attribute age. The age values for the data
   tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33,
   33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.                                         (5 marks)
   a)   What is the mean of the data?                                           (1 mark)
   b)   What is the median?                                                     (1 mark)
   c)   What is the mode of the data?                                           (1 mark)
   d)   Use smoothing by bin means to smooth the above data, using a bin depth of 3.
        Illustrate your steps.                                                  (2 marks)
3) Suppose you have the following four Dimension Tables namely Time, Customer, Employee
   and Product. Construct a snowflake scheme by developing "Sales" Fact Table. The linkage
   attribute in the dimension tables can be used to split the table to form a snowflake scheme.
   The aggregate variable of fact table can be "quantity" of products.
     Time                                            Customer
     OrderID (primary key)                           CustID (primary key)
     Order Date                                      Name
     Year                                            Address
     Quarter                                         CityID (linkage attribute)
     Month                                           City Name
                                                     Zip Code
                                                     State
                                                     Country
     Employee                                        Product
     EmpID (primary key)                             ProductID
     Employee Name                                   Product Name
     DepartmentID (linkage attribute)                Product Category
     Region                                          Product Description
     Territory
                                                                                        (5 marks)
4) Suppose you have the following transactional database, construct an FP (frequent pattern)
   tree from this transaction database.
                                                                                  (5 marks)
                                                                                               2/4
                                                                                      WQD7005
5) Let us consider the dataset of sales related to computer systems (e.g. hardware and software)
   shown below. We are required to learn a decision tree which predicts the profit either up or
   down based on certain features i.e. condition, upgradable and type.
                                                                                       (5 marks)
                         Condition     Upgradable      Type     Profit
                             Old            Yes         S/W     Down
                             Old            No          S/W     Down
                             Old            No         H/W      Down
                            Mid             Yes         S/W     Down
                            Mid             Yes        H/W      Down
                            Mid             No         H/W        Up
                            Mid             No          S/W       Up
                            New             Yes         S/W       Up
                            New             No         H/W        Up
                            New             No          S/W       Up
Calculate the Information Gain of feature "Condition" based on,
Entropy (Profit)
Entropy (Old)
Entropy (Mid)
Entropy (New)
Entropy (Condition)
6) Write down the steps of DBScan algorithm.
                                                                                      (5 marks)
                                                                                            3/4
                                                                                   WQD7005
PART B                                                                             (20 marks)
Instructions: Answer the following questions by using any data mining tool. Explain how you
do each step (include print screens). Download “Data(Exam).csv” from the Spectrum (You can
find the description of this data at https://archive.ics.uci.edu/ml/datasets/Zoo).
1) Select the best non-target features using one of statistical methods "correlation", "Chi-
   square", or "ANOVA". Your solution should describe the relevant statistical findings.
                                                                                    (5 marks)
2) Experiment/simulate the classification algorithms (Naive Bayes, Random Forest, Support
   Vector Machine) and identify the best algorithm among the three algorithms using 10-fold
   cross validation. Justify your choice of algorithm in terms of classification accuracy and
   false positive rate.
                                                                                   (10 marks)
3) Discuss the performance metric of all three algorithms in terms of Receiver Operator
   Characteristic (ROC) curve.
                                                                              (5 marks)
                                           END
                                                                                         4/4