Data Warehousing and Data Mining
Data Preprocessing
                     2
 Content
 Why Data Preprocessing?
 Descriptive data summarization
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
                                                3
 Why Data Preprocessing?
 Data in the real world is dirty
  incomplete: lacking attribute values, lacking certain attributes of
   interest, or containing only aggregate data
    E.g., Occupation=“ ” (missing data)
  noisy: containing errors or outliers that deviates from the expected.
    E.g., Salary=“−10” (an error)
  inconsistent: containing discrepancies in codes or names, e.g.
    Age=“42”, Birthday=“03/07/2010”
    Was rating “1, 2, 3”, now rating “A, B, C”
    discrepancy between duplicate records
 No quality data, no quality mining results!
  Quality decisions must be based on quality data
  Data warehouse needs consistent integration of quality data
                                                                           4
 Data Quality Measures
 A well-accepted multidimensional                   data     quality
  measures are the following:
  Accuracy (No errors, no outliers)
      Reasons for inaccurate data: fault in device, human error during
       entry, users may submit incorrect data(E.g. Jan 1 for birthday),
       etc.
    Completeness (no missing values)
    Consistency (no inconsistent values and attributes)
    Timeliness (appropriateness)
    Believability (acceptability)
    Interpretability (easy to understand)
                                                                      5
 Descriptive data summarization
 Descriptive summary about data can be generated with
  the help of measure of central tendency of the data
  and dispersion of the data
 Measure of central tendency includes
    Mean
    Median
    Mode
    Mid-Range
 Measure of dispersion includes
    range
    The five number summary (based on Quartiles)
    Interquartile range (IQR)
    Standard deviation
                                                         6
 Mean
 The mean is the sum of the values, divided by the
  total number of values.
 Appropriate for data distributed normally
 Mean is the most important quantity for describing
  dataset but it is sensitive to extreme values of an
  attribute (example outliers)
 E.g. Find the mean: 20, 26, 40, 36, 23, 42, 35, 24, 30
                                                           7
 Median
 The median is the halfway point in a data set. Before
  you can find this point, the data must be arranged in
  order. When the data set is ordered, it is called a data
  array.
 E.g. The number of rooms in the seven hotels in
  downtown Pittsburgh is 713, 300, 618, 595, 311, 401,
  and 292. Find the median.
                                                         8
 Mode
 The mode is the value that occurs most often in the
  data set. It is sometimes said to be the most typical
  case.
 A data set that has only one value that occurs with the
  greatest frequency is said to be unimodal.
 E.g. Find the mode of the signing bonuses of eight NFL
  players for a specific year. The bonuses in millions of
  dollars are 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10
 Since $10 million occurred 3 times—a frequency larger
  than any other number—the mode is $10 million.
                                                        9
 Mid-Range
 The midrange is a rough estimate of the middle. It is
  found by adding the lowest and highest values in the
  data set and dividing by 2. It is a very rough estimate
  of the average and can be affected by one extremely
  high or low value.
 E.g. In the last two winter seasons, the city of
  Brownsville, Minnesota, reported these numbers of
  water-line breaks per month. Find the midrange: 2, 3,
  6, 8, 4, 1
                                                        10
 Range
 The range is the highest value minus the lowest
  value. The symbol R is used for the range.
  R = highest value - lowest value
                                                    11
 Standard deviation
 The variance is the average of the squares of the distance each
  value is from the mean. The formula for the population variance
  is
 where
   X individual value
   population mean
   N population size
 The standard deviation is the square root of the variance.
                                                                12
   Example
 Find the variance and standard deviation for brand B. The
  months were: 35, 45, 30, 35, 40, 25
                                                              13
 Quartiles
 Quartiles divide the distribution into four groups,
  separated by Q1, Q2, Q3.
 Finding Data Values Corresponding to Q1, Q2, and
  Q3
  Step 1 Arrange the data in order from lowest to highest.
  Step 2 Find the median of the data values. This is the value
   for Q2.
  Step 3 Find the median of the data values that fall below Q2.
   This is the value for Q1.
  Step 4 Find the median of the data values that fall above Q2.
   This is the value for Q3.
                                                              14
        Quartiles: Example
 Find Q1, Q2, and
  Q3 for the data set
  15, 13, 6, 5, 12,
  50, 22, 18.
                             15
 IQR and Outliers
 Interquartile range (IQR) is defined as the difference
  between Q1 and Q3
  is used to identify outlier, extremely high or an extremely low
   data value when compared with the rest of the data values.
 Check the data set for any data value that is greater
  than Q3 + 1.5IQR and bellow Q1-1.5IQR
 For the previous example data
     Q3 + 1.5*IQR = 36.5 and
  Q1-1.5IQR = -7.5
 50 is outside this interval; hence, it can be considered
  an outlier.
                                                                 16
 Major Tasks in Data Preprocessing
 Data cleaning
   Fill in missing values, smooth noisy data, identify or remove
    outliers, and resolve inconsistencies
 Data integration
   Integration of multiple databases, data cubes, or files
 Data reduction
   Dimensionality reduction
   Numerosity reduction
   Data compression
 Data transformation and data discretization
   Normalization
   Concept hierarchy generation
                                                               17
 Data Cleaning: Missing Data
 Causes for missing data
   equipment malfunction
   inconsistent with other recorded data and thus deleted
   data not entered due to lack of understanding
   certain data may not be considered important at the time of entry and
    hence left blank
   not register history or changes of the data
 Missing data may need to be inferred.
                                                                       18
       Missing Data Example
Name          SSN           Address      Phone #        Date         Acct Total
John Doe      111-22-3333   1 Main St    111-222-3333   2/12/1999    2200.12
                            Bedford,
                            Ma
John W. Doe                 Bedford,                    7/15/2000    12000.54
                            Ma
John Doe      111-22-3333                               8/22/2001    2000.33
James Smith   222-33-4444   2 Oak St     222-333-4444   12/22/2002   15333.22
                            Boston, Ma
Jim Smith     222-33-4444   2 Oak St     222-333-4444                12333.66
                            Boston, Ma
Jim Smith     222-33-4444   2 Oak St     222-333-4444
                            Boston, Ma
                                                                                  19
  How to Handle Missing Data?
 Ignore the tuple: not effective when the percentage of
  missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
   a global constant : e.g., “unknown”, a new class?!
   the attribute mean
   the attribute mean for all samples belonging to the same class:
    smarter
   the most probable value: inference-based such as Bayesian
    formula or decision tree
                                                                 20
 Data Cleaning: Noisy Data
 Noise: random error or variance in a measured
  variable
 Incorrect attribute values may be due to
    faulty data collection instruments (e.g. OCR)
    data entry problems
    data transmission problems
    technology limitation
    inconsistency in naming convention
                                                     21
Data Cleaning: How to catch Noisy Data
  Manually check all data : tedious + infeasible?
  Sort data by frequency
   ‘green’ is more frequent than ‘rgeen’
   Works well for categorical data
  Use, say Numerical constraints to Catch Corrupt Data
   Weight can’t be negative
   People can’t have more than 2 parents
   Salary can’t be less than Birr 300
  Use statistical techniques to Catch Corrupt Data
   Check for outliers (the case of the 8 meters man)
   Check for correlated outliers using n-gram (“pregnant male”)
       People can be male
       People can be pregnant
       People can’t be male AND pregnant
                                                                   22
 How to Handle Noisy Data?
 Binning
  first sort data and partition into bins
  Choose the number of bins (N) and do binning
    The bins can be equal-depth or equal-width
  then one can smooth by bin means, smooth by bin median,
   smooth by bin boundaries, etc.
 Regression
  smooth by fitting the data into regression functions
 Clustering
  detect and remove outliers
 Combined computer and human inspection
  detect suspicious values and check by human
                                                          23
 Binning
 Equal-width (distance) partitioning:
  It divides the range into N intervals of equal size: uniform grid
  if A and B are the lowest and highest values of the attribute,
   the width of intervals will be: W = (B-A)/N.
  The most straightforward
  But outliers may dominate presentation
  Skewed data is not handled well.
 Equal-depth (frequency) partitioning:
  It divides the range into N intervals, each containing
   approximately same number of samples
  Good data scaling
  Managing categorical attributes can be tricky.
                                                                  24
 Equal-width Example
 Given the data set (say 24, 21, 28, 8, 4, 26, 34, 21,
  29, 15, 9, 25)
  Determine the number of bins N (say 3)
  Sort the data as 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
  Determine the range R = Max – Min = 30
  Divide the range into N equal width where the ith bin is [Xi-1,
   Xi) where X0=Min and XN=Max and Xi = Xi-1 + R/N (R/N=10)
  Hence X0= 4, X1 = 14, X2 = 24, and X3 = 34
  Therefore:
      Bin 1 = 4,8,9
      Bin 2 = 15, 21, 21
      Bin3 = 24, 25, 26, 28, 29, 34
                                                                     25
Equal-depth partitioning Example
 It divides the range into N intervals, each containing
  approximately same number of samples
 Given the data set (say 24, 21, 28, 8, 4, 26, 34, 21,
  29, 15, 9, 25)
    Determine the number of bins : N (say 3)
    Determine the number of data elements F(F=12)
    Sort the data as 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
    Place F/N (12/3 = 4) element in order into the different bins
    Therefore:
      Bin 1 = 4,8,9 ,15
      Bin 2 = 21, 21,24, 25
      Bin3 = 26, 28, 29, 34
                                                                     26
 Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
 Partition into (equal-frequency) bins:
   - Bin 1: 4, 8, 15
    - Bin 2: 21, 21, 24
    - Bin 3: 25, 28, 34
 Smoothing by bin means:
    - Bin 1: 9, 9, 9
    - Bin 2: 22, 22, 22
    - Bin 3: 29, 29, 29
 Smoothing by bin boundaries(Each bin value is replaced by the closest
  boundary value) :
    - Bin 1: 4, 4, 15
    - Bin 2: 21, 21, 24
    - Bin 3: 25, 25, 34
                                                                         27
 Activity
 Suppose a group of 12 sales price records has been
  sorted as follows: 5, 10, 11, 13, 15, 35, 50, 55, 72, 92,
  204, 215
 Partition them into three bins by each of the following
  methods:
  (a) equal-frequency (equidepth) partitioning
  (b) equal-width partitioning
                                                          28
 Handling Noisy Data by Regression
 Smooth by fitting the data into regression functions
 Finding a fitting function for two a variable using its relation with
  another variable(s):
 In this way, the missing value of the first variable can be predicted
  from the fitting function
                                   Y
              Dependent variable
                                               y =F(x)=x+1
                                   y1
                                        x1               x
                                                                      29
Example: Clustering
                      30
  Data Cleaning as a Process
 Data discrepancy detection
   Use metadata (e.g., domain, range, dependency, distribution)
   Check field overloading
   Check uniqueness rule, consecutive rule and null rule
   Use commercial tools
     Data scrubbing: use simple domain knowledge (e.g., postal code,
      spell-check) to detect errors and make corrections
     Data auditing: by analyzing data to discover rules and relationship to
      detect violators (e.g., correlation and clustering to find outliers)
 Data migration and integration
   Data migration tools: allow transformations to be specified
   ETL (Extraction/Transformation/Loading) tools: allow users to
    specify transformations through a graphical user interface
 Integration of the two processes
   Iterative and interactive (e.g., Potter’s Wheels)
                                                                          31
 Data Integration
 Data integration:
  Combines data from multiple sources into a coherent store
 Because of the use of different sources, data that is fine on
  its own may become problematic when we want to integrate
  it.
 Some of the issues are:
  Different formats and structures
  Data at different levels
  Conflicting and redundant data
 Careful integration of the data from multiple sources may
  help reduce/avoid redundancies and inconsistencies and
  improve mining speed and quality
                                                               32
 Data Integration: Formats
 Not everyone uses the same format.
  Schema integration: e.g., A.cust-id  B.cust-#
      Integrate metadata from different sources
 Dates are especially problematic:
    12/19/97
    19/12/97
    19/12/1997
    19-12-97
    Dec 19, 1997
    19 December 1997
    19th Dec. 1997
 Are you frequently writing money as:
  Birr 200, Br. 200, 200 Birr, …
                                                    33
     Data Integration: different structure
           ID                       Name                 City    State
                            Ministry of
1234                           Transportation     Addis Ababa   AA
ID                          Name                  City          State
GCR34                       Ministry of Finance   Addis Ababa   AA
Name                        ID                    City          State
Office of Foreign Affairs
                            GCR34                 Addis Ababa   AA
                                                                         34
 Data Integration: Data that Moves
 Be careful of taking snapshots of a moving target
 Example: Let’s say you want to store the price of a
  shoe in France, and the price of a shoe in Italy. Can
  we use same currency (say, US$) or country’s
  currency?
  You can’t store it all in the same currency (say, US$) because
   the exchange rate changes
  Price in foreign currency stays the same
  Must keep the data in foreign currency and use the current
   exchange rate to convert
 The same needs to be done for ‘Age’
  It is better to store ‘Date of Birth’ than ‘Age’
                                                                35
Data at different level of detail than needed
    If it is at a finer level of detail, you can sometimes bin it
      Example
         I need age ranges of 20-30, 30-40, 40-50, etc.
         Imported data contains birth date
         No problem! Divide data into appropriate categories
    Sometimes you cannot bin it
      Example
         I need age ranges 20-30, 30-40, 40-50 etc.
         Data is of age ranges 25-35, 35-45, etc.
         What to do?
           Ignore age ranges because you aren’t sure
           Make educated guess based on imported data (e.g., assume that # people
            of age 25-35 are average # of people of age 20-30 & 30-40)
                                                                                36
Data Integration: Conflicting Data
 Detecting and resolving data value conflicts
   For the same real world entity, attribute values from different sources are
    different
   Possible reasons: different representations, different scales, e.g., American
    vs. British units
       weight measurement: KG or pound
       Height measurement: meter or inch
 Information source #1 says that Hussen lives in Dire Dawa
   Information source #2 says that Hussen lives in Harar
 What to do?
     Use both (He lives in both places)
     Use the most recently updated piece of information
     Use the “most trusted” information
     Flag row to be investigated further by hand
     Use neither (We’d rather be incomplete than wrong)
                                                                                37
 Data Integration: Avoiding
 redundancy issue
 Redundant data occur often during integration of
  multiple databases
  The same attribute may have different names in different
   databases
  One attribute may be a “derived” attribute in another table,
   e.g., annual revenue from monthly revenue
 Redundant data may be able to be detected by
  correlation analysis and covariance analysis
                                                              38
Correlation Analysis (Numeric Data)
   Correlation coefficient
                 i1 (ai  A)(bi  B)     
                   n                           n
                                                   (ai bi )  n AB
       rA, B                             i 1
                       (n  1) A B           (n  1) A B
   where n is the number of tuples, A and B are the respective
     means of A and B, σA and σB are the respective standard
      deviation of A and B, and Σ(aibi) is the sum of the AB cross-
      product.
   If rA,B > 0, A and B are positively correlated (A’s values
    increase as B’s). The higher, the stronger correlation.
   rA,B = 0: independent; rAB < 0: negatively correlated
                                                                     39
  Covariance
 Covariance
  Covariance is similar to correlation
where nn isisthe
               thenumber
                   numberofoftuples,
                                tuples, and
                                         p and are
                                                q the
                                                    arerespective mean
                                                         the respective
  of
  meanp and
          of qp and q
 It can be
          be simplified
              simplified in
                          in computation
                             computationas:
                                          as:
 Positive
  Positive covariance:
             covariance: IfIf Cov
                              Covp,q > 0, then p and q    both
                                                          both tend
                                                               tend to be
                                 p,q > 0, then p and q             to be
  directly
  directly related.
           related.
 Negative
  Negative covariance:
              covariance: IfIf Cov
                               Covp,q   < 0 then p and    q are
                                                            are inversely
                                                                inversely
                                   p,q < 0 then p and    q
  related.
  related.
 Independence:
  Independence: Cov Covp,q =0
                       p,q = 0
                                                                       40
 Example
 Suppose two stocks A and B have the following values
 in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).
 Question:     If the stocks are affected by the same
 industry trends, will their prices rise or fall together?
  E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
  E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
  Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
 Thus, A and B rise together since Cov(A, B) > 0.
                                                             41
   Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set that is
  much smaller in volume but yet produces the same (or almost the same)
  analytical results
 Why data reduction? — A database/data warehouse may store terabytes
  of data. Complex data analysis may take a very long time to run on the
  complete data set.
 Data reduction strategies
  Dimensionality reduction, e.g., remove unimportant attributes
      Wavelet transforms
      Principal Components Analysis (PCA)
      Attribute subset Selection
  Numerosity reduction (some simply call it: Data Reduction)
      Regression and Log-Linear Models
      Histograms, clustering, sampling
      Data cube aggregation
  Data compression
                                                                            42
Data Reduction: Dimensionality Reduction
     Curse of dimensionality
      When dimensionality increases, data becomes increasingly
       sparse
     Dimensionality reduction
      Help eliminate irrelevant features and reduce noise
      Reduce time and space required in data mining
      Allow easier visualization
                                                              43
  Attribute Subset Selection
 Redundant attributes
   Duplicate much or all of the information contained in one or more other
    attributes
   E.g., purchase price of a product and the amount of sales tax paid
 Irrelevant attributes
   Contain no information that is useful for the data mining task at hand
   E.g., students' ID is often irrelevant to the task of predicting students'
    GPA
   Problem of irrelevant attributes: causing confusion for the mining
    algorithm employed
     Consequence: poor quality patterns, can slow down the mining process, etc.
 The “best” (and “worst”) attributes are typically determined using
  tests of statistical significance
                                                                              44
Heuristic Search in Attribute Selection
    There are 2d possible attribute combinations of d attributes
    Typical heuristic attribute selection methods:
       step-wise forward selection
       step-wise backward elimination
       combining forward selection and backward elimination
       decision-tree induction algorithm
    Step-wise forward selection
       Start with empty set
       The best single-feature is picked first
       Then next best feature will be selected conditioned by the first, ...
       Stop when the selected feature set closely represent the entire
        features
                                                                            45
  Heuristic Search in Attribute
  Selection(cont’d)
 Step-wise backward elimination
  Start with all the feature set elements
  The feature which is most irrelevant will be discarded first
  Then next most irrelevant feature will be discarded and repeated, ...
  Stop when removing the next candidate attribute for removal affects
   the pattern significantly
 Combining forward selection and backward elimination
  At each step, the procedure selects the best feature and remove the
   most irrelevant
 Decision-tree induction algorithm
  This algorithm generate a decision tree using some of the attributes
  The attributes used in building the decision tree will be taken as
   attributes that represents closely the entire attributes
                                                                       46
Heuristic Search in Attribute
Selection(Example)
                                47
     Numerosity reduction:Histogram Analysis
 Divide data into buckets and store       40
  average (sum) for each bucket
                                           35
 Partitioning rules:
                                           30
   Equal-width:       equal      bucket
    range(e.g., the width of $10)          25
   Equal-frequency (or equal-depth)       20
                                           15
                                           10
                                            5
                                            0
                                                                30000
                                                10000
                                                        20000
                                                                        40000
                                                                                50000
                                                                                        60000
                                                                                                70000
                                                                                                        80000
                                                                                                                90000
                                                                                                                        100000
                                                                                                                          48
Numerosity reduction: Clustering
 Partition data set into clusters based on similarity, and
  store cluster representation (e.g., centroid and
  diameter) only
 Can be very effective if data is clustered but not if
  data is “smeared”
 There are many choices of clustering definitions and
  clustering algorithms
                                                          49
 Numerosity reduction: Sampling
 Obtaining a small sample s to represent the whole
  data set N
 Allow a mining algorithm to run in complexity that is
  potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the
  data using suitable sampling technique
                                                      50
      Types of Sampling
 Simple random sampling
  There is an equal probability of selecting any particular item
 Sampling without replacement
  Once an object is selected, it is removed from the population
 Sampling with replacement
  A selected object is not removed from the population
 Stratified sampling:
  Partition the data set, and draw samples from each partition
   (proportionally, i.e., approximately the same percentage of the data)
  Used in conjunction with skewed data
                                                                      51
Sampling: With or without Replacement
                         W   O   R
                  SRS le random
                      i m  p            ho u t
                   ( s           e wi t
                               l
                      samp ment)
                          pl a c e
                       re
                      SRSW
                          R
       Raw Data
                                                 52
Sampling: Cluster or Stratified Sampling
     Raw Data            Cluster/Stratified Sample
                                                     53
 Data reduction strategies: by Data Cube
 Aggregation
 Data cube aggregation and using it for data mining task reduces the
  data set size significantly
 For example, one can aggregate sales amount specified at each year and
  quarter into the sum of the sales amount per year
 Multiple levels of aggregation in data cubes further reduce the size of
  data to deal with
 One should select appropriate levels of aggregation
 Use the most reduced representation which is sufficient to solve the
  task
                                                                       54
  Data Transformation
 A function that maps the entire set of values of a given attribute to a
  new set of replacement values such that each old value can be
  identified with one of the new values
 Methods
  Smoothing: Remove noise from data
  Attribute/feature construction
     New attributes constructed from the given ones
  Aggregation: Summarization, data cube construction
  Normalization: Scaled to fall within a smaller, specified range
     min-max normalization
     z-score normalization
     normalization by decimal scaling
  Discretization: Concept hierarchy climbing
                                                                       55
           Normalization
 Measurement unit used can affect the data analysis. E.g. changing
  from Kg to pound may lead to very different result.
   expressing an attribute in smaller units will lead to a larger range for that
    attribute, and thus tend to give such an attribute greater effect or “weight”
 Normalization avoids dependence on the choice of measurement
  units, data to fall within common ranges such as [-1, 1] or [0.0, 1.0]
 Min-max normalization: to [new_minA, new_maxA]
                     v  minA
             v'                (new _ maxA  new _ minA)  new _ minA
                    maxA  minA
   Min-max normalization preserves the relationships among the original data
    values. It will encounter an “out-of-bounds” error if a future input case for
    normalization falls outside of the original data range for A.
   E.g. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
                            73,600  12,000
                                            (1.0  0)  0  0.716
    $73,000 is mapped to    98,000  12,000
                                                                                    56
   Normalization(cont’d)
 Z-score normalization (μ: mean, σ: standard deviation):
   useful when the actual minimum and maximum of attribute A are
    unknown
                                                 73,600  54,000
                                                                  1.225
                                                     16,000
   E.g. Let μ = 54,000, σ = 16,000. Then
 Normalization by decimal scaling
        v
   v'  j      Where j is the smallest integer such that Max(|ν’|) < 1
       10
   E.g. record values -673 to 672, divide each value by 1000 so that -
    673 normalizes to -0.673 and 672 to 0.762
                                                                           57
 Discretization
 Data discritization refers to transforming the data set which is
  usually continous into discrete interval values
 Three types of attributes:
  Nominal — finite number of possible values, no ordering among
   values. E.g. Marital status( Single, married, widowed, and
   divorced)
  Ordinal — values from an ordered set. E.g. Size(big, med, small)
  Continuous — real numbers
 Discretization:
    divide the range of a continuous attribute into intervals
    Some classification algorithms only accept categorical attributes.
    Reduce data size by discretization
    Prepare for further analysis
                                                                          58
 Data Discretization Methods
 Typical methods: All the methods can be applied
  recursively
  Binning
    Top-down split, unsupervised
  Histogram analysis
    Top-down split, unsupervised
  Clustering analysis (unsupervised, top-down split or bottom-up
   merge)
  Decision-tree analysis (supervised, top-down split)
  Correlation (e.g., 2) analysis (unsupervised, bottom-up
   merge)
                                                               59
            Concept Hierarchy Generation
 Concept hierarchy organizes concepts (i.e.,                 country
  attribute values) hierarchically and is usually
  associated with each dimension in a data
  warehouse                                               Region or state
     Concept hierarchy formation: Recursively reduce
      the data by collecting and replacing low level
      concepts (such as numeric values for age) by               city
      higher level concepts (such as youth, adult, or
      senior)
 Concept hierarchies can be explicitly specified             Sub city
  by domain experts and/or data warehouse
  designers
                                                               Kebele
• Concept hierarchy can be automatically formed by the analysis of the number of
  distinct values. E.g., for a set of attributes: {Kebele, city, state, country}
     For numeric data, use discretization methods.
                                                                              60
                Assignment(Due Date: June 12)
Review 5+ literatures (books and articles) & write a report (overview, significance, steps
involved, applications, review of 2+ related local and international research works and
concluding remarks) and present in class.
   1-7: meaning, why, its tasks & functions, steps followed, comparison, pros and cons, applications
   8-12: problem statement, methodology, results, findings, recommendation
1.    Data Warehouses, Data Mining and Business Intelligence
2.    Predictive Modeling
3.    Data Mining Models (like CRISP, Hybrid, & other models)
4.    Text Mining
5.    Web Mining
6.    Sentiment/opinion mining
7.    Log Mining
8.    Knowledge Mining
9.    Multimedia Data Mining
10.   Spatial Mining
11.   Review studies related to ‘Application of data mining in Finance
12.   Review studies related to ‘Application of data mining in Insurance’
13.   Review studies related to ‘Application of data mining in Health’
14.    Review studies related to ‘Application of data mining in Agriculture’
                                                                                                        61