Data Mining Techniques for Managers (DMTM)
By Kushal Anjaria
Session-1
Nowadays, we are witnessing the enormous growth of data
from terabytes to petabytes. This is a major challenge for
companies to manage, analyse, and visualize data. In this
note, we will discuss the top big data trends that are shaping
the future. For data, we have multiple data collection tools
and sources. The device includes various types of disk,
servers, hardware, and processing units. The sources can be
classified into three categories
1. Business: transactions stocks Web and e-commerce
2. Science: Remote Sensing bioinformatics and
simulations
3. Society: News digital camera social networking
sites and so on
“Real drowning in data but starving for knowledge.”- Prof.
Pabitra Mitra. Data is growing exponentially, and the rate of
data growth is expected to double every two years. As a result,
there is a tremendous increase in the demand for data
scientists who can manage this data and make sense of it. In
this situation, data mining comes into the picture
Definition of data mining: Extraction of interesting, • In this diagram, the entire KDD process is described.
nontrivial, implicit, previously unknown, and the potentially • From the vast amount of data, it is crucial to search
useful pattern of knowledge obtained from a vast amount of for attributes that fulfill our requirements. This
data is known as data mining process is known as the selection process.
• Once the data is selected, we check the data to verify
The alternative name of data mining is Knowledge Discovery whether any data point is missing or not. The task
from Data (KDD). KDD can be defined as “the process of of scanning the data is known as data pre-
discovering new patterns in large data sets to gain insight into processing. In the pre-processing stage, we may also
the problem at hand.” Data mining is a subset of KDD, which fill the data points using statistical functions.
involves the use of specific algorithms and approaches to • In the transformation phase, we combine our data
analyse large datasets. While doing data mining, one should into meaningful repositories. We create a data
be careful as one has to know what data mining is. For warehouse where relational databases may be given
example, a simple search in the search engine or query in the some formal meanings and interpretations in the
database is not a data mining procedure. transformation phase.
A data mining process is a set of tasks to analyse data, • In the data mining phase, we apply mathematical
uncover patterns and make predictions. It can be used for models and data mining algorithms to transform
many purposes such as fraud detection, marketing analysis data. This stage helps us in identifying underlying
and business intelligence. Data mining uses various patterns in the data. In the data mining phase, we use
techniques such as machine learning and statistical pattern various statistical analysis and learning algorithms.
recognition to identify hidden patterns from large datasets. It • The final stage of the KDD process is interpretation
is a process that focuses on understanding the relationship and evaluation. In this phase, we convert the pattern
between variables or items in a dataset. Data analytics is a obtained in the data mining phase into a human-
process that uses statistical and mathematical techniques to understandable form. Proper data interpretation and
extract information from data in order to reveal patterns and visualization only leads to knowledge generation.
relationships. It can be used for many purposes such as fraud • The entire KDD process is an iterative process that
detection, marketing analysis and business intelligence. means once knowledge is generated, one can again
Normal Data Analytics procedure will not be able to handle go back to the data selection and pre-processing
the following: stage.
• Data Stream (from the sensor), time-series data,
temporal data, sequential data, We will start our discussion on data mining by understanding
• graphs, graphical data, multi-linked data, social the meaning of data.
network data
• Heterogeneous database and legacy databases, In this course, we consider data in tabular form. Suppose a
• Multimedia, large text, and web data bank has provided us with the historical past data. From the
• Simulation and forecasting data patterns available in the data, we intend to evaluate new loan
applications. We aim to identify whether the new applicant is
Procedure for knowledge discovery from data a fraud or legit.
1002 is greater than the person with 1001. You cannot
compare these values. They are just symbolic values.
Consider another example: Kushal Anjaria and another
person’s name is to say something else, say Ram Kumar. So,
it does not tell us about anything more besides our identity.
2. Ordinal attributes: e.g., ranking, grades, weights,
measurements
The ordinal values can be compared and measured. For
example, you are rating a movie or potato chips, for instance,
on a scale of 1 to 10, how good it is or how bad it is.
3. Interval attributes: e.g., date, temperature range
The value of interval attributes represents some interval
space. E.g., a date. The calendar date tells you that whether
another date, say the date of a loan application, falls in some
time interval or not. From the interval attribute values, you
can say that one belongs to this interval one does not belong
to this interval.
4. Ratio attributes: e.g., time, the temperature where
you can change the unit, and ratio can be obtained.
The data will have specific attributes and objects or it can be Properties of the attributes:
structured or unstructured. The data mining process is a very The four types of attributes described above depend upon
complex process that involves many factors. The data mining which of the following properties it possesses:
process includes: Data cleaning, Data integration, Data pre-
processing, Data mining algorithms, and Data Visualization. 1. Distinctness:
In the above example, the table columns are the attributes, and 2. Order
rows of the table are records. In data mining literature, 3. Addition or subtraction
attributes or columns are known as features, variables, or 4. Multiplication or division
inputs.
One important thing to be noted is that in this particular Nominal attribute: distinctness
representation if we examine a row, we can think of each row Ordinal attribute: distinctness and order
as a vector whose components are these individual attribute Interval attribute: distinctness, order, and addition
values. These vectors are also sometimes known as the object Ratio attribute: All the four properties above
vector or the feature vector. Mathematically, we know that
each vector will have a dimension associated with it. The Each data mining algorithm is redefined based on which
number of attributes determines the dimension of the vectors attributes we use and which property the attribute possesses.
in it. In the present example, we have five attributes, so it is a The attribute types and operations are summarized in the table
five-dimensional vector. In data mining, each vector is below:
considered as a point in the coordinate system. For the present
example, ten objects can be represented as data points in the
five-dimensional coordinate system.
Furthermore, a bank may have a collection of one lakh loan
applications over the past year. These loans can be thought of
as one lakh points in a five-dimensional coordinate system.
And once you do this plotting exercise, it helps you visualize
the nature of the data.
There are four types of attributes used in the data mining
process
1. Nominal attributes: e.g., ID numbers, eye color, zip
codes, etc.
The nominal attributes are just symbols. In other words, they
are arbitrary. The nominal attributes of a coin are its face
value, color and size—but none of these things make it
valuable. For example, the ID number of a bank account is
only a number; it has no other meaning. Similarly, eye color:
black or white or blue, zip code the pin code of a place. So, In data mining, whatever operations you do has to be
these are just numbers or values or symbols. These attributes compatible with the data type.
are nominal attributes that act as identifiers. Why are these In data mining, the attribute can be represented as the
only symbols? Because suppose a person has a bank account discrete or continuous attribute:
number say 1001, and another person has a bank account
number say 1002, then you cannot say that the person with Discrete attribute:
• Has finite or countably infinite set of values
• Examples: zip code, set of words in the documents, three has bought milk and diaper and beer and coke
counting of any entity, number of accounts in the and so on.
bank, number of products in the warehouse • These types of transactions are called market basket
• Often represented as integer values transactions. So, these transactions consist of 2 parts.
• Please note that binary values of any attribute can 1st is the Id of the transaction of particular
be the particular case of the discrete attributes customers, and the second is the list of items
Continuous attributes: purchased by the customers. Suppose every day
• Has real numbers as attribute values thousands of people come to the supermarket, and
• For example, temperature, height, weight do this kind of transaction. If you see throughout, say
• Practically real numbers are represented using a 1 or 2 years, there will be an enormous amount of
finite number of digits data. IBM was the first company to analyze these
• Continuous variables are represented as floating- data types and come up with the association rule
point variables generation and mining technique.
Before going for data transformation, it is crucial to check the
quality of data. Let’s observe the following table and find out what IBM has
discovered from the data.
The data can be considered of bad quality if TID ITEMS
• Some values of the attributes are missing
• If the data domain is not satisfied 1 Bread, Milk
• Incorrect data is inserted
• Duplication or redundancy of data exists 2 Bread, Diaper, Beer, Eggs
• If data has some noise or distortion
• Data with outliers (e.g., % data with more than 100% 3 Milk, Diaper, Beer, Coke
values. In decimal, missing the. etc.)
4 Bread, Milk, Diaper, Beer
Data pre-processing increases the value of the data.
5 Bread, Milk, Diaper, Coke
Moreover, it also decreases the computational load.
Data pre-processing tasks can be completed in the following
way: The table shows that people who buy bread and milk are most
1. Aggregation: Aggregation means sometimes you likely to buy diapers. The people who buy diapers are most
consider a bunch of data together. After that, the likely to buy beer. Now this kind of pattern had some
cumulative information of all these data is used. commercial significance. For example, if you buy diapers, I
2. Sampling: In the sampling technique, only a few could have given you a discount, and you can buy beer at a
representative data are kept, and the rest is thrown discounted rate. I can also arrange the placements of the items
away. The idea is that only the sample is enough for accordingly in the store. Now the question is, from the vast
the processing. amount of data, how to calculate association rules?
3. Dimensionality reduction: We pick only the required For association rule, the following terminologies are useful:
characteristics of the data. For example, if you go to 1. Item set: Collection of one or more items: e.g.
a doctor with lots of symptoms and lots of {Bread, Milk, Diaper, Coke}. The k-item set means
measurements, the doctor would not look at all of the item set that contains k-items
them. The doctor will select a few of them and 2. Support count (𝜎): Frequency of occurrence of an
complete the diagnosis. item set. E.g., 𝜎 ({Bread, Milk, Diaper})=2
4. Discretization and binarization: Sometimes, we have 3. Support (s): The fraction of transactions that
to convert data in the discrete form or binary form contains an item set. s({Bread, Milk, Diaper})=⅖
from the continuous data. 4. Frequent Itemset: An itemset whose support is
greater than or equal to some minimum support
• Now, we will focus on data transformation and threshold.
pattern mining. So, the first pattern we will consider
is something known as association rules.
• This association pattern origin was one of the
earliest use of data mining in the retail shop. Say, for
example, you are going to a supermarket or a mall,
and you have bought some items. For this instance,
I may record the bill after a person has bought
something in his basket. For each transaction or
purchase by the customer, I will have a massive
number of rows for each basket of items. You can
see a table where these rows are describing the
different transactions. So, TID 1 is the transaction Id
1, for the first customer’s transactions. The next row
is the subsequent customer transactions, and so on.
Along with the transactions, what is noted is that
what are the items purchased by that customer. So,
you can see that in this table, customer one has
bought bread and milk, customer two has bought
bread and diaper and beer and eggs, and customer