Data Mining Notes
Data Mining Notes
Data Mining Notes
Machine learning is related to the development and designing of a machine that can learn itself from a specified set
of data to obtain a desirable result without it being explicitly coded. Hence Machine learning implies 'a machine which
learns on its own. Arthur Samuel invented the term Machine learning an American pioneer in the area of computer
gaming and artificial intelligence in 1959. He said that "it gives computers the ability to learn without being
explicitly programmed."
Machine learning is a technique that creates complex algorithms for large data processing and provides outcomes to
its users. It utilizes complex programs that can learn through experience and make predictions.
Origin Traditional databases with unstructured It has an existing algorithm and data.
data.
Meaning Extracting information from a huge Introduce new Information from data as
amount of data. well as previous experience.
History In 1930, it was known as knowledge The first program, i.e., Samuel's checker
discovery in databases(KDD). playing program, was established in 1950.
Responsibility Data Mining is used to obtain the rules Machine learning teaches the computer,
from the existing data. how to learn and comprehend the rules.
Abstraction Data mining abstract from the data Machine learning reads machine.
warehouse.
Applications In compare to machine learning, data It needs a large amount of data to obtain
mining can produce outcomes on the accurate results. It has various applications,
lesser volume of data. It is also used in used in web search, spam filter, credit
cluster analysis. scoring, computer design, etc.
Techniques Data mining is more of research using a It is a self-learned and train system to do
involve technique like a machine learning. the task precisely.
The overall goal of data mining process is to extract information from a data set and transform it into an
understandable structure for further use. It is also defined as extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from a huge amount of data. Data mining is a rapidly growing
field that is concerned with developing techniques to assist managers and decision-makers to make intelligent use of
a huge amount of repositories.
2. Knowledge extraction
3. Data/pattern analysis
4. Data archaeology
5. Data dredging
6. Information harvesting
7. Business intelligence
What is OLAP?
Most times used interchangeably, the terms Online Analytical Processing (OLAP) and data warehousing apply to decision
support and business intelligence systems. OLAP systems help data warehouses to analyze the data effectively. The dimensional
modeling in data warehousing primarily supports OLAP, which encompasses a greater category of business intelligence like
relational database, data mining and report writing.
Many of the OLAP applications include sales reporting, marketing, business process management (BPM),
forecasting, budgeting , creating finance reports and others. Each OLAP cube is presented through measures and
dimensions. Measures refers to the numeric value categorized by dimensions. In below diagrams, dimensions are
time, item type and courtiers/cities and the values inside them (605, 825, 14, 400) are measures.
The OLAP approach is used to analyze multidimensional data from multiple sources and perspectives. The three
basic operations in OLAP are:
Roll-up (Consolidation)
Drill-down
Slicing and dicing
Roll-up or consolidation refers to data aggregation and computation in one or more dimensions. It is actually
performed on an OLAP cube. For instance, the cube with cities is rolled up to countries to depict the data with respect
to time (in quarters) and item (type).
On the contrary, Drill-down operation helps users navigate through the data details. In the above example, drilling
down enables users to analyze data in the three months of the first quarter separately. The data is divided with
respect to cities, months (time) and item (type).
Slicing is an OLAP feature that allows taking out a portion of the OLAP cube to view specific data. For instance, in the
above diagram, the cube is sliced to a two dimensional view showing Item(types) with respect to Quadrant (time).
The location dimension is skipped here. In dicing, users can analyze data from different viewpoints. In the above
diagram, the users create a sub cube and chose to view data for two Item types and two locations in two quadrants.
What is MOLAP?
Multidimensional OLAP (MOLAP) is a classical OLAP that facilitates data analysis by using a
multidimensional data cube. Data is pre-computed, re-summarized, and stored in a MOLAP (a
major difference from ROLAP). Using a MOLAP, a user can use multidimensional view data
with different facets.
Multidimensional data analysis is also possible if a relational database is used. By that would
require querying data from multiple tables. On the contrary, MOLAP has all possible
combinations of data already stored in a multidimensional array. MOLAP can access this data
directly. Hence, MOLAP is faster compared to Relational Online Analytical Processing
(ROLAP).
MOLAP Architecture
MOLAP Architecture includes the following components:
Database Server
MOLAP Server
Front-end tool
MOLAP Architecture
Considering the above given MOLAP Architecture:
MOLAP architecture mainly reads the precompiled data. MOLAP architecture has limited
capabilities to dynamically create aggregations or to calculate results that have not been
pre-calculated and stored.
For example, an accounting head can run a report showing the corporate P/L account or
P/L account for a specific subsidiary. The MDDB would retrieve precompiled Profit & Loss
figures and display that result to the user.
ROLAP stands for Relational MOLAP stands for HOLAP stands for Hybrid Online
Online Analytical Processing. Multidimensional Online Analytical Analytical Processing.
Processing.
The ROLAP storage mode The MOLAP storage mode The HOLAP storage mode
causes the aggregation of the principle the aggregations of the connects attributes of both
division to be stored in indexed division and a copy of its source MOLAP and ROLAP. Like MOLAP,
views in the relational database information to be saved in a HOLAP causes the aggregation
that was specified in the multidimensional operation in of the division to be stored in a
partition's data source. analysis services when the multidimensional operation in an
separation is processed. SQL Server analysis services
instance.
ROLAP does not because a This MOLAP operation is highly HOLAP does not causes a copy
copy of the source information optimize to maximize query of the source information to be
to be stored in the Analysis performance. The storage area can stored. For queries that access
services data folders. Instead, be on the computer where the the only summary record in the
when the outcome cannot be partition is described or on aggregations of a division,
derived from the query cache, another computer running Analysis HOLAP is the equivalent of
the indexed views in the record services. Because a copy of the MOLAP.
source are accessed to answer source information resides in the
queries. multidimensional operation,
queries can be resolved without
accessing the partition's source
record.
Query response is frequently Query response times can be Queries that access source
slower with ROLAP storage reduced substantially by using record for example, if we want to
than with the MOLAP or aggregations. The record in the drill down to an atomic cube cell
HOLAP storage mode. partition's MOLAP operation is for which there is no aggregation
Processing time is also only as current as of the most information must retrieve data
frequently slower with ROLAP. recent processing of the from the relational database and
separation. will not be as fast as they would
be if the source information were
stored in the MOLAP
architecture.
Cluster analysis, also known as clustering, is a method of data mining that groups similar data points
together. The goal of cluster analysis is to divide a dataset into groups (or clusters) such that the data
points within each group are more similar to each other than to data points in other groups. This
process is often used for exploratory data analysis and can help identify patterns or relationships within
the data that may not be immediately obvious. There are many different algorithms used for cluster
analysis, such as k-means, hierarchical clustering, and density-based clustering. The choice of
algorithm will depend on the specific requirements of the analysis and the nature of the data being
analyzed.
Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of data points into clusters
so that the objects belong to the same group.
Clustering helps to splits data into several subsets. Each of these subsets contains data similar to each other, and
these subsets are called clusters. Now that the data from our customer base is divided into clusters, we can make an
informed decision about who we think is best suited for this product.
An example, suppose we are a market manager, and we have a new tempting product to sell. We are sure that the
product would bring enormous profit, as long as it is sold to the right people. So, how can we tell who is best suited
for the product from our company's huge customer base?
Clustering, falling under the category of unsupervised machine learning, is one of the problems that machine
learning algorithms solve.
Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its input data.
o The intra-cluster similarities are high, It implies that the data present inside the cluster is similar to one
another.
o The inter-cluster similarity is low, and it means each cluster holds data that is not similar to other data.
What is a Cluster?
o Clustering is the method of converting a group of abstract objects into classes of similar objects.
o Clustering is a method of partitioning a set of data or objects into a set of significant subclasses called
clusters.
o It helps users to understand the structure or natural grouping in a data set and used either as a stand-alone
instrument to get a better insight into data distribution or as a pre-processing step for other algorithms
Scalability in clustering implies that as we boost the amount of data objects, the time to perform clustering should
approximately scale to the complexity order of the algorithm. For example, if we perform K- means clustering, we
know it is O(n), where n is the number of objects in the data. If we raise the number of data objects 10 folds, then the
time taken to cluster them should also approximately increase 10 times. It means there should be a linear relationship.
If that is not the case, then there is some error with our implementation process.
Data should be scalable if it is not scalable, then we can't get the appropriate result. The figure illustrates the graphical
2. Interpretability:
The clustering algorithm should be able to find arbitrary shape clusters. They should not be limited to only distance
measurements that tend to discover a spherical cluster of small sizes.
Algorithms should be capable of being applied to any data such as data based on intervals (numeric), binary data, and
categorical data.
Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to such data and may result in
poor quality clusters.
6. High dimensionality:
The clustering tools should not only able to handle high dimensional data space but also the low-dimensional space.
The association rule learning is one of the very important concepts of machine learning, and it is employed in Market
Basket analysis, Web usage mining, continuous production, etc. Here market basket analysis is a technique used
by the various big retailers to discover the associations between items. We can understand it by taking an example of
a supermarket, as in a supermarket, all products that are purchased together are put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these products are stored
within a shelf or mostly nearby. Consider the below diagram:
Association rule learning can be divided into three types of algorithms:
1. Apriori
2. Eclat
3. F-P Growth Algorithm
Here the If element is called antecedent, and then statement is called as Consequent. These types of relationships
where we can find out some association or relation between two items is known as single cardinality. It is all about
creating rules, and if the number of items increases, then cardinality also increases accordingly. So, to measure the
associations between thousands of data items, there are several metrics. These metrics are given below:
o Support
o Confidence
o Lift
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the fraction of the
transaction T that contains the itemset X. If there are X datasets, then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X and Y occur together in
the dataset when the occurrence of X is already given. It is the ratio of the transaction that contains X and Y to the
number of records that contain X.
Lift
It is the ratio of the observed support measure and expected support if X and Y are independent of each other. It has
three possible values:
o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has a negative effect on
another.
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to work on the databases that
contain transactions. This algorithm uses a breadth-first search and Hash Tree to calculate the itemset efficiently.
It is mainly used for market basket analysis and helps to understand the products that can be bought together. It can
also be used in the healthcare field to find drug reactions for patients.
Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-first search technique to
find frequent itemsets in a transaction database. It performs faster execution than Apriori Algorithm.
F-P Growth Algorithm
The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the Apriori Algorithm. It
represents the database in the form of a tree structure that is known as a frequent pattern or tree. The purpose of this
frequent tree is to extract the most frequent patterns.
o Market Basket Analysis: It is one of the popular examples and applications of association rule mining. This
technique is commonly used by big retailers to determine the association between items.
o Medical Diagnosis: With the help of association rules, patients can be cured easily, as it helps in identifying
the probability of illness for a particular disease.
o Protein Sequence: The association rules help in determining the synthesis of artificial Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and many more other applications.