[go: up one dir, main page]

0% found this document useful (0 votes)
20 views14 pages

Data Mining Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 14

What is Machine learning?

Machine learning is related to the development and designing of a machine that can learn itself from a specified set
of data to obtain a desirable result without it being explicitly coded. Hence Machine learning implies 'a machine which
learns on its own. Arthur Samuel invented the term Machine learning an American pioneer in the area of computer
gaming and artificial intelligence in 1959. He said that "it gives computers the ability to learn without being
explicitly programmed."

Machine learning is a technique that creates complex algorithms for large data processing and provides outcomes to
its users. It utilizes complex programs that can learn through experience and make predictions.

Data Mining Vs Machine Learning

Factors Data Mining Machine Learning

Origin Traditional databases with unstructured It has an existing algorithm and data.
data.

Meaning Extracting information from a huge Introduce new Information from data as
amount of data. well as previous experience.

History In 1930, it was known as knowledge The first program, i.e., Samuel's checker
discovery in databases(KDD). playing program, was established in 1950.

Responsibility Data Mining is used to obtain the rules Machine learning teaches the computer,
from the existing data. how to learn and comprehend the rules.

Abstraction Data mining abstract from the data Machine learning reads machine.
warehouse.

Applications In compare to machine learning, data It needs a large amount of data to obtain
mining can produce outcomes on the accurate results. It has various applications,
lesser volume of data. It is also used in used in web search, spam filter, credit
cluster analysis. scoring, computer design, etc.

Nature It involves human interference more It is automated, once designed and


towards the manual. implemented, there is no need for human
effort.

Techniques Data mining is more of research using a It is a self-learned and train system to do
involve technique like a machine learning. the task precisely.

Scope Applied in the limited fields. It can be used in a vast area.

The overall goal of data mining process is to extract information from a data set and transform it into an
understandable structure for further use. It is also defined as extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from a huge amount of data. Data mining is a rapidly growing
field that is concerned with developing techniques to assist managers and decision-makers to make intelligent use of
a huge amount of repositories.

Alternative names for Data Mining :


1. Knowledge discovery (mining) in databases (KDD)

2. Knowledge extraction

3. Data/pattern analysis

4. Data archaeology

5. Data dredging

6. Information harvesting

7. Business intelligence

Data Mining and Business Intelligence :

Key properties of Data Mining:


1. Automatic discovery of patterns
2. Prediction of likely outcomes
3. Creation of actionable information
4. Focus on large datasets and databases

Data Mining : Confluence of Multiple Disciplines –


Data Mining Process: Data Mining is a process of discovering various models, summaries, and
derived values from a given collection of data. The general experimental procedure adapted to data-
mining problem involves following steps:
1. State problem and formulate hypothesis – In this step, a modeler usually specifies a group of
variables for unknown dependency and, if possible, a general sort of this dependency as an initial
hypothesis. There could also be several hypotheses formulated for one problem at this stage. The
primary step requires combined expertise of an application domain and a data-mining model. In
practice, it always means an in-depth interaction between data-mining expert and application expert.
In successful data-mining applications, this cooperation does not stop within initial phase. It
continues during whole data-mining process.
2. Collect data – This step cares about how information is generated and picked up. Generally, there
are two distinct possibilities. The primary is when data-generation process is under control of an
expert (modeler). This approach is understood as a designed experiment. The second possibility is
when expert cannot influence data generation process. This is often referred to as observational
approach. An observational setting, namely, random data generation, is assumed in most data-
mining applications. Typically, sampling distribution is totally unknown after data are collected, or it
is partially and implicitly given within data-collection procedure. It is vital, however, to know how data
collection affects its theoretical distribution since such a piece of prior knowledge is often useful for
modeling and, later, for ultimate interpretation of results. Also, it is important to form sure that
information used for estimating a model and therefore data used later for testing and applying a
model come from an equivalent, unknown, sampling distribution. If this is often not case, estimated
model cannot be successfully utilized in a final application of results.
3. Data Preprocessing – In the observational setting, data is usually “collected” from prevailing
databases, data warehouses, and data marts. Data preprocessing usually includes a minimum of
two common tasks :
 (i) Outlier Detection (and removal) : Outliers are unusual data values that are not according to
most observations. Commonly, outliers result from measurement errors, coding, and recording
errors, and, sometimes, are natural, abnormal values. Such non-representative samples can
seriously affect model produced later. There are two strategies for handling outliers : Detect and
eventually remove outliers as a neighborhood of preprocessing phase. And Develop robust
modeling methods that are insensitive to outliers.
 (ii) Scaling, encoding, and selecting features : Data preprocessing includes several steps like
variable scaling and differing types of encoding. For instance, one feature with range [0, 1] and
other with range [100, 1000] will not have an equivalent weight within applied technique. They are
going to also influence ultimate data-mining results differently. Therefore, it is recommended to
scale them and convey both features to an equivalent weight for further analysis. Also,
application-specific encoding methods usually achieve dimensionality reduction by providing a
smaller number of informative features for subsequent data modeling.
4. Estimate model – The selection and implementation of acceptable data-mining technique is that
main task during this phase. This process is not straightforward. Usually, in practice, implementation
is predicated on several models, and selecting simplest one is a further task.
5. Interpret model and draw conclusions – In most cases, data-mining models should help in
deciding. Hence, such models got to be interpretable so as to be useful because humans are not
likely to base their decisions on complex “black-box” models. Note that goals of accuracy of model
and accuracy of its interpretation are somewhat contradictory. Usually, simple models are more
interpretable, but they are also less accurate. Modern data-mining methods are expected to yield
highly accurate results using high dimensional models. The matter of interpreting these models, also
vital, is taken into account a separate task, with specific techniques to validate results.

Classification of Data Mining Systems:


1. Database Technology
2. Statistics
3. Machine Learning
4. Information Science
5. Visualization

What is OLAP?
Most times used interchangeably, the terms Online Analytical Processing (OLAP) and data warehousing apply to decision
support and business intelligence systems. OLAP systems help data warehouses to analyze the data effectively. The dimensional
modeling in data warehousing primarily supports OLAP, which encompasses a greater category of business intelligence like
relational database, data mining and report writing.

Many of the OLAP applications include sales reporting, marketing, business process management (BPM),
forecasting, budgeting , creating finance reports and others. Each OLAP cube is presented through measures and
dimensions. Measures refers to the numeric value categorized by dimensions. In below diagrams, dimensions are
time, item type and courtiers/cities and the values inside them (605, 825, 14, 400) are measures.
The OLAP approach is used to analyze multidimensional data from multiple sources and perspectives. The three
basic operations in OLAP are:

 Roll-up (Consolidation)
 Drill-down
 Slicing and dicing

Roll-up or consolidation refers to data aggregation and computation in one or more dimensions. It is actually
performed on an OLAP cube. For instance, the cube with cities is rolled up to countries to depict the data with respect
to time (in quarters) and item (type).

On the contrary, Drill-down operation helps users navigate through the data details. In the above example, drilling
down enables users to analyze data in the three months of the first quarter separately. The data is divided with
respect to cities, months (time) and item (type).

Slicing is an OLAP feature that allows taking out a portion of the OLAP cube to view specific data. For instance, in the
above diagram, the cube is sliced to a two dimensional view showing Item(types) with respect to Quadrant (time).
The location dimension is skipped here. In dicing, users can analyze data from different viewpoints. In the above
diagram, the users create a sub cube and chose to view data for two Item types and two locations in two quadrants.
What is MOLAP?
Multidimensional OLAP (MOLAP) is a classical OLAP that facilitates data analysis by using a
multidimensional data cube. Data is pre-computed, re-summarized, and stored in a MOLAP (a
major difference from ROLAP). Using a MOLAP, a user can use multidimensional view data
with different facets.

Multidimensional data analysis is also possible if a relational database is used. By that would
require querying data from multiple tables. On the contrary, MOLAP has all possible
combinations of data already stored in a multidimensional array. MOLAP can access this data
directly. Hence, MOLAP is faster compared to Relational Online Analytical Processing
(ROLAP).

 Multidimensional OLAP (MOLAP) is a classical OLAP that facilitates Data


Analysis by using a multidimensional data cube.
 MOLAP tools process information with the same amount of response time
irrespective of the level of summarizing.
 MOLAP server implements two level of storage to manage dense and sparse data
sets.
 MOLAP can manage, analyze, and store considerable amounts of multidimensional
data.
 It helps to automate computation of higher level of aggregates data
 It is less scalable than ROLAP as it handles only a limited amount of data.

MOLAP Architecture
MOLAP Architecture includes the following components:

 Database Server
 MOLAP Server
 Front-end tool

MOLAP Architecture
Considering the above given MOLAP Architecture:

1. The user request reports through the interface


2. The application logic layer of the MDDB retrieves the stored data from Database
3. The application logic layer forwards the result to the client/user.

MOLAP architecture mainly reads the precompiled data. MOLAP architecture has limited
capabilities to dynamically create aggregations or to calculate results that have not been
pre-calculated and stored.

For example, an accounting head can run a report showing the corporate P/L account or
P/L account for a specific subsidiary. The MDDB would retrieve precompiled Profit & Loss
figures and display that result to the user.

Difference between ROLAP, MOLAP, and HOLAP


ROLAP MOLAP HOLAP

ROLAP stands for Relational MOLAP stands for HOLAP stands for Hybrid Online
Online Analytical Processing. Multidimensional Online Analytical Analytical Processing.
Processing.

The ROLAP storage mode The MOLAP storage mode The HOLAP storage mode
causes the aggregation of the principle the aggregations of the connects attributes of both
division to be stored in indexed division and a copy of its source MOLAP and ROLAP. Like MOLAP,
views in the relational database information to be saved in a HOLAP causes the aggregation
that was specified in the multidimensional operation in of the division to be stored in a
partition's data source. analysis services when the multidimensional operation in an
separation is processed. SQL Server analysis services
instance.

ROLAP does not because a This MOLAP operation is highly HOLAP does not causes a copy
copy of the source information optimize to maximize query of the source information to be
to be stored in the Analysis performance. The storage area can stored. For queries that access
services data folders. Instead, be on the computer where the the only summary record in the
when the outcome cannot be partition is described or on aggregations of a division,
derived from the query cache, another computer running Analysis HOLAP is the equivalent of
the indexed views in the record services. Because a copy of the MOLAP.
source are accessed to answer source information resides in the
queries. multidimensional operation,
queries can be resolved without
accessing the partition's source
record.

Query response is frequently Query response times can be Queries that access source
slower with ROLAP storage reduced substantially by using record for example, if we want to
than with the MOLAP or aggregations. The record in the drill down to an atomic cube cell
HOLAP storage mode. partition's MOLAP operation is for which there is no aggregation
Processing time is also only as current as of the most information must retrieve data
frequently slower with ROLAP. recent processing of the from the relational database and
separation. will not be as fast as they would
be if the source information were
stored in the MOLAP
architecture.

Data Mining – Cluster Analysis

Cluster analysis, also known as clustering, is a method of data mining that groups similar data points
together. The goal of cluster analysis is to divide a dataset into groups (or clusters) such that the data
points within each group are more similar to each other than to data points in other groups. This
process is often used for exploratory data analysis and can help identify patterns or relationships within
the data that may not be immediately obvious. There are many different algorithms used for cluster
analysis, such as k-means, hierarchical clustering, and density-based clustering. The choice of
algorithm will depend on the specific requirements of the analysis and the nature of the data being
analyzed.

Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of data points into clusters
so that the objects belong to the same group.

Clustering helps to splits data into several subsets. Each of these subsets contains data similar to each other, and
these subsets are called clusters. Now that the data from our customer base is divided into clusters, we can make an
informed decision about who we think is best suited for this product.
An example, suppose we are a market manager, and we have a new tempting product to sell. We are sure that the
product would bring enormous profit, as long as it is sold to the right people. So, how can we tell who is best suited
for the product from our company's huge customer base?

Clustering, falling under the category of unsupervised machine learning, is one of the problems that machine
learning algorithms solve.

Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its input data.

A good clustering algorithm aims to obtain clusters whose:

o The intra-cluster similarities are high, It implies that the data present inside the cluster is similar to one
another.
o The inter-cluster similarity is low, and it means each cluster holds data that is not similar to other data.
What is a Cluster?

o A cluster is a subset of similar objects


o A subset of objects such that the distance between any of the two objects in the cluster is less than the
distance between any object in the cluster and any object that is not located inside it.
o A connected region of a multidimensional space with a comparatively high density of objects.

What is clustering in Data Mining?

o Clustering is the method of converting a group of abstract objects into classes of similar objects.
o Clustering is a method of partitioning a set of data or objects into a set of significant subclasses called
clusters.
o It helps users to understand the structure or natural grouping in a data set and used either as a stand-alone
instrument to get a better insight into data distribution or as a pre-processing step for other algorithms

Why is clustering used in data mining?


1. Scalability:

Scalability in clustering implies that as we boost the amount of data objects, the time to perform clustering should
approximately scale to the complexity order of the algorithm. For example, if we perform K- means clustering, we
know it is O(n), where n is the number of objects in the data. If we raise the number of data objects 10 folds, then the
time taken to cluster them should also approximately increase 10 times. It means there should be a linear relationship.
If that is not the case, then there is some error with our implementation process.
Data should be scalable if it is not scalable, then we can't get the appropriate result. The figure illustrates the graphical

example where it may lead to the wrong result.

2. Interpretability:

The outcomes of clustering should be interpretable, comprehensible, and usable.

3. Discovery of clusters with attribute shape:

The clustering algorithm should be able to find arbitrary shape clusters. They should not be limited to only distance
measurements that tend to discover a spherical cluster of small sizes.

4. Ability to deal with different types of attributes:

Algorithms should be capable of being applied to any data such as data based on intervals (numeric), binary data, and
categorical data.

5. Ability to deal with noisy data:

Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to such data and may result in
poor quality clusters.

6. High dimensionality:

The clustering tools should not only able to handle high dimensional data space but also the low-dimensional space.

Association Rule Learning


Association rule learning is a type of unsupervised learning technique that checks for the dependency of one data
item on another data item and maps accordingly so that it can be more profitable. It tries to find some interesting
relations or associations among the variables of dataset. It is based on different rules to discover the interesting
relations between variables in the database.

The association rule learning is one of the very important concepts of machine learning, and it is employed in Market
Basket analysis, Web usage mining, continuous production, etc. Here market basket analysis is a technique used
by the various big retailers to discover the associations between items. We can understand it by taking an example of
a supermarket, as in a supermarket, all products that are purchased together are put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these products are stored
within a shelf or mostly nearby. Consider the below diagram:
Association rule learning can be divided into three types of algorithms:

1. Apriori
2. Eclat
3. F-P Growth Algorithm

We will understand these algorithms in later chapters.

How does Association Rule Learning work?


Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These types of relationships
where we can find out some association or relation between two items is known as single cardinality. It is all about
creating rules, and if the number of items increases, then cardinality also increases accordingly. So, to measure the
associations between thousands of data items, there are several metrics. These metrics are given below:

o Support
o Confidence
o Lift

Let's understand each of them:

Support

Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the fraction of the
transaction T that contains the itemset X. If there are X datasets, then for transactions T, it can be written as:
Confidence

Confidence indicates how often the rule has been found to be true. Or how often the items X and Y occur together in
the dataset when the occurrence of X is already given. It is the ratio of the transaction that contains X and Y to the
number of records that contain X.

Lift

It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are independent of each other. It has
three possible values:

o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has a negative effect on
another.

Types of Association Rule Lerning


Association rule learning can be divided into three algorithms:

Apriori Algorithm

This algorithm uses frequent datasets to generate association rules. It is designed to work on the databases that
contain transactions. This algorithm uses a breadth-first search and Hash Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products that can be bought together. It can
also be used in the healthcare field to find drug reactions for patients.

Eclat Algorithm

Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-first search technique to
find frequent itemsets in a transaction database. It performs faster execution than Apriori Algorithm.
F-P Growth Algorithm

The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the Apriori Algorithm. It
represents the database in the form of a tree structure that is known as a frequent pattern or tree. The purpose of this
frequent tree is to extract the most frequent patterns.

Applications of Association Rule Learning


It has various applications in machine learning and data mining. Below are some popular applications of association
rule learning:

o Market Basket Analysis: It is one of the popular examples and applications of association rule mining. This
technique is commonly used by big retailers to determine the association between items.
o Medical Diagnosis: With the help of association rules, patients can be cured easily, as it helps in identifying
the probability of illness for a particular disease.
o Protein Sequence: The association rules help in determining the synthesis of artificial Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and many more other applications.

You might also like