DMW Notes by Me

What is Data Mining?
 Data mining is the process of extracting knowledge or insights from large

amounts of data using various statistical and computational techniques.
The data can be structured, semi-structured or unstructured, and can be
stored in various forms such as databases, data warehouses, and data
lakes.
 The primary goal of data mining is to discover hidden patterns and
relationships in the data that can be used to make informed decisions or
predictions. This involves exploring the data using various techniques
such as clustering, classification, regression analysis, association rule
mining, and anomaly detection.
 Data mining has a wide range of applications across various industries,
including marketing, finance, healthcare, and telecommunications. For
example, in marketing, data mining can be used to identify customer
segments and target marketing campaigns, while in healthcare, it can be
used to identify risk factors for diseases and develop personalized
treatment plans.
 Data mining is the process of discovering patterns and relationships in
large datasets using techniques such as machine learning and statistical
analysis.
 Data mining is important because it allows organizations to uncover
insights and trends in their data that would be difficult or impossible to
discover manually.
 Pattern Recognition
 Image Analysis
 Signal Processing
 Computer Graphics
 Web Technology
 Business
 Bioinformatics
Data Mining System Classification

A data mining system can be classified according to the following
criteria −
 Database Technology
 Statistics
 Machine Learning
 Information Science
 Visualization
 Other Disciplines
Apart from these, a data mining system can also be classified based
on the kind of (a) databases mined, (b) knowledge mined, (c)
techniques utilized, and (d) applications adapted.
Classification Based on the Databases Mined

We can classify a data mining system according to the kind of
databases mined. Database system can be classified according to
different criteria such as data models, types of data, etc. And the
data mining system can be classified accordingly.
For example, if we classify a database according to the data model,

then we may have a relational, transactional, object-relational, or
data warehouse mining system.
Classification Based on the kind of Knowledge Mined

knowledge mined. It means the data mining system is classified on
the basis of functionalities such as −
 Characterization
 Discrimination
 Association
 Classification
 Prediction
 Correlation Analysis
 Outlier Analysis
 Evolution Analysis
Classification Based on the Techniques Utilized

techniques used. We can describe these techniques according to the
degree of user interaction involved or the methods of analysis
employed.
Classification Based on the Applications Adapted

We can classify a data mining system according to the applications
adapted. These applications are as follows −
 Finance
 Telecommunications
 DNA
 Stock Markets
 E-mail
Integrating a Data Mining System with a DB/DW

System
If a data mining system is not integrated with a database or a data
warehouse system, then there will be no system to communicate
with. This scheme is known as the non-coupling scheme. In this
scheme, the main focus is on data mining design and on developing
efficient and effective algorithms for mining the available data sets.
The list of Integration Schemes is as follows −
 No Coupling − In this scheme, the data mining system does

not utilize any of the database or data warehouse functions. It
fetches the data from a particular source and processes that
data using some data mining algorithms. The data mining
result is stored in another file.
 Loose Coupling − In this scheme, the data mining system
may use some of the functions of database and data
warehouse system. It fetches the data from the data
respiratory managed by these systems and performs data
mining on that data. It then stores the mining result either in a
file or in a designated place in a database or in a data
warehouse.
 Semi−tight Coupling − In this scheme, the data mining
system is linked with a database or a data warehouse system
and in addition to that, efficient implementations of a few data
mining primitives can be provided in the database.
 Tight coupling − In this coupling scheme, the data mining
system is smoothly integrated into the database or data
warehouse system. The data mining subsystem is treated as
one functional component of an information system.
5 Use Cases of Data Mining
Data mining has a wide range of applications and uses cases across many
industries and domains. Some of the most common use cases of data mining
include:
1. Market Basket Analysis: Market basket analysis is a common use case of
data mining in the retail and e-commerce industries. It involves analyzing
data on customer purchases to identify items that are frequently purchased
together, and using this information to make recommendations or
suggestions to customers.
2. Fraud Detection: Data mining is widely used in the financial industry to

detect and prevent fraud. It involves analyzing data on transactions and
customer behavior to identify patterns or anomalies that may indicate
fraudulent activity.
3. Customer Segmentation: Data mining is commonly used in the marketing

and advertising industries to segment customers into different groups based
on their characteristics and behavior. This information can then be used to
tailor marketing and advertising campaigns to specific segments of
customers.
4. Predictive Maintenance: Data mining is increasingly used in the

manufacturing and industrial sectors to predict when equipment or machinery
is likely to fail or require maintenance. It involves analyzing data on the
performance and usage of equipment to identify patterns that can indicate
potential failures, and using this information to schedule maintenance and
prevent downtime.
5. Network Intrusion Detection: Data mining is used in the cybersecurity

industry to detect network intrusions and prevent cyber attacks. It involves
analyzing data on network traffic and behavior to identify patterns that may
indicate an attempted intrusion, and using this information to alert security
teams and prevent attacks.
Overall, data mining has a wide range of applications and use cases across
many industries and domains. It is a powerful tool for uncovering insights and
information hidden in data sets and is widely used to solve a variety of business
and technical challenges.
Data Mining Architecture

Data mining architecture refers to the overall design and structure of a data
mining system. A data mining architecture typically includes several key
components, which work together to perform data mining tasks and extract
useful insights and information from data. Some of the key components of a
typical data mining architecture include:
 Data Sources: Data sources are the sources of data that are used in data
mining. These can include structured and unstructured data from databases,
files, sensors, and other sources. Data sources provide the raw data that is
used in data mining and can be processed, cleaned, and transformed to
create a usable data set for analysis.
 Data Preprocessing: Data preprocessing is the process of preparing data

for analysis. This typically involves cleaning and transforming the data to
remove errors, inconsistencies, and irrelevant information, and to make it
suitable for analysis. Data preprocessing is an important step in data mining,
as it ensures that the data is of high quality and is ready for analysis.
 Data Mining Algorithms: Data mining algorithms are the algorithms and
models that are used to perform data mining. These algorithms can include
supervised and unsupervised learning algorithms, such as regression,
classification, and clustering, as well as more specialized algorithms for
specific tasks, such as association rule mining and anomaly detection. Data
mining algorithms are applied to the data to extract useful insights and
information from it.
 Data Visualization: Data visualization is the process of presenting data and

insights in a clear and effective manner, typically using charts, graphs, and
other visualizations. Data visualization is an important part of data mining, as
it allows data miners to communicate their findings and insights to others in a
way that is easy to understand and interpret.
Overall, a data mining architecture typically includes several key components,

which work together to perform data mining tasks and extract useful insights
and information from data. These components include data sources, data
preprocessing, data mining algorithms, and data visualization, and are essential
for enabling effective and efficient data mining.
3 Types of Data Mining
There are many different types of data mining, but they can generally be
grouped into three broad categories: descriptive, predictive, and prescriptive.
 Descriptive data mining involves summarizing and describing the
characteristics of a data set. This type of data mining is often used to explore
and understand the data, identify patterns and trends, and summarize the
data in a meaningful way.
 Predictive data mining involves using data to build models that can make
predictions or forecasts about future events or outcomes. This type of data
mining is often used to identify and model relationships between different
variables, and to make predictions about future events or outcomes based on
those relationships.
 Prescriptive data mining involves using data and models to make

recommendations or suggestions about actions or decisions. This type of
data mining is often used to optimize processes, allocate resources, or make
other decisions that can help organizations achieve their goals.
Overall, these three types of data mining are commonly used to explore, model,
and make decisions based on data. They are powerful tools for uncovering
insights and information hidden in data sets and are widely used in a variety of
applications.
How Does Data Mining Work?
Data mining is the process of extracting useful information and insights from
large data sets. It typically involves several steps, including defining the
problem, preparing the data, exploring the data, modeling the data,
validating the model, implementing the model, and evaluating the results.
Let’s understand the process of Data Mining in the following phases:
 The process of data mining typically begins with defining the problem or
question that you want to answer with your data. This involves
understanding the business context and goals and identifying the data that is
relevant to the problem.
 Next, the data is prepared for analysis. This involves cleaning the data,
transforming it into a usable format, and checking for errors or
inconsistencies.
 Once the data is prepared, you can begin exploring it to gain insights and
understand its characteristics. This typically involves using visualization
and summary statistics to understand the distribution, patterns, and trends in
the data.
 The next step is to build models that can be used to make predictions or
forecasts based on the data. This involves choosing an appropriate
modeling technique, fitting the model to the data, and evaluating its
performance.
 After the model is built, it is important to validate its performance to

ensure that it is accurate and reliable. This typically involves using a
separate data set (called a validation set) to evaluate the model’s
performance and make any necessary adjustments.
 Once the model has been validated, it can be implemented in a production

environment to make predictions or recommendations. This involves
deploying the model and integrating it into the organization’s existing systems
and processes.
 The final step in the data mining process is to evaluate the results of the
model and determine its effectiveness in solving the problem or
achieving the goals. This involves measuring the model’s performance,
comparing it to other models or approaches, and making any necessary
changes or improvements.

Overall, data mining is a powerful and flexible tool for extracting useful
information and insights from large data sets. By following these steps, data
miners and other practitioners can uncover valuable insights and information
hidden in their data, and use it to make better decisions and improve their
businesses.
Data Warehousing and Mining Software
Data warehousing and mining software is a type of software that is used to

store, manage, and analyze large data sets. This software is commonly used in
the field of data warehousing and data mining, and it typically includes tools and
features for pre-processing, storing, querying, and analyzing data.
Some of the most common types of data warehousing and mining software
include:
 Relational database management systems (RDBMS) – RDBMS are
software systems that are used to store and manage data in a structured,
tabular format. These systems are widely used in data warehousing and data
mining, and they typically support SQL for querying and manipulating data.
 Data mining tools – Data mining tools are software tools that are used to
extract information and insights from large data sets. These tools typically
include algorithms and methods for exploring, modeling, and analyzing data,
and they are commonly used in the field of data mining.
 Data visualization tools – Data visualization tools are software tools that are
used to visualize and display data in a graphical or graphical format. These
tools are commonly used in data mining to explore and understand the data,
and to communicate the results of the analysis.
 Data warehousing platforms – Data warehousing platforms are software

systems that are designed to support the creation and management of data
warehouses. These platforms typically include tools and features for loading,
transforming, and managing data, as well as tools for querying and analyzing
the data.
Overall, data warehousing and mining software is a powerful and essential tool
for storing, managing and analyzing large data sets. This software is widely
used in the field of data warehousing and data mining, and it plays a crucial role
in the data-driven decision-making process.
Open-Source Software for Data Mining
There are many open-source software applications and platforms that are
available for data mining. These open-source tools provide a range of
algorithms, techniques, and functions that can be used to extract useful insights
and information from data, and are typically available at no cost. Some
examples of popular open-source software for data mining include:
 RapidMiner – RapidMiner is an open-source data mining platform that
provides a range of tools and functions for data preparation, analysis, and
machine learning. It has a user-friendly interface and is suitable for users of
all skill levels, from beginners to experts. RapidMiner is available under the
AGPL license and is widely used in industries such as finance, healthcare,
and retail.
 Orange – Orange is an open-source data mining platform that provides a

range of tools and functions for data visualization, analysis, and machine
learning. It has a user-friendly interface and is suitable for users of all skill
levels, from beginners to experts. Orange is available under the GPL license
and is widely used in industries such as finance, healthcare, and retail.
 KNIME – KNIME is an open-source data mining platform that provides a

range of tools and functions for data preparation, analysis, and machine
learning. It has a user-friendly interface and is suitable for users of all skill
levels, from beginners to experts. KNIME is available under the AGPL license
and is widely used in industries such as finance, healthcare, and retail.
 WEKA – WEKA is an open-source data mining platform that provides a range

of tools and functions for data preparation, analysis, and machine learning. It
has a user-friendly interface and is suitable for users of all skill levels, from
beginners to experts. WEKA is available under the GPL license and is widely
used in industries such as finance, healthcare, and retail.
Overall, there are many open-source software applications and platforms
available for data mining. These tools provide powerful and flexible tools and
functions for data mining and are typically available at no cost. Open-source
data mining tools are an excellent option for users who want to perform data
mining.
Data mining vs. Data Analytics and Data Warehousing
Data mining, data analytics, and data warehousing are closely related fields that
are often used together to extract useful information and insights from large
data sets. However, there are some key differences between these fields:
 Data mining is the process of extracting useful information and insights from
large data sets. It involves applying algorithms and techniques to uncover
hidden patterns and relationships in the data and to generate predictions and
forecasts.
 Data analytics is the process of analyzing data to extract insights and

information. It involves applying statistical and mathematical methods to data
sets in order to understand and describe the data and draw conclusions and
make predictions.
 Data warehousing is the process of storing and managing large data sets. It
involves designing and implementing a database or data repository that can
efficiently store and manage data, and that can be queried and accessed by
data mining and analytics tools.
In summary, data mining, data analytics, and data warehousing are closely
related fields that are often used together to extract useful information and
insights from large data sets. Data mining focuses on applying algorithms and
techniques to uncover hidden patterns and relationships in the data, data
analytics focuses on applying statistical and mathematical methods to data
sets, and data warehousing focuses on storing and managing large data sets.
Data Mining vs. Data Analysis
Data mining and data analysis are closely related, but they are not the same
thing. Data mining is a process of extracting useful insights and information
from data, using techniques and algorithms from fields such as statistics,
machine learning, and database management. Data analysis, on the other
hand, is the process of examining and interpreting data, typically to uncover
trends, patterns, and relationships.
Data mining and data analysis are often used together in a data-driven
approach to decision-making and problem-solving. Data mining involves
applying algorithms and techniques to data to extract useful insights and
information, while data analysis involves examining and interpreting these
insights and information to understand their significance and implications.
Overall, the main difference between data mining and data analysis is the focus
of each process. Data mining focuses on extracting useful insights and
information from data, while data analysis focuses on examining and
interpreting these insights and information to understand their meaning and
implications. Both data mining and data analysis are important and valuable
tools for making sense of data and making better decisions and predictions.
Data Mining vs. Data Science
Data mining and data science are closely related, but they are not the same
thing. Data mining is a process of extracting useful insights and information
from data, using techniques and algorithms from fields such as statistics,
machine learning, and database management. Data science, on the other hand,
is a broader field that involves using data and analytical methods to extract
knowledge and insights from data.
Data mining is a key component of data science, but it is not the only
component. Data science also involves other aspects of working with data, such
as data collection, cleaning, and preparation, as well as data visualization,
communication, and collaboration. Data science is therefore a broader and
more comprehensive field than data mining and involves a wider range of skills,
techniques, and tools.
Overall, the main difference between data mining and data science is the scope
and focus of each field. Data mining focuses on extracting useful insights and
information from data, using techniques and algorithms from fields such as
statistics and machine learning. Data science, on the other hand, is a broader
field that involves using data and analytical methods to extract knowledge and
insights from data, and to support decision-making and problem-solving. Both
data mining and data science are important and valuable fields that are driving
innovation and progress in many different industries and applications.
Benefits of Data Mining
large data sets. It is a powerful and flexible tool that has many benefits,
including:
1. Improved decision-making – One of the main benefits of data mining is that
it can help organizations make better decisions. By analyzing data and
uncovering hidden patterns and trends, data mining can provide valuable
insights and information that can be used to inform and improve decision-
making.
2. Increased efficiency and productivity – Data mining can also help

organizations increase their efficiency and productivity. By automating and
streamlining the data analysis process, data mining can save time and
resources, and help organizations work more effectively and efficiently.
3. Reduced costs – Data mining can also help organizations reduce their
costs. By identifying and addressing inefficiencies and waste, data mining
can help organizations save money and improve their bottom line.
4. Increased customer satisfaction – Data mining can also be used to

improve customer satisfaction. By analyzing data on customer behavior and
preferences, data mining can help organizations understand their customers
better, and provide more personalized and relevant products and services.
5. Improved risk management – Data mining can also be used to improve risk
management. By analyzing data on potential risks and vulnerabilities, data
mining can help organizations identify and mitigate potential risks, and make
more informed and strategic decisions.
Overall, data mining is a powerful tool that has many benefits for organizations.
By extracting valuable information and insights from data, data mining can help
organizations make better decisions, increase their efficiency and productivity,
reduce their costs, improve customer satisfaction, and manage risks more
effectively.
Limitations of Data Mining

Data mining is a powerful and flexible tool for extracting useful information and
insights from large data sets. However, like any other tool, data mining has its
limitations and challenges. Some of the main limitations of data mining include:
1. Data quality – One of the main limitations of data mining is the quality of the
data. Data mining can only be as accurate and reliable as the data that it is
based on, and poor-quality data can lead to inaccurate or misleading results.
2. Model bias – Another limitation of data mining is the potential for bias in the
models that are built from the data. If the data is not representative of the
population, or if there is bias in the way the data is collected or analyzed, the
models that are built from the data may be biased, and may not accurately
reflect the underlying relationships in the data.
3. Ethical considerations – Data mining also raises ethical considerations. The

data that is collected and analyzed may be sensitive or personal, and
organizations must ensure that they handle this data responsibly and in
compliance with relevant laws and regulations.
4. Technical challenges – Data mining can also be technically challenging,

especially when dealing with large and complex data sets. Extracting useful
information and insights from data can require specialized skills and
expertise, and can be time-consuming and resource-intensive.
Overall, data mining is a powerful and flexible tool, but it has its limitations and
challenges. Organizations must be aware of these limitations, and take steps to
address them in order to ensure that their data mining efforts are accurate,
reliable, and ethical.
7 steps of Data Mining
The process of data mining typically involves seven steps:

1. Identify the problem – The first step in data mining is to identify the problem
or question that you want to answer with your data. This step involves
understanding the business context and goals and identifying the data that is
relevant to the problem you want to solve.
2. Prepare the data – The next step is to prepare the data for analysis. This
involves cleaning the data, transforming it into a usable format, and checking
for errors or inconsistencies.
3. Explore the data – Once the data is prepared, you can begin exploring it to
gain insights and understand its characteristics. This step typically involves
using visualization and summary statistics to understand the distribution,
patterns, and trends in the data.
4. Model the data – The next step is to build models that can be used to make
predictions or forecasts based on the data. This step involves choosing an
appropriate modeling technique, fitting the model to the data, and evaluating
its performance.
5. Validate the model – After the model is built, it is important to validate its
performance to ensure that it is accurate and reliable. This step typically
involves using a separate data set (called a validation set) to evaluate the
model’s performance and make any necessary adjustments.
6. Implement the model – Once the model has been validated, it can be
implemented in a production environment to make predictions or
recommendations. This step involves deploying the model and integrating it
into the organization’s existing systems and processes.
7. Evaluate the results – The final step in the data mining process is to
evaluate the results of the model and determine its effectiveness in solving
the problem or achieving the goals. This step involves measuring the model’s
performance, comparing it to other models or approaches, and making any
necessary changes or improvements.
Overall, these seven steps form the core of the data mining process and are
used to explore, model, and make decisions based on data. By following these
steps, data miners and other practitioners can uncover valuable insights and
information hidden in their data.
What is Data Mining Techniques?
Data mining techniques are algorithms and methods used to extract information
and insights from data sets. These techniques are commonly used in the field of
data mining and machine learning, and they include a variety of methods for
exploring, modeling, and analyzing data.
Some of the most common data mining techniques include:
1. Regression
Regression is a data mining technique that is used to model the relationship
between a dependent variable and one or more independent variables. In
regression analysis, the goal is to fit a mathematical model to the data that can
be used to make predictions or forecasts about the dependent variable based
on the values of the independent variables.
There are many different types of regression models, including linear
regression, logistic regression, and non-linear regression. These models differ
in the way that they model the relationship between the dependent and
independent variables, and in the assumptions that they make about the data.
In general, regression models are used to answer questions such as:
 What is the relationship between the dependent and independent variables?
 How well does the model fit the data?
 How accurate are the predictions or forecasts made by the model?
Overall, regression is a powerful and widely used data mining technique that is
used to model and predict the relationship between variables in a data set. It is
a crucial tool for many applications in the field of data mining and is commonly
used in areas such as finance, marketing, and healthcare.
2. Classification
Classification is a data mining technique that is used to predict the class or
category of an item or instance based on its characteristics or attributes. In
classification analysis, the goal is to build a model that can accurately predict
the class of an item based on its attributes and to evaluate the performance of
the model.
There are many different types of classification models, including decision
trees, k-nearest neighbors, and support vector machines. These models differ
in the way that they model the relationship between the classes and the
attributes, and in the assumptions that they make about the data.
In general, classification models are used to answer questions such as:
 What is the relationship between the classes and the attributes
 How well does the model fit the data?
 How accurate are the predictions made by the model?
Overall, classification is a powerful and widely used data mining technique that
is used to predict the class or category of an item based on its characteristics. It
is a crucial tool for many applications in the field of data mining and is
commonly used in areas such as marketing, finance, and healthcare.
3. Clustering
Clustering is a data mining technique that is used to group items or instances in

a data set into clusters or groups based on their similarity or proximity. In
clustering analysis, the goal is to identify and explore the natural structure or
organization of the data, and to uncover hidden patterns and relationships.
There are many different types of clustering algorithms, including k-means
clustering, hierarchical clustering, and density-based clustering. These
algorithms differ in the way that they define and measure similarity or proximity,
and in the way that they group the items in the data set.
In general, clustering is used to answer questions such as:
 What is the natural structure or organization of the data?
 What are the main clusters or groups in the data?
 How similar or dissimilar are the items in the data set?
Overall, clustering is a powerful and widely used data mining technique that is
used to group items in a data set into clusters based on their similarity. It is a
crucial tool for many applications in the field of data mining and is commonly
used in areas such as market research, customer segmentation, and image
analysis.
4. Association rule mining
Association rule mining is a data mining technique that is used to identify and
explore relationships between items or attributes in a data set. In association
rule mining, the goal is to identify patterns and rules that describe the co-
occurrence or occurrence of items or attributes in the data set and to evaluate
the strength and significance of these patterns and rules.
There are many different algorithms and methods for association rule mining,
including the Apriori algorithm and the FP-growth algorithm. These algorithms
differ in the way that they generate and evaluate association rules, and in the
assumptions that they make about the data.
In general, association rule mining is used to answer questions such as:
 What are the main patterns and rules in the data?
 How strong and significant are these patterns and rules?
 What are the implications of these patterns and rules for the data set and the
domain?
Overall, association rule mining is a powerful and widely used data mining
technique that is used to identify and explore relationships between items or
attributes in a data set. It is a crucial tool for many applications in the field of
data mining and is commonly used in areas such as market basket analysis,
recommendation systems, and fraud detection.
5. Dimensionality Reduction
Dimensionality reduction is a data mining technique that is used to reduce the

number of dimensions or features in a data set while retaining as much
information and structure as possible. In dimensionality reduction, the goal is to
identify and remove redundant or irrelevant dimensions, and to transform the
data into a lower-dimensional space that is easier to visualize and analyze.
There are many different methods for dimensionality reduction, including
principal component analysis (PCA), independent component analysis (ICA),
and singular value decomposition (SVD). These methods differ in the way that
they transform the data, and in the assumptions that they make about the data.
In general, dimensionality reduction is used to answer questions such as:
 What are the main dimensions or features in the data set?
 How much information and structure can be retained in a lower-dimensional
space?
 How can the data be visualized and analyzed in a lower-dimensional space?
Overall, dimensionality reduction is a powerful and widely used data mining
technique that is used to reduce the number of dimensions or features in a data
set. It is a crucial tool for many applications in the field of data mining and is
commonly used in areas such as image recognition, text analysis, and feature
selection.
These are just a few examples of the many data mining techniques that are
available. There are many other techniques that can be used for exploring,
modeling, and analyzing data, and the appropriate technique will depend on the
specific problem or question you are trying to answer with your data.
The Differences Between Data Mining and Machine Learning

Data mining and machine learning are closely related fields, and both are used
to extract useful insights and information from large data sets. However, there
are some key differences between these fields:
 Data mining is the process of extracting useful information and insights from
large data sets. It involves applying algorithms and techniques to uncover
hidden patterns and relationships in the data and to generate predictions and
forecasts. Data mining is typically used to extract insights from structured
data and is often applied in domains where the data and relationships are
well understood.
 Machine learning is the process of using algorithms and models to learn
from data and make predictions or decisions. It involves training a model on a
large data set and then using the model to make predictions or decisions
based on new data. Machine learning is typically used to extract insights from
unstructured or semi-structured data, and is often applied in domains where
the data and relationships are complex and not well understood.
In summary, data mining and machine learning are closely related fields, but
they have some key differences. Data mining focuses on extracting useful
insights from structured data, while machine learning focuses on using
algorithms and models to learn from data and make predictions. Both data
mining and machine learning are powerful and widely used tools for extracting
insights from data and are often used together in many applications and
domains.
Current Advancements in Data Mining

There are many current advancements in data mining, as the field continues to
evolve and grow. Some of the key current advancements in data mining
include:
1. Big Data Technologies
One of the major current advancements in data mining is the increasing use of
big data technologies. These technologies, such as Hadoop and Spark, enable
data mining on large and complex data sets and provide scalable and efficient
ways to process and analyze data. As the amount of data generated by
businesses and organizations continue to grow, big data technologies are
becoming increasingly important for data mining.
2. Machine Learning
Another major advancement in data mining is the increasing use of machine

learning techniques. Machine learning algorithms and models can
automatically learn from data and can be used to make predictions or decisions
based on new data. By applying machine learning techniques to data mining, it
is possible to extract valuable insights and information that would not be
possible using traditional data mining techniques.
3. Graph Mining
Graph mining is a relatively new field that involves applying data mining
techniques to graphs and networks. Graphs and networks are used to
represent complex and interrelated data and can be mined to uncover hidden
patterns and relationships in the data. By applying graph mining techniques to
data mining, it is possible to extract valuable insights and information from
complex and interrelated data.
4. Cloud Computing
Cloud computing is another major advancement in data mining, as it provides a

scalable and cost-effective way to perform data mining. By using cloud
computing platforms and services, data miners can access large amounts of
computing power and storage and can perform data mining on large and
complex data sets without the need for expensive hardware and infrastructure.
Cloud computing is therefore an important enabling technology for data mining.
Overall, there are many current advancements in data mining, as the field
continues to evolve and grow. These advancements are driving innovation and
progress in data mining, and are enabling data miners to extract more valuable
insights and information from their data.
The Future of Data Mining

The future of data mining is likely to be shaped by a number of factors,
including the continued growth of data and the increasing availability of data
mining tools and technologies. Some of the key trends and developments that
are likely to impact the future of data mining include:
1. Big Data and Cloud Computing
The growth of big data and the increasing availability of cloud computing
technologies are likely to continue to drive the development of data mining. As
more and more data is generated and collected, data mining will become
increasingly important for managing, analyzing, and extracting insights from
this data. Cloud computing will also make it easier for organizations to access
and use data mining tools and technologies and will enable them to perform
large-scale and complex data mining analyses.
2. Machine Learning and Artificial Intelligence
The development of machine learning and artificial intelligence is likely to

continue to drive the evolution of data mining. Machine learning algorithms are
already being used to improve the performance and accuracy of data mining
models, and are likely to become increasingly important in the future. Artificial
intelligence technologies, such as natural language processing and computer
vision, will also enable data mining to be applied to new types of data and in
new domains.
3. Data Privacy and Security
As data mining becomes more widely used, concerns about data privacy and
security are likely to become more important. Organizations will need to ensure
that they comply with data protection laws and regulations and that they protect
the privacy and security of their data and the individuals who are represented
in it. This will require the development of new technologies and practices for
data mining, such as privacy-preserving data mining algorithms and secure
data management systems.
4. Ethics and Governance
As data mining becomes more powerful and widely used, there will be a
growing need for ethical and governance frameworks to guide its use and
ensure that it is used responsibly and for the benefit of society. This will require
the development of ethical principles and guidelines for data mining, and the
creation of governance structures and mechanisms to ensure that data mining
is used in an ethical and responsible manner. This will involve a range of
stakeholders, including data scientists, policymakers, and ethicists, who will
need to work together to develop and implement these frameworks.
Overall, the future of data mining is likely to be shaped by the continued growth
of data, the development of new technologies and tools, and the increasing
importance of data privacy and ethics. Data mining will continue to be a
powerful and widely used tool for extracting useful insights and information
from data and will play a critical role in many applications and domains.
Data Mining and Social Media
large data sets, and social media is a rich source of data that can be mined for
insights and information. By analyzing data from social media platforms,
organizations can gain valuable insights into consumer behavior, preferences,
and opinions, and use this information to inform and improve their marketing
and advertising efforts.
Some common examples of data mining in social media include:
1. Sentiment analysis – Sentiment analysis is a common application of data
mining in social media. By analyzing the text of social media posts and
comments, organizations can determine the overall sentiment of users
towards their products, services, or brand, and use this information to
improve their marketing and customer service efforts.
2. Influencer identification – Data mining can also be used to identify

influencers on social media. By analyzing data on user engagement, reach,
and influence, organizations can identify users who are influential and have a
large audience and target their marketing and advertising efforts toward
these users.
3. Trend analysis – Data mining can also be used to analyze trends on social
media. By analyzing data on user behavior and interactions, organizations
can identify emerging trends and topics of interest, and use this information
to tailor their content and messaging to be more relevant and engaging.
Overall, data mining is a powerful tool for extracting useful information and
insights from social media data. By analyzing data from social media platforms,
organizations can gain valuable insights into consumer behavior, preferences,
and opinions, and use this information to inform and improve their marketing
and advertising efforts.
Best Tools/Programming Languages for Data Mining
There are many different tools and platforms available for data mining, and the
best tool for you will depend on your specific needs and requirements. Some of
the most popular and widely used tools for data mining include:
1. R – R is a powerful programming language for data analysis and statistical
computing. It has a rich ecosystem of packages and tools for data mining and
is widely used by data miners and other practitioners.
2. Python – Python is a popular data analysis and machine learning

programming language. It has a rich ecosystem of libraries and frameworks
for data mining and is widely used in the field.
3. SAS – SAS is a commercial software suite for data management, analytics,

and business intelligence. It has a range of tools and features for data mining
and is widely used in the corporate and enterprise sectors.
4. IBM SPSS – IBM SPSS is a commercial software suite for data analysis and
predictive modeling. It has a range of tools and features for data mining and
is widely used in the social sciences and other fields.
5. RapidMiner – RapidMiner is a commercial data science platform for building

and deploying predictive models. It has a range of tools and features for data
mining and is widely used by data scientists and other practitioners.
Overall, there are many different tools and platforms available for data mining,
and the best one for you will depend on your specific needs and requirements.
Some of the most popular and widely used tools for data mining include R,
Python, SAS, IBM SPSS, and RapidMiner.
Data Mining in R
R is a popular programming language for data analysis and statistical
computing. It has a rich ecosystem of packages and tools for data mining,
including tools for pre-processing, visualization, and modeling. Data miners and
other practitioners can use R to quickly and easily explore and analyze their
data, build and evaluate predictive models, and visualize the results of their
analysis.
To get started with data mining in R, you will need to install R and some of the
commonly used packages for data mining, such as caret, arules, cluster, and
ggplot2. Once you have these tools installed, you can load your data and start
exploring it, using R’s powerful data manipulation and visualization capabilities.
You can then use the tools and functions provided by these packages to pre-
process your data, build predictive models, and evaluate and visualize the
results of your analysis.
Overall, R is a powerful and flexible language for data mining, and the rich
ecosystem of packages and tools available for R makes it an attractive choice
for data miners and other practitioners who need to quickly and easily explore,
analyze, and model their data.
The Benefits of Data Mining in R
1. R is a powerful and versatile programming language that is well-suited for

data mining tasks, such as data manipulation, statistical analysis, and
machine learning.
2. R has a rich ecosystem of packages and libraries that provide a wide range
of tools and functions for data mining, including the caret package for training
and evaluating machine learning algorithms, the arules package for mining
association rules, the cluster package for clustering data, and the ggplot2
package for visualizing data.
3. R has a strong community of users and developers who contribute to the
development of new packages and share their knowledge and experiences
through forums, blogs, and conferences.
4. R is open-source and freely available, which makes it accessible and
affordable for organizations of all sizes and budgets.
Challenges of Data Mining in R
1. R is a programming language, which means that it has a steep learning curve

and requires a certain level of technical expertise to use effectively.
2. R is not as fast or scalable as some other languages and tools, which can
make it difficult to handle large datasets or perform complex data mining
tasks.
3. R is not as user-friendly or intuitive as some other data mining tools, which

can make it difficult for non-technical users to use or interpret the results.
4. R is not as well-supported or integrated with other tools and platforms as

some other languages, which can limit its flexibility and interoperability.
KDD- Knowledge Discovery in Databases
 The term KDD stands for Knowledge Discovery in Databases. It

refers to the broad procedure of discovering knowledge in data
and emphasizes the high-level applications of specific Data
Mining techniques
 The main objective of the KDD process is to extract information

from data in the context of large databases. It does this by using
Data Mining algorithms to identify what is deemed knowledge.

 The KDD process is an iterative process and it requires multiple

iterations of the above steps to extract accurate knowledge from the
data.The following steps are included in KDD process:
1. Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation
tools.
2. Data Integration
Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse). Data integration
using Data Migration tools, Data Synchronization tools and
ETL(Extract-Load-Transformation) process.
3. Data Selection
Data selection is defined as the process where data relevant to the
analysis is decided and retrieved from the data collection. For this we
can use Neural network, Decision Trees, Naive bayes, Clustering,
and Regression methods.
4. Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a
two step process:
5. Data Mapping: Assigning elements from source base to destination to
capture transformations.
6. Code generation: Creation of the actual transformation program.
7. Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and
decides purpose of model using classification or characterization.
8. Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures. It
find interestingness score of each pattern, and
uses summarization and Visualization to make data understandable by
user.
9. Knowledge Representation
This involves presenting the results in a way that is meaningful and can be
used to make decisions.
Advantages of KDD
1. Improves decision-making: KDD provides valuable insights and

knowledge that can help organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-consuming tasks
and makes the data ready for analysis, which saves time and money.
3. Better customer service: KDD helps organizations gain a better
understanding of their customers’ needs and preferences, which can help
them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by
identifying patterns and anomalies in the data that may indicate fraud.
5. Predictive modeling: KDD can be used to build predictive models that can
forecast future trends and patterns.
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting

and analyzing large amounts of data, which can include sensitive
information about individuals.
2. Complexity: KDD can be a complex process that requires specialized skills
and knowledge to implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences,
such as bias or discrimination, if the data or models are not properly
understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data
is not accurate or consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant
investments in hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common
problem in machine learning where a model learns the detail and noise in
the training data to the extent that it negatively impacts the performance of
the model on new unseen data.
Difference between KDD and Data Mining
Parameter KDD Data Mining
KDD refers to a process of

Data Mining refers to a process
identifying valid, novel, potentially
of extracting useful and valuable
Definition useful, and ultimately
information or patterns from
understandable patterns and
large data sets.
relationships in data.
Parameter KDD Data Mining
To find useful knowledge from To extract useful information

Objective
data. from data.
Data cleaning, data integration,

Association rules, classification,
data selection, data transformation,
Techniques clustering, regression, decision
data mining, pattern evaluation,
Used trees, neural networks, and
and knowledge representation and
dimensionality reduction.
visualization.
Patterns, associations, or insights

Structured information, such as
that can be used to improve
Output rules and models, that can be used
decision-making or
to make decisions or predictions.
understanding.
Focus is on the discovery of useful Data mining focus is on the

Focus knowledge, rather than simply discovery of patterns or
finding patterns in data. relationships in data.
Domain expertise is important in Domain expertise is less critical

Role of KDD, as it helps in defining the in data mining, as the algorithms
domain goals of the process, choosing are designed to identify patterns
expertise appropriate data, and interpreting without relying on prior
the results. knowledge.
Data Warehousing
A Data Warehouse is separate from DBMS, it stores a huge amount of

data, which is typically collected from multiple heterogeneous sources
like files, DBMS, etc. The goal is to produce statistical results that may
help in decision makings. For example, a college might want to see
quick different results, like how the placement of CS students has
improved over the last 10 years, in terms of salaries, counts, etc.
Need for Data Warehouse
An ordinary Database can store MBs to GBs of data and that too for a
specific purpose. For storing data of TB size, the storage shifted to
Data Warehouse. Besides this, a transactional database doesn’t offer
itself to analytics. To effectively perform analytics, an organization
keeps a central Data Warehouse to closely study its business by
organizing, understanding, and using its historic data for taking
strategic decisions and analyzing trends.
Benefits of Data Warehouse:
1. Better business analytics: Data warehouse plays an important

role in every business to store and analysis of all the past data and
records of the company. which can further increase the
understanding or analysis of data to the company.
2. Faster Queries: Data warehouse is designed to handle large
queries that’s why it runs queries faster than the database.
3. Improved data Quality: In the data warehouse the data you
gathered from different sources is being stored and analyzed it does
not interfere with or add data by itself so your quality of data is
maintained and if you get any issue regarding data quality then the
data warehouse team will solve this.
4. Historical Insight: The warehouse stores all your historical data
which contains details about the business so that one can analyze it
at any time and extract insights from it
Data Warehouse vs DBMS

Example Applications of Data Warehousing
Data Warehousing can be applied anywhere where we have a huge

amount of data and we want to see statistical results that help in
decision making.
 Social Media Websites: The social networking websites like

Facebook, Twitter, Linkedin, etc. are based on analyzing large data
sets. These sites gather data related to members, groups, locations,
etc., and store it in a single central repository. Being a large amount
of data, Data Warehouse is needed for implementing the same.
 Banking: Most of the banks these days use warehouses to see the
spending patterns of account/cardholders. They use this to provide
them with special offers, deals, etc.
 Government: Government uses a data warehouse to store and
analyze tax payments which are used to detect tax thefts.
Features :
Centralized Data Repository: Data warehousing provides a

centralized repository for all enterprise data from various sources, such
as transactional databases, operational systems, and external sources.
This enables organizations to have a comprehensive view of their data,
which can help in making informed business decisions.
Data Integration: Data warehousing integrates data from different

sources into a single, unified view, which can help in eliminating data
silos and reducing data inconsistencies.
Historical Data Storage: Data warehousing stores historical data,

which enables organizations to analyze data trends over time. This can
help in identifying patterns and anomalies in the data, which can be
used to improve business performance.
Query and Analysis: Data warehousing provides powerful query and

analysis capabilities that enable users to explore and analyze data in
different ways. This can help in identifying patterns and trends, and can
also help in making informed business decisions.
Data Transformation: Data warehousing includes a process of data

transformation, which involves cleaning, filtering, and formatting data
from various sources to make it consistent and usable. This can help in
improving data quality and reducing data inconsistencies.
Data Mining: Data warehousing provides data mining capabilities,

which enable organizations to discover hidden patterns and
relationships in their data. This can help in identifying new
opportunities, predicting future trends, and mitigating risks.
Data Security: Data warehousing provides robust data security
features, such as access controls, data encryption, and data backups,
which ensure that the data is secure and protected from unauthorized
access.
Advantages:
Improved data quality: Data warehousing can help improve data

quality by consolidating data from various sources into a single,
consistent view.
Faster access to information: Data warehousing enables quick
access to information, allowing businesses to make better, more
informed decisions faster.
Better decision-making: With a data warehouse, businesses can
analyze data and gain insights into trends and patterns that can inform
better decision-making.
Reduced data redundancy: By consolidating data from various
sources, data warehousing can reduce data redundancy and
inconsistencies.
Scalability: Data warehousing is highly scalable and can handle large
amounts of data from different sources.
Disadvantages:
Cost: Building a data warehouse can be expensive, requiring

significant investments in hardware, software, and personnel.
Complexity: Data warehousing can be complex, and businesses may
need to hire specialized personnel to manage the system.
Time-consuming: Building a data warehouse can take a significant
amount of time, requiring businesses to be patient and committed to
the process.
Data integration challenges: Data from different sources can be
challenging to integrate, requiring significant effort to ensure
consistency and accuracy.
Data security: Data warehousing can pose data security risks, and
businesses must take measures to protect sensitive data from
unauthorized access or breaches.
A Data Warehouse can be viewed as a data system with the
following attributes:
o It is a database designed for investigative tasks, using data

from various applications.
o It supports a relatively small number of clients with relatively
long interactions.
o It includes current and historical data to provide a historical
perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant

store of information in support of management's decisions."
Characteristics of Data Warehouse
Subject-Oriented
A data warehouse target on the modeling and analysis of data for
decision-makers. Therefore, data warehouses typically provide a
concise and straightforward view around a particular subject, such
as customer, product, or sales, instead of the global organization's
ongoing operations. This is done by excluding data that are not
useful concerning the subject and including all data needed by the
users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources
like RDBMS, flat files, and online transaction records. It requires
performing data cleaning and integration during data warehousing
to ensure consistency in naming conventions, attributes types, etc.,
among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one
can retrieve files from 3 months, 6 months, 12 months, or even
previous data from a data warehouse. These variations with a
transactions system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is
transformed from the source operational RDBMS. The operational
updates of data do not occur in the data warehouse, i.e., update,
insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction
processing, recovery, and concurrency capabilities, which allows for
substantial speedup of data retrieval. Non-Volatile defines that once
entered into the warehouse, and data should not change.
Apriori Algorithm
Apriori algorithm refers to the algorithm which is used to calculate
the association rules between objects. It means how two or more
objects are related to one another. In other words, we can say that
the apriori algorithm is an association rule leaning that analyzes that
people who bought product A also bought product B.
The primary objective of the apriori algorithm is to create the

association rule between different objects. The association rule
describes how two or more objects are related to one another.
Apriori algorithm is also called frequent pattern mining. Generally,
you operate the Apriori algorithm on a database that consists of a
huge number of transactions
How to improve the efficiency of the Apriori

Algorithm?
There are various methods used for the efficiency of the Apriori
algorithm
Hash-based itemset counting
In hash-based itemset counting, you need to exclude the k-itemset

whose equivalent hashing bucket count is least than the threshold is
an infrequent itemset.
Transaction Reduction
In transaction reduction, a transaction not involving any frequent X

itemset becomes not valuable in subsequent scans.
The two primary drawbacks of the Apriori Algorithm are:

1. At each step, candidate sets have to be built.
2. To build the candidate sets, the algorithm has to repeatedly scan
the database.
These two properties inevitably make the algorithm slower. To
overcome these redundant steps, a new association-rule mining
algorithm was developed named Frequent Pattern Growth Algorithm. It
overcomes the disadvantages of the Apriori algorithm by storing all the
transactions in a Trie Data Structure.
What is FP Growth Algorithm?

The FP-Growth Algorithm is an alternative way to find frequent item
sets without using candidate generations, thus improving
performance. For so much, it uses a divide-and-conquer strategy.
The core of this method is the usage of a special data structure
named frequent-pattern tree (FP-tree), which retains the item set
association information.
This algorithm works as follows:
o First, it compresses the input database creating an FP-tree

instance to represent frequent items.
o After this first step, it divides the compressed database into a
set of conditional databases, each associated with one
frequent pattern.
o Finally, each such database is mined separately.
Using this strategy, the FP-Growth reduces the search costs by

recursively looking for short patterns and then concatenating them
into the long frequent patterns.
In large databases, holding the FP tree in the main memory is

impossible. A strategy to cope with this problem is to partition the
database into a set of smaller databases (called projected
databases) and then construct an FP-tree from each of these smaller
databases.
FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that
stores quantitative information about frequent patterns in a
database. Each transaction is read and then mapped onto a path in
the FP-tree. This is done until all transactions have been read.
Different transactions with common subsets allow the tree to remain
compact because their paths overlap.
A frequent Pattern Tree is made with the initial item sets of the
database. The purpose of the FP tree is to mine the most frequent
pattern. Each node of the FP tree represents an item of the item set.
The root node represents null, while the lower nodes represent the
item sets. The associations of the nodes with the lower nodes, that
is, the item sets with the other item sets, are maintained while
forming the tree.
Han defines the FP-tree as the tree structure given below:
1. One root is labelled as "null" with a set of item-prefix subtrees

as children and a frequent-item-header table.
2. Each node in the item-prefix subtree consists of three fields:
o Item-name: registers which item is represented by the
node;
o Count: the number of transactions represented by the
portion of the path reaching the node;
o Node-link: links to the next node in the FP-tree carrying
the same item name or null if there is none.
3. Each entry in the frequent-item-header table consists of two
fields:
o Item-name: as the same to the node;
o Head of node-link: a pointer to the first node in the FP-
tree carrying the item name.
Advantages of FP Growth Algorithm

Here are the following advantages of the FP growth algorithm, such
as:
o This algorithm needs to scan the database twice when

compared to Apriori, which scans the transactions for each
iteration.
o The pairing of items is not done in this algorithm, making it
faster.
o The database is stored in a compact version in memory.
o It is efficient and scalable for mining both long and short
frequent patterns.
Disadvantages of FP-Growth Algorithm

This algorithm also has some disadvantages, such as:
o FP Tree is more cumbersome and difficult to build than Apriori.

o It may be expensive.
o The algorithm may not fit in the shared memory when the
database is large.
Difference between Apriori and FP Growth Algorithm
Apriori FP Growth
Apriori generates frequent patterns by making the FP Growth generates an FP-Tree for making frequent patterns.
itemsets using pairings such as single item set,
double itemset, and triple itemset.
Apriori uses candidate generation where frequent FP-growth generates a conditional FP-Tree for every item in th
subsets are extended one item at a time.
Since apriori scans the database in each step, it FP-tree requires only one database scan in its beginning steps
becomes time-consuming for data where the
number of items is larger. so it consumes less time.
A converted version of the database is saved in A set of conditional FP-tree for every item is saved in the mem
the memory
It uses a breadth-first search It uses a depth-first search.
OLAP stands for Online Analytical Processing. OLAP systems have the
capability to analyze database information of multiple systems at the
current time. The primary goal of OLAP Service is data analysis and
not data processing.
OLTP stands for Online Transaction Processing. OLTP has the work to
administer day-to-day transactions in any organization. The main goal
of OLTP is data processing not data analysis.
Online Analytical Processing (OLAP)
Online Analytical Processing (OLAP) consists of a type of software tool

that is used for data analysis for business decisions. OLAP provides an
environment to get insights from the database retrieved from multiple
database systems at one time.
OLAP Examples
Any type of Data Warehouse System is an OLAP system. The uses of

the OLAP System are described below.
 Spotify analyzed songs by users to come up with a personalized
homepage of their songs and playlist.
 Netflix movie recommendation system.
Benefits of OLAP Services
 OLAP services help in keeping consistency and calculation.

 We can store planning, analysis, and budgeting for business
analytics within one platform.
 OLAP services help in handling large volumes of data, which helps
in enterprise-level business applications.
 OLAP services help in applying security restrictions for data
protection.
 OLAP services provide a multidimensional view of data, which helps
in applying operations on data in various ways.
Drawbacks of OLAP Services
 OLAP Services requires professionals to handle the data because

of its complex modeling procedure.
 OLAP services are expensive to implement and maintain in cases
when datasets are large.
 We can perform an analysis of data only after extraction and
transformation of data in the case of OLAP which delays the
system.
 OLAP services are not efficient for decision-making, as it is updated
on a periodic basis.
Online Transaction Processing (OLTP)
Online transaction processing provides transaction-oriented

applications in a 3-tier architecture. OLTP administers the day-to-day
transactions of an organization.
OLTP Examples
An example considered for OLTP System is ATM Center a person who

authenticates first will receive the amount first and the condition is that
the amount to be withdrawn must be present in the ATM. The uses of
the OLTP System are described below.
 ATM center is an OLTP application.
 OLTP handles the ACID properties during data transactions via the
application.
 It’s also used for Online banking, Online airline ticket booking,
sending a text message, add a book to the shopping cart.
Benefits of OLTP Services

 OLTP services allow users to read, write and delete data operations
quickly.
 OLTP services help in increasing users and transactions which
helps in real-time access to data.
 OLTP services help to provide better security by applying multiple
security features.
 OLTP services help in making better decision making by providing
accurate data or current data.
 OLTP Services provide Data Integrity, Consistency, and High
Availability to the data.
Drawbacks of OLTP Services
 OLTP has limited analysis capability as they are not capable of

intending complex analysis or reporting.
 OLTP has high maintenance costs because of frequent
maintenance, backups, and recovery.
 OLTP Services get hampered in the case whenever there is a
hardware failure which leads to the failure of online transactions.
 OLTP Services many times experience issues such as duplicate or
inconsistent data.
Difference between OLAP and OLTP
S.No. OLAP OLTP
1 OLAP stands for Online analytical OLTP stands for online transaction processing.
processing.
2 It includes software tools that help in It helps in managing online database modification.
analyzing data mainly for business
decisions.
3 It utilizes the data warehouse. It utilizes traditional approaches of DBMS.

4 It is popular as an online database query It is popular as an online database modifying system.
management system.
5 OLAP employs the data warehouse. OLTP employs traditional DBMS.
6 It holds old data from various Databases. It holds current operational data.
7 Here the tables are not normalized. Here, the tables are normalized.
8 It allows only read and hardly write It allows both read and write operations.
operations.
9 Here, the complex queries are involved. Here, the queries are simple.
Relational Online Analytical Processing (ROLAP) :
ROLAP servers are placed between relational backend server

and client front-end tools. It uses relational or extended DBMS
to store and manage warehouse data. ROLAP has basically 3
main components: Database Server, ROLAP server, and Front-
end tool.
Advantages of ROLAP –
 ROLAP is used for handle the large amount of data.
 ROLAP tools don’t use pre-calculated data cubes.
 Data can be stored efficiently.
 ROLAP can leverage functionalities inherent in the relational
database.
Disadvantages of ROLAP –
 Performance of ROLAP can be slow.
 In ROALP, difficult to maintain aggregate tables.
 Limited by SQL functionalities.
Multidimensional Online Analytical Processing
(MOLAP) :
MOLAP does not uses relational database to storage.It stores in

optimized multidimensional array storage. The storage
utilization may be low With multidimensional data stores. Many
MOLAP server handle dense and sparse data sets by using two
levels of data storage representation. MOLAP has 3 components
: Database Server, MOLAP server, and Front-end tool.
Advantages of MOLAP –
 MOLAP is basically used for complex calculations.
 MOLAP is optimal for operation such as slice and dice.
 MOLAP allows fastest indexing to the pre-computed
summarized data.
Disadvantages of MOLAP –
 MOLAP can’t handle large amount of data.
 In MOLAP, Requires additional investment.
 Without re-aggregation, difficult to change dimension.
Hybrid Online Analytical Processing (HOLAP) :
Hybrid is a combination of both ROLAP and MOLAP.It offers

functionalities of both ROLAP and as well as MOLAP like faster
computation of MOLAP and higher scalability of ROLAP. The
aggregations are stored separately in MOLAP store. Its server
allows storing the large data volumes of detailed information.
Advantages of HOLAP –
 HOLAP provides the functionalities of both MOLAP and
ROLAP.
 HOLAP provides fast access at all levels of aggregation.
Disadvantages of HOLAP –
HOLAP architecture is very complex to understand because it
supports both MOLAP and ROLAP.
Difference between ROLAP, MOLAP and HOLAP :
Basis ROLAP MOLAP HOLAP
Relational
Multidimensional
Storage location Database is used Multidimensional Database
Database is used as
for summary as storage location is used as storage location
storage location for
aggregation for summary for summary aggregation.
summary aggregation.
aggregation.
Processing time of Processing time of Processing time of HOLAP is

Processing time
ROLAP is very slow. MOLAP is fast. fast.
Large storage
space requirement Medium storage space Small storage space
Storage space in ROLAP as requirement in MOLAP requirement in HOLAP as
requirement compare to as compare to ROLAP compare to MOLAP and
MOLAP and and HOLAP. ROLAP.
HOLAP.
Storage location Relational Multidimensional Relational database is used

for detail data database is used as database is used as as storage location for detail
Basis ROLAP MOLAP HOLAP
storage location storage location for

data.
for detail data. detail data.
Low latency in
High latency in MOLAP Medium latency in HOLAP
ROLAP as compare
Latency as compare to ROLAP as compare to MOLAP and
to MOLAP and
and HOLAP. ROLAP.
HOLAP.
Slow query
Fast query response
response time in Medium query response
Query response time in MOLAP as
ROLAP as compare time in HOLAP as compare
time compare to ROLAP and
to MOLAP and to MOLAP and ROLAP.
HOLAP.
HOLAP.

DMW Notes by Me

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

DMW Notes by Me

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DMW Notes by Me

Uploaded by

Copyright:

Available Formats

What is Data Mining?

 Data mining is the process of extracting knowledge or insights from large

Data Mining System Classification

Classification Based on the Databases Mined

For example, if we classify a database according to the data model,

Classification Based on the kind of Knowledge Mined

Classification Based on the Techniques Utilized

Classification Based on the Applications Adapted

Integrating a Data Mining System with a DB/DW

The list of Integration Schemes is as follows −

 No Coupling − In this scheme, the data mining system does

5 Use Cases of Data Mining

2. Fraud Detection: Data mining is widely used in the financial industry to

3. Customer Segmentation: Data mining is commonly used in the marketing

4. Predictive Maintenance: Data mining is increasingly used in the

5. Network Intrusion Detection: Data mining is used in the cybersecurity

Data Mining Architecture

 Data Preprocessing: Data preprocessing is the process of preparing data

 Data Visualization: Data visualization is the process of presenting data and

Overall, a data mining architecture typically includes several key components,

3 Types of Data Mining

 Prescriptive data mining involves using data and models to make

How Does Data Mining Work?

 After the model is built, it is important to validate its performance to

 Once the model has been validated, it can be implemented in a production

Data warehousing and mining software is a type of software that is used to

 Data warehousing platforms – Data warehousing platforms are software

 Orange – Orange is an open-source data mining platform that provides a

 KNIME – KNIME is an open-source data mining platform that provides a

 WEKA – WEKA is an open-source data mining platform that provides a range

Data mining vs. Data Analytics and Data Warehousing

 Data analytics is the process of analyzing data to extract insights and

Data Mining vs. Data Analysis

Benefits of Data Mining

2. Increased efficiency and productivity – Data mining can also help

4. Increased customer satisfaction – Data mining can also be used to

Limitations of Data Mining

3. Ethical considerations – Data mining also raises ethical considerations. The

4. Technical challenges – Data mining can also be technically challenging,

7 steps of Data Mining

The process of data mining typically involves seven steps:

What is Data Mining Techniques?

Clustering is a data mining technique that is used to group items or instances in

4. Association rule mining

Dimensionality reduction is a data mining technique that is used to reduce the

The Differences Between Data Mining and Machine Learning

Current Advancements in Data Mining

1. Big Data Technologies

Another major advancement in data mining is the increasing use of machine

Cloud computing is another major advancement in data mining, as it provides a

The Future of Data Mining

1. Big Data and Cloud Computing

The development of machine learning and artificial intelligence is likely to

3. Data Privacy and Security

4. Ethics and Governance

2. Influencer identification – Data mining can also be used to identify

2. Python – Python is a popular data analysis and machine learning