DMW Notes by Me
DMW Notes by Me
DMW Notes by Me
Pattern Recognition
Image Analysis
Signal Processing
Computer Graphics
Web Technology
Business
Bioinformatics
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
Apart from these, a data mining system can also be classified based
on the kind of (a) databases mined, (b) knowledge mined, (c)
techniques utilized, and (d) applications adapted.
Characterization
Discrimination
Association
Classification
Prediction
Correlation Analysis
Outlier Analysis
Evolution Analysis
Finance
Telecommunications
DNA
Stock Markets
E-mail
Data mining has a wide range of applications and uses cases across many
industries and domains. Some of the most common use cases of data mining
include:
1. Market Basket Analysis: Market basket analysis is a common use case of
data mining in the retail and e-commerce industries. It involves analyzing
data on customer purchases to identify items that are frequently purchased
together, and using this information to make recommendations or
suggestions to customers.
Overall, data mining has a wide range of applications and use cases across
many industries and domains. It is a powerful tool for uncovering insights and
information hidden in data sets and is widely used to solve a variety of business
and technical challenges.
Data Mining Algorithms: Data mining algorithms are the algorithms and
models that are used to perform data mining. These algorithms can include
supervised and unsupervised learning algorithms, such as regression,
classification, and clustering, as well as more specialized algorithms for
specific tasks, such as association rule mining and anomaly detection. Data
mining algorithms are applied to the data to extract useful insights and
information from it.
There are many different types of data mining, but they can generally be
grouped into three broad categories: descriptive, predictive, and prescriptive.
Descriptive data mining involves summarizing and describing the
characteristics of a data set. This type of data mining is often used to explore
and understand the data, identify patterns and trends, and summarize the
data in a meaningful way.
Predictive data mining involves using data to build models that can make
predictions or forecasts about future events or outcomes. This type of data
mining is often used to identify and model relationships between different
variables, and to make predictions about future events or outcomes based on
those relationships.
Data mining is the process of extracting useful information and insights from
large data sets. It typically involves several steps, including defining the
problem, preparing the data, exploring the data, modeling the data,
validating the model, implementing the model, and evaluating the results.
Let’s understand the process of Data Mining in the following phases:
The process of data mining typically begins with defining the problem or
question that you want to answer with your data. This involves
understanding the business context and goals and identifying the data that is
relevant to the problem.
Next, the data is prepared for analysis. This involves cleaning the data,
transforming it into a usable format, and checking for errors or
inconsistencies.
Once the data is prepared, you can begin exploring it to gain insights and
understand its characteristics. This typically involves using visualization
and summary statistics to understand the distribution, patterns, and trends in
the data.
The next step is to build models that can be used to make predictions or
forecasts based on the data. This involves choosing an appropriate
modeling technique, fitting the model to the data, and evaluating its
performance.
The final step in the data mining process is to evaluate the results of the
model and determine its effectiveness in solving the problem or
achieving the goals. This involves measuring the model’s performance,
comparing it to other models or approaches, and making any necessary
changes or improvements.
Overall, data mining is a powerful and flexible tool for extracting useful
information and insights from large data sets. By following these steps, data
miners and other practitioners can uncover valuable insights and information
hidden in their data, and use it to make better decisions and improve their
businesses.
Data Warehousing and Mining Software
Data mining tools – Data mining tools are software tools that are used to
extract information and insights from large data sets. These tools typically
include algorithms and methods for exploring, modeling, and analyzing data,
and they are commonly used in the field of data mining.
Data visualization tools – Data visualization tools are software tools that are
used to visualize and display data in a graphical or graphical format. These
tools are commonly used in data mining to explore and understand the data,
and to communicate the results of the analysis.
Overall, data warehousing and mining software is a powerful and essential tool
for storing, managing and analyzing large data sets. This software is widely
used in the field of data warehousing and data mining, and it plays a crucial role
in the data-driven decision-making process.
Open-Source Software for Data Mining
There are many open-source software applications and platforms that are
available for data mining. These open-source tools provide a range of
algorithms, techniques, and functions that can be used to extract useful insights
and information from data, and are typically available at no cost. Some
examples of popular open-source software for data mining include:
RapidMiner – RapidMiner is an open-source data mining platform that
provides a range of tools and functions for data preparation, analysis, and
machine learning. It has a user-friendly interface and is suitable for users of
all skill levels, from beginners to experts. RapidMiner is available under the
AGPL license and is widely used in industries such as finance, healthcare,
and retail.
Data mining, data analytics, and data warehousing are closely related fields that
are often used together to extract useful information and insights from large
data sets. However, there are some key differences between these fields:
Data mining is the process of extracting useful information and insights from
large data sets. It involves applying algorithms and techniques to uncover
hidden patterns and relationships in the data and to generate predictions and
forecasts.
Data warehousing is the process of storing and managing large data sets. It
involves designing and implementing a database or data repository that can
efficiently store and manage data, and that can be queried and accessed by
data mining and analytics tools.
In summary, data mining, data analytics, and data warehousing are closely
related fields that are often used together to extract useful information and
insights from large data sets. Data mining focuses on applying algorithms and
techniques to uncover hidden patterns and relationships in the data, data
analytics focuses on applying statistical and mathematical methods to data
sets, and data warehousing focuses on storing and managing large data sets.
Data mining and data analysis are closely related, but they are not the same
thing. Data mining is a process of extracting useful insights and information
from data, using techniques and algorithms from fields such as statistics,
machine learning, and database management. Data analysis, on the other
hand, is the process of examining and interpreting data, typically to uncover
trends, patterns, and relationships.
Data mining and data analysis are often used together in a data-driven
approach to decision-making and problem-solving. Data mining involves
applying algorithms and techniques to data to extract useful insights and
information, while data analysis involves examining and interpreting these
insights and information to understand their significance and implications.
Overall, the main difference between data mining and data analysis is the focus
of each process. Data mining focuses on extracting useful insights and
information from data, while data analysis focuses on examining and
interpreting these insights and information to understand their meaning and
implications. Both data mining and data analysis are important and valuable
tools for making sense of data and making better decisions and predictions.
Data Mining vs. Data Science
Data mining and data science are closely related, but they are not the same
thing. Data mining is a process of extracting useful insights and information
from data, using techniques and algorithms from fields such as statistics,
machine learning, and database management. Data science, on the other hand,
is a broader field that involves using data and analytical methods to extract
knowledge and insights from data.
Data mining is a key component of data science, but it is not the only
component. Data science also involves other aspects of working with data, such
as data collection, cleaning, and preparation, as well as data visualization,
communication, and collaboration. Data science is therefore a broader and
more comprehensive field than data mining and involves a wider range of skills,
techniques, and tools.
Overall, the main difference between data mining and data science is the scope
and focus of each field. Data mining focuses on extracting useful insights and
information from data, using techniques and algorithms from fields such as
statistics and machine learning. Data science, on the other hand, is a broader
field that involves using data and analytical methods to extract knowledge and
insights from data, and to support decision-making and problem-solving. Both
data mining and data science are important and valuable fields that are driving
innovation and progress in many different industries and applications.
Data mining is the process of extracting useful information and insights from
large data sets. It is a powerful and flexible tool that has many benefits,
including:
1. Improved decision-making – One of the main benefits of data mining is that
it can help organizations make better decisions. By analyzing data and
uncovering hidden patterns and trends, data mining can provide valuable
insights and information that can be used to inform and improve decision-
making.
3. Reduced costs – Data mining can also help organizations reduce their
costs. By identifying and addressing inefficiencies and waste, data mining
can help organizations save money and improve their bottom line.
5. Improved risk management – Data mining can also be used to improve risk
management. By analyzing data on potential risks and vulnerabilities, data
mining can help organizations identify and mitigate potential risks, and make
more informed and strategic decisions.
Overall, data mining is a powerful tool that has many benefits for organizations.
By extracting valuable information and insights from data, data mining can help
organizations make better decisions, increase their efficiency and productivity,
reduce their costs, improve customer satisfaction, and manage risks more
effectively.
2. Model bias – Another limitation of data mining is the potential for bias in the
models that are built from the data. If the data is not representative of the
population, or if there is bias in the way the data is collected or analyzed, the
models that are built from the data may be biased, and may not accurately
reflect the underlying relationships in the data.
3. Explore the data – Once the data is prepared, you can begin exploring it to
gain insights and understand its characteristics. This step typically involves
using visualization and summary statistics to understand the distribution,
patterns, and trends in the data.
4. Model the data – The next step is to build models that can be used to make
predictions or forecasts based on the data. This step involves choosing an
appropriate modeling technique, fitting the model to the data, and evaluating
its performance.
5. Validate the model – After the model is built, it is important to validate its
performance to ensure that it is accurate and reliable. This step typically
involves using a separate data set (called a validation set) to evaluate the
model’s performance and make any necessary adjustments.
6. Implement the model – Once the model has been validated, it can be
implemented in a production environment to make predictions or
recommendations. This step involves deploying the model and integrating it
into the organization’s existing systems and processes.
7. Evaluate the results – The final step in the data mining process is to
evaluate the results of the model and determine its effectiveness in solving
the problem or achieving the goals. This step involves measuring the model’s
performance, comparing it to other models or approaches, and making any
necessary changes or improvements.
Overall, these seven steps form the core of the data mining process and are
used to explore, model, and make decisions based on data. By following these
steps, data miners and other practitioners can uncover valuable insights and
information hidden in their data.
Data mining techniques are algorithms and methods used to extract information
and insights from data sets. These techniques are commonly used in the field of
data mining and machine learning, and they include a variety of methods for
exploring, modeling, and analyzing data.
Some of the most common data mining techniques include:
1. Regression
Regression is a data mining technique that is used to model the relationship
between a dependent variable and one or more independent variables. In
regression analysis, the goal is to fit a mathematical model to the data that can
be used to make predictions or forecasts about the dependent variable based
on the values of the independent variables.
There are many different types of regression models, including linear
regression, logistic regression, and non-linear regression. These models differ
in the way that they model the relationship between the dependent and
independent variables, and in the assumptions that they make about the data.
In general, regression models are used to answer questions such as:
What is the relationship between the dependent and independent variables?
How well does the model fit the data?
How accurate are the predictions or forecasts made by the model?
Overall, regression is a powerful and widely used data mining technique that is
used to model and predict the relationship between variables in a data set. It is
a crucial tool for many applications in the field of data mining and is commonly
used in areas such as finance, marketing, and healthcare.
2. Classification
Classification is a data mining technique that is used to predict the class or
category of an item or instance based on its characteristics or attributes. In
classification analysis, the goal is to build a model that can accurately predict
the class of an item based on its attributes and to evaluate the performance of
the model.
There are many different types of classification models, including decision
trees, k-nearest neighbors, and support vector machines. These models differ
in the way that they model the relationship between the classes and the
attributes, and in the assumptions that they make about the data.
In general, classification models are used to answer questions such as:
What is the relationship between the classes and the attributes
How well does the model fit the data?
How accurate are the predictions made by the model?
Overall, classification is a powerful and widely used data mining technique that
is used to predict the class or category of an item based on its characteristics. It
is a crucial tool for many applications in the field of data mining and is
commonly used in areas such as marketing, finance, and healthcare.
3. Clustering
Association rule mining is a data mining technique that is used to identify and
explore relationships between items or attributes in a data set. In association
rule mining, the goal is to identify patterns and rules that describe the co-
occurrence or occurrence of items or attributes in the data set and to evaluate
the strength and significance of these patterns and rules.
There are many different algorithms and methods for association rule mining,
including the Apriori algorithm and the FP-growth algorithm. These algorithms
differ in the way that they generate and evaluate association rules, and in the
assumptions that they make about the data.
In general, association rule mining is used to answer questions such as:
What are the main patterns and rules in the data?
How strong and significant are these patterns and rules?
What are the implications of these patterns and rules for the data set and the
domain?
Overall, association rule mining is a powerful and widely used data mining
technique that is used to identify and explore relationships between items or
attributes in a data set. It is a crucial tool for many applications in the field of
data mining and is commonly used in areas such as market basket analysis,
recommendation systems, and fraud detection.
5. Dimensionality Reduction
In summary, data mining and machine learning are closely related fields, but
they have some key differences. Data mining focuses on extracting useful
insights from structured data, while machine learning focuses on using
algorithms and models to learn from data and make predictions. Both data
mining and machine learning are powerful and widely used tools for extracting
insights from data and are often used together in many applications and
domains.
One of the major current advancements in data mining is the increasing use of
big data technologies. These technologies, such as Hadoop and Spark, enable
data mining on large and complex data sets and provide scalable and efficient
ways to process and analyze data. As the amount of data generated by
businesses and organizations continue to grow, big data technologies are
becoming increasingly important for data mining.
2. Machine Learning
Graph mining is a relatively new field that involves applying data mining
techniques to graphs and networks. Graphs and networks are used to
represent complex and interrelated data and can be mined to uncover hidden
patterns and relationships in the data. By applying graph mining techniques to
data mining, it is possible to extract valuable insights and information from
complex and interrelated data.
4. Cloud Computing
The growth of big data and the increasing availability of cloud computing
technologies are likely to continue to drive the development of data mining. As
more and more data is generated and collected, data mining will become
increasingly important for managing, analyzing, and extracting insights from
this data. Cloud computing will also make it easier for organizations to access
and use data mining tools and technologies and will enable them to perform
large-scale and complex data mining analyses.
2. Machine Learning and Artificial Intelligence
As data mining becomes more widely used, concerns about data privacy and
security are likely to become more important. Organizations will need to ensure
that they comply with data protection laws and regulations and that they protect
the privacy and security of their data and the individuals who are represented
in it. This will require the development of new technologies and practices for
data mining, such as privacy-preserving data mining algorithms and secure
data management systems.
As data mining becomes more powerful and widely used, there will be a
growing need for ethical and governance frameworks to guide its use and
ensure that it is used responsibly and for the benefit of society. This will require
the development of ethical principles and guidelines for data mining, and the
creation of governance structures and mechanisms to ensure that data mining
is used in an ethical and responsible manner. This will involve a range of
stakeholders, including data scientists, policymakers, and ethicists, who will
need to work together to develop and implement these frameworks.
Overall, the future of data mining is likely to be shaped by the continued growth
of data, the development of new technologies and tools, and the increasing
importance of data privacy and ethics. Data mining will continue to be a
powerful and widely used tool for extracting useful insights and information
from data and will play a critical role in many applications and domains.
Data Mining and Social Media
Data mining is the process of extracting useful information and insights from
large data sets, and social media is a rich source of data that can be mined for
insights and information. By analyzing data from social media platforms,
organizations can gain valuable insights into consumer behavior, preferences,
and opinions, and use this information to inform and improve their marketing
and advertising efforts.
Some common examples of data mining in social media include:
1. Sentiment analysis – Sentiment analysis is a common application of data
mining in social media. By analyzing the text of social media posts and
comments, organizations can determine the overall sentiment of users
towards their products, services, or brand, and use this information to
improve their marketing and customer service efforts.
3. Trend analysis – Data mining can also be used to analyze trends on social
media. By analyzing data on user behavior and interactions, organizations
can identify emerging trends and topics of interest, and use this information
to tailor their content and messaging to be more relevant and engaging.
Overall, data mining is a powerful tool for extracting useful information and
insights from social media data. By analyzing data from social media platforms,
organizations can gain valuable insights into consumer behavior, preferences,
and opinions, and use this information to inform and improve their marketing
and advertising efforts.
Best Tools/Programming Languages for Data Mining
There are many different tools and platforms available for data mining, and the
best tool for you will depend on your specific needs and requirements. Some of
the most popular and widely used tools for data mining include:
1. R – R is a powerful programming language for data analysis and statistical
computing. It has a rich ecosystem of packages and tools for data mining and
is widely used by data miners and other practitioners.
4. IBM SPSS – IBM SPSS is a commercial software suite for data analysis and
predictive modeling. It has a range of tools and features for data mining and
is widely used in the social sciences and other fields.
Data Mining in R
R is a popular programming language for data analysis and statistical
computing. It has a rich ecosystem of packages and tools for data mining,
including tools for pre-processing, visualization, and modeling. Data miners and
other practitioners can use R to quickly and easily explore and analyze their
data, build and evaluate predictive models, and visualize the results of their
analysis.
To get started with data mining in R, you will need to install R and some of the
commonly used packages for data mining, such as caret, arules, cluster, and
ggplot2. Once you have these tools installed, you can load your data and start
exploring it, using R’s powerful data manipulation and visualization capabilities.
You can then use the tools and functions provided by these packages to pre-
process your data, build predictive models, and evaluate and visualize the
results of your analysis.
Overall, R is a powerful and flexible language for data mining, and the rich
ecosystem of packages and tools available for R makes it an attractive choice
for data miners and other practitioners who need to quickly and easily explore,
analyze, and model their data.
The Benefits of Data Mining in R
2. R is not as fast or scalable as some other languages and tools, which can
make it difficult to handle large datasets or perform complex data mining
tasks.
3. Data Selection
Data selection is defined as the process where data relevant to the
analysis is decided and retrieved from the data collection. For this we
can use Neural network, Decision Trees, Naive bayes, Clustering,
and Regression methods.
4. Data Transformation
Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a
two step process:
5. Data Mapping: Assigning elements from source base to destination to
capture transformations.
6. Code generation: Creation of the actual transformation program.
7. Data Mining
Data mining is defined as techniques that are applied to extract patterns
potentially useful. It transforms task relevant data into patterns, and
decides purpose of model using classification or characterization.
8. Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures. It
find interestingness score of each pattern, and
uses summarization and Visualization to make data understandable by
user.
9. Knowledge Representation
This involves presenting the results in a way that is meaningful and can be
used to make decisions.
Advantages of KDD
Disadvantages of KDD
Data Warehousing
Features :
Advantages:
Disadvantages:
Subject-Oriented
A data warehouse target on the modeling and analysis of data for
decision-makers. Therefore, data warehouses typically provide a
concise and straightforward view around a particular subject, such
as customer, product, or sales, instead of the global organization's
ongoing operations. This is done by excluding data that are not
useful concerning the subject and including all data needed by the
users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources
like RDBMS, flat files, and online transaction records. It requires
performing data cleaning and integration during data warehousing
to ensure consistency in naming conventions, attributes types, etc.,
among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one
can retrieve files from 3 months, 6 months, 12 months, or even
previous data from a data warehouse. These variations with a
transactions system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is
transformed from the source operational RDBMS. The operational
updates of data do not occur in the data warehouse, i.e., update,
insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction
processing, recovery, and concurrency capabilities, which allows for
substantial speedup of data retrieval. Non-Volatile defines that once
entered into the warehouse, and data should not change.
Apriori Algorithm
Apriori algorithm refers to the algorithm which is used to calculate
the association rules between objects. It means how two or more
objects are related to one another. In other words, we can say that
the apriori algorithm is an association rule leaning that analyzes that
people who bought product A also bought product B.
Transaction Reduction
FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that
stores quantitative information about frequent patterns in a
database. Each transaction is read and then mapped onto a path in
the FP-tree. This is done until all transactions have been read.
Different transactions with common subsets allow the tree to remain
compact because their paths overlap.
A frequent Pattern Tree is made with the initial item sets of the
database. The purpose of the FP tree is to mine the most frequent
pattern. Each node of the FP tree represents an item of the item set.
The root node represents null, while the lower nodes represent the
item sets. The associations of the nodes with the lower nodes, that
is, the item sets with the other item sets, are maintained while
forming the tree.
Apriori FP Growth
Apriori generates frequent patterns by making the FP Growth generates an FP-Tree for making frequent patterns.
itemsets using pairings such as single item set,
double itemset, and triple itemset.
Apriori uses candidate generation where frequent FP-growth generates a conditional FP-Tree for every item in th
subsets are extended one item at a time.
Since apriori scans the database in each step, it FP-tree requires only one database scan in its beginning steps
becomes time-consuming for data where the
number of items is larger. so it consumes less time.
A converted version of the database is saved in A set of conditional FP-tree for every item is saved in the mem
the memory
OLAP stands for Online Analytical Processing. OLAP systems have the
capability to analyze database information of multiple systems at the
current time. The primary goal of OLAP Service is data analysis and
not data processing.
OLTP stands for Online Transaction Processing. OLTP has the work to
administer day-to-day transactions in any organization. The main goal
of OLTP is data processing not data analysis.
OLAP Examples
OLTP Examples
1 OLAP stands for Online analytical OLTP stands for online transaction processing.
processing.
2 It includes software tools that help in It helps in managing online database modification.
analyzing data mainly for business
decisions.
6 It holds old data from various Databases. It holds current operational data.
7 Here the tables are not normalized. Here, the tables are normalized.
8 It allows only read and hardly write It allows both read and write operations.
operations.
9 Here, the complex queries are involved. Here, the queries are simple.
Advantages of ROLAP –
ROLAP is used for handle the large amount of data.
ROLAP tools don’t use pre-calculated data cubes.
Data can be stored efficiently.
ROLAP can leverage functionalities inherent in the relational
database.
Disadvantages of ROLAP –
Performance of ROLAP can be slow.
In ROALP, difficult to maintain aggregate tables.
Limited by SQL functionalities.
Multidimensional Online Analytical Processing
(MOLAP) :
Advantages of MOLAP –
MOLAP is basically used for complex calculations.
MOLAP is optimal for operation such as slice and dice.
MOLAP allows fastest indexing to the pre-computed
summarized data.
Disadvantages of MOLAP –
MOLAP can’t handle large amount of data.
In MOLAP, Requires additional investment.
Without re-aggregation, difficult to change dimension.
Advantages of HOLAP –
HOLAP provides the functionalities of both MOLAP and
ROLAP.
HOLAP provides fast access at all levels of aggregation.
Disadvantages of HOLAP –
HOLAP architecture is very complex to understand because it
supports both MOLAP and ROLAP.
Difference between ROLAP, MOLAP and HOLAP :
Relational
Multidimensional
Storage location Database is used Multidimensional Database
Database is used as
for summary as storage location is used as storage location
storage location for
aggregation for summary for summary aggregation.
summary aggregation.
aggregation.
Large storage
space requirement Medium storage space Small storage space
Storage space in ROLAP as requirement in MOLAP requirement in HOLAP as
requirement compare to as compare to ROLAP compare to MOLAP and
MOLAP and and HOLAP. ROLAP.
HOLAP.
Low latency in
High latency in MOLAP Medium latency in HOLAP
ROLAP as compare
Latency as compare to ROLAP as compare to MOLAP and
to MOLAP and
and HOLAP. ROLAP.
HOLAP.
Slow query
Fast query response
response time in Medium query response
Query response time in MOLAP as
ROLAP as compare time in HOLAP as compare
time compare to ROLAP and
to MOLAP and to MOLAP and ROLAP.
HOLAP.
HOLAP.