0% found this document useful (0 votes)

136 views23 pages

Big Data & Cloud Computing CME Unit 1

Chapter 1 of the document provides a comprehensive overview of data mining, including its definition, types, advantages, disadvantages, applications, and implementation challenges. It discusses various data mining techniques, the data mining implementation process, and the architecture of data mining systems. Additionally, it highlights the evolution of data mining and introduces tools used in the field.

Uploaded by

samson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

136 views23 pages

Big Data & Cloud Computing CME Unit 1

Uploaded by

samson

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Big Data & Cloud Computing CM-502

Chapter 1
Over View of Data Mining
1.1. Define Data Mining

1.2. List type of Data Mining

1.3. List Advantages of Data Mining
1.4. List Disadvantages of Data Mining
1.5. List Applications of Data Mining
1.6. List Challenges of Implementation in Data mining
1.7. Evolution of Data Mining
1.8. List and explain Data Mining Techniques
1.9. Explain Data Mining Implementation Process
1.10. Explaining Data Mining Architecture
1.11. Explain KDD- Knowledge Discovery in Databases of Data Mining
1.12. List and explain Data Mining tools
1.13. List Major Difference between Data mining and Machine learning
1.14. State the importance of Data Analytics
1.15. List and explain phases of Data Analytics
1.16. Differentiate between Data Mining and Data Analytics
1.17. List and explain types of Data mining techniques
1.18. Explain Text data mining
1.19. Differentiate between classification and clustering in data mining
1.1 Define Data Mining

The process of analyzing large datasets to find hidden patterns, relationships, and useful
information using techniques from statistics, machine learning, and database systems.

The automatic or semi-automatic process of examining large data sets to uncover meaningful
patterns, trends, or insights that can help in decision-making.

Purpose: To extract meaningful insights from raw data.

Techniques Used: Machine learning, statistics, database systems.

Example A retailer may use data mining to analyze customer purchase histories and predict
future buying behavior, helping them target marketing more effectively. Data mining turns data
into actionable knowledge.

1.2Type of Data Mining

Data mining involves various techniques to extract useful patterns and knowledge from
large datasets. These techniques can categorized into predictive and descriptive
mining. Predictive mining aims to predict future outcomes, while descriptive mining
focuses on understanding past patterns.

1 Classification

• Sorting data into different categories or groups.

• Example: Classifying emails as "spam" or "not spam".

2 Clustering

• Grouping similar data items together without pre-defined labels.

• Example: Grouping customers based on buying behavior.

3 Association Rule Mining

• Finding relationships between items in a dataset.

• Example: People who buy tea also often buy sugar.

4 Regression

• Predicting a value based on past data.

• Example: Predicting house prices based on size and location.

5 Anomaly Detection (Outlier Detection)

• Finding data that is different or unusual.

• Example: Detecting fraudulent transactions in a bank.

6 Prediction
• Using patterns in old data to guess future outcomes.

Example: Predicting which customers are likely to cancel a subscription

1.3 Advantages of Data Mining:

Customer Insights : Identifies customer behavior and preferences for better marketing

Cost Reduction : Optimizes operations by identifying inefficiencies and unnecessary costs.

Fraud Detection : Detects unusual patterns to prevent fraud in banking, insurance, etc.

Forecasting and Prediction : Predicts future trends, such as sales or market movements and
weather.

Improved Decision-Making : Helps businesses make data-driven decisions based on patterns

and trends.

Personalization : Enables customized recommendations (e.g., in e-commerce or streaming

platforms).

Market Analysis : Finds patterns in market data, helping businesses identify target customers
and improve strategies.

Healthcare Improvements : Analyzes patient records to detect diseases early, improve

treatments, and manage resources.

Handling Big Data : Data mining tools can work with huge amounts of data that humans can’t
easily analyze.

Education Analytics : Used to monitor and improve student performance, attendance, and
learning methods.

1.4 Disadvantages of Data Mining

Complexity – data mining techniques can be complex and require specialized knowledge to
implement and interpret.

Data quality issues – inaccurate, incomplete, or outdated data can lead to misleading results.

High cost – tools, software, hardware, and skilled professionals required for data mining can
be expensive.

Privacy concerns – collecting and analyzing personal data can lead to privacy violations if not
handled responsibly.
Security issues – sensitive data can be exposed or misused if proper security measures are not
in place.

Wrong or Misleading Results : If the data is incorrect or incomplete, the results can be wrong,
leading to bad decisions.

Overfitting or Underfitting : If the mining model is not well-designed, it may give patterns
that are too specific or too general — not useful in real life.

Legal and Ethical Issues : Some countries have laws that restrict how data can be collected
and used. Violating them can lead to legal trouble.

Not Always Useful : Data mining may find patterns, but not all of them are meaningful or
helpful for business decisions.

While data mining is powerful and useful, it must be used carefully, with proper data security,
legal awareness, and good-quality data — otherwise, it can lead to serious problems.

1.5 Applications of Data Mining

Agriculture – Monitors crop health, predicts yields, and improves decision-making in
precision farming.

Business and Marketing – Identifies customer behavior, buying patterns, and helps in
customer segmentation, targeted advertising, and market basket analysis.

Banking and Finance – Detects fraud, assesses credit risk, manages customer accounts, and
analyzes market trends.

Education – Tracks student performance, identifies at-risk students, and improves curriculum
planning.

E-commerce – Powers recommendation systems (like “people also bought”), dynamic pricing,
and customer behavior analysis.

Government – Supports crime analysis, resource planning, tax fraud detection, and policy-
making.

Healthcare – Helps in disease prediction, patient diagnosis, treatment effectiveness analysis,

and personalized medicine.

Manufacturing – Analyzes production data to identify defects, optimize processes, and predict
equipment maintenance.
Telecommunications - Predicts which customers might leave so companies can keep them.

Real Estate - Predicts house prices and finds good places to buy or sell

Sports - Analyzes players’ performance to help coaches make good decisions.

Insurance - Checks if claims are real and helps decide who to insure.

Entertainment - Suggests movies, music, or shows you might like.

1.6 Challenges of Implementation in Data mining

Data Quality Issues – Poor, incomplete, or noisy data can lead to inaccurate results.

Data Integration – Combining data from multiple sources (structured, semi-structured, and
unstructured) can be technically difficult.

Data Preprocessing Requirements – Cleaning, transforming, and preparing data for mining
can be time-consuming.

Privacy and Security – Protecting sensitive information and complying with data protection
laws (e.g., GDPR) is a major concern.

Scalability – Processing and analyzing vast volumes of data requires high-performance

computing and efficient algorithms.

Selection of the Right Algorithm – Choosing the appropriate data mining technique depends
on the data and business goals, which can be challenging.

Real-Time Processing – Extracting insights in real-time from streaming data is complex and
resource-demanding.

Overfitting and Underfitting – Poorly tuned models may not generalize well to unseen data.

1.7 Evolution of Data Mining

Data Collection (1960s–1970s)

• Focused on the development of data storage technologies.

• Data was collected manually or through early computer systems.

Data Access (1980s)

• Introduction of Relational Database Management Systems (RDBMS).

• Query languages like SQL allowed easier access to data.
• Focus was on data retrieval, not analysis.

Data Warehousing and OLAP (1990s)

• Emergence of Data Warehousing to integrate data from multiple sources.
• Online Analytical Processing (OLAP) enabled multi-dimensional data analysis.

Early Data Mining (Mid-1990s)

• The term "Data Mining" became popular.

• Use of machine learning, statistics, and pattern recognition began.
• Tools like decision trees, clustering, and association rules were used.

Advanced Data Mining (2000s)

• Growth of more sophisticated algorithms, including neural networks and support

vector machines.
• Use in industries like finance, healthcare, marketing, and telecom.

Big Data Era (2010s)

• Explosion of data from the internet, mobile, and IoT devices.

• Emergence of Hadoop, Spark, and NoSQL databases for large-scale processing.

AI and Deep Learning Integration (Late 2010s–2020s)

• Data mining merged with AI, deep learning, and natural language processing.
• Real-time data mining and predictive analytics became mainstream.
• Applications expanded to self-driving cars, voice assistants, and recommendation
engines.

Modern Data Mining (Present & Ongoing)

• Use of cloud-based platforms (e.g., AWS, Azure, Google Cloud).

• Automated Machine Learning (AutoML) tools for ease of use.
• Focus on ethical AI, data privacy, and explainable AI (XAI).
• Continued growth in fields like personalized medicine, smart cities, and climate
modeling.

1.8 Data Mining Techniques

Classification

Classification is a supervised learning technique used to assign items in a dataset to predefined

categories or classes. It works based on a training dataset with known labels.
Example: Classifying whether a loan applicant is “high risk” or “low risk” based on features
like income, age, and credit score.

Clustering

Clustering is an unsupervised learning technique used to group similar data items together.
Unlike classification, there are no predefined labels. The algorithm finds natural groupings
within the data.
Example: Grouping customers into segments based on their shopping behavior for targeted
marketing.

Association Rule Mining

This technique is used to discover interesting relationships or associations among items in large
datasets. It finds rules like "If item A is bought, item B is likely to be bought."
Example: Market Basket Analysis where buying milk is often followed by buying bread.

Regression

Regression is a technique used to predict continuous numeric values based on the relationship
between variables. It is a form of supervised learning.
Example: Predicting house prices based on size, location, and number of rooms.

Prediction

Prediction is used to forecast future outcomes using historical data. It can involve both
classification (categorical prediction) and regression (numerical prediction).
Example: Predicting next month's product demand based on past sales data.

Outlier Detection (Anomaly Detection)

This technique identifies rare or abnormal data points that do not follow the general pattern in
the dataset. These unusual patterns may indicate fraud, errors, or significant events.
Example: Detecting fraudulent transactions in a credit card system.

Summarization

Summarization involves creating a compact representation of the data set, such as through
descriptive statistics or data visualization. It helps in understanding the overall structure of the
data.
Example: Generating reports showing average sales per region or most popular product
categories.

Sequential Pattern Mining

This technique finds regular sequences or patterns in ordered data over time. It is useful in
identifying trends or sequences of behavior.
Example: In retail, discovering that customers who buy a laptop often return later to buy a
mouse, then a printer.

Decision Tree
A tree-like model used to make decisions or predictions by splitting data based on features.
Example: Loan approval systems.

Neural Networks
Computational models inspired by the human brain; useful for complex pattern recognition.
Example: Image and speech recognition.
K-Nearest Neighbors (K-NN)
Classifies new data based on the majority label of its nearest neighbors.
Example: Recommender systems.

1.9 Data Mining Implementation Process

Structured overview of the Data Mining Implementation Process, typically followed in real-
world applications:

Implementation Process of Data mining

Business Understanding

• Define the project goals from a business perspective.

• Understand the problem you want to solve with data mining.
• Set objectives, success criteria, and scope.

Data Understanding

• Collect initial data from various sources.

• Explore the data to identify quality issues, patterns, and relationships.
• Assess data relevance and usefulness for the task.

Data Preparation

• Clean the data: handle missing values, remove duplicates, correct errors.
• Transform data: normalize, encode, or aggregate as needed.
• Select relevant attributes/features for analysis.
• Format the data into a structure suitable for mining.
Data Mining

• Choose appropriate data mining techniques (e.g., classification, clustering, regression).

• Apply algorithms to extract patterns, trends, or models from the data.
• Fine-tune parameters for optimal performance.

Evaluation

• Assess the model’s performance using metrics like accuracy, precision, recall, etc.
• Validate results against business objectives.
• Determine if the findings are valid and actionable.

Deployment

• Implement the data mining model into a real-world environment (e.g., software,
dashboard, report).
• Integrate insights into business processes or decision-making tools.
• Provide training or documentation for end-users if needed.

Monitoring and Maintenance

• Continuously track the model’s performance.

• Update or retrain the model as data and business needs evolve.
• Ensure data quality and relevance are maintained over time.
1.10. Explaining Data Mining Architecture
Data mining architecture refers to the structure or design that supports the processes and
components involved in data mining. It outlines how data is collected, processed, and mined to
extract valuable patterns and knowledge.
The major component of a data mining system architecture is as follows :

Data Mining Architecture

Database, Data Warehouse or Other Information Repository: This is a single or set of

databases, data warehouses, spreadsheets, or other kinds of information repositories. Data
cleaning and data integration techniques may be performed on the data.
Database or Data Warehouse Server : It fetches the data as per the users’ requirements from
the Database using Data Mining tasks.
Knowledge Base : This is the domain knowledge that is used to guide the search or evaluate
the interestingness of resulting patterns. It is simply stored in the form of set of rules.
Data Mining Engine : It performs the data mining task such as characterization, association,
classification, prediction, cluster analysis etc.
Pattern Evaluation Module : They are responsible for finding interesting patterns in the data
using a threshold value. It interacts with the data mining engine to focus the search on
interesting patterns.
Graphical User Interface: This module is used to communicate between user and the data
mining system and allow users to browse databases or data warehouse schemas by specifying
a data mining query or task.

1.11. KDD- Knowledge Discovery in Databases of Data Mining

KDD (Knowledge Discovery in Databases) is the overall process of discovering useful

knowledge or patterns from large volumes of data. It is a broader term than data mining, as
data mining is just one step within the KDD process.

KDD is a comprehensive, multi-step process for turning raw data into meaningful knowledge.
Data mining is at the heart of this process but cannot function effectively without the steps that
precede and follow it.

KDD is the process of identifying valid, novel, potentially useful, and understandable patterns
in data.

KDD (Knowledge Discovery in Databases)

Data Selection

• Purpose: Identify and retrieve relevant data from multiple sources.

• Example: Selecting customer transaction data from a retail database.

Data Preprocessing (Cleaning)

• Purpose: Remove noise and handle missing values.

• Example: Filling missing values, removing duplicate records, correcting data entry
errors.
Data Transformation

• Purpose: Convert data into suitable format for mining.

• Processes: Normalization, aggregation, feature selection.
• Example: Scaling numerical data to a uniform range.

Data Mining

• Core Step: Apply algorithms to discover patterns and relationships.

• Techniques: Classification, clustering, association rules, regression.
• Example: Using decision trees to classify customer behavior.

Pattern Evaluation and Knowledge Presentation

• Purpose: Evaluate the mined patterns and present only the most useful ones.
• Methods: Use measures like support, confidence, lift.
• Example: Visualizing association rules with graphs or dashboards.

1.12. Data Mining tools

Data mining tools are software applications that help in discovering patterns, correlations, and
insights from large datasets. These tools use techniques like machine learning, statistical
analysis, and database systems to extract useful knowledge.
list of commonly used data mining tools, along with brief explanations:

RapidMiner

• Type: Open-source (with commercial version)

• Features:
o GUI-based interface
o Supports data preprocessing, visualization, modeling, and evaluation
o Integrates with R and Python
• Use Case: Predictive analytics, sentiment analysis, fraud detection
Weka (Waikato Environment for Knowledge Analysis)

• Type: Open-source
• Features:
o Collection of machine learning algorithms
o GUI and command-line interface
o Good for educational and research purposes
• Use Case: Classification, regression, clustering, association rule mining

KNIME (Konstanz Information Miner)

• Type: Open-source
• Features:
o Drag-and-drop interface
o Integration with R, Python, SQL, and big data platforms
o Visual workflow builder
• Use Case: Data preparation, machine learning, business intelligence

Orange

• Type: Open-source
• Features:
o Visual programming for data analysis
o Interactive data visualization
o Add-ons for text mining, bioinformatics, etc.
• Use Case: Educational purposes, visual data exploration

SAS Enterprise Miner

• Type: Commercial
• Features:
o Powerful analytics and data mining platform
o Integrates with SAS programming
o High-performance predictive modeling
• Use Case: Enterprise-level data mining, predictive modeling

Apache Mahout

• Type: Open-source (part of Apache Software Foundation)

• Features:
o Scalable machine learning algorithms
o Built to work with Hadoop ecosystem
o Good for big data analytics
• Use Case: Recommender systems, classification, clustering on large datasets

R and Python (with Libraries)

• Type: Open-source programming languages

• Features:
o Extensive libraries for data mining (e.g., scikit-learn, caret, mlr)
o Customizable and flexible
o Strong community support
• Use Case: Custom data mining tasks, research, machine learning pipelines

IBM SPSS Modeler

• Type: Commercial
• Features:
o User-friendly GUI
o Designed for statistical analysis and predictive modeling
o Supports text analytics and geospatial analytics
• Use Case: Market research, healthcare analytics, academic research

The choice of a data mining tool depends on your needs—whether you prioritize ease of use,
flexibility, big data support, or integration with other systems. For beginners, tools like Weka
or Orange are ideal. For large-scale or professional use, RapidMiner, KNIME, and SAS are
preferred.
1.13. List Major Difference between Data mining and Machine
learning
Although Data Mining and Machine Learning are closely related and often used together,
they have distinct goals, methods, and applications. Here's a clear comparison:

Aspect Data Mining Machine Learning

Process of discovering patterns Field of study that gives computers the
Definition and knowledge from large ability to learn from data without being
datasets. explicitly programmed.
Extract useful information or Enable systems to make predictions or
Goal
patterns from data. decisions.
Uses predefined algorithms to Builds models that learn from data to
Approach
analyze data and find patterns. improve over time.
Knowledge discovery and
Focus Prediction, classification, and optimization.
pattern extraction.
Relies more on human Can operate autonomously with minimal
Dependency
interpretation of results. human input.
Techniques Association rules, clustering, Supervised learning, unsupervised learning,
Used decision trees. reinforcement learning.
Human-interpretable patterns, Predictive models that can be used in real-
Output
rules, and trends. time.
Market basket analysis, fraud
Spam filtering, speech recognition, image
Examples detection, customer
classification.
segmentation.
Less automated; involves
Highly automated; systems improve as more
Automation significant data preparation and
data is used.
interpretation.
Data analysis and business AI systems, robotics, recommendation
Used In
intelligence. engines.

1.14. State the importance of Data Analytics

Data Analytics is the science of analyzing raw data to make informed decisions. It plays a vital
role in modern organizations by turning data into actionable insights.

Informed Decision-Making

• Benefit: Helps businesses make data-driven decisions rather than relying on intuition.
• Example: Analyzing customer behavior to optimize marketing strategies.

Improved Operational Efficiency

• Benefit: Identifies bottlenecks, waste, and areas for improvement in business

processes.
• Example: Logistics companies using analytics to optimize delivery routes.

Better Customer Insights

• Benefit: Understand customer preferences, habits, and feedback.

• Example: E-commerce platforms using analytics to personalize product
recommendations.

Competitive Advantage

• Benefit: Organizations using analytics can outperform competitors by responding

faster to trends.
• Example: Retailers tracking market trends to stock trending items before competitors.

Risk Management

• Benefit: Helps in identifying potential risks and fraud.

• Example: Banks analyzing transaction patterns to detect fraudulent activities.

Cost Reduction

• Benefit: Pinpoints areas where cost savings can be made.

• Example: Manufacturing units using data to minimize downtime and reduce
maintenance costs.

Innovation and Product Development

• Benefit: Enables companies to create new products or improve existing ones based on
data.
• Example: Tech companies analyzing user data to enhance app features.

Monitoring Business Performance

• Benefit: Tracks KPIs (Key Performance Indicators) and business health in real-time.
• Example: Dashboards showing real-time sales, inventory, and customer service
metrics.

Data Analytics is essential in today’s data-driven world. It helps organizations across all industries to
make smarter decisions, reduce costs, improve customer satisfaction, and stay ahead of the competition.
1.15. List and explain phases of Data Analytics
Data Analytics is a structured process that involves multiple phases to turn raw data into
actionable insights. Each phase has its own objectives, tasks, and tools.

Phases of Data Analytics

Data Requirement Gathering

• Purpose: Understand what kind of data is needed and why.

• Activities:
o Define business objectives
o Identify key metrics and KPIs
o Determine data sources (internal or external)
• Example: A retail business wants to know why sales dropped last quarter.

Data Collection

• Purpose: Gather the necessary data from various sources.

• Sources May Include:
o Databases
o Web logs
o Social media
o Surveys and sensors
• Tools: SQL, APIs, data scraping tools
• Example: Collecting transaction data, customer feedback, and web traffic logs.

Data Cleaning (Data Preprocessing)

• Purpose: Ensure data is accurate, consistent, and usable.

• Activities:
o Remove duplicates and outliers
o Handle missing values
o Correct data types and formats
• Tools: Python (Pandas), Excel, OpenRefine
• Example: Fixing missing customer ages or formatting inconsistent date entries.

Data Exploration and Analysis

• Purpose: Understand the structure and content of the data.

• Activities:
o Exploratory Data Analysis (EDA)
o Descriptive statistics
o Data visualization
• Tools: Python (Matplotlib, Seaborn), R, Tableau, Power BI
• Example: Finding which products have the highest return rates.

Data Modeling

• Purpose: Build models that explain or predict patterns in the data.

• Techniques:
o Regression
o Classification
o Clustering
• Tools: Python (scikit-learn), R, SAS, RapidMiner
• Example: Predicting customer churn using logistic regression.

Data Interpretation and Communication

• Purpose: Translate analytical findings into insights and actions.

• Activities:
o Interpret model results
o Create dashboards and reports
o Present findings to stakeholders
• Tools: Tableau, Power BI, Excel, presentations
• Example: Presenting why customer engagement increased after a new campaign.

Decision Making and Action

• Purpose: Use insights to make strategic business decisions.

• Activities:
o Implement data-driven changes
o Monitor the impact of decisions
• Example: Changing a marketing strategy based on low conversion data.

The data analytics process is a systematic journey from identifying the problem to making
informed decisions. Each phase builds upon the previous one to ensure insights are both
accurate and actionable.
1.16. Differentiate between Data Mining and Data Analytics
Data Mining and Data Analytics are closely related and often overlap, they serve different
purposes in the field of data science.

• Data Mining = "Find the hidden patterns"

• Data Analytics = "Understand and explain the data to make decisions"

Aspect Data Mining Data Analytics

The process of discovering hidden The process of examining, organizing,
Definition patterns and relationships in large and interpreting data to support
datasets. decision-making.
To uncover unknown patterns, To derive meaningful insights and
Purpose
associations, and trends. support business decisions.
Pattern discovery and knowledge Problem-solving, forecasting, and
Focus
extraction. performance evaluation.
Typically a sub-process within the
A broader process that includes data
Process Type KDD (Knowledge Discovery in
mining, visualization, and reporting.
Databases) process.
Techniques Clustering, association rules, Descriptive, diagnostic, predictive,
Used classification, anomaly detection. and prescriptive analytics.
Patterns, rules, relationships (often not Actionable insights, summaries,
Output
directly interpretable). dashboards, and reports.
Weka, RapidMiner, SAS Enterprise Excel, Tableau, Power BI, R, Python
Tools
Miner. (Pandas, NumPy).
Example Use Discovering that people who buy Analyzing monthly sales data to
Case bread also often buy butter. determine declining revenue trends.
Both are essential in the data science pipeline, but they address different types of questions and
problems.

1.17. List and explain types of Data mining techniques

Types of Data Mining Techniques

Data mining techniques are methods used to analyze large datasets and extract useful patterns,
relationships, or insights. These techniques can be broadly classified into different types based
on their purpose and the kind of knowledge they discover.

Classification

• Purpose: Assign data items to predefined categories or classes.

• How it Works: Uses labeled data to build a model that can predict the class of new
data.
• Example Algorithms: Decision Trees, Naive Bayes, Support Vector Machines (SVM).
• Use Case: Email spam detection, credit risk assessment
Clustering

• Purpose: Group similar data items into clusters without predefined classes.
• How it Works: Identifies natural groupings based on feature similarity.
• Example Algorithms: K-Means, Hierarchical Clustering, DBSCAN.
• Use Case: Customer segmentation, image segmentation.

Association Rule Mining

• Purpose: Discover interesting relationships or associations between variables.

• How it Works: Finds frequent itemsets and generates rules that explain how items are
associated.
• Example Algorithm: Apriori, FP-Growth.
• Use Case: Market basket analysis (e.g., customers buying bread also buy butter).

Regression

• Purpose: Predict a continuous numeric value based on input variables.

• How it Works: Models the relationship between dependent and independent variables.
• Example Algorithms: Linear Regression, Polynomial Regression.
• Use Case: Predicting sales, stock prices, or temperature.

Anomaly Detection (Outlier Detection)

• Purpose: Identify rare or unusual data points that do not conform to expected patterns.
• How it Works: Models normal behavior and flags deviations.
• Use Case: Fraud detection, network security, fault diagnosis.

Summarization

• Purpose: Provide a compact representation of the data.

• How it Works: Generates a summary or abstraction (e.g., statistical measures).
• Use Case: Generating reports, data overview.

Sequential Pattern Mining

• Purpose: Discover sequential patterns or trends over time.

• How it Works: Finds frequent sequences in ordered data.
• Use Case: Customer purchase sequences, web clickstream analysis.

• Technique Purpose Example Use Case

Choosing the right data mining technique depends on the nature of the problem, the type of
data, and the desired outcome. Often, multiple techniques are combined for better insights.
1.18. Explain Text data mining
What is Text Data Mining?

Text Data Mining (also called Text Mining or Text Analytics) is the process of extracting
meaningful information and patterns from unstructured text data. Unlike structured data (like
numbers or tables), text data is typically in natural language form (documents, emails, social
media posts, etc.), which requires special techniques to analyze.

Why Text Data Mining?

• Most of the data generated today is unstructured text (emails, articles, social media,
customer reviews).
• Extracting useful knowledge from this vast amount of text helps organizations gain
insights that traditional data mining on structured data can’t provide.

Key Steps in Text Data Mining:

1. Text Preprocessing
o Cleaning text by removing stopwords (common words like “the,” “and”)
o Tokenization (splitting text into words or phrases)
o Stemming and Lemmatization (reducing words to their root forms)
o Removing punctuation, numbers, and irrelevant characters
2. Text Representation
o Converting text into structured format (e.g., vectors)
o Techniques: Bag of Words, TF-IDF (Term Frequency-Inverse Document
Frequency), Word Embeddings (Word2Vec, GloVe)
3. Pattern Discovery
o Applying mining techniques like classification, clustering, and association rule
mining on text data
o Examples: Topic modeling, sentiment analysis, named entity recognition
4. Evaluation and Interpretation
o Assessing the quality and relevance of extracted patterns
o Presenting insights in understandable formats (reports, dashboards)

Common Applications of Text Data Mining:

• Sentiment Analysis: Determining the sentiment (positive, negative, neutral) from

customer reviews or social media.
• Topic Modeling: Discovering the main themes or topics from large document
collections.
• Spam Detection: Classifying emails or messages as spam or not spam.
• Information Extraction: Identifying specific facts or entities (like names, dates) from
text.
• Customer Feedback Analysis: Analyzing surveys, support tickets, and feedback for
improving products/services.

Challenges in Text Data Mining:

• Ambiguity and Polysemy: Words can have multiple meanings depending on context.
• Unstructured Nature: Text data lacks a fixed schema, making analysis complex.
• Language Variability: Different languages, slang, abbreviations.
• High Dimensionality: Text converted to vectors can have very large dimensions.

Summary:

Text Data Mining transforms unstructured text into structured knowledge by cleaning,
representing, and analyzing text. It unlocks valuable insights from vast textual information,
supporting decision-making in business, healthcare, social media, and more.

1.19. Differentiate between classification and clustering in data

mining
Aspect Classification Clustering
Assigns data items to predefined Groups similar data items into
Definition
classes or categories. clusters without predefined labels.
Type of Supervised learning (requires Unsupervised learning (no labeled
Learning labeled data). data).
Predict the class label of new data Discover natural groupings or
Goal
points. structures in the data.
Labeled dataset with known Unlabeled dataset without class
Input Data
categories. information.
Output Class labels for each data instance. Clusters (groups) of similar instances.
Examples of Decision Trees, Naive Bayes, K-Means, Hierarchical Clustering,
Algorithms Support Vector Machines (SVM). DBSCAN.
Application Spam email detection, loan Customer segmentation, image
Examples approval, medical diagnosis. segmentation, market basket analysis.
Accuracy is measured by
Evaluated by measures like cohesion,
Evaluation comparing predicted and actual
separation, silhouette score.
labels.
Nature of Provides explicit classification Provides groups with shared
Results rules or models. characteristics but no explicit labels.

• Classification predicts known categories using labelled data (supervised).

• Clustering discovers unknown groups in unlabelled data (unsupervised).

Both techniques help in understanding and organizing data but serve different purposes
depending on the availability of labelled data and the problem being solved.

Over View of Data Mining
No ratings yet
Over View of Data Mining
23 pages
1 - Lect 1 & 2 Data Mining
No ratings yet
1 - Lect 1 & 2 Data Mining
20 pages
Unit 3
No ratings yet
Unit 3
22 pages
Data Mining Cognate
No ratings yet
Data Mining Cognate
23 pages
Data Mining
No ratings yet
Data Mining
18 pages
Dmi Unit 1 - 186 - N3
No ratings yet
Dmi Unit 1 - 186 - N3
12 pages
Data Mining
No ratings yet
Data Mining
8 pages
Unit 1
No ratings yet
Unit 1
27 pages
MUAZ
No ratings yet
MUAZ
21 pages
L - 1 Data Mining
No ratings yet
L - 1 Data Mining
17 pages
Data-Mining by Harshit Khattar
No ratings yet
Data-Mining by Harshit Khattar
11 pages
Data Mining L1,2
No ratings yet
Data Mining L1,2
26 pages
DWDM 2
No ratings yet
DWDM 2
15 pages
BIDW Lecture 2
No ratings yet
BIDW Lecture 2
33 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
1 - DM
No ratings yet
1 - DM
5 pages
Data Mining Notes
No ratings yet
Data Mining Notes
46 pages
DM Module1
No ratings yet
DM Module1
15 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
46 pages
Fundamental of Data Mining (CSI-508) .
No ratings yet
Fundamental of Data Mining (CSI-508) .
19 pages
Data Mining Notes1
No ratings yet
Data Mining Notes1
56 pages
Notes DATA MINING MBA III
No ratings yet
Notes DATA MINING MBA III
8 pages
Combinepdf 1
No ratings yet
Combinepdf 1
74 pages
Data Mining
No ratings yet
Data Mining
395 pages
Data Mining Data Mining: Knowledge Discovery in Data (KDD)
No ratings yet
Data Mining Data Mining: Knowledge Discovery in Data (KDD)
26 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
15 pages
Data Mining Techniques Unit-1
No ratings yet
Data Mining Techniques Unit-1
122 pages
Data Mining
No ratings yet
Data Mining
6 pages
Datamining Topic 2
No ratings yet
Datamining Topic 2
13 pages
Data Mining1
No ratings yet
Data Mining1
37 pages
Intro To Data Minning
No ratings yet
Intro To Data Minning
24 pages
Unit 4 Introduction To Data Mining
No ratings yet
Unit 4 Introduction To Data Mining
22 pages
Data Mining OVERVIEW
No ratings yet
Data Mining OVERVIEW
8 pages
Data Mining Tutorial Guide
No ratings yet
Data Mining Tutorial Guide
30 pages
Data Mining and Data Warehousing Unit 3 Part 1
No ratings yet
Data Mining and Data Warehousing Unit 3 Part 1
13 pages
Data Mining Mids
No ratings yet
Data Mining Mids
24 pages
VO - MCA - S4 - Data Mining Unit 1
No ratings yet
VO - MCA - S4 - Data Mining Unit 1
18 pages
Data Mining Unit 1 (MSC Ds 3 Sem)
No ratings yet
Data Mining Unit 1 (MSC Ds 3 Sem)
119 pages
Module 1 Introduction To Data Mining
No ratings yet
Module 1 Introduction To Data Mining
4 pages
Week-1-Introduction To Data Mining
No ratings yet
Week-1-Introduction To Data Mining
43 pages
Data Mining 1
No ratings yet
Data Mining 1
39 pages
Introduction To Data Mining - 125604
No ratings yet
Introduction To Data Mining - 125604
7 pages
DM
No ratings yet
DM
15 pages
What Is Data Mining
No ratings yet
What Is Data Mining
1 page
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
43 pages
Unit 3 Ba
No ratings yet
Unit 3 Ba
29 pages
Data Analytics & Mining Guide
No ratings yet
Data Analytics & Mining Guide
3 pages
Data Mining M1
No ratings yet
Data Mining M1
64 pages
Introduction and Meaning: The Most Commonly Used Techniques in Data Mining Are
No ratings yet
Introduction and Meaning: The Most Commonly Used Techniques in Data Mining Are
2 pages
UNIT 5 Introduction To Data Mining-1
No ratings yet
UNIT 5 Introduction To Data Mining-1
185 pages
Introduction To Data Mining: Dr. Hany Saleeb
No ratings yet
Introduction To Data Mining: Dr. Hany Saleeb
17 pages
Data Mining
No ratings yet
Data Mining
9 pages
Data Mining-Introduction
No ratings yet
Data Mining-Introduction
8 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
71 pages
Final Document
No ratings yet
Final Document
25 pages
Lec 1
No ratings yet
Lec 1
7 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
DWDM Unit3
No ratings yet
DWDM Unit3
15 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
50 pages
K Means Homework
100% (1)
K Means Homework
8 pages
Example of Engineering Literature Review
100% (3)
Example of Engineering Literature Review
6 pages
Cs3002 Question Paper 2015.16 - Externalreviewed
No ratings yet
Cs3002 Question Paper 2015.16 - Externalreviewed
5 pages
21cs54 Tie Simp
No ratings yet
21cs54 Tie Simp
5 pages
M4 DM Clustering Part I
No ratings yet
M4 DM Clustering Part I
81 pages
Data Mining & Bayesian Networks Quiz
100% (1)
Data Mining & Bayesian Networks Quiz
6 pages
Causal Discovery Algorithms A Practical Guide
No ratings yet
Causal Discovery Algorithms A Practical Guide
11 pages
Unit-2 Advance Concept of Model. Notes
No ratings yet
Unit-2 Advance Concept of Model. Notes
15 pages
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
No ratings yet
What Is Not Data Mining - Ex: Generation of Attendance Report (Of A Course) From Registration Cards. - Student Table (STD)
33 pages
Lecture 1
No ratings yet
Lecture 1
43 pages
Lecture 10 Clustering and Classification
No ratings yet
Lecture 10 Clustering and Classification
41 pages
Malaysian Refrigerator Energy Study
No ratings yet
Malaysian Refrigerator Energy Study
22 pages
CS 804 Image Processing Notes
No ratings yet
CS 804 Image Processing Notes
57 pages
DWDM Unit - 1-1
No ratings yet
DWDM Unit - 1-1
25 pages
Introduction To ML
100% (1)
Introduction To ML
39 pages
Prediction of Diseases in Smart Health Care System Using Machine Learning
No ratings yet
Prediction of Diseases in Smart Health Care System Using Machine Learning
5 pages
A Systematic Review of Vehicle Routing Problems and Models 2024 Supply Chain
No ratings yet
A Systematic Review of Vehicle Routing Problems and Models 2024 Supply Chain
12 pages
Olson 2020
No ratings yet
Olson 2020
131 pages
Lattin Et Al - Analyzing Multivariate Data - 281-283
No ratings yet
Lattin Et Al - Analyzing Multivariate Data - 281-283
3 pages
A PAC-Bayesian Approach To Structure Learning: Yevgeny Seldin
No ratings yet
A PAC-Bayesian Approach To Structure Learning: Yevgeny Seldin
139 pages
Business Intelligence Models Guide
No ratings yet
Business Intelligence Models Guide
18 pages
Intro to Clustering Techniques
No ratings yet
Intro to Clustering Techniques
13 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
ME F321 - Data Minining in Mechanical Sciences - Handout - Jan 2023
No ratings yet
ME F321 - Data Minining in Mechanical Sciences - Handout - Jan 2023
4 pages
Gibrat's Law: An Overview of The Empirical Literature: June 2006
No ratings yet
Gibrat's Law: An Overview of The Empirical Literature: June 2006
34 pages
Overview of Data Mining
No ratings yet
Overview of Data Mining
4 pages
Hierarchical Clustering: Instructions
67% (3)
Hierarchical Clustering: Instructions
4 pages
Unit 3 PPT (BA)
No ratings yet
Unit 3 PPT (BA)
19 pages
Unsupervised Learning Cheatsheet
No ratings yet
Unsupervised Learning Cheatsheet
3 pages