Big Data & Cloud Computing CM-502
Chapter 1
Over View of Data Mining
1.1. Define Data Mining
1.2. List type of Data Mining
1.3. List Advantages of Data Mining
1.4. List Disadvantages of Data Mining
1.5. List Applications of Data Mining
1.6. List Challenges of Implementation in Data mining
1.7. Evolution of Data Mining
1.8. List and explain Data Mining Techniques
1.9. Explain Data Mining Implementation Process
1.10. Explaining Data Mining Architecture
1.11. Explain KDD- Knowledge Discovery in Databases of Data Mining
1.12. List and explain Data Mining tools
1.13. List Major Difference between Data mining and Machine learning
1.14. State the importance of Data Analytics
1.15. List and explain phases of Data Analytics
1.16. Differentiate between Data Mining and Data Analytics
1.17. List and explain types of Data mining techniques
1.18. Explain Text data mining
1.19. Differentiate between classification and clustering in data mining
1.1 Define Data Mining
The process of analyzing large datasets to find hidden patterns, relationships, and useful
information using techniques from statistics, machine learning, and database systems.
The automatic or semi-automatic process of examining large data sets to uncover meaningful
patterns, trends, or insights that can help in decision-making.
Purpose: To extract meaningful insights from raw data.
Techniques Used: Machine learning, statistics, database systems.
Example A retailer may use data mining to analyze customer purchase histories and predict
future buying behavior, helping them target marketing more effectively. Data mining turns data
into actionable knowledge.
1.2Type of Data Mining
Data mining involves various techniques to extract useful patterns and knowledge from
large datasets. These techniques can categorized into predictive and descriptive
mining. Predictive mining aims to predict future outcomes, while descriptive mining
focuses on understanding past patterns.
1 Classification
• Sorting data into different categories or groups.
• Example: Classifying emails as "spam" or "not spam".
2 Clustering
• Grouping similar data items together without pre-defined labels.
• Example: Grouping customers based on buying behavior.
3 Association Rule Mining
• Finding relationships between items in a dataset.
• Example: People who buy tea also often buy sugar.
4 Regression
• Predicting a value based on past data.
• Example: Predicting house prices based on size and location.
5 Anomaly Detection (Outlier Detection)
• Finding data that is different or unusual.
• Example: Detecting fraudulent transactions in a bank.
6 Prediction
• Using patterns in old data to guess future outcomes.
Example: Predicting which customers are likely to cancel a subscription
1.3 Advantages of Data Mining:
Customer Insights : Identifies customer behavior and preferences for better marketing
Cost Reduction : Optimizes operations by identifying inefficiencies and unnecessary costs.
Fraud Detection : Detects unusual patterns to prevent fraud in banking, insurance, etc.
Forecasting and Prediction : Predicts future trends, such as sales or market movements and
weather.
Improved Decision-Making : Helps businesses make data-driven decisions based on patterns
and trends.
Personalization : Enables customized recommendations (e.g., in e-commerce or streaming
platforms).
Market Analysis : Finds patterns in market data, helping businesses identify target customers
and improve strategies.
Healthcare Improvements : Analyzes patient records to detect diseases early, improve
treatments, and manage resources.
Handling Big Data : Data mining tools can work with huge amounts of data that humans can’t
easily analyze.
Education Analytics : Used to monitor and improve student performance, attendance, and
learning methods.
1.4 Disadvantages of Data Mining
Complexity – data mining techniques can be complex and require specialized knowledge to
implement and interpret.
Data quality issues – inaccurate, incomplete, or outdated data can lead to misleading results.
High cost – tools, software, hardware, and skilled professionals required for data mining can
be expensive.
Privacy concerns – collecting and analyzing personal data can lead to privacy violations if not
handled responsibly.
Security issues – sensitive data can be exposed or misused if proper security measures are not
in place.
Wrong or Misleading Results : If the data is incorrect or incomplete, the results can be wrong,
leading to bad decisions.
Overfitting or Underfitting : If the mining model is not well-designed, it may give patterns
that are too specific or too general — not useful in real life.
Legal and Ethical Issues : Some countries have laws that restrict how data can be collected
and used. Violating them can lead to legal trouble.
Not Always Useful : Data mining may find patterns, but not all of them are meaningful or
helpful for business decisions.
While data mining is powerful and useful, it must be used carefully, with proper data security,
legal awareness, and good-quality data — otherwise, it can lead to serious problems.
1.5 Applications of Data Mining
Agriculture – Monitors crop health, predicts yields, and improves decision-making in
precision farming.
Business and Marketing – Identifies customer behavior, buying patterns, and helps in
customer segmentation, targeted advertising, and market basket analysis.
Banking and Finance – Detects fraud, assesses credit risk, manages customer accounts, and
analyzes market trends.
Education – Tracks student performance, identifies at-risk students, and improves curriculum
planning.
E-commerce – Powers recommendation systems (like “people also bought”), dynamic pricing,
and customer behavior analysis.
Government – Supports crime analysis, resource planning, tax fraud detection, and policy-
making.
Healthcare – Helps in disease prediction, patient diagnosis, treatment effectiveness analysis,
and personalized medicine.
Manufacturing – Analyzes production data to identify defects, optimize processes, and predict
equipment maintenance.
Telecommunications - Predicts which customers might leave so companies can keep them.
Real Estate - Predicts house prices and finds good places to buy or sell
Sports - Analyzes players’ performance to help coaches make good decisions.
Insurance - Checks if claims are real and helps decide who to insure.
Entertainment - Suggests movies, music, or shows you might like.
1.6 Challenges of Implementation in Data mining
Data Quality Issues – Poor, incomplete, or noisy data can lead to inaccurate results.
Data Integration – Combining data from multiple sources (structured, semi-structured, and
unstructured) can be technically difficult.
Data Preprocessing Requirements – Cleaning, transforming, and preparing data for mining
can be time-consuming.
Privacy and Security – Protecting sensitive information and complying with data protection
laws (e.g., GDPR) is a major concern.
Scalability – Processing and analyzing vast volumes of data requires high-performance
computing and efficient algorithms.
Selection of the Right Algorithm – Choosing the appropriate data mining technique depends
on the data and business goals, which can be challenging.
Real-Time Processing – Extracting insights in real-time from streaming data is complex and
resource-demanding.
Overfitting and Underfitting – Poorly tuned models may not generalize well to unseen data.
1.7 Evolution of Data Mining
Data Collection (1960s–1970s)
• Focused on the development of data storage technologies.
• Data was collected manually or through early computer systems.
Data Access (1980s)
• Introduction of Relational Database Management Systems (RDBMS).
• Query languages like SQL allowed easier access to data.
• Focus was on data retrieval, not analysis.
Data Warehousing and OLAP (1990s)
• Emergence of Data Warehousing to integrate data from multiple sources.
• Online Analytical Processing (OLAP) enabled multi-dimensional data analysis.
Early Data Mining (Mid-1990s)
• The term "Data Mining" became popular.
• Use of machine learning, statistics, and pattern recognition began.
• Tools like decision trees, clustering, and association rules were used.
Advanced Data Mining (2000s)
• Growth of more sophisticated algorithms, including neural networks and support
vector machines.
• Use in industries like finance, healthcare, marketing, and telecom.
Big Data Era (2010s)
• Explosion of data from the internet, mobile, and IoT devices.
• Emergence of Hadoop, Spark, and NoSQL databases for large-scale processing.
AI and Deep Learning Integration (Late 2010s–2020s)
• Data mining merged with AI, deep learning, and natural language processing.
• Real-time data mining and predictive analytics became mainstream.
• Applications expanded to self-driving cars, voice assistants, and recommendation
engines.
Modern Data Mining (Present & Ongoing)
• Use of cloud-based platforms (e.g., AWS, Azure, Google Cloud).
• Automated Machine Learning (AutoML) tools for ease of use.
• Focus on ethical AI, data privacy, and explainable AI (XAI).
• Continued growth in fields like personalized medicine, smart cities, and climate
modeling.
1.8 Data Mining Techniques
Classification
Classification is a supervised learning technique used to assign items in a dataset to predefined
categories or classes. It works based on a training dataset with known labels.
Example: Classifying whether a loan applicant is “high risk” or “low risk” based on features
like income, age, and credit score.
Clustering
Clustering is an unsupervised learning technique used to group similar data items together.
Unlike classification, there are no predefined labels. The algorithm finds natural groupings
within the data.
Example: Grouping customers into segments based on their shopping behavior for targeted
marketing.
Association Rule Mining
This technique is used to discover interesting relationships or associations among items in large
datasets. It finds rules like "If item A is bought, item B is likely to be bought."
Example: Market Basket Analysis where buying milk is often followed by buying bread.
Regression
Regression is a technique used to predict continuous numeric values based on the relationship
between variables. It is a form of supervised learning.
Example: Predicting house prices based on size, location, and number of rooms.
Prediction
Prediction is used to forecast future outcomes using historical data. It can involve both
classification (categorical prediction) and regression (numerical prediction).
Example: Predicting next month's product demand based on past sales data.
Outlier Detection (Anomaly Detection)
This technique identifies rare or abnormal data points that do not follow the general pattern in
the dataset. These unusual patterns may indicate fraud, errors, or significant events.
Example: Detecting fraudulent transactions in a credit card system.
Summarization
Summarization involves creating a compact representation of the data set, such as through
descriptive statistics or data visualization. It helps in understanding the overall structure of the
data.
Example: Generating reports showing average sales per region or most popular product
categories.
Sequential Pattern Mining
This technique finds regular sequences or patterns in ordered data over time. It is useful in
identifying trends or sequences of behavior.
Example: In retail, discovering that customers who buy a laptop often return later to buy a
mouse, then a printer.
Decision Tree
A tree-like model used to make decisions or predictions by splitting data based on features.
Example: Loan approval systems.
Neural Networks
Computational models inspired by the human brain; useful for complex pattern recognition.
Example: Image and speech recognition.
K-Nearest Neighbors (K-NN)
Classifies new data based on the majority label of its nearest neighbors.
Example: Recommender systems.
1.9 Data Mining Implementation Process
Structured overview of the Data Mining Implementation Process, typically followed in real-
world applications:
Implementation Process of Data mining
Business Understanding
• Define the project goals from a business perspective.
• Understand the problem you want to solve with data mining.
• Set objectives, success criteria, and scope.
Data Understanding
• Collect initial data from various sources.
• Explore the data to identify quality issues, patterns, and relationships.
• Assess data relevance and usefulness for the task.
Data Preparation
• Clean the data: handle missing values, remove duplicates, correct errors.
• Transform data: normalize, encode, or aggregate as needed.
• Select relevant attributes/features for analysis.
• Format the data into a structure suitable for mining.
Data Mining
• Choose appropriate data mining techniques (e.g., classification, clustering, regression).
• Apply algorithms to extract patterns, trends, or models from the data.
• Fine-tune parameters for optimal performance.
Evaluation
• Assess the model’s performance using metrics like accuracy, precision, recall, etc.
• Validate results against business objectives.
• Determine if the findings are valid and actionable.
Deployment
• Implement the data mining model into a real-world environment (e.g., software,
dashboard, report).
• Integrate insights into business processes or decision-making tools.
• Provide training or documentation for end-users if needed.
Monitoring and Maintenance
• Continuously track the model’s performance.
• Update or retrain the model as data and business needs evolve.
• Ensure data quality and relevance are maintained over time.
1.10. Explaining Data Mining Architecture
Data mining architecture refers to the structure or design that supports the processes and
components involved in data mining. It outlines how data is collected, processed, and mined to
extract valuable patterns and knowledge.
The major component of a data mining system architecture is as follows :
Data Mining Architecture
Database, Data Warehouse or Other Information Repository: This is a single or set of
databases, data warehouses, spreadsheets, or other kinds of information repositories. Data
cleaning and data integration techniques may be performed on the data.
Database or Data Warehouse Server : It fetches the data as per the users’ requirements from
the Database using Data Mining tasks.
Knowledge Base : This is the domain knowledge that is used to guide the search or evaluate
the interestingness of resulting patterns. It is simply stored in the form of set of rules.
Data Mining Engine : It performs the data mining task such as characterization, association,
classification, prediction, cluster analysis etc.
Pattern Evaluation Module : They are responsible for finding interesting patterns in the data
using a threshold value. It interacts with the data mining engine to focus the search on
interesting patterns.
Graphical User Interface: This module is used to communicate between user and the data
mining system and allow users to browse databases or data warehouse schemas by specifying
a data mining query or task.
1.11. KDD- Knowledge Discovery in Databases of Data Mining
KDD (Knowledge Discovery in Databases) is the overall process of discovering useful
knowledge or patterns from large volumes of data. It is a broader term than data mining, as
data mining is just one step within the KDD process.
KDD is a comprehensive, multi-step process for turning raw data into meaningful knowledge.
Data mining is at the heart of this process but cannot function effectively without the steps that
precede and follow it.
KDD is the process of identifying valid, novel, potentially useful, and understandable patterns
in data.
KDD (Knowledge Discovery in Databases)
Data Selection
• Purpose: Identify and retrieve relevant data from multiple sources.
• Example: Selecting customer transaction data from a retail database.
Data Preprocessing (Cleaning)
• Purpose: Remove noise and handle missing values.
• Example: Filling missing values, removing duplicate records, correcting data entry
errors.
Data Transformation
• Purpose: Convert data into suitable format for mining.
• Processes: Normalization, aggregation, feature selection.
• Example: Scaling numerical data to a uniform range.
Data Mining
• Core Step: Apply algorithms to discover patterns and relationships.
• Techniques: Classification, clustering, association rules, regression.
• Example: Using decision trees to classify customer behavior.
Pattern Evaluation and Knowledge Presentation
• Purpose: Evaluate the mined patterns and present only the most useful ones.
• Methods: Use measures like support, confidence, lift.
• Example: Visualizing association rules with graphs or dashboards.
1.12. Data Mining tools
Data mining tools are software applications that help in discovering patterns, correlations, and
insights from large datasets. These tools use techniques like machine learning, statistical
analysis, and database systems to extract useful knowledge.
list of commonly used data mining tools, along with brief explanations:
RapidMiner
• Type: Open-source (with commercial version)
• Features:
o GUI-based interface
o Supports data preprocessing, visualization, modeling, and evaluation
o Integrates with R and Python
• Use Case: Predictive analytics, sentiment analysis, fraud detection
Weka (Waikato Environment for Knowledge Analysis)
• Type: Open-source
• Features:
o Collection of machine learning algorithms
o GUI and command-line interface
o Good for educational and research purposes
• Use Case: Classification, regression, clustering, association rule mining
KNIME (Konstanz Information Miner)
• Type: Open-source
• Features:
o Drag-and-drop interface
o Integration with R, Python, SQL, and big data platforms
o Visual workflow builder
• Use Case: Data preparation, machine learning, business intelligence
Orange
• Type: Open-source
• Features:
o Visual programming for data analysis
o Interactive data visualization
o Add-ons for text mining, bioinformatics, etc.
• Use Case: Educational purposes, visual data exploration
SAS Enterprise Miner
• Type: Commercial
• Features:
o Powerful analytics and data mining platform
o Integrates with SAS programming
o High-performance predictive modeling
• Use Case: Enterprise-level data mining, predictive modeling
Apache Mahout
• Type: Open-source (part of Apache Software Foundation)
• Features:
o Scalable machine learning algorithms
o Built to work with Hadoop ecosystem
o Good for big data analytics
• Use Case: Recommender systems, classification, clustering on large datasets
R and Python (with Libraries)
• Type: Open-source programming languages
• Features:
o Extensive libraries for data mining (e.g., scikit-learn, caret, mlr)
o Customizable and flexible
o Strong community support
• Use Case: Custom data mining tasks, research, machine learning pipelines
IBM SPSS Modeler
• Type: Commercial
• Features:
o User-friendly GUI
o Designed for statistical analysis and predictive modeling
o Supports text analytics and geospatial analytics
• Use Case: Market research, healthcare analytics, academic research
The choice of a data mining tool depends on your needs—whether you prioritize ease of use,
flexibility, big data support, or integration with other systems. For beginners, tools like Weka
or Orange are ideal. For large-scale or professional use, RapidMiner, KNIME, and SAS are
preferred.
1.13. List Major Difference between Data mining and Machine
learning
Although Data Mining and Machine Learning are closely related and often used together,
they have distinct goals, methods, and applications. Here's a clear comparison:
Aspect Data Mining Machine Learning
Process of discovering patterns Field of study that gives computers the
Definition and knowledge from large ability to learn from data without being
datasets. explicitly programmed.
Extract useful information or Enable systems to make predictions or
Goal
patterns from data. decisions.
Uses predefined algorithms to Builds models that learn from data to
Approach
analyze data and find patterns. improve over time.
Knowledge discovery and
Focus Prediction, classification, and optimization.
pattern extraction.
Relies more on human Can operate autonomously with minimal
Dependency
interpretation of results. human input.
Techniques Association rules, clustering, Supervised learning, unsupervised learning,
Used decision trees. reinforcement learning.
Human-interpretable patterns, Predictive models that can be used in real-
Output
rules, and trends. time.
Market basket analysis, fraud
Spam filtering, speech recognition, image
Examples detection, customer
classification.
segmentation.
Less automated; involves
Highly automated; systems improve as more
Automation significant data preparation and
data is used.
interpretation.
Data analysis and business AI systems, robotics, recommendation
Used In
intelligence. engines.
1.14. State the importance of Data Analytics
Data Analytics is the science of analyzing raw data to make informed decisions. It plays a vital
role in modern organizations by turning data into actionable insights.
Informed Decision-Making
• Benefit: Helps businesses make data-driven decisions rather than relying on intuition.
• Example: Analyzing customer behavior to optimize marketing strategies.
Improved Operational Efficiency
• Benefit: Identifies bottlenecks, waste, and areas for improvement in business
processes.
• Example: Logistics companies using analytics to optimize delivery routes.
Better Customer Insights
• Benefit: Understand customer preferences, habits, and feedback.
• Example: E-commerce platforms using analytics to personalize product
recommendations.
Competitive Advantage
• Benefit: Organizations using analytics can outperform competitors by responding
faster to trends.
• Example: Retailers tracking market trends to stock trending items before competitors.
Risk Management
• Benefit: Helps in identifying potential risks and fraud.
• Example: Banks analyzing transaction patterns to detect fraudulent activities.
Cost Reduction
• Benefit: Pinpoints areas where cost savings can be made.
• Example: Manufacturing units using data to minimize downtime and reduce
maintenance costs.
Innovation and Product Development
• Benefit: Enables companies to create new products or improve existing ones based on
data.
• Example: Tech companies analyzing user data to enhance app features.
Monitoring Business Performance
• Benefit: Tracks KPIs (Key Performance Indicators) and business health in real-time.
• Example: Dashboards showing real-time sales, inventory, and customer service
metrics.
Data Analytics is essential in today’s data-driven world. It helps organizations across all industries to
make smarter decisions, reduce costs, improve customer satisfaction, and stay ahead of the competition.
1.15. List and explain phases of Data Analytics
Data Analytics is a structured process that involves multiple phases to turn raw data into
actionable insights. Each phase has its own objectives, tasks, and tools.
Phases of Data Analytics
Data Requirement Gathering
• Purpose: Understand what kind of data is needed and why.
• Activities:
o Define business objectives
o Identify key metrics and KPIs
o Determine data sources (internal or external)
• Example: A retail business wants to know why sales dropped last quarter.
Data Collection
• Purpose: Gather the necessary data from various sources.
• Sources May Include:
o Databases
o Web logs
o Social media
o Surveys and sensors
• Tools: SQL, APIs, data scraping tools
• Example: Collecting transaction data, customer feedback, and web traffic logs.
Data Cleaning (Data Preprocessing)
• Purpose: Ensure data is accurate, consistent, and usable.
• Activities:
o Remove duplicates and outliers
o Handle missing values
o Correct data types and formats
• Tools: Python (Pandas), Excel, OpenRefine
• Example: Fixing missing customer ages or formatting inconsistent date entries.
Data Exploration and Analysis
• Purpose: Understand the structure and content of the data.
• Activities:
o Exploratory Data Analysis (EDA)
o Descriptive statistics
o Data visualization
• Tools: Python (Matplotlib, Seaborn), R, Tableau, Power BI
• Example: Finding which products have the highest return rates.
Data Modeling
• Purpose: Build models that explain or predict patterns in the data.
• Techniques:
o Regression
o Classification
o Clustering
• Tools: Python (scikit-learn), R, SAS, RapidMiner
• Example: Predicting customer churn using logistic regression.
Data Interpretation and Communication
• Purpose: Translate analytical findings into insights and actions.
• Activities:
o Interpret model results
o Create dashboards and reports
o Present findings to stakeholders
• Tools: Tableau, Power BI, Excel, presentations
• Example: Presenting why customer engagement increased after a new campaign.
Decision Making and Action
• Purpose: Use insights to make strategic business decisions.
• Activities:
o Implement data-driven changes
o Monitor the impact of decisions
• Example: Changing a marketing strategy based on low conversion data.
The data analytics process is a systematic journey from identifying the problem to making
informed decisions. Each phase builds upon the previous one to ensure insights are both
accurate and actionable.
1.16. Differentiate between Data Mining and Data Analytics
Data Mining and Data Analytics are closely related and often overlap, they serve different
purposes in the field of data science.
• Data Mining = "Find the hidden patterns"
• Data Analytics = "Understand and explain the data to make decisions"
Aspect Data Mining Data Analytics
The process of discovering hidden The process of examining, organizing,
Definition patterns and relationships in large and interpreting data to support
datasets. decision-making.
To uncover unknown patterns, To derive meaningful insights and
Purpose
associations, and trends. support business decisions.
Pattern discovery and knowledge Problem-solving, forecasting, and
Focus
extraction. performance evaluation.
Typically a sub-process within the
A broader process that includes data
Process Type KDD (Knowledge Discovery in
mining, visualization, and reporting.
Databases) process.
Techniques Clustering, association rules, Descriptive, diagnostic, predictive,
Used classification, anomaly detection. and prescriptive analytics.
Patterns, rules, relationships (often not Actionable insights, summaries,
Output
directly interpretable). dashboards, and reports.
Weka, RapidMiner, SAS Enterprise Excel, Tableau, Power BI, R, Python
Tools
Miner. (Pandas, NumPy).
Example Use Discovering that people who buy Analyzing monthly sales data to
Case bread also often buy butter. determine declining revenue trends.
Both are essential in the data science pipeline, but they address different types of questions and
problems.
1.17. List and explain types of Data mining techniques
Types of Data Mining Techniques
Data mining techniques are methods used to analyze large datasets and extract useful patterns,
relationships, or insights. These techniques can be broadly classified into different types based
on their purpose and the kind of knowledge they discover.
Classification
• Purpose: Assign data items to predefined categories or classes.
• How it Works: Uses labeled data to build a model that can predict the class of new
data.
• Example Algorithms: Decision Trees, Naive Bayes, Support Vector Machines (SVM).
• Use Case: Email spam detection, credit risk assessment
Clustering
• Purpose: Group similar data items into clusters without predefined classes.
• How it Works: Identifies natural groupings based on feature similarity.
• Example Algorithms: K-Means, Hierarchical Clustering, DBSCAN.
• Use Case: Customer segmentation, image segmentation.
Association Rule Mining
• Purpose: Discover interesting relationships or associations between variables.
• How it Works: Finds frequent itemsets and generates rules that explain how items are
associated.
• Example Algorithm: Apriori, FP-Growth.
• Use Case: Market basket analysis (e.g., customers buying bread also buy butter).
Regression
• Purpose: Predict a continuous numeric value based on input variables.
• How it Works: Models the relationship between dependent and independent variables.
• Example Algorithms: Linear Regression, Polynomial Regression.
• Use Case: Predicting sales, stock prices, or temperature.
Anomaly Detection (Outlier Detection)
• Purpose: Identify rare or unusual data points that do not conform to expected patterns.
• How it Works: Models normal behavior and flags deviations.
• Use Case: Fraud detection, network security, fault diagnosis.
Summarization
• Purpose: Provide a compact representation of the data.
• How it Works: Generates a summary or abstraction (e.g., statistical measures).
• Use Case: Generating reports, data overview.
Sequential Pattern Mining
• Purpose: Discover sequential patterns or trends over time.
• How it Works: Finds frequent sequences in ordered data.
• Use Case: Customer purchase sequences, web clickstream analysis.
• Technique Purpose Example Use Case
Choosing the right data mining technique depends on the nature of the problem, the type of
data, and the desired outcome. Often, multiple techniques are combined for better insights.
1.18. Explain Text data mining
What is Text Data Mining?
Text Data Mining (also called Text Mining or Text Analytics) is the process of extracting
meaningful information and patterns from unstructured text data. Unlike structured data (like
numbers or tables), text data is typically in natural language form (documents, emails, social
media posts, etc.), which requires special techniques to analyze.
Why Text Data Mining?
• Most of the data generated today is unstructured text (emails, articles, social media,
customer reviews).
• Extracting useful knowledge from this vast amount of text helps organizations gain
insights that traditional data mining on structured data can’t provide.
Key Steps in Text Data Mining:
1. Text Preprocessing
o Cleaning text by removing stopwords (common words like “the,” “and”)
o Tokenization (splitting text into words or phrases)
o Stemming and Lemmatization (reducing words to their root forms)
o Removing punctuation, numbers, and irrelevant characters
2. Text Representation
o Converting text into structured format (e.g., vectors)
o Techniques: Bag of Words, TF-IDF (Term Frequency-Inverse Document
Frequency), Word Embeddings (Word2Vec, GloVe)
3. Pattern Discovery
o Applying mining techniques like classification, clustering, and association rule
mining on text data
o Examples: Topic modeling, sentiment analysis, named entity recognition
4. Evaluation and Interpretation
o Assessing the quality and relevance of extracted patterns
o Presenting insights in understandable formats (reports, dashboards)
Common Applications of Text Data Mining:
• Sentiment Analysis: Determining the sentiment (positive, negative, neutral) from
customer reviews or social media.
• Topic Modeling: Discovering the main themes or topics from large document
collections.
• Spam Detection: Classifying emails or messages as spam or not spam.
• Information Extraction: Identifying specific facts or entities (like names, dates) from
text.
• Customer Feedback Analysis: Analyzing surveys, support tickets, and feedback for
improving products/services.
Challenges in Text Data Mining:
• Ambiguity and Polysemy: Words can have multiple meanings depending on context.
• Unstructured Nature: Text data lacks a fixed schema, making analysis complex.
• Language Variability: Different languages, slang, abbreviations.
• High Dimensionality: Text converted to vectors can have very large dimensions.
Summary:
Text Data Mining transforms unstructured text into structured knowledge by cleaning,
representing, and analyzing text. It unlocks valuable insights from vast textual information,
supporting decision-making in business, healthcare, social media, and more.
1.19. Differentiate between classification and clustering in data
mining
Aspect Classification Clustering
Assigns data items to predefined Groups similar data items into
Definition
classes or categories. clusters without predefined labels.
Type of Supervised learning (requires Unsupervised learning (no labeled
Learning labeled data). data).
Predict the class label of new data Discover natural groupings or
Goal
points. structures in the data.
Labeled dataset with known Unlabeled dataset without class
Input Data
categories. information.
Output Class labels for each data instance. Clusters (groups) of similar instances.
Examples of Decision Trees, Naive Bayes, K-Means, Hierarchical Clustering,
Algorithms Support Vector Machines (SVM). DBSCAN.
Application Spam email detection, loan Customer segmentation, image
Examples approval, medical diagnosis. segmentation, market basket analysis.
Accuracy is measured by
Evaluated by measures like cohesion,
Evaluation comparing predicted and actual
separation, silhouette score.
labels.
Nature of Provides explicit classification Provides groups with shared
Results rules or models. characteristics but no explicit labels.
• Classification predicts known categories using labelled data (supervised).
• Clustering discovers unknown groups in unlabelled data (unsupervised).
Both techniques help in understanding and organizing data but serve different purposes
depending on the availability of labelled data and the problem being solved.