0% found this document useful (0 votes)

4 views17 pages

Module 3 - DSV

Data Science VTU

Uploaded by

taranilakshmi.23ise

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views17 pages

Module 3 - DSV

Data Science VTU

Uploaded by

taranilakshmi.23ise

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Data Science

with
Visualization
By
Prof. Madhusmita Behera

Department of
Computer Science & Engineering
Department of Computer Science & Engineering www.cambridge.edu.in www.cambridge.edu.in
Department of Computer Science & Engineering www.cambridge.edu.in
Module 3:
Text mining, Probability, Scraping the Web
Text mining and text analytics:
Text mining in the real world, Text mining techniques.
Scraping the Web:
HTML and the Parsing Thereof, Using APIs, JSON and XML, Using an Unauthenticated
API, Finding APIs.
Probability:
Dependence and Independence, Conditional Probability, Baye’s Theorem, Random
Variables, Continuous Distributions, The Normal Distribution, The Central Limit
Theorem Case study 1: Classifying Reddit posts Case study 2: Using the Twitter APIs

Department of Computer Science & Engineering www.cambridge.edu.in

Introduction
• Text mining (or text analytics): combines
language science, computer science,
statistics, and machine learning to analyze
and structure unorganized text, enabling
insights.
• For example, analyzing police reports can
reveal people, places, and crime types,
helping study crime trends. While text
mining can also apply to non-natural
languages (like machine logs or Morse
code)

Department of Computer Science & Engineering www.cambridge.edu.in

8.1 Text mining in the real world
• Text mining and NLP are used in everyday applications.
• Examples include autocomplete and spell check in emails or
messages.
• Social media platforms (e.g., Facebook) use these techniques to
suggest names.
• A key method is named entity recognition (NER).
• NER not only detects nouns but also identifies their type (e.g., a
person) and even which specific person.

Department of Computer Science & Engineering www.cambridge.edu.in

8.1 Text mining in the real world

To provide the most relevant

answer, Google must do
(among other things) all of the
following:
• Preprocess all the documents
it collects for named entities
• Perform language
identification
• Detect what type of entity
you’re referring to
• Match a query to a result
• Detect the type of content to
return (PDF, adult-sensitive)

Department of Computer Science & Engineering www.cambridge.edu.in

8.1 Text mining in the real world
• Text mining has many applications, including, but not limited to, the
following:
• Entity identification
• Plagiarism detection
• Topic identification
• Text clustering
• Translation
• Automatic text summarization
• Fraud detection
• Spam filtering
• Sentiment analysis

Department of Computer Science & Engineering www.cambridge.edu.in

8.1 Text mining in the real world
• Text mining is difficult, despite impressive examples like Wolfram Alpha and
IBM Watson.
• Ambiguity is a major issue (e.g., multiple places named “Springfield”).
• Spelling problems: computers struggle with misspellings and variations (“NY,”
“Neww York,” “New York”).
• Synonyms and pronouns create challenges (e.g., resolving “she” in a sentence).
• Computers need algorithms to link variations and references that humans
interpret naturally.
• Algorithms usually perform well only in specific, well-defined tasks.
• General algorithms that work across all cases are much harder to develop.
• Example: a model trained to detect US account numbers won’t generalize to
international account numbers.
• Context matters: models trained on one domain (e.g., Twitter) don’t work well
in another (e.g., legal texts).
• No one-size-fits-all solution exists in text mining.

Department of Computer Science & Engineering www.cambridge.edu.in

8.2 Text mining techniques

• Text classification: automatically classifying uncategorized texts into

specific categories.
• Text mining techniques are need background knowledge to be applied
effectively.
• Techniques:
• Bag of words
• Stemming and lemmatization
• Decision tree classifier

Department of Computer Science & Engineering www.cambridge.edu.in

8.2.1 Bag of words
• Bag of Words (BoW) is a simple method for structuring text
data.
• Each document is converted into a word vector.
• If a word appears in a document → labeled True, otherwise
False.
• Example: documents about Game of Thrones and Data Science.
• Together, these vectors form a document-term matrix (DTM).
• The DTM has columns = terms and rows = documents.
• In this case, values are binary (True/False for the presence of a
term).
• The example is a simplified version of text structuring.
• In reality, text preprocessing involves steps like filtering words
and stemming.
• Large corpora may contain thousands of unique words, leading
to huge datasets.
• A binary Bag of Words is just one method of structuring text.
Department of Computer Science & Engineering www.cambridge.edu.in
8.2.1 Bag of words
• Before creating a Bag of Words, several preprocessing steps are needed:
1. Tokenization:
• Splits text into tokens/terms (basic units of analysis).
• Usually words (unigrams), but can also be bigrams (2 words) or trigrams (3 words) to capture
more meaning.
• Including bigrams/trigrams improves performance but increases vector size as well as cost.
2. Stop word filtering:
• Removes common words (like the, and, is) that add little value.
• Libraries like NLTK provide stop word lists.
3. Lowercasing:
• Converts all words to lowercase to avoid treating words like Data and data as different.
4. Stemming.

Department of Computer Science & Engineering www.cambridge.edu.in

8.2.2 Stemming and lemmatization
•Stemming
•Brings words back to their root by cutting off endings.
•Example: planes → plane.
•Useful to reduce variance in data.
•Lemmatization
•Similar goal as stemming but more grammar-aware.
•Can convert plural words (cars → car) and verb forms
(are → be).
•Relies on grammar knowledge for accuracy.
•POS Tagging (Part of Speech Tagging)
•Assigns grammatical roles (noun, verb, etc.) to each
word in a sentence.
•Example:
({“game”:”NN”},{“of”:”IN},{“thrones”:”NNS},{“is”:”VBZ}
,{“a”:”DT},{“television”:”NN}, {“series”:”NN})
•Works on sentences, not just single words.

Department of Computer Science & Engineering www.cambridge.edu.in

8.2.2 Stemming and lemmatization

•Stemming vs. Lemmatization

•Stemming = faster, simpler, but less accurate.
•Lemmatization = slower, but gives cleaner data when combined with POS tagging.
•Practical use
•For simplicity, stemming is chosen in the case study.
•However, combining POS tagging + lemmatization usually produces better results.
•Next step in text analytics
•Along with text preprocessing, a decision tree classifier will be used for analysis.

Department of Computer Science & Engineering www.cambridge.edu.in

8.2.3 Decision tree classifier
• Decision Tree Classifier
• Does not assume independence between variables.
• Creates interaction variables: combines words/features to capture relationships
(e.g., data + science together = stronger predictor).
• Creates buckets: splits one variable into multiple categories for better analysis
(useful for numerical features).
• Naïve Bayes Classifier
• Assumes all input variables are independent (the “naïve” assumption).
• In text mining, this often loses context because words are related.
• Example: “data science” → becomes two separate tokens (data, science) if using
unigrams.
• Context can be partly restored by using bigrams (data science, data analysis) or
trigrams (game of thrones).

Department of Computer Science & Engineering www.cambridge.edu.in

Figure 8.8 Fictitious decision tree model. A decision tree
automatically creates buckets and supposes interactions
between input variables.
• Working of decision tree:
• Decision trees split data into
branches based on criteria of
importance.
• Variables closer to the root
are more important
predictors.
• Criteria:
• Entropy is a measure of
unpredictability or chaos.
• Gain.

Department of Computer Science & Engineering www.cambridge.edu.in

8.2.3 Decision tree classifier
• information gain using the example of predicting a baby’s gender.
• At first, there’s a 50% uncertainty (male or female).
• An ultrasound, while not 100% accurate, reduces that uncertainty—for example, from
50% down to 10% at 12 weeks.
• This reduction in unpredictability is called information gain.
• Decision trees use the same principle: they choose splits that most reduce uncertainty
(entropy), just like an ultrasound provides clearer information about the baby’s gender.

Department of Computer Science & Engineering www.cambridge.edu.in

8.2.3 Decision tree classifier

• Tree Structure
• Root = strongest predictor.
• Branches = weaker predictors.
• Splitting continues until no variables/observations remain.
• Disadvantage:
• Overfitting: At leaf level, too few observa ons → model captures randomness
instead of real patterns.
• Solution:
• Remove meaningless branches, Keeps the tree simpler and more robust.

Department of Computer Science & Engineering www.cambridge.edu.in

NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
Applications of NLP
No ratings yet
Applications of NLP
85 pages
Live Classroom 3
No ratings yet
Live Classroom 3
36 pages
05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
BDA Module-5b Text Mining
No ratings yet
BDA Module-5b Text Mining
23 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
31 pages
Bcse206l FDS Module-4 Smsatapathy
No ratings yet
Bcse206l FDS Module-4 Smsatapathy
50 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
Module 5
No ratings yet
Module 5
57 pages
L-6 NLP
No ratings yet
L-6 NLP
11 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
Introduction to Text Mining Course
No ratings yet
Introduction to Text Mining Course
45 pages
Unit Ii DM
No ratings yet
Unit Ii DM
18 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
Data Mining Techniques Guide
No ratings yet
Data Mining Techniques Guide
61 pages
Unit 5
No ratings yet
Unit 5
8 pages
Mapping Texts 2024
No ratings yet
Mapping Texts 2024
326 pages
Mapping Texts - Computational Text Analysis For The Social - Dustin S - Stoltz, Marshall A - Taylor - 2024 - Computational Social Science - 9780197756874 - Anna's Archive
100% (1)
Mapping Texts - Computational Text Analysis For The Social - Dustin S - Stoltz, Marshall A - Taylor - 2024 - Computational Social Science - 9780197756874 - Anna's Archive
326 pages
1152cs191 Data Visualization Unit IV
No ratings yet
1152cs191 Data Visualization Unit IV
99 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
NLP Revision Notes and Applications
No ratings yet
NLP Revision Notes and Applications
4 pages
Natural Language Processing Notes Class 10
No ratings yet
Natural Language Processing Notes Class 10
10 pages
NLP - CH-6
No ratings yet
NLP - CH-6
4 pages
Lect 5
No ratings yet
Lect 5
40 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
Week 1-4 Text An
No ratings yet
Week 1-4 Text An
74 pages
NLP Class X AI
No ratings yet
NLP Class X AI
36 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
NLP Applications and Techniques
No ratings yet
NLP Applications and Techniques
7 pages
Text Analysis Pipelines
No ratings yet
Text Analysis Pipelines
36 pages
Assigmnent I TEXT WEB Media (2024 Feb)
No ratings yet
Assigmnent I TEXT WEB Media (2024 Feb)
12 pages
PDF NLP
No ratings yet
PDF NLP
7 pages
Unit 5 - Aiaaia
No ratings yet
Unit 5 - Aiaaia
19 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Final
No ratings yet
Final
14 pages
Week10 Social Network Analytics
No ratings yet
Week10 Social Network Analytics
19 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
Intro To TM
No ratings yet
Intro To TM
32 pages
News Classification Using Machine Learning
No ratings yet
News Classification Using Machine Learning
5 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
AI & ML Internship Insights
No ratings yet
AI & ML Internship Insights
11 pages
Practical Natural Language Processing A Comprehensive Guide To Building Real World NLP Systems 1st Edition Sowmya Vajjala Full Chapters Included
No ratings yet
Practical Natural Language Processing A Comprehensive Guide To Building Real World NLP Systems 1st Edition Sowmya Vajjala Full Chapters Included
123 pages
02.MOUDLE 5 - Text Mining
No ratings yet
02.MOUDLE 5 - Text Mining
27 pages
DM - Ai22c07 - Unit 5
No ratings yet
DM - Ai22c07 - Unit 5
101 pages
Unit 1
No ratings yet
Unit 1
8 pages
RapidMiner Text Mining Guide
No ratings yet
RapidMiner Text Mining Guide
3 pages
Text Analysis: Why Do We Need Text Analytics
No ratings yet
Text Analysis: Why Do We Need Text Analytics
2 pages
ARTIN1 Week 10 NLP, Part 2
No ratings yet
ARTIN1 Week 10 NLP, Part 2
8 pages
Text Mining
No ratings yet
Text Mining
62 pages
Topic 8
No ratings yet
Topic 8
55 pages
Week 12
No ratings yet
Week 12
19 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
MR4103 - Week 4
No ratings yet
MR4103 - Week 4
27 pages
Text Mining
No ratings yet
Text Mining
25 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis
No ratings yet
10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis
36 pages
DBMS Question Bank
No ratings yet
DBMS Question Bank
16 pages
M1 - Events and JSON
No ratings yet
M1 - Events and JSON
18 pages
Module 2 - DSV-1
No ratings yet
Module 2 - DSV-1
39 pages
AI-Module 3 - Notes
No ratings yet
AI-Module 3 - Notes
8 pages
Rm306 Rmipr Module 1
No ratings yet
Rm306 Rmipr Module 1
148 pages
Rm306 Rmipr Module 3
No ratings yet
Rm306 Rmipr Module 3
45 pages
UNIX Introduction: The Kernel
No ratings yet
UNIX Introduction: The Kernel
34 pages
Industrial Visit Report: Forge & Forge
No ratings yet
Industrial Visit Report: Forge & Forge
14 pages
Estudio de Ecotoxicidad y Genotoxicidad en Recursos Hídricos - Iis
No ratings yet
Estudio de Ecotoxicidad y Genotoxicidad en Recursos Hídricos - Iis
8 pages
Machine Learning: The Hundred-Page Book
No ratings yet
Machine Learning: The Hundred-Page Book
4 pages
Lesson 14 - Business Etiquette & Personal Branding
No ratings yet
Lesson 14 - Business Etiquette & Personal Branding
14 pages
Steals & Deals Southeastern Editiion 5-28-20
No ratings yet
Steals & Deals Southeastern Editiion 5-28-20
12 pages
GSM Tech: Architecture & Security
No ratings yet
GSM Tech: Architecture & Security
6 pages
CTM Manual (Compression Machine)
50% (4)
CTM Manual (Compression Machine)
108 pages
Final Project
No ratings yet
Final Project
81 pages
VulnerabilitySecurityRisk IIoT IEC 62443 Compliance With Standard
No ratings yet
VulnerabilitySecurityRisk IIoT IEC 62443 Compliance With Standard
8 pages
Banking Theory and Practice Chapter Three
No ratings yet
Banking Theory and Practice Chapter Three
36 pages
Ank 32q
No ratings yet
Ank 32q
22 pages
WZKat 0811 E
No ratings yet
WZKat 0811 E
70 pages
Extinction of Criminal Action
No ratings yet
Extinction of Criminal Action
17 pages
Industrial Tube Heat Treatment System
No ratings yet
Industrial Tube Heat Treatment System
2 pages
Credit Analysis & Functions Guide
No ratings yet
Credit Analysis & Functions Guide
7 pages
Lab Management
No ratings yet
Lab Management
19 pages
Giới Thiệu Tổng Quan Về Công Ty Cổ Phần Sữa Việt Nam
No ratings yet
Giới Thiệu Tổng Quan Về Công Ty Cổ Phần Sữa Việt Nam
10 pages
Application Wear Metals Icp-Oes-5800 5994-1671en Us Agilent
No ratings yet
Application Wear Metals Icp-Oes-5800 5994-1671en Us Agilent
9 pages
Green Logistics - Ha Vi
No ratings yet
Green Logistics - Ha Vi
94 pages
TSP Info Request for Spouses
No ratings yet
TSP Info Request for Spouses
2 pages
Screenshot 2022-10-20 at 11.58.50 PM
No ratings yet
Screenshot 2022-10-20 at 11.58.50 PM
227 pages
Final Report 3
No ratings yet
Final Report 3
65 pages
Identity Governance and Administration Excerpt Final
No ratings yet
Identity Governance and Administration Excerpt Final
18 pages
Part 3 Exercises
No ratings yet
Part 3 Exercises
5 pages
Transforming Lives. Transforming Materials.: Advanced Co-Rotating Twin Screw Extruder
No ratings yet
Transforming Lives. Transforming Materials.: Advanced Co-Rotating Twin Screw Extruder
16 pages
Good Reasons Why We Should Not Have Homework
100% (1)
Good Reasons Why We Should Not Have Homework
6 pages
MMA Plover
No ratings yet
MMA Plover
2 pages
Azure Databricks Interview
100% (2)
Azure Databricks Interview
35 pages
Art 1193-1206 Case Digest - From GOOGLE DRIVE
No ratings yet
Art 1193-1206 Case Digest - From GOOGLE DRIVE
3 pages

Module 3 - DSV

Uploaded by

Module 3 - DSV

Uploaded by

Data Science

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

To provide the most relevant

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

• Text classification: automatically classifying uncategorized texts into

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

•Stemming vs. Lemmatization

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

Department of Computer Science & Engineering www.cambridge.edu.in

You might also like