Data Science
with
Visualization
By
Prof. Madhusmita Behera
Department of
Computer Science & Engineering
Department of Computer Science & Engineering www.cambridge.edu.in www.cambridge.edu.in
Department of Computer Science & Engineering www.cambridge.edu.in
Module 3:
Text mining, Probability, Scraping the Web
Text mining and text analytics:
Text mining in the real world, Text mining techniques.
Scraping the Web:
HTML and the Parsing Thereof, Using APIs, JSON and XML, Using an Unauthenticated
API, Finding APIs.
Probability:
Dependence and Independence, Conditional Probability, Baye’s Theorem, Random
Variables, Continuous Distributions, The Normal Distribution, The Central Limit
Theorem Case study 1: Classifying Reddit posts Case study 2: Using the Twitter APIs
Department of Computer Science & Engineering www.cambridge.edu.in
Introduction
• Text mining (or text analytics): combines
language science, computer science,
statistics, and machine learning to analyze
and structure unorganized text, enabling
insights.
• For example, analyzing police reports can
reveal people, places, and crime types,
helping study crime trends. While text
mining can also apply to non-natural
languages (like machine logs or Morse
code)
Department of Computer Science & Engineering www.cambridge.edu.in
8.1 Text mining in the real world
• Text mining and NLP are used in everyday applications.
• Examples include autocomplete and spell check in emails or
messages.
• Social media platforms (e.g., Facebook) use these techniques to
suggest names.
• A key method is named entity recognition (NER).
• NER not only detects nouns but also identifies their type (e.g., a
person) and even which specific person.
Department of Computer Science & Engineering www.cambridge.edu.in
8.1 Text mining in the real world
To provide the most relevant
answer, Google must do
(among other things) all of the
following:
• Preprocess all the documents
it collects for named entities
• Perform language
identification
• Detect what type of entity
you’re referring to
• Match a query to a result
• Detect the type of content to
return (PDF, adult-sensitive)
Department of Computer Science & Engineering www.cambridge.edu.in
8.1 Text mining in the real world
• Text mining has many applications, including, but not limited to, the
following:
• Entity identification
• Plagiarism detection
• Topic identification
• Text clustering
• Translation
• Automatic text summarization
• Fraud detection
• Spam filtering
• Sentiment analysis
Department of Computer Science & Engineering www.cambridge.edu.in
8.1 Text mining in the real world
• Text mining is difficult, despite impressive examples like Wolfram Alpha and
IBM Watson.
• Ambiguity is a major issue (e.g., multiple places named “Springfield”).
• Spelling problems: computers struggle with misspellings and variations (“NY,”
“Neww York,” “New York”).
• Synonyms and pronouns create challenges (e.g., resolving “she” in a sentence).
• Computers need algorithms to link variations and references that humans
interpret naturally.
• Algorithms usually perform well only in specific, well-defined tasks.
• General algorithms that work across all cases are much harder to develop.
• Example: a model trained to detect US account numbers won’t generalize to
international account numbers.
• Context matters: models trained on one domain (e.g., Twitter) don’t work well
in another (e.g., legal texts).
• No one-size-fits-all solution exists in text mining.
Department of Computer Science & Engineering www.cambridge.edu.in
8.2 Text mining techniques
• Text classification: automatically classifying uncategorized texts into
specific categories.
• Text mining techniques are need background knowledge to be applied
effectively.
• Techniques:
• Bag of words
• Stemming and lemmatization
• Decision tree classifier
Department of Computer Science & Engineering www.cambridge.edu.in
8.2.1 Bag of words
• Bag of Words (BoW) is a simple method for structuring text
data.
• Each document is converted into a word vector.
• If a word appears in a document → labeled True, otherwise
False.
• Example: documents about Game of Thrones and Data Science.
• Together, these vectors form a document-term matrix (DTM).
• The DTM has columns = terms and rows = documents.
• In this case, values are binary (True/False for the presence of a
term).
• The example is a simplified version of text structuring.
• In reality, text preprocessing involves steps like filtering words
and stemming.
• Large corpora may contain thousands of unique words, leading
to huge datasets.
• A binary Bag of Words is just one method of structuring text.
Department of Computer Science & Engineering www.cambridge.edu.in
8.2.1 Bag of words
• Before creating a Bag of Words, several preprocessing steps are needed:
1. Tokenization:
• Splits text into tokens/terms (basic units of analysis).
• Usually words (unigrams), but can also be bigrams (2 words) or trigrams (3 words) to capture
more meaning.
• Including bigrams/trigrams improves performance but increases vector size as well as cost.
2. Stop word filtering:
• Removes common words (like the, and, is) that add little value.
• Libraries like NLTK provide stop word lists.
3. Lowercasing:
• Converts all words to lowercase to avoid treating words like Data and data as different.
4. Stemming.
Department of Computer Science & Engineering www.cambridge.edu.in
8.2.2 Stemming and lemmatization
•Stemming
•Brings words back to their root by cutting off endings.
•Example: planes → plane.
•Useful to reduce variance in data.
•Lemmatization
•Similar goal as stemming but more grammar-aware.
•Can convert plural words (cars → car) and verb forms
(are → be).
•Relies on grammar knowledge for accuracy.
•POS Tagging (Part of Speech Tagging)
•Assigns grammatical roles (noun, verb, etc.) to each
word in a sentence.
•Example:
({“game”:”NN”},{“of”:”IN},{“thrones”:”NNS},{“is”:”VBZ}
,{“a”:”DT},{“television”:”NN}, {“series”:”NN})
•Works on sentences, not just single words.
Department of Computer Science & Engineering www.cambridge.edu.in
8.2.2 Stemming and lemmatization
•Stemming vs. Lemmatization
•Stemming = faster, simpler, but less accurate.
•Lemmatization = slower, but gives cleaner data when combined with POS tagging.
•Practical use
•For simplicity, stemming is chosen in the case study.
•However, combining POS tagging + lemmatization usually produces better results.
•Next step in text analytics
•Along with text preprocessing, a decision tree classifier will be used for analysis.
Department of Computer Science & Engineering www.cambridge.edu.in
8.2.3 Decision tree classifier
• Decision Tree Classifier
• Does not assume independence between variables.
• Creates interaction variables: combines words/features to capture relationships
(e.g., data + science together = stronger predictor).
• Creates buckets: splits one variable into multiple categories for better analysis
(useful for numerical features).
• Naïve Bayes Classifier
• Assumes all input variables are independent (the “naïve” assumption).
• In text mining, this often loses context because words are related.
• Example: “data science” → becomes two separate tokens (data, science) if using
unigrams.
• Context can be partly restored by using bigrams (data science, data analysis) or
trigrams (game of thrones).
Department of Computer Science & Engineering www.cambridge.edu.in
Figure 8.8 Fictitious decision tree model. A decision tree
automatically creates buckets and supposes interactions
between input variables.
• Working of decision tree:
• Decision trees split data into
branches based on criteria of
importance.
• Variables closer to the root
are more important
predictors.
• Criteria:
• Entropy is a measure of
unpredictability or chaos.
• Gain.
Department of Computer Science & Engineering www.cambridge.edu.in
8.2.3 Decision tree classifier
• information gain using the example of predicting a baby’s gender.
• At first, there’s a 50% uncertainty (male or female).
• An ultrasound, while not 100% accurate, reduces that uncertainty—for example, from
50% down to 10% at 12 weeks.
• This reduction in unpredictability is called information gain.
• Decision trees use the same principle: they choose splits that most reduce uncertainty
(entropy), just like an ultrasound provides clearer information about the baby’s gender.
Department of Computer Science & Engineering www.cambridge.edu.in
8.2.3 Decision tree classifier
• Tree Structure
• Root = strongest predictor.
• Branches = weaker predictors.
• Splitting continues until no variables/observations remain.
• Disadvantage:
• Overfitting: At leaf level, too few observa ons → model captures randomness
instead of real patterns.
• Solution:
• Remove meaningless branches, Keeps the tree simpler and more robust.
Department of Computer Science & Engineering www.cambridge.edu.in