[go: up one dir, main page]

0% found this document useful (0 votes)
4 views17 pages

Module 3 - DSV

Data Science VTU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views17 pages

Module 3 - DSV

Data Science VTU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Data Science

with
Visualization
By
Prof. Madhusmita Behera

Department of
Computer Science & Engineering
Department of Computer Science & Engineering www.cambridge.edu.in www.cambridge.edu.in
Department of Computer Science & Engineering www.cambridge.edu.in
Module 3:
Text mining, Probability, Scraping the Web
Text mining and text analytics:
Text mining in the real world, Text mining techniques.
Scraping the Web:
HTML and the Parsing Thereof, Using APIs, JSON and XML, Using an Unauthenticated
API, Finding APIs.
Probability:
Dependence and Independence, Conditional Probability, Baye’s Theorem, Random
Variables, Continuous Distributions, The Normal Distribution, The Central Limit
Theorem Case study 1: Classifying Reddit posts Case study 2: Using the Twitter APIs

Department of Computer Science & Engineering www.cambridge.edu.in


Introduction
• Text mining (or text analytics): combines
language science, computer science,
statistics, and machine learning to analyze
and structure unorganized text, enabling
insights.
• For example, analyzing police reports can
reveal people, places, and crime types,
helping study crime trends. While text
mining can also apply to non-natural
languages (like machine logs or Morse
code)

Department of Computer Science & Engineering www.cambridge.edu.in


8.1 Text mining in the real world
• Text mining and NLP are used in everyday applications.
• Examples include autocomplete and spell check in emails or
messages.
• Social media platforms (e.g., Facebook) use these techniques to
suggest names.
• A key method is named entity recognition (NER).
• NER not only detects nouns but also identifies their type (e.g., a
person) and even which specific person.

Department of Computer Science & Engineering www.cambridge.edu.in


8.1 Text mining in the real world

To provide the most relevant


answer, Google must do
(among other things) all of the
following:
• Preprocess all the documents
it collects for named entities
• Perform language
identification
• Detect what type of entity
you’re referring to
• Match a query to a result
• Detect the type of content to
return (PDF, adult-sensitive)

Department of Computer Science & Engineering www.cambridge.edu.in


8.1 Text mining in the real world
• Text mining has many applications, including, but not limited to, the
following:
• Entity identification
• Plagiarism detection
• Topic identification
• Text clustering
• Translation
• Automatic text summarization
• Fraud detection
• Spam filtering
• Sentiment analysis

Department of Computer Science & Engineering www.cambridge.edu.in


8.1 Text mining in the real world
• Text mining is difficult, despite impressive examples like Wolfram Alpha and
IBM Watson.
• Ambiguity is a major issue (e.g., multiple places named “Springfield”).
• Spelling problems: computers struggle with misspellings and variations (“NY,”
“Neww York,” “New York”).
• Synonyms and pronouns create challenges (e.g., resolving “she” in a sentence).
• Computers need algorithms to link variations and references that humans
interpret naturally.
• Algorithms usually perform well only in specific, well-defined tasks.
• General algorithms that work across all cases are much harder to develop.
• Example: a model trained to detect US account numbers won’t generalize to
international account numbers.
• Context matters: models trained on one domain (e.g., Twitter) don’t work well
in another (e.g., legal texts).
• No one-size-fits-all solution exists in text mining.

Department of Computer Science & Engineering www.cambridge.edu.in


8.2 Text mining techniques

• Text classification: automatically classifying uncategorized texts into


specific categories.
• Text mining techniques are need background knowledge to be applied
effectively.
• Techniques:
• Bag of words
• Stemming and lemmatization
• Decision tree classifier

Department of Computer Science & Engineering www.cambridge.edu.in


8.2.1 Bag of words
• Bag of Words (BoW) is a simple method for structuring text
data.
• Each document is converted into a word vector.
• If a word appears in a document → labeled True, otherwise
False.
• Example: documents about Game of Thrones and Data Science.
• Together, these vectors form a document-term matrix (DTM).
• The DTM has columns = terms and rows = documents.
• In this case, values are binary (True/False for the presence of a
term).
• The example is a simplified version of text structuring.
• In reality, text preprocessing involves steps like filtering words
and stemming.
• Large corpora may contain thousands of unique words, leading
to huge datasets.
• A binary Bag of Words is just one method of structuring text.
Department of Computer Science & Engineering www.cambridge.edu.in
8.2.1 Bag of words
• Before creating a Bag of Words, several preprocessing steps are needed:
1. Tokenization:
• Splits text into tokens/terms (basic units of analysis).
• Usually words (unigrams), but can also be bigrams (2 words) or trigrams (3 words) to capture
more meaning.
• Including bigrams/trigrams improves performance but increases vector size as well as cost.
2. Stop word filtering:
• Removes common words (like the, and, is) that add little value.
• Libraries like NLTK provide stop word lists.
3. Lowercasing:
• Converts all words to lowercase to avoid treating words like Data and data as different.
4. Stemming.

Department of Computer Science & Engineering www.cambridge.edu.in


8.2.2 Stemming and lemmatization
•Stemming
•Brings words back to their root by cutting off endings.
•Example: planes → plane.
•Useful to reduce variance in data.
•Lemmatization
•Similar goal as stemming but more grammar-aware.
•Can convert plural words (cars → car) and verb forms
(are → be).
•Relies on grammar knowledge for accuracy.
•POS Tagging (Part of Speech Tagging)
•Assigns grammatical roles (noun, verb, etc.) to each
word in a sentence.
•Example:
({“game”:”NN”},{“of”:”IN},{“thrones”:”NNS},{“is”:”VBZ}
,{“a”:”DT},{“television”:”NN}, {“series”:”NN})
•Works on sentences, not just single words.

Department of Computer Science & Engineering www.cambridge.edu.in


8.2.2 Stemming and lemmatization

•Stemming vs. Lemmatization


•Stemming = faster, simpler, but less accurate.
•Lemmatization = slower, but gives cleaner data when combined with POS tagging.
•Practical use
•For simplicity, stemming is chosen in the case study.
•However, combining POS tagging + lemmatization usually produces better results.
•Next step in text analytics
•Along with text preprocessing, a decision tree classifier will be used for analysis.

Department of Computer Science & Engineering www.cambridge.edu.in


8.2.3 Decision tree classifier
• Decision Tree Classifier
• Does not assume independence between variables.
• Creates interaction variables: combines words/features to capture relationships
(e.g., data + science together = stronger predictor).
• Creates buckets: splits one variable into multiple categories for better analysis
(useful for numerical features).
• Naïve Bayes Classifier
• Assumes all input variables are independent (the “naïve” assumption).
• In text mining, this often loses context because words are related.
• Example: “data science” → becomes two separate tokens (data, science) if using
unigrams.
• Context can be partly restored by using bigrams (data science, data analysis) or
trigrams (game of thrones).

Department of Computer Science & Engineering www.cambridge.edu.in


Figure 8.8 Fictitious decision tree model. A decision tree
automatically creates buckets and supposes interactions
between input variables.
• Working of decision tree:
• Decision trees split data into
branches based on criteria of
importance.
• Variables closer to the root
are more important
predictors.
• Criteria:
• Entropy is a measure of
unpredictability or chaos.
• Gain.

Department of Computer Science & Engineering www.cambridge.edu.in


8.2.3 Decision tree classifier
• information gain using the example of predicting a baby’s gender.
• At first, there’s a 50% uncertainty (male or female).
• An ultrasound, while not 100% accurate, reduces that uncertainty—for example, from
50% down to 10% at 12 weeks.
• This reduction in unpredictability is called information gain.
• Decision trees use the same principle: they choose splits that most reduce uncertainty
(entropy), just like an ultrasound provides clearer information about the baby’s gender.

Department of Computer Science & Engineering www.cambridge.edu.in


8.2.3 Decision tree classifier

• Tree Structure
• Root = strongest predictor.
• Branches = weaker predictors.
• Splitting continues until no variables/observations remain.
• Disadvantage:
• Overfitting: At leaf level, too few observa ons → model captures randomness
instead of real patterns.
• Solution:
• Remove meaningless branches, Keeps the tree simpler and more robust.

Department of Computer Science & Engineering www.cambridge.edu.in

You might also like