[go: up one dir, main page]

0% found this document useful (0 votes)
21 views4 pages

Email Classification Terminologies Explanations

The document provides an overview of several Python libraries and concepts for data analysis and natural language processing, including Pandas for data manipulation, regular expressions for string pattern matching, and NLTK for NLP tasks. It discusses the importance of stop words, the differences between stemming and lemmatization, and introduces the Bag of Words model for converting text into numerical features. Additionally, it explains the concepts of precision and recall in the context of evaluating predictive models.

Uploaded by

hemavasanth69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views4 pages

Email Classification Terminologies Explanations

The document provides an overview of several Python libraries and concepts for data analysis and natural language processing, including Pandas for data manipulation, regular expressions for string pattern matching, and NLTK for NLP tasks. It discusses the importance of stop words, the differences between stemming and lemmatization, and introduces the Bag of Words model for converting text into numerical features. Additionally, it explains the concepts of precision and recall in the context of evaluating predictive models.

Uploaded by

hemavasanth69
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1.

Pandas
Pandas is an open-source Python library used for data analysis and data manipulation. It
provides fast, flexible, and expressive data structures like:

• Series: One-dimensional labeled array.

• Data Frame: Two-dimensional labeled data structure (like a spreadsheet or SQL


table).

• Accessing Different Types of Datasets with Pandas

Format Function
CSV pd.read_csv('filename.csv')
Excel pd.read_excel('filename.xlsx')
JSON pd.read_json('filename.json')
SQL pd.read_sql(query, connection)
Parquet pd.read_parquet('filename.parquet')
HTML Tables pd.read_html('url_or_file.html')
Clipboard pd.read_clipboard()
Pickle pd.read_pickle('filename.pkl')

2.Regular expression library.


Common Functions in re

Function Description

re.search() Searches for a pattern anywhere in a string

re.match() Checks if the pattern matches at the beginning of the string

re.findall() Returns all non-overlapping matches as a list

re.sub() Replaces matches with a new string

re.split() Splits a string by the matches of the pattern

re.compile() Compiles a pattern for reuse (efficient in loops)

3.NLTK Library.
NLTK is a powerful Python package used for Natural Language Processing (NLP) tasks,
like:

• Tokenization

• Stemming

• Stopword removal

• Text classification

• Sentiment analysis

4. What are Stop words in NLP?


Stop words are the most common words in a language that usually do not carry
important meaning, especially for tasks like text classification, search engines, or NLP
models.

Why Remove Stop words?

Removing stop words helps to:

• Reduce noise in text data

• Decrease dimensionality of data

• Improve model performance in many NLP tasks

5.Stemming vs Lemmatization.
Stemming is the process of reducing a word to its word stem that affixes to suffixes and
prefixes or to the roots of words known as a lemma. Stemming is important in natural
language understanding (NLU) and natural language processing (NLP).

Lemmatization technique is like stemming. The output we will get after lemmatization is
called ‘lemma’, which is a root word rather than root stem, the output of stemming. After
lemmatization, we will be getting a valid word that means the same thing.

6.BAG OF WORDS.
What is Bag of Words (BoW)?

The Bag of Words model is a technique used to convert text into numerical features so
that it can be used by machine learning models.
• It counts the frequency of each word in a document.

• It ignores grammar and word order, and focuses only on word occurrences.

How BoW Works — Step-by-Step

Suppose you have these 3 sentences:

1. "I love playing football"

2. "Football is a great game"

3. "I do not like football"

Step 1: Build Vocabulary (Unique Words)

From all the sentences, list all unique words:

['i', 'love', 'playing', 'football', 'is', 'a', 'great', 'game', 'do', 'not', 'like']

(11 unique words)

Step 2: Vectorize Sentences

Now convert each sentence into a vector using word counts from the vocabulary.

Sentence BoW Vector (frequency of each word)

I love playing football [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

Football is a great game [0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0]

I do not like football [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1]

Summary

Feature BoW

Use Convert text to numbers

Based on Word frequency


Feature BoW

Ignores Grammar, word order

Output Vectors of word counts

7.Precision vs Recall.
Precision vs Recall (Intuition)

Term What it Answers Importance in Context

"Of all the messages predicted as Important when false positives are costly
Precision spam, how many were actually (e.g., mislabeling an important email as
spam?" spam)

"Of all the actual spam messages, Important when missing positives is risky
Recall how many did we correctly (e.g., missing a spam email that contains a
identify?" phishing link)

You might also like