1.
Pandas
Pandas is an open-source Python library used for data analysis and data manipulation. It
provides fast, flexible, and expressive data structures like:
• Series: One-dimensional labeled array.
• Data Frame: Two-dimensional labeled data structure (like a spreadsheet or SQL
table).
• Accessing Different Types of Datasets with Pandas
Format Function
CSV pd.read_csv('filename.csv')
Excel pd.read_excel('filename.xlsx')
JSON pd.read_json('filename.json')
SQL pd.read_sql(query, connection)
Parquet pd.read_parquet('filename.parquet')
HTML Tables pd.read_html('url_or_file.html')
Clipboard pd.read_clipboard()
Pickle pd.read_pickle('filename.pkl')
2.Regular expression library.
Common Functions in re
Function Description
re.search() Searches for a pattern anywhere in a string
re.match() Checks if the pattern matches at the beginning of the string
re.findall() Returns all non-overlapping matches as a list
re.sub() Replaces matches with a new string
re.split() Splits a string by the matches of the pattern
re.compile() Compiles a pattern for reuse (efficient in loops)
3.NLTK Library.
NLTK is a powerful Python package used for Natural Language Processing (NLP) tasks,
like:
• Tokenization
• Stemming
• Stopword removal
• Text classification
• Sentiment analysis
4. What are Stop words in NLP?
Stop words are the most common words in a language that usually do not carry
important meaning, especially for tasks like text classification, search engines, or NLP
models.
Why Remove Stop words?
Removing stop words helps to:
• Reduce noise in text data
• Decrease dimensionality of data
• Improve model performance in many NLP tasks
5.Stemming vs Lemmatization.
Stemming is the process of reducing a word to its word stem that affixes to suffixes and
prefixes or to the roots of words known as a lemma. Stemming is important in natural
language understanding (NLU) and natural language processing (NLP).
Lemmatization technique is like stemming. The output we will get after lemmatization is
called ‘lemma’, which is a root word rather than root stem, the output of stemming. After
lemmatization, we will be getting a valid word that means the same thing.
6.BAG OF WORDS.
What is Bag of Words (BoW)?
The Bag of Words model is a technique used to convert text into numerical features so
that it can be used by machine learning models.
• It counts the frequency of each word in a document.
• It ignores grammar and word order, and focuses only on word occurrences.
How BoW Works — Step-by-Step
Suppose you have these 3 sentences:
1. "I love playing football"
2. "Football is a great game"
3. "I do not like football"
Step 1: Build Vocabulary (Unique Words)
From all the sentences, list all unique words:
['i', 'love', 'playing', 'football', 'is', 'a', 'great', 'game', 'do', 'not', 'like']
(11 unique words)
Step 2: Vectorize Sentences
Now convert each sentence into a vector using word counts from the vocabulary.
Sentence BoW Vector (frequency of each word)
I love playing football [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
Football is a great game [0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0]
I do not like football [1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1]
Summary
Feature BoW
Use Convert text to numbers
Based on Word frequency
Feature BoW
Ignores Grammar, word order
Output Vectors of word counts
7.Precision vs Recall.
Precision vs Recall (Intuition)
Term What it Answers Importance in Context
"Of all the messages predicted as Important when false positives are costly
Precision spam, how many were actually (e.g., mislabeling an important email as
spam?" spam)
"Of all the actual spam messages, Important when missing positives is risky
Recall how many did we correctly (e.g., missing a spam email that contains a
identify?" phishing link)