0% found this document useful (0 votes)

26 views11 pages

EXPERIMENT8

This document evaluates various word embedding models for the Maithili language, highlighting their effectiveness in natural language processing (NLP) tasks. It covers models such as Neural Network Language Models, LSTMs, Transformers, and FastText, among others, and discusses both intrinsic and extrinsic evaluation metrics. The study aims to provide insights into the performance of these models, guiding future research in Maithili NLP.

Uploaded by

Rohit kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views11 pages

EXPERIMENT8

Uploaded by

Rohit kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Evaluation of Word Embedding Models for Maithili

NLP
Rajarajeshwari
March 30, 2025

1 Introduction
Word embeddings are essential for various NLP tasks, providing dense vector represen-
tations of words that capture semantic and syntactic relationships. This study evaluates
different word embedding models for the Maithili language.

2 Word Embedding Models

This section discusses various word embedding models evaluated for Maithili NLP.

2.1 Neural Network Language Models (NNLM)

NNLM uses a feedforward neural network to predict the next word in a sequence, im-
proving representation learning over traditional count-based models.

2.2 Long Short-Term Memory (LSTM)

LSTMs are a type of recurrent neural network (RNN) that address the vanishing gradient
problem, making them effective for sequential word embeddings.

2.3 Recurrent Neural Networks (RNNs)

RNNs process sequential data by maintaining a hidden state, making them useful for
capturing word dependencies over time.

2.4 Transformers
Transformers use self-attention mechanisms to process entire sequences in parallel, leading
to state-of-the-art performance in contextual embeddings.

2.5 Continuous Bag-of-Words (CBOW) and Skip-gram

CBOW predicts a target word from its context, while Skip-gram predicts surrounding
words from a given word, both trained using shallow neural networks.

1
2.6 Term Frequency-Inverse Document Frequency (TF-IDF)
TF-IDF represents words based on their importance within a document, calculated as
the product of term frequency and inverse document frequency.

2.7 Latent Semantic Analysis (LSA)

LSA applies singular value decomposition (SVD) to a term-document matrix to uncover
latent semantic structures in the text.

2.8 Latent Dirichlet Allocation (LDA)

LDA is a probabilistic topic modeling technique that assigns words to topics based on
statistical distributions.

2.9 Co-occurrence Matrix: Singular Value Decomposition (SVD)

SVD factorizes a co-occurrence matrix into singular vectors, extracting latent word rela-
tionships.

2.10 Global Vectors for Word Representation (GloVe)

GloVe constructs word embeddings by factorizing a word co-occurrence matrix, optimiz-
ing word vector representations.

2.11 FastText
FastText extends Skip-gram by representing words as character n-grams, improving em-
beddings for rare and out-of-vocabulary words.

2.12 n-gram Model: ngram2vec

ngram2vec generates word embeddings by considering different n-gram contexts, improv-
ing representation learning for word sequences.

2.13 Dictionary Model: Dict2vec

Dict2vec enhances embeddings by incorporating semantic knowledge from dictionaries
and lexicons.

2.14 Deep Contextualized Embeddings (ELMo)

ELMo generates contextual word embeddings using deep bidirectional LSTMs trained on
large text corpora.

2
2.15 Bidirectional Encoder Representations from Transformers
(BERT)
BERT uses masked language modeling and next sentence prediction to generate context-
aware word embeddings.

2.16 OpenAI GPT

GPT is a transformer-based model trained using autoregressive language modeling, cap-
turing rich contextual information.

2.17 Skip-Gram with Negative Sampling (SGNS)

SGNS enhances the Skip-gram model by using negative sampling to improve training
efficiency.

2.18 NV-Embed-v2
NV-Embed-v2 is a neural word embedding approach that optimizes representations using
adversarial training techniques.

2.19 Doc2Vec
Doc2Vec extends Word2Vec to generate embeddings for entire documents rather than
individual words.

2.20 InferSent
InferSent is a sentence embedding model trained for capturing semantic relationships in
sentence pairs.

3 Experimental Setup
We evaluate the above models on Maithili datasets for short and long text embedding.
We apply both intrinsic and extrinsic evaluation metrics.

4 Intrinsic Evaluation
We assess word similarity, word analogy, concept categorization, outlier detection, QVEC,
embedding latency, retrieval quality, Geodesic correlation, triplet loss, Minimum Recon-
struction Error (MRE) score, t-SNE, and PCA visualizations.

5 Extrinsic Evaluation
Extrinsic evaluation includes POS tagging, chunking, named-entity recognition (NER),
sentiment analysis, paraphrase identification, and neural machine translation (NMT).

3
6 Comparative Analysis
We compare the performance of different embeddings, analyzing:

• Similarity score variations

• Word similarity ranking

• Effects of character, word, sentence, and document-level embeddings

7 Conclusion
Our study provides insights into the effectiveness of different word embedding models for
Maithili NLP, guiding future research directions.

4
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Function to get article URLs from the homepage of Mithila News

def get_article_urls(base_url):
response = requests.get(base_url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the links to articles (we assume that article links are contained in <a> tags with specific classes)
article_links = []
for article in soup.find_all('a', href=True): # <a> tags with href attributes
link = article['href']
# Filter links for those that lead to articles (you can refine the condition based on the actual page structure)
if link.startswith('https://www.mithilanews.com/') and 'category' not in link:
article_links.append(link)

return article_links

# Function to scrape text from a single article

def scrape_article(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find article text (Assuming article text is in <div> with class 'entry-content' or similar)
article_text = ""
content_div = soup.find('div', class_='entry-content') # You may need to adjust this based on the actual site structure
if content_div:
article_text = content_div.get_text(separator=' ', strip=True) # Get text from article

return article_text

# Function to scrape multiple articles and save them into a CSV file
def scrape_mithila_news(base_url, output_file):
article_urls = get_article_urls(base_url)
articles = []

for url in article_urls:

print(f"Scraping article: {url}")
article_text = scrape_article(url)
articles.append({'url': url, 'text': article_text})

# Save the scraped articles into a pandas DataFrame and then to CSV
df = pd.DataFrame(articles)
df.to_csv(output_file, index=False, encoding='utf-8')
print(f"Scraping complete! Data saved to {output_file}")

# Main script to start scraping

if __name__ == "__main__":
base_url = 'https://www.mithilanews.com/' # The homepage URL of Mithila News
output_file = 'maithili_articles.csv' # The output CSV file where articles will be saved
scrape_mithila_news(base_url, output_file)

Scraping article: https://www.mithilanews.com/search/label/Katha-Pihani 

Scraping article: https://www.mithilanews.com/search/label/Mithila%20Vibhuti
Scraping article: https://www.mithilanews.com/search/label/Pabani-Tyohar
Scraping article: https://www.mithilanews.com/search/label/Geet-Naad
Scraping article: https://www.mithilanews.com/search/label/Paryatan
Scraping article: https://www.mithilanews.com/search/label/Emhar-Omhar
Scraping article: https://www.mithilanews.com/2024/07/vineet-utpal-on-prabhat-jha.html
Scraping article: https://www.mithilanews.com/2024/07/vineet-utpal-on-prabhat-jha.html#more
Scraping article: https://www.mithilanews.com/2024/07/vineet-utpal-on-prabhat-jha.html#comment-form
Scraping article: https://www.mithilanews.com/search/label/Mithila%20Vibhuti
Scraping article: https://www.mithilanews.com/search/label/Report
Scraping article: https://www.mithilanews.com/2022/11/blog-post_208.html
Scraping article: https://www.mithilanews.com/2022/11/blog-post_208.html#more
Scraping article: https://www.mithilanews.com/2022/11/blog-post_208.html#comment-form
Scraping article: https://www.mithilanews.com/search/label/Mithila%20Vibhuti
Scraping article: https://www.mithilanews.com/2022/11/blog-post_68.html
Scraping article: https://www.mithilanews.com/2022/11/blog-post_68.html#more
Scraping article: https://www.mithilanews.com/2022/11/blog-post_68.html#comment-form
Scraping article: https://www.mithilanews.com/search/label/Mithila%20Vibhuti
Scraping article: https://www.mithilanews.com/2022/11/blog-post_384.html
Scraping article: https://www.mithilanews.com/2022/11/blog-post_384.html#more
Scraping article: https://www.mithilanews.com/2022/11/blog-post_384.html#comment-form
Scraping article: https://www.mithilanews.com/search/label/Mithila%20Vibhuti
Scraping article: https://www.mithilanews.com/2022/11/gopalji-jha-gopesh.html
Scraping article: https://www.mithilanews.com/2022/11/gopalji-jha-gopesh.html#more
Scraping article: https://www.mithilanews.com/2022/11/gopalji-jha-gopesh.html#comment-form
Scraping article: https://www.mithilanews.com/search/label/Mithila%20Vibhuti
Scraping article: https://www.mithilanews.com/2022/11/rajkamal.html
Scraping article: https://www.mithilanews.com/2022/11/rajkamal.html#more
Scraping article: https://www.mithilanews.com/2022/11/rajkamal.html#comment-form 
Scraping article: https://www.mithilanews.com/search/label/Mithila%20Vibhuti
Scraping article: https://www.mithilanews.com/2022/11/blog-post_795.html
Scraping article: https://www.mithilanews.com/2022/11/blog-post_795.html#more
Scraping article: https://www.mithilanews.com/2022/11/blog-post_795.html#comment-form
Scraping article: https://www.mithilanews.com/search/label/Mithila%20Vibhuti
Scraping article: https://www.mithilanews.com/search?updated-max=2022-11-14T07:11:00-08:00&max-results=7
Scraping article: https://www.mithilanews.com/
Scraping article: https://www.mithilanews.com/feeds/posts/default
<ipython-input-1-af3589daea53>:23: XMLParsedAsHTMLWarning: It looks like you're using an HTML parser to parse an XML document.

Assuming this really is an XML document, what you're doing might work, but you should know that using an XML parser will be more

If you want or need to use an HTML parser on this document, you can make this warning go away by filtering it. To do that, run th

from bs4 import XMLParsedAsHTMLWarning

import warnings

warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)

soup = BeautifulSoup(response.content, 'html.parser')

Scraping article: https://www.mithilanews.com/2024/07/vineet-utpal-on-prabhat-jha.html
Scraping article: https://www.mithilanews.com/2020/05/blog-post.html
Scraping article: https://www.mithilanews.com/2022/11/blog-post.html
Scraping article: https://www.mithilanews.com/2022/11/blog-post.html
Scraping article: https://www.mithilanews.com/2022/11/blog-post_208.html
Scraping article: https://www.mithilanews.com/2024/07/

data=pd.read_csv("maithili_articles.csv")
data['text'].head()

text

0 मिथिला न्यूज मे अहांक स्वागत अछि.

1 मैथिल पुत्र प्रभात झा नहि रहलाह। हुनकर देहावसा...

2 मिथिला न्यूज मे अहांक स्वागत अछि.

3 मिथिला न्यूज मे अहांक स्वागत अछि.

4 मिथिला न्यूज मे अहांक स्वागत अछि.

dtype: object
 

import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...

[nltk_data] Unzipping tokenizers/punkt_tab.zip.
True

import pandas as pd
import re 
 nltk
import 
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec, FastText

# Sample text column (Update this based on your dataset's text column name)
text_column = "text" # Change to the actual column name if different
data = data.dropna(subset=[text_column]) # Drop missing values

# Text Preprocessing
def preprocess_text(text):
text = re.sub(r'[^ऀ -ॿ ]', '', text) # Keep only Maithili (Devanagari) characters
tokens = word_tokenize(text)
return tokens if tokens else ["empty"] # Prevent empty lists

nltk.download('punkt')
data['tokens'] = data[text_column].apply(preprocess_text)

# 1. TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(analyzer='word', tokenizer=lambda x: x, preprocessor=lambda x: x)
X_tfidf = tfidf_vectorizer.fit_transform(data['tokens'])

# 2. Train Word2Vec (CBOW & Skip-gram)

word2vec_cbow = Word2Vec(sentences=data['tokens'], vector_size=100, window=5, min_count=2, workers=4, sg=0) # CBOW
word2vec_sg = Word2Vec(sentences=data['tokens'], vector_size=100, window=5, min_count=2, workers=4, sg=1) # Skip-gram

# 3. Train FastText
fasttext_model = FastText(sentences=data['tokens'], vector_size=100, window=5, min_count=2, workers=4, sg=1) # Skip-gram FastText

# Example Usage:
print("TF-IDF Shape:", X_tfidf.shape)
print("Word2Vec CBOW Example:", word2vec_cbow.wv.most_similar("मैथिली", topn=5)) # Replace with a valid Maithili word
print("FastText Example:", fasttext_model.wv.most_similar("मैथिली", topn=5)) # Replace with a valid Maithili word

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Package punkt is already up-to-date!
/usr/local/lib/python3.11/dist-packages/sklearn/feature_extraction/text.py:517: UserWarning: The parameter 'token_pattern' will not
warnings.warn(
TF-IDF Shape: (52, 1667)
Word2Vec CBOW Example: [('आऽ', 0.9916399717330933), ('हिनक', 0.9912693500518799), ('ई', 0.9899354577064514), ('आ', 0.989905595779
FastText Example: [('२मैथिली', 0.9999423623085022), ('१मैथिली', 0.9999391436576843), ('१३मैथिली', 0.9998805522918701), ('मैथिलीक', 0.9998

 

# Intrinsic Evaluation
## Word Similarity
def word_similarity(model, word):
try:
return model.wv.most_similar(word, topn=5)
except KeyError:
return "Word not in vocabulary"

print("Word Similarity (CBOW):", word_similarity(word2vec_cbow, "मैथिली"))

print("Word Similarity (FastText):", word_similarity(fasttext_model, "मैथिली"))

Word Similarity (CBOW): [('आऽ', 0.9916399717330933), ('हिनक', 0.9912693500518799), ('ई', 0.9899354577064514), ('आ', 0.98990559577
Word Similarity (FastText): [('२मैथिली', 0.9999423623085022), ('१मैथिली', 0.9999391436576843), ('१३मैथिली', 0.9998805522918701), ('मैथिलीक

 

## Word Analogy
def word_analogy(model, positive_words, negative_words):
try:
return model.wv.most_similar(positive=positive_words, negative=negative_words, topn=5)
except KeyError:
return "Words not in vocabulary"

print("Word Analogy (CBOW):", word_analogy(word2vec_cbow, ["मैथिली", "भाषा"], ["इंग्लिश"]))

print("Word Analogy (FastText):", word_analogy(fasttext_model, ["मैथिली", "भाषा"], ["इंग्लिश"]))

Word Analogy (CBOW): Words not in vocabulary

Word Analogy (FastText): [('२मैथिली', 0.9991796612739563), ('१मैथिली', 0.9991757869720459), ('१३मैथिली', 0.9990861415863037), ('मैथिलीक',

 

from sklearn.manifold import TSNE

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import numpy as np

## t-SNE Visualization
def plot_embeddings(model, title):
words = list(model.wv.index_to_key)[:100]
vectors = np.array([model.wv[word] for word in words])
tsne = TSNE(n_components=2, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)
plt.figure(figsize=(10, 6))
plt.scatter(reduced_vectors[:, 0], reduced_vectors[:, 1])
for word, (x, y) in zip(words, reduced_vectors):
plt.text(x, y, word, fontsize=8)
plt.title(title)
plt.show()

# Settings the warnings to be ignored

warnings.filterwarnings('ignore')

plot_embeddings(word2vec_cbow, "t-SNE Visualization of Word2Vec (CBOW) Embeddings")

plot_embeddings(fasttext_model, "t-SNE Visualization of FastText Embeddings")
 

import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to

[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger_eng.zip.
True

# Extrinsic Evaluation: POS Tagging (Placeholder Code)

from nltk.tag import pos_tag
sample_sentence = "मैथिली भाषा भारत में प्राचीन भाषाओं में से एक है।"
tokens = word_tokenize(sample_sentence)
print("POS Tagging Example:", pos_tag(tokens))

POS Tagging Example: [('मैथिली', 'JJ'), ('भाषा', 'NNP'), ('भारत', 'NNP'), ('में', 'NNP'), ('प्राचीन', 'NNP'), ('भाषाओं', 'NNP'), ('में', 'NNP

 

# Comparative Investigation
## Similarity Comparisons
word_a = "भाषा"
word_b = "मैथिली"
word_c = "प्रभात"
similarity_ab = word2vec_cbow.wv.similarity(word_a, word_b)
similarity_ac = word2vec_cbow.wv.similarity(word_a, word_c)
print(f"Similarity {word_a}-{word_b}: {similarity_ab}, {word_a}-{word_c}: {similarity_ac}")
print("Is A more similar to B than C?", similarity_ab > similarity_ac)

## Discussion on Embedding Levels

print("Character-level embeddings work best for morphological variations, while word-level embeddings suit lexical semantics. Sentence/d

Similarity भाषा-मैथिली: 0.9369572997093201, भाषा-प्रभात: 0.9217463135719299

Is A more similar to B than C? True
Character-level embeddings work best for morphological variations, while word-level embeddings suit lexical semantics. Sentence/docu

 

import pandas as pd
import re
import nltk
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from gensim.models import Word2Vec, FastText, Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.manifold import TSNE
from sklearn.decomposition import LatentDirichletAllocation
from transformers import BertModel, BertTokenizer
import torch

# Load dataset
data = pd.read_csv("maithili_articles.csv")
# Sample text column (Update this based on your dataset's text column name)
text_column = "text" # Change to the actual column name if different
data = data.dropna(subset=[text_column]) # Drop missing values

# Text Preprocessing
def preprocess_text(text):
text = re.sub(r'[^ऀ -ॿ ]', '', text) # Keep only Telugu characters
sentences = sent_tokenize(text) # Sentence tokenization
tokenized_sentences = [word_tokenize(sent) for sent in sentences]
return tokenized_sentences

nltk.download('punkt')
data['tokenized_sentences'] = data[text_column].apply(preprocess_text)

[nltk_data] Downloading package punkt to /root/nltk_data...

[nltk_data] Package punkt is already up-to-date!

# 1. TF-IDF for Long Text

tfidf_vectorizer = TfidfVectorizer(analyzer='word', tokenizer=lambda x: x, preprocessor=lambda x: x)
X_tfidf = tfidf_vectorizer.fit_transform(data[text_column])

# 2. LSA (Latent Semantic Analysis)

lsa = TruncatedSVD(n_components=100, random_state=42)
X_lsa = lsa.fit_transform(X_tfidf)

# 3. LDA (Latent Dirichlet Allocation)

count_vectorizer = CountVectorizer()
X_counts = count_vectorizer.fit_transform(data[text_column])
lda = LatentDirichletAllocation(n_components=10, random_state=42)
X_lda = lda.fit_transform(X_counts)

# 4. Train Word2Vec (CBOW & Skip-gram) for Long Text

word2vec_cbow = Word2Vec(sentences=[sent for doc in data['tokenized_sentences'] for sent in doc], vector_size=100, window=5, min_count=2
word2vec_sg = Word2Vec(sentences=[sent for doc in data['tokenized_sentences'] for sent in doc], vector_size=100, window=5, min_count=2,

# 5. Train FastText for Long Text

fasttext_model = FastText(sentences=[sent for doc in data['tokenized_sentences'] for sent in doc], vector_size=100, window=5, min_count=

# 6. Train Doc2Vec for Document-Level Embeddings

tagged_docs = [TaggedDocument(words=[word for sent in doc for word in sent], tags=[str(i)]) for i, doc in enumerate(data['tokenized_sent
doc2vec_model = Doc2Vec(tagged_docs, vector_size=100, window=5, min_count=2, workers=4, epochs=10)

# 7. BERT for Sentence Embeddings

tokenizer = BertTokenizer.from_pretrained("bert-base-multilingual-cased")
model = BertModel.from_pretrained("bert-base-multilingual-cased")
tokenizer_config.json: 100% 49.0/49.0 [00:00<00:00, 3.21kB/s]

vocab.txt: 100% 996k/996k [00:00<00:00, 10.9MB/s]

tokenizer.json: 100% 1.96M/1.96M [00:00<00:00, 14.4MB/s]

config.json: 100% 625/625 [00:00<00:00, 40.1kB/s]

model.safetensors: 100% 714M/714M [00:08<00:00, 78.4MB/s]

 

import torch
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import warnings

# Ignore warnings
warnings.filterwarnings('ignore')

def get_bert_embedding(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
return outputs.last_hidden_state.mean(dim=1).numpy()

data['bert_embeddings'] = data[text_column].apply(get_bert_embedding)

# Example Usage
print("TF-IDF Shape:", X_tfidf.shape)
print("LSA Shape:", X_lsa.shape)
print("LDA Shape:", X_lda.shape)
print("Word2Vec CBOW Example:", word2vec_cbow.wv.most_similar("मैथिली", topn=5)) # Replace with a valid Maithili word
print("FastText Example:", fasttext_model.wv.most_similar("मैथिली", topn=5)) # Replace with a valid Maithili word
print("Doc2Vec Example:", doc2vec_model.dv.most_similar([doc2vec_model.dv[0]], topn=5)) # Corrected index handling
print("BERT Embedding Shape:", data['bert_embeddings'][0].shape)

# t-SNE Visualization for Doc2Vec

def plot_doc2vec_embeddings(model, title):
vectors = np.array([model.dv[i] for i in range(len(model.dv))]) # Ensure NumPy array
tsne = TSNE(n_components=2, random_state=42, perplexity=min(30, len(vectors) - 1)) # Fix perplexity issue
reduced_vectors = tsne.fit_transform(vectors)

plt.figure(figsize=(10, 6))
plt.scatter(reduced_vectors[:, 0], reduced_vectors[:, 1])
plt.title(title)
plt.show()

plot_doc2vec_embeddings(doc2vec_model, "t-SNE Visualization of Doc2Vec Embeddings")

TF-IDF Shape: (52, 102)
LSA Shape: (52, 52)
LDA Shape: (52, 10)
Word2Vec CBOW Example: [('आऽ', 0.9912952780723572), ('हिनक', 0.9909197092056274), ('आ', 0.9895987510681152), ('ई', 0.9895461201667786), (
FastText Example: [('२मैथिली', 0.9999431371688843), ('१मैथिली', 0.9999404549598694), ('१३मैथिली', 0.9998822212219238), ('मैथिलीक', 0.999830961227
Doc2Vec Example: [('0', 1.0000001192092896), ('30', 0.9155434966087341), ('13', 0.914787232875824), ('24', 0.9144942760467529), ('17', 0.91
BERT Embedding Shape: (1, 768)
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.
WARNING:matplotlib.font_manager:findfont: Font family 'Noto Sans Devanagari' not found.

RAG PText
No ratings yet
RAG PText
11 pages
BDMH LLM
No ratings yet
BDMH LLM
51 pages
Activation Functions in Neural Networks (12 Types & Use Cases)
No ratings yet
Activation Functions in Neural Networks (12 Types & Use Cases)
43 pages
5 Pretained Word Embeddings Algorithms
No ratings yet
5 Pretained Word Embeddings Algorithms
21 pages
Text Blob
No ratings yet
Text Blob
16 pages
Word Embedding
No ratings yet
Word Embedding
35 pages
Smart Computing and Communication
No ratings yet
Smart Computing and Communication
425 pages
Project Review-IV Presentation On: Department of Information Technology 2025 Semester VIII
No ratings yet
Project Review-IV Presentation On: Department of Information Technology 2025 Semester VIII
44 pages
14-Word Embeddings II
No ratings yet
14-Word Embeddings II
31 pages
Play List de Agosto 0082013.PRT629216.7261
No ratings yet
Play List de Agosto 0082013.PRT629216.7261
170 pages
The Good and The Bad Exploring Privacy Issues in Retrieval-Augmented Generation
No ratings yet
The Good and The Bad Exploring Privacy Issues in Retrieval-Augmented Generation
18 pages
Instructuie: Multi-Task Instruction Tuning For Unified Information Extraction
No ratings yet
Instructuie: Multi-Task Instruction Tuning For Unified Information Extraction
15 pages
Model5 Partial
No ratings yet
Model5 Partial
52 pages
Word Embedding Learning Process
No ratings yet
Word Embedding Learning Process
6 pages
Word Prediction Using NLP
No ratings yet
Word Prediction Using NLP
12 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
Introductory Sheet
No ratings yet
Introductory Sheet
4 pages
Abbreviations and Acronyms in The Translation Industry
No ratings yet
Abbreviations and Acronyms in The Translation Industry
12 pages
Abstract
No ratings yet
Abstract
8 pages
Lecture 27
No ratings yet
Lecture 27
40 pages
UER: An Open-Source Toolkit For Pre-Training Models
No ratings yet
UER: An Open-Source Toolkit For Pre-Training Models
6 pages
Three 150224 Generative A I Intro
No ratings yet
Three 150224 Generative A I Intro
19 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Word Embadding
No ratings yet
Word Embadding
24 pages
Exp 8 Report
No ratings yet
Exp 8 Report
2 pages
pdf24 - Merged (7) - 1-2
No ratings yet
pdf24 - Merged (7) - 1-2
2 pages
Seminar#1
No ratings yet
Seminar#1
29 pages
20 Batch SRP Batches Allocation
No ratings yet
20 Batch SRP Batches Allocation
12 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Electronics 10 01372 With Cover
No ratings yet
Electronics 10 01372 With Cover
24 pages
Leancontext: Cost-Efficient Domain-Specific Question Answering Using Llms
No ratings yet
Leancontext: Cost-Efficient Domain-Specific Question Answering Using Llms
8 pages
NLP - L9 Word Embedding
No ratings yet
NLP - L9 Word Embedding
5 pages
Dfa Examples
No ratings yet
Dfa Examples
10 pages
Word Embedding 9 Mar 23 PDF
No ratings yet
Word Embedding 9 Mar 23 PDF
16 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
CONNEAU and Lample - 2019 - Cross-Lingual Language Model Pretraining
No ratings yet
CONNEAU and Lample - 2019 - Cross-Lingual Language Model Pretraining
11 pages
NLP 3-6
No ratings yet
NLP 3-6
20 pages
DL Unit-IV
No ratings yet
DL Unit-IV
20 pages
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
No ratings yet
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
20 pages
Improving Text Embeddings With Large Language Models
No ratings yet
Improving Text Embeddings With Large Language Models
20 pages
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
No ratings yet
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
21 pages
The Development of Language AI Models in 2018
No ratings yet
The Development of Language AI Models in 2018
5 pages
A Survey of Word Embeddings Based On Deep Learning: Shirui Wang Wenan Zhou Chao Jiang
No ratings yet
A Survey of Word Embeddings Based On Deep Learning: Shirui Wang Wenan Zhou Chao Jiang
24 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Trend
No ratings yet
Trend
47 pages
1508.06615 - PTB Character Aware Neural Language Models Yoon Kim
No ratings yet
1508.06615 - PTB Character Aware Neural Language Models Yoon Kim
9 pages
Yeabsira Asefa PDF
No ratings yet
Yeabsira Asefa PDF
81 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
13 - Bert
No ratings yet
13 - Bert
17 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Word Embeddings Notes
No ratings yet
Word Embeddings Notes
9 pages
Chapter II
No ratings yet
Chapter II
26 pages
Word 2 Vec
No ratings yet
Word 2 Vec
6 pages
Model
No ratings yet
Model
5 pages
10 (3S) 4112-4118
No ratings yet
10 (3S) 4112-4118
7 pages
Seminar Outline NLP
No ratings yet
Seminar Outline NLP
5 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
Practical Introduction. Ottawa: University of Ottawa Press
No ratings yet
Practical Introduction. Ottawa: University of Ottawa Press
4 pages
Stemming Is The Process of Reducing Words To Their Base or Root Form (E.g., "Running"
No ratings yet
Stemming Is The Process of Reducing Words To Their Base or Root Form (E.g., "Running"
5 pages
Thuyết Trình TWP
No ratings yet
Thuyết Trình TWP
7 pages
An Empirical Evaluation of Stop Word Removal in Statistical Machine Translation
No ratings yet
An Empirical Evaluation of Stop Word Removal in Statistical Machine Translation
8 pages
Fulbright Grant: Research Proposal
100% (4)
Fulbright Grant: Research Proposal
2 pages
Explaining The Intuition of Word2Vec & Implementing It in Python
No ratings yet
Explaining The Intuition of Word2Vec & Implementing It in Python
13 pages
Machine Translation
No ratings yet
Machine Translation
22 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
LLM Survey
100% (1)
LLM Survey
43 pages
NLP - Natural Language Processing
No ratings yet
NLP - Natural Language Processing
74 pages
Natural Language Processing Handout
No ratings yet
Natural Language Processing Handout
8 pages
ccs369 Ts A Syllabus
No ratings yet
ccs369 Ts A Syllabus
3 pages
Motivation Letter MSC Computational Linguistics PDF
100% (2)
Motivation Letter MSC Computational Linguistics PDF
2 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
Analogue Digital Normal Use Hints Timetables, TV, Etc.: 6:00 It's 6 O'clock
No ratings yet
Analogue Digital Normal Use Hints Timetables, TV, Etc.: 6:00 It's 6 O'clock
9 pages
ESLint Configuration and Best Practices: Definitive Reference for Developers and Engineers
From Everand
ESLint Configuration and Best Practices: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Elasticsearch for Hadoop
From Everand
Elasticsearch for Hadoop
Shukla Vishal
No ratings yet
Mastering Spring Application Development
From Everand
Mastering Spring Application Development
Anjana Mankale
1/5 (1)
ASP.NET Core for Jobseekers: Build Career in Designing Cross-Platform Web Applications Using Razor and Entity Framework Core
From Everand
ASP.NET Core for Jobseekers: Build Career in Designing Cross-Platform Web Applications Using Razor and Entity Framework Core
Kemal Birer
No ratings yet
Learning Elasticsearch 7.x: Index, Analyze, Search and Aggregate Your Data Using Elasticsearch (English Edition)
From Everand
Learning Elasticsearch 7.x: Index, Analyze, Search and Aggregate Your Data Using Elasticsearch (English Edition)
Anurag Srivastava
No ratings yet
Business Analytics with SAS Studio: Deliver Business Intelligence by Combining SQL Processing, Insightful Visualizations, and Various Data Mining Techniques
From Everand
Business Analytics with SAS Studio: Deliver Business Intelligence by Combining SQL Processing, Insightful Visualizations, and Various Data Mining Techniques
Rajinder Kr. Chitoria
No ratings yet
Mastering Ansible - Second Edition
From Everand
Mastering Ansible - Second Edition
Keating Jesse
No ratings yet
Learn SQL with MySQL: Retrieve and Manipulate Data Using SQL Commands with Ease
From Everand
Learn SQL with MySQL: Retrieve and Manipulate Data Using SQL Commands with Ease
Ashwin Pajankar
No ratings yet
Sass and Compass for Designers
From Everand
Sass and Compass for Designers
Ben Frain
No ratings yet
Mastering MEAN Stack: Build full stack applications using MongoDB, Express.js, Angular, and Node.js (English Edition)
From Everand
Mastering MEAN Stack: Build full stack applications using MongoDB, Express.js, Angular, and Node.js (English Edition)
Pinakin Ashok Chaubal
No ratings yet
Introduction to DBMS: Designing and Implementing Databases from Scratch for Absolute Beginners
From Everand
Introduction to DBMS: Designing and Implementing Databases from Scratch for Absolute Beginners
Dr. Hariram Chavan
No ratings yet
Data Analytics with SAS: Explore your data and get actionable insights with the power of SAS (English Edition)
From Everand
Data Analytics with SAS: Explore your data and get actionable insights with the power of SAS (English Edition)
Nishant Sidana
No ratings yet
Spring Data
From Everand
Spring Data
Petri Kainulainen
No ratings yet
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
React Components
From Everand
React Components
Christopher Pitt
No ratings yet
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Responsive Web Design with HTML5 and CSS3
From Everand
Responsive Web Design with HTML5 and CSS3
Ben Frain
3.5/5 (12)
Web Scraping for SEO with Python
From Everand
Web Scraping for SEO with Python
Enrique Vicente
No ratings yet
Java: Tips and Tricks to Programming Code with Java: Java Computer Programming, #2
From Everand
Java: Tips and Tricks to Programming Code with Java: Java Computer Programming, #2
Charlie Masterson
No ratings yet
Java: Tips and Tricks to Programming Code with Java
From Everand
Java: Tips and Tricks to Programming Code with Java
Charlie Masterson
No ratings yet

EXPERIMENT8

Uploaded by

EXPERIMENT8

Uploaded by

Evaluation of Word Embedding Models for Maithili

2 Word Embedding Models

2.1 Neural Network Language Models (NNLM)

2.2 Long Short-Term Memory (LSTM)

2.3 Recurrent Neural Networks (RNNs)

2.5 Continuous Bag-of-Words (CBOW) and Skip-gram

2.7 Latent Semantic Analysis (LSA)

2.8 Latent Dirichlet Allocation (LDA)

2.9 Co-occurrence Matrix: Singular Value Decomposition (SVD)

2.10 Global Vectors for Word Representation (GloVe)

2.12 n-gram Model: ngram2vec

2.13 Dictionary Model: Dict2vec

2.14 Deep Contextualized Embeddings (ELMo)

2.16 OpenAI GPT

2.17 Skip-Gram with Negative Sampling (SGNS)

• Similarity score variations

• Word similarity ranking

• Effects of character, word, sentence, and document-level embeddings

# Function to get article URLs from the homepage of Mithila News

# Function to scrape text from a single article

for url in article_urls:

# Main script to start scraping

Scraping article: https://www.mithilanews.com/search/label/Katha-Pihani 

from bs4 import XMLParsedAsHTMLWarning

soup = BeautifulSoup(response.content, 'html.parser')

0 मिथिला न्यूज मे अहांक स्वागत अछि.

1 मैथिल पुत्र प्रभात झा नहि रहलाह। हुनकर देहावसा...

2 मिथिला न्यूज मे अहांक स्वागत अछि.

3 मिथिला न्यूज मे अहांक स्वागत अछि.

4 मिथिला न्यूज मे अहांक स्वागत अछि.

[nltk_data] Downloading package punkt_tab to /root/nltk_data...

# 2. Train Word2Vec (CBOW & Skip-gram)

[nltk_data] Downloading package punkt to /root/nltk_data...

print("Word Similarity (CBOW):", word_similarity(word2vec_cbow, "मैथिली"))

print("Word Analogy (CBOW):", word_analogy(word2vec_cbow, ["मैथिली", "भाषा"], ["इंग्लिश"]))

Word Analogy (CBOW): Words not in vocabulary

from sklearn.manifold import TSNE

# Settings the warnings to be ignored

plot_embeddings(word2vec_cbow, "t-SNE Visualization of Word2Vec (CBOW) Embeddings")

[nltk_data] Downloading package averaged_perceptron_tagger_eng to

# Extrinsic Evaluation: POS Tagging (Placeholder Code)

## Discussion on Embedding Levels

Similarity भाषा-मैथिली: 0.9369572997093201, भाषा-प्रभात: 0.9217463135719299

[nltk_data] Downloading package punkt to /root/nltk_data...

# 1. TF-IDF for Long Text

# 2. LSA (Latent Semantic Analysis)

# 3. LDA (Latent Dirichlet Allocation)

# 4. Train Word2Vec (CBOW & Skip-gram) for Long Text

# 5. Train FastText for Long Text

# 6. Train Doc2Vec for Document-Level Embeddings

# 7. BERT for Sentence Embeddings

vocab.txt: 100% 996k/996k [00:00<00:00, 10.9MB/s]

tokenizer.json: 100% 1.96M/1.96M [00:00<00:00, 14.4MB/s]

config.json: 100% 625/625 [00:00<00:00, 40.1kB/s]

model.safetensors: 100% 714M/714M [00:08<00:00, 78.4MB/s]

# t-SNE Visualization for Doc2Vec

plot_doc2vec_embeddings(doc2vec_model, "t-SNE Visualization of Doc2Vec Embeddings")

You might also like