IR Journal (Printable)
IR Journal (Printable)
BSc CS
Practical No: 1
Aim: Document Indexing and Retrieval
● Implement an inverted index construction algorithm.
● Build a simple document retrieval system using the constructed
index.
Practical:
Input:
import nltk # Import NLTK to download stopwords
from nltk.corpus import stopwords # Import stopwords from NLTK
inverted_index[term] = documents
Information Retrieval 2
Mrs. Chanderi S. Sarkale T.Y.BSc CS
Practical No: 2
Aim: Retrieval Models
● Implement the Boolean retrieval model and process queries.
● Implement the vector space model with TF-IDF weighting and cosine
similarity.
Practical:
A) Implement the Boolean retrieval model and process queries:
Input:
documents = {
1: "apple banana orange",
2: "apple banana",
3: "banana orange",
4: "apple"
}
Information Retrieval 3
Mrs. Chanderi S. Sarkale T.Y.BSc CS
result = index.get(operands[0], set()) # Get the set of document IDs for the
first operand
for term in operands[1:]: # Iterate through the rest of the operands
result = result.intersection(index.get(term, set())) # Compute intersection
with sets of document IDs
return list(result) # Return the resulting list of document IDs
# Example queries
query1 = ["apple", "banana"] # Query for documents containing both "apple"
Information Retrieval 4
Mrs. Chanderi S. Sarkale T.Y.BSc CS
and "banana"
query2 = ["apple", "orange"] # Query for documents containing "apple" or
"orange"
# Printing results
print("Documents containing 'apple' and 'banana':", result1)
print("Documents containing 'apple' or 'orange':", result2)
print("Documents not containing 'orange':", result3)
print("Performed by 740_Pallavi & 743_Deepak")
Output:
B) Implement the vector space model with TF-IDF weighting and cosine
similarity:
Input:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
# Import necessary libraries
import nltk # Import NLTK to download stopwords
from nltk.corpus import stopwords # Import stopwords from NLTK
import numpy as np # Import NumPy library
from numpy.linalg import norm # Import norm function from NumPy's linear
algebra module
# Define the training and test sets of text documents
Information Retrieval 5
Mrs. Chanderi S. Sarkale T.Y.BSc CS
# Fit the transformer to the test set and transform it to TF-IDF representation
transformer.fit(testVectorizerArray)
print()
tfidf = transformer.transform(testVectorizerArray)
print(tfidf.todense())
Output:
Information Retrieval 7
Mrs. Chanderi S. Sarkale T.Y.BSc CS
Practical No: 3
Output:
Information Retrieval 9
Mrs. Chanderi S. Sarkale T.Y.BSc CS
Practical No: 4
Input:
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F-measure: {f_measure}")
Output:
Input:
Information Retrieval 11
Mrs. Chanderi S. Sarkale T.Y.BSc CS
Practical No: 5
Input:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
# Load the CSV file
df = pd.read_csv(r"C:\Users\Administrator\Documents\Sem 6\IR\Dataset.csv")
data = df["covid"] + "" + df["fever"]
X = data.astype(str) # Test data
y = df['flu'] # Labels
# Splitting the data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
random_state = 42)
# Converting data into bag-of-data format to train the model
vectorizer = CountVectorizer()
# initializing the converter
X_train_counts = vectorizer.fit_transform(X_train)
# converting the training data
X_test_counts = vectorizer.transform(X_test)
# converting the test data
# using and training the multinomial model of naive bayes algorithm
classifier = MultinomialNB() # initializing the classifier
classifier.fit(X_train_counts, y_train) # training the classifier
Information Retrieval 12
Mrs. Chanderi S. Sarkale T.Y.BSc CS
# loading another dataset to test if the model is working properly
data1 = pd.read_csv(r"C:\Users\Administrator\Documents\Sem 6\IR\Test.csv")
new_data = data1["covid"] + "" + data1["fever"]
new_data_counts = vectorizer.transform(new_data.astype(str)) # converting
the new data
# making the model to predict the results for new dataset
predictions = classifier.predict(new_data_counts)
# Output the results
new_data = predictions
print(new_data)
# retrieving the accuracy and classification report
accuracy = accuracy_score(y_test, classifier.predict(X_test_counts))
print(f"\nAccuracy: {accuracy:.2f}")
print("Classification Report: ")
print(classification_report(y_test, classifier.predict(X_test_counts)))
# Convert the predictions to a DataFrame
predictions_df = pd.DataFrame(predictions, columns = ['flu_prediction'])
# concatenate the original DataFrame with the predictions DataFrame
data1 = pd.concat([data1, predictions_df], axis = 1)
# write the DataFrame back to CSV data1.to_csv(r"C:
\Users\Administrator\Documents\Sem 6\IR\Test1.csv", index
= False)
Output:
Information Retrieval 13
Mrs. Chanderi S. Sarkale T.Y.BSc CS
Practical No: 6
Practical
Input:
Information Retrieval 14
Mrs. Chanderi S. Sarkale T.Y.BSc CS
Practical No: 7
Practical
Input:
import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import urljoin, urlparse
from urllib.robotparser import RobotFileParser
def get_html(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/
537.3'}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
return response.text
except requests.exceptions.HTTPError as errh:
print(f"HTTP Error: {errh}")
except requests.exceptions.RequestException as err:
print(f"Request Error: {err}")
return None
def save_robots_txt(url):
try:
robots_url = urljoin(url, '/robots.txt')
robots_content = get_html(robots_url)
if robots_content:
Information Retrieval 15
Mrs. Chanderi S. Sarkale T.Y.BSc CS
def load_robots_txt():
try:
with open('robots.txt', 'rb') as file:
return file.read().decode('utf-8-sig')
except FileNotFoundError:
return None
time.sleep(delay)
html = get_html(url)
if html:
print(f"Crawling {url}")
links = extract_links(html, url)
for link in links:
recursive_crawl(link, depth + 1, robots_content)
save_robots_txt(start_url)
robots_content = load_robots_txt()
if not robots_content:
print("Unable to retrieve robots.txt. Crawling without restrictions.")
recursive_crawl(start_url, 1, robots_content)
# Example usage:
print("Performed by 740_Pallavi & 743_Deepak") crawl('https://
wikipedia.com', max_depth=2, delay=2)
Output:
Information Retrieval 17
Mrs. Chanderi S. Sarkale T.Y.BSc CS
robot.txt file:
Information Retrieval 18
Mrs. Chanderi S. Sarkale T.Y.BSc CS
Practical No: 8
Practical
Input:
import numpy as np
return page_ranks
# Example usage
if name == " main ":
# Define a simple directed graph as an adjacency list
# Each index represents a node, and the list at that index contains nodes to
which it has outgoing links
web_graph = [
[1, 2], # Node 0 has links to Node 1 and Node 2
[0, 2], # Node 1 has links to Node 0 and Node 2
[0, 1] , # Node 2 has links to Node 0 and Node 1
[1,2], # Node 3 has links to Node 1 and Node 2
]
# Calculate PageRank
result = page_rank(web_graph)
Information Retrieval 20