[go: up one dir, main page]

0% found this document useful (0 votes)
48 views27 pages

Web Mining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views27 pages

Web Mining

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Web Mining

(CIE-431P)
B.Tech. Programme
(IT)

LAB MANUAL

Maharaja Surajmal Institute of Technology


(NAAC and NBA Accreditated)
Affiliated to GGSIP University
C-4, Janak Puri, New Delhi - 110058
Web Mining CIE-431P

Department of Information Technology

VISION

 Foster innovation and research in students to solve future challenges using


computing

MISSION

 M1 : Create educational pathways for students' career success.


 M2 : Encourage curiosity and innovation with effective technology use.
 M3 : Develop management skills, integrity, and values through diverse
activities.

Program Educational Objectives (PEOs)

 PEO 1: Graduates are ready for IT roles, applying new ideas and knowledge.
 PEO 2: Graduates will excel as team members and grow into leadership roles.
 PEO 3: Graduates apply computing principles to complete software projects
successfully.
 PEO 4: Graduates can pursue higher education and contribute to research and IT
development.
Web Mining CIE-431P

Department of Information Technology


Name of the Subject: Web Mining Lab Subject Code CIE-431P
Semester/Year: 7th /4th Year External Marks: 60
Internal Marks: 40

Course Objective

To understand and apply web mining techniques such as data pre-processing, analyzing link
structures, and mining web usage and content. Additionally, students will gain proficiency in web
analysis tools and technologies used for tasks like sentiment analysis, opinion mining, and
recommendation systems.

Course Outcomes

1. Able to understand and implement the key concepts of web page ranking algorithms.
2. Able to design and implement methods for analyzing the link structure of the web.
3. Able to perform text and webpage pre-processing for web mining tasks.
4. Able to analyze social networks using web mining techniques.
5. Able to perform opinion mining and sentiment analysis from web data.
6. Able to implement methods for privatizing web content.
7. Able to understand and implement web usage mining techniques.
8. Able to design and implement recommender systems.
9. Able to mine web structures for better content organization and analysis.
Web Mining CIE-431P

CONTENTS

1. Introduction

2. Hardware and Software requirements

3. Marking scheme for the Practical lab Examination

4. List of Experiments

5. Detail of Experiments

6. Expected Viva Voce Questions

7. References
Web Mining CIE-431P

1. INTRODUCTION

Web mining is a sophisticated process that involves extracting useful information and knowledge from
web data. It combines techniques from data mining, machine learning, and statistics to analyze and
interpret vast amounts of unstructured data available on the internet. Understanding web mining is
crucial for developing applications that leverage online information for decision-making, trend
analysis, and user engagement.

The Web Mining Lab provides students with a comprehensive understanding of key concepts,
techniques, and tools related to web mining. This includes the study of web content mining, web
structure mining, and web usage mining. Through practical experiments, students will learn how to
preprocess text data, analyze web structures, and implement algorithms for tasks such as opinion
mining and sentiment analysis.

The objective of the Web Mining Laboratory is to equip students with both theoretical foundations and
practical skills necessary to conduct web mining projects effectively.

Course Outcomes

1. Able to understand and implement the fundamental concepts of web content mining.
2. Able to analyze and interpret the link structure of the web using web structure mining techniques.
3. Able to preprocess text data for effective web mining.
4. Able to apply social network analysis methods to derive insights from web data.
5. Able to perform opinion mining and sentiment analysis on online content.
6. Able to implement web usage mining techniques to analyze user behavior.
7. Able to design and develop recommender systems based on user data.
Web Mining CIE-431P

2. Lab Requirements

Software Requirements:

• Python (with libraries such as Pandas, NumPy, Scikit-learn, and NLTK)


• R (for statistical analysis and data mining)
• Web scraping tools (e.g., Beautiful Soup, Scrapy)
• Database management system (e.g., MySQL or MongoDB)
• Data visualization tools (e.g., Tableau or Matplotlib)

Operating System:

• Windows 10 or Linux (Ubuntu)

Hardware Requirements:

• Processor: Intel Core i3 or higher


• RAM: Minimum 8 GB
• Storage: 256 GB HDD or SSD
• Network Card: LAN Card (10/100 Mbps)
Web Mining CIE-431P

3. Marking Scheme for the Practical Lab Exam

There will be two practical exams in each semester.

 Internal Practical Exam


 External Practical Exam

Internal Practical Exam:

It is taken by the concerned Faculty member of the batch.

Marking Scheme:

Total Marks: 40

Division of 40 marks is as follows:

 Punctuality/Attendance 10
 File (Organization and File checking status) 10
 Viva Voce 20

NOTE: In every lab, marks are awarded to the student out of 40 for each experiment
performed in the lab and at the end the average marks are given out of 40.
Web Mining CIE-431P

External Practical Exam:

It is taken by the concerned faculty member of the batch and by an external examiner. In this
exam student needs to perform the experiment allotted at the time of the examination, a sheet
will be given to the student in which some details asked by the examiner needs to be written
and at last viva will be taken by the external examiner.

Marking Scheme:

Total Marks: 60

Division of 60 marks is as follows:

a. Evaluation of the answer sheet 20


b. Viva Voce 15
c. Experiment performance 15
d. File submitted 10

NOTE:

 Internal marks + External marks = Total marks given to the students


(40 marks) (60 marks) (100 marks)

 Experiments given to perform can be from any section of the lab.


Web Mining CIE-431P

4. List of Experiments

Subject: Web Mining Lab Subject Code: CIE-431P

1. Implement the Page Rank Algorithm to rank web pages based on their importance.
2. Analyze web link structures using the Page Rank algorithm to understand connectivity and
authority.
3. Perform text and webpage pre-processing to prepare data for analysis.
4. Conduct social network analysis to examine relationships and structures within social networks.
5. Implement opinion mining techniques to extract opinions from online content.
6. Perform sentiment analysis on textual data to determine positive, negative, or neutral
sentiments.
7. Develop a program for web content privatization to control access to sensitive information.
8. Conduct web usage mining to analyze user behavior and interactions with websites.
9. Design a recommender system to provide personalized content recommendations based on user
preferences.
10. Perform web structure mining to extract information about the underlying structure of the web.

.
Web Mining CIE-431P

5. Detail of Experiments

Aim: Implement the Page Rank Algorithm to rank web pages based on their importance.

PageRank algorithm

The PageRank algorithm ranks web pages based on their importance by analyzing the links
between them. It assigns a score to each page, reflecting the likelihood that a user will land on that
page through random navigation.

Algorithm
1. Initialize ranks for all pages.
2. Create a transition matrix based on link structure.
3. Iteratively update PageRank scores using the formula:

The PageRank formula can be expressed as:

Code
import numpy as np
def page_rank(links, damping_factor=0.85, num_iterations=100):
N = len(links)
ranks = np.ones(N) / N
transition_matrix = np.zeros((N, N))

for i in range(N):
outgoing_links = np.sum(links[i])
if outgoing_links > 0:
transition_matrix[i] = links[i] / outgoing_links
Web Mining CIE-431P

for _ in range(num_iterations):
new_ranks = (1 - damping_factor) / N + damping_factor * transition_matrix.T @ ranks
ranks = new_ranks

return ranks

if __name__ == "__main__":
links = np.array([[0, 1, 1, 0],
[0, 0, 1, 0],
[1, 0, 0, 1],
[0, 1, 0, 0]])

ranks = page_rank(links)
print("PageRank Scores:", ranks)

Output
Web Mining CIE-431P

```EXPERIMENT-2

Aim: Analyze web link structures using the Page Rank algorithm to understand connectivity and
authority.

Introduction:

The PageRank algorithm helps rank web pages by evaluating the number and quality of links between
them. It assigns a score (PageRank) to each page, indicating its importance. Pages with many links
from important pages receive higher ranks, showing authority and connectivity in the web structure.

Theory:

• Connectivity: Measures how well pages are linked. Pages with many incoming links from
other pages show strong connectivity.
• Authority: Pages that receive links from many important pages have high authority. The more
authoritative a page, the higher its PageRank score.

Steps of the PageRank Algorithm:

1. Initialize Ranks: Assign an initial equal rank to all web pages.


2. Build the Transition Matrix: Create a matrix representing the probability of navigating from
one page to another.
3. Update Ranks Iteratively: Use the formula to calculate new ranks over multiple iterations:

• PR(Pi): Rank of page i


• d: Damping factor (probability of continuing to another page, usually 0.85)
• M(i): Pages linking to page i
• L(Pj): Number of outbound links from page j

4. Convergence: Repeat the process until the ranks stabilize or reach a set number of iterations.
Web Mining CIE-431P

Algorithm (Step-by-Step):

1. Initialize: Start with equal ranks for all pages (1/N, where N is the number of pages).
2. Build the Matrix: Create a matrix based on the links between the pages (outgoing links
normalized to 1).
3. Iterate: Update the ranks using the transition matrix and the PageRank formula.
4. Stop Condition: Stop when the ranks converge (don’t change significantly) or after a fixed
number of iterations.
5. Result: Output the final PageRank scores for each page, showing their relative importance.

Example:

Consider 4 web pages (Page 0, 1, 2, 3) with the following link structure:

• Page 0 links to Page 1 and Page 2.


• Page 1 links to Page 2.
• Page 2 links to Page 0 and Page 3.
• Page 3 links to Page 1.

The PageRank algorithm will compute ranks for each page, showing which one is more connected and
authoritative.

Python Code:

import numpy as np

def page_rank(links, damping_factor=0.85, num_iterations=100):

N = len(links)

ranks = np.ones(N) / N

transition_matrix = np.zeros((N, N))

for i in range(N):

outgoing_links = np.sum(links[i])

if outgoing_links > 0:

transition_matrix[i] = links[i] / outgoing_links

for _ in range(num_iterations):

new_ranks = (1 - damping_factor) / N + damping_factor * transition_matrix.T @ ranks


Web Mining CIE-431P

ranks = new_ranks

return ranks

if __name__ == "__main__":

links = np.array([[0, 1, 1, 0],

[0, 0, 1, 0],

[1, 0, 0, 1],

[0, 1, 0, 0]])

ranks = page_rank(links)

print("PageRank Scores:", ranks)

Output
Web Mining CIE-431P

EXPERIMENT-3

Aim: Perform text and webpage pre-processing to prepare data for analysis.

Pre-processing Steps:

Webpages contain a lot of unstructured data like HTML tags, JavaScript, CSS, etc. Pre-processing is
essential to extract relevant information (links between web pages) and prepare the data for PageRank
analysis.

Steps for Pre-processing:

Data Collection:

• Download or extract the HTML content of webpages.


• For real web data, you can use libraries like requests or beautifulsoup to scrape web content.

Remove Unnecessary Data:

• Strip out HTML tags, JavaScript, CSS, and other non-relevant parts of the webpage content.
• Focus on extracting the links (URLs) between pages, as these are needed for building the link
structure.

Extract Hyperlinks:

• Use an HTML parser to extract all hyperlinks (<a> tags).


• Identify the URLs present in the <href> attributes, as these indicate connections between
different web pages.

Create an Adjacency Matrix:

• Convert the extracted hyperlinks into a matrix where rows and columns represent pages, and
the entries indicate whether a page links to another.
• For example, if Page A links to Page B, the matrix will have a 1 at the corresponding position.

Normalize Links:

• Normalize the number of links by dividing each link by the total number of outgoing links from
the source page, so all links from a page sum to 1.
Web Mining CIE-431P

Tools/Libraries for Pre-processing:

• BeautifulSoup (for extracting text and links from webpages)


• Requests (for accessing webpage content)
• Regular Expressions (for cleaning text data)

Python Code:

import requests

from bs4 import BeautifulSoup

import numpy as np

def extract_links(url):

page = requests.get(url)

soup = BeautifulSoup(page.content, "html.parser")

links = []

for link in soup.find_all('a', href=True):

links.append(link['href'])

return links

url = "http://example.com"

webpage_links = extract_links(url)

print("Extracted Links:", webpage_links)

def build_adjacency_matrix(links, all_pages):

N = len(all_pages)

matrix = np.zeros((N, N))

for i, page in enumerate(links):


Web Mining CIE-431P

for link in page:

if link in all_pages:

j = all_pages.index(link)

matrix[i][j] = 1

return matrix

pages = ["http://example.com/page1", "http://example.com/page2", "http://example.com/page3"]

links = [["http://example.com/page2"], ["http://example.com/page3"], ["http://example.com/page1"]]

adj_matrix = build_adjacency_matrix(links, pages)

print("Adjacency Matrix:\n", adj_matrix)

Output :
Web Mining CIE-431P

EXPERIMENT-4

Aim: Conduct social network analysis to examine relationships and structures within social networks.

Definition:

Social Network Analysis (SNA) is a method used to study the relationships and interactions between
individuals, groups, or entities within a network. It focuses on understanding how the structure of these
connections influences the behavior, importance, and position of each node (e.g., individual, webpage)
in the network.

Purpose of Social Network Analysis:

The aim is to analyze how relationships form, which nodes are most influential, and how information
or influence spreads across the network. This is especially useful in identifying key players,
influencers, or important connections within the network.

Steps for Social Network Analysis:

Representation of Social Network:

• Represent the social network as a graph where each node represents an individual (e.g., user,
webpage), and the edges (lines between nodes) represent relationships (e.g., friendship,
hyperlinks).

Building Adjacency Matrix:

• Use an adjacency matrix to capture the structure of the network. In this matrix, rows and
columns represent nodes, and a value of 1 indicates a relationship (edge) between two nodes,
while 0 indicates no connection.

Applying the PageRank Algorithm:

• PageRank is used to rank the nodes based on their importance. Nodes that have more incoming
connections or links from other influential nodes receive higher scores. In a social network, this
helps identify central or influential individuals.
Web Mining CIE-431P

Centrality Analysis:

• Centrality measures, such as degree centrality (number of direct connections), betweenness


centrality (how often a node appears on the shortest path between two other nodes), and
PageRank, help determine the most important or influential nodes in the network.

Network Visualization:

• Visualizing the network graph allows for a clearer understanding of the relationships and
structure within the network. Key clusters, influencers, and the general structure (e.g.,
hierarchical, decentralized) can be observed.

Interpretation of Results:

• Use the PageRank scores and centrality measures to determine which nodes are the most
influential or critical to the network. You can also analyze the density of connections (how
closely knit the network is), identify subgroups or communities, and understand the overall
structure of the network.

Python Code:

import numpy as np

import networkx as nx

import matplotlib.pyplot as plt

adj_matrix = np.array([[0, 1, 1, 0],

[1, 0, 1, 1],

[1, 1, 0, 1],

[0, 1, 1, 0]])

G = nx.from_numpy_matrix(adj_matrix, create_using=nx.DiGraph())

pagerank_scores = nx.pagerank(G)
Web Mining CIE-431P

print("PageRank Scores:", pagerank_scores)

plt.figure(figsize=(8, 6))

nx.draw(G, with_labels=True, node_color='lightblue', node_size=2000, font_size=10,


font_color='black')

plt.show()

Output
Web Mining CIE-431P

EXPERIMENT-5

Aim: Implement opinion mining techniques to extract opinions from online content..

Definition:

Opinion Mining, also known as Sentiment Analysis, is a technique used to extract subjective
information from online content. It focuses on identifying and classifying opinions expressed in text,
such as positive, negative, or neutral sentiments, to understand public attitudes and feelings towards a
particular topic.

Steps to Implement Opinion Mining:

Data Collection:

• Gather online content such as product reviews, social media posts, blog comments, or news
articles. You can use web scraping techniques or APIs to collect text data.

Text Preprocessing:

• Clean the text by removing unwanted elements such as HTML tags, punctuation, numbers, and
stop words (commonly used words like "the", "and", "is").
• Convert the text to lowercase and apply techniques like stemming or lemmatization to reduce
words to their root forms (e.g., "running" → "run").

Feature Extraction:

• Convert the processed text into numerical representations using techniques like Bag of Words
(BoW), Term Frequency-Inverse Document Frequency (TF-IDF), or word embeddings like
Word2Vec or GloVe.

Sentiment Classification:

• Use machine learning or deep learning algorithms to classify the sentiments of the text. Popular
approaches include:

Naive Bayes for traditional machine learning.

• LSTM (Long Short-Term Memory) or BERT (Bidirectional Encoder Representations from


Transformers) for deep learning.
Web Mining CIE-431P

Model Training:

• Train a sentiment analysis model using labeled data (text tagged as positive, negative, or
neutral). Split the data into training and testing sets to evaluate the model's performance.

Opinion Extraction:

• Once the model is trained, apply it to unseen text data to extract opinions, classifying the
sentiment as positive, negative, or neutral.

Python Code:

from textblob import TextBlob

reviews = [

"I absolutely love this product! It's amazing.",

"The service was terrible and I won't be returning.",

"It was okay, not too bad but could be better."

for review in reviews:

blob = TextBlob(review)

sentiment = blob.sentiment.polarity

if sentiment > 0:

opinion = "Positive"

elif sentiment < 0:

opinion = "Negative"

else:
Web Mining CIE-431P

opinion = "Neutral"

print(f"Review: {review}\nSentiment: {opinion}\n")

Output
Web Mining CIE-431P

EXPERIMENT-6

Aim: Able to implement methods for privatizing web content.

Definition:

Sentiment Analysis is the process of using natural language processing (NLP) techniques to determine
the emotional tone behind a body of text. The goal is to classify the text into categories such as
positive, negative, or neutral sentiments. This helps in understanding public opinion or customer
feedback from large datasets.

Steps to Perform Sentiment Analysis:

Data Collection:

• Collect textual data from sources such as social media, product reviews, or news articles.

Text Preprocessing:

• Clean the text data by removing unwanted characters like special symbols, HTML tags, and
punctuation.
• Convert the text to lowercase and perform tokenization (splitting text into individual words).
• Remove stop words (common words that don't carry sentiment, like "the", "and").

Sentiment Classification:

• Use an NLP library or a machine learning model to classify the sentiment of the processed text.
• Libraries like TextBlob, VADER, or NLTK can easily classify sentiment.

Interpretation of Sentiments:

• The tool or model assigns a polarity score to each piece of text:


• Positive sentiment: Polarity > 0.
• Negative sentiment: Polarity < 0.
• Neutral sentiment: Polarity = 0.

Output and Analysis:

• Once sentiment is classified, the results can be used to gauge overall mood or opinion in large
datasets.
Web Mining CIE-431P

Python Code:

from textblob import TextBlob

texts = [

"This movie was fantastic! I enjoyed every moment.",

"The product quality is very poor. It broke after one use.",

"I'm not sure how I feel about the service, it's just okay." ]

for text in texts:

blob = TextBlob(text)

sentiment = blob.sentiment.polarity

if sentiment > 0:

opinion = "Positive"

elif sentiment < 0:

opinion = "Negative"

else:

opinion = "Neutral"

print(f"Text: {text}\nSentiment: {opinion}\n")

Output
Web Mining CIE-431P

EXPERIMENT-7

Aim: Able to understand and implement web usage mining techniques.

Definition:

Web Usage Mining is the process of extracting useful information from server logs, web traffic data,
and user interactions to understand how users navigate websites. It helps identify usage patterns,
improve user experience, and personalize web services.

Key Techniques of Web Usage Mining:

Data Collection:

• Web usage data is collected from web server logs, browser cookies, or databases. This data
contains information like user IP addresses, visited URLs, time spent on pages, and clickstream
data.

Data Preprocessing:

• Clean and format the raw web usage data by removing irrelevant or redundant information
(e.g., failed requests, bots). The data is then converted into a usable format for analysis.
• Session identification is performed, where user activities are grouped into sessions,
representing a user's actions during a single visit to the website.

Pattern Discovery:

• Use data mining algorithms to find patterns in user behavior. Techniques include:
• Clustering: Group users with similar browsing behaviors.
• Association Rule Mining: Identify frequently visited pages together (e.g., “People who visit
Page A also visit Page B”).
• Sequential Pattern Mining: Detect common navigation sequences (e.g., "Users often visit Page
A, then Page C").

Pattern Analysis:

• Analyze the discovered patterns to generate insights about user preferences, common paths, or
frequent behaviors. This helps in optimizing website design, improving recommendation
systems, and enhancing the overall user experience.
Web Mining CIE-431P

Python Code:

from collections import defaultdict

web_logs = [

('user1', 'home'), ('user1', 'product'), ('user1', 'cart'),

('user2', 'home'), ('user2', 'product'), ('user2', 'checkout'),

('user3', 'home'), ('user3', 'product'), ('user3', 'cart'),

user_sessions = defaultdict(list)

for log in web_logs:

user_sessions[log[0]].append(log[1])

patterns = defaultdict(int)

for session in user_sessions.values():

for i in range(len(session) - 1):

patterns[(session[i], session[i+1])] += 1

for pattern, count in patterns.items():

print(f"Pattern: {pattern}, Count: {count}")

Output

You might also like