[go: up one dir, main page]

0% found this document useful (0 votes)
22 views3 pages

Introduction To Web Crawling Chapter - 13

Uploaded by

kruti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views3 pages

Introduction To Web Crawling Chapter - 13

Uploaded by

kruti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Introduction to Web Crawling

Web crawling, also known as web scraping, is the process where automated programs called
web crawlers or spiders systematically browse the internet to discover and index content.

How Web Crawling Works:

1. Starting Point: The crawler begins with a list of seed URLs (initial websites to visit).
2. Fetching Content: It downloads the webpage's content, including text, images, and
links.
3. Following Links: The crawler identifies and follows hyperlinks on the page to
discover new pages.
4. Indexing: The extracted data is stored in a database, making it searchable for search
engines like Google.
5. Revisiting Pages: Crawlers revisit pages periodically to update content changes.

Why is Web Crawling Important?

 Search Engines (Google, Bing, etc.): Crawlers help index websites, making them
searchable.
 SEO Optimization: Websites must be crawlable for good search rankings.
 Data Scraping: Companies use crawlers to collect market research, news, or
competitor data.
 Website Monitoring: Businesses monitor competitors or changes in specific web
pages.

Building a Simple Web Crawler in Python


You can build a simple web crawler using Python with the requests and BeautifulSoup
libraries.

Step 1: Install Dependencies


pip install requests beautifulsoup4

Step 2: Create a Simple Web Crawler

This script fetches and extracts all links from a webpage.

import requests
from bs4 import BeautifulSoup

def crawl(url):
headers = {"User-Agent": "Mozilla/5.0"} # Helps avoid getting blocked
response = requests.get(url, headers=headers)

if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
print(f"Title: {soup.title.string}\n") # Print page title
print("Links found:")

for link in soup.find_all("a", href=True): # Find all links


print(link["href"])
else:
print(f"Failed to retrieve {url}, Status Code:
{response.status_code}")

# Example usage
crawl("https://example.com")

Building an Advanced Web Crawler with Scrapy


For large-scale web scraping, Scrapy is a more powerful framework.

Step 1: Install Scrapy


pip install scrapy

Step 2: Create a Scrapy Project

Run the following command in your terminal:

scrapy startproject mycrawler


cd mycrawler

Step 3: Create a Spider

Inside the spiders folder, create a new Python file (my_spider.py) and add:

import scrapy

class MySpider(scrapy.Spider):
name = "mycrawler"
start_urls = ["https://example.com"] # Replace with your target
website

def parse(self, response):


# Extract page title
title = response.xpath("//title/text()").get()
print(f"Title: {title}")

# Extract all links


for link in response.xpath("//a/@href").getall():
yield {"link": response.urljoin(link)} # Normalize relative
URLs

# Follow links and crawl deeper


for link in response.xpath("//a/@href").getall():
yield response.follow(link, callback=self.parse)

Step 4: Run the Spider


Execute the spider from the terminal:

scrapy crawl mycrawler -o output.json

This will save the crawled data into output.json.

Enhancements for a More Powerful Crawler:

 Follow links with restrictions (only crawl certain domains).


 Store data in a database (MongoDB, SQLite, or PostgreSQL).
 Use middlewares to rotate user agents and avoid getting blocked.

With these steps, you can build a functional web crawler for scraping and indexing websites
efficiently!

You might also like