Introduction To Web Crawling Chapter - 13
Introduction To Web Crawling Chapter - 13
Web crawling, also known as web scraping, is the process where automated programs called
web crawlers or spiders systematically browse the internet to discover and index content.
1. Starting Point: The crawler begins with a list of seed URLs (initial websites to visit).
2. Fetching Content: It downloads the webpage's content, including text, images, and
links.
3. Following Links: The crawler identifies and follows hyperlinks on the page to
discover new pages.
4. Indexing: The extracted data is stored in a database, making it searchable for search
engines like Google.
5. Revisiting Pages: Crawlers revisit pages periodically to update content changes.
Search Engines (Google, Bing, etc.): Crawlers help index websites, making them
searchable.
SEO Optimization: Websites must be crawlable for good search rankings.
Data Scraping: Companies use crawlers to collect market research, news, or
competitor data.
Website Monitoring: Businesses monitor competitors or changes in specific web
pages.
import requests
from bs4 import BeautifulSoup
def crawl(url):
headers = {"User-Agent": "Mozilla/5.0"} # Helps avoid getting blocked
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.text, "html.parser")
print(f"Title: {soup.title.string}\n") # Print page title
print("Links found:")
# Example usage
crawl("https://example.com")
Inside the spiders folder, create a new Python file (my_spider.py) and add:
import scrapy
class MySpider(scrapy.Spider):
name = "mycrawler"
start_urls = ["https://example.com"] # Replace with your target
website
With these steps, you can build a functional web crawler for scraping and indexing websites
efficiently!