[go: up one dir, main page]

0% found this document useful (0 votes)
37 views5 pages

Beautiful Soup & Selenium Web Scraping Guide

Uploaded by

syamshop134
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views5 pages

Beautiful Soup & Selenium Web Scraping Guide

Uploaded by

syamshop134
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Beautiful Soup vs Selenium

Beautiful Soup is a Python library that turns HTML and XML documents into a
tree of Python objects. It’s widely used in web scraping to extract information from
web pages.

Installation: pip install bs4

Selenium is mostly for automating web applications and testing purposes.


Selenium is also available for NET/C#, Ruby, Java, and JavaScript. It requires a
web driver to run.

Installation: pip install selenium

Which one should you use?


If you are scraping a website with static content use Beautiful Soup. If you are
scraping a website with dynamically loaded content (like infinite scrolling) use
Selenium.
Resources
Web Drivers

●​ Chrome: https://developer.chrome.com/docs/chromedriver
●​ Edge: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver
●​ Firefox: https://github.com/mozilla/geckodriver
●​ Safari: https://webkit.org/blog/6900/webdriver-support-in-safari-10

Sites that allow scraping

●​ Static: https://books.toscrape.com/
●​ Dynamic: https://webscraper.io/test-sites/e-commerce/scroll/computers/laptops

Scraping Static Web Pages with Beautiful Soup


1.​ Import the Beautiful Soup and the requests libraries:
from bs4 import BeautifulSoup​
​ import requests

2.​ Fetch Page HTML with requests:


url = "https://books.toscrape.com"​
​ response= = requests.get(url)

3.​ Making the Soup:


Soup = BeautifulSoup(response, 'html.parser')

The Beautiful Soup constructor takes in two arguments the html you want to parse
and the parser you want to use (‘html.parser’ is Python’s built-in parser) . A parser
takes raw text (like HTML or JSON) and breaks it down into a structured format that
your program can understand and work with.
Commonly Used Beautiful Soup Methods

Method What it does Example

find(tag, attrs) Finds the first matching soup.find("h1")


element

find_all(tag, attrs) Finds all matching elements soup.find_all("p")


(list)

select(css_selector) Finds elements using CSS soup.select("div.quote


selectors span.text")

select_one(css_selector) Finds the first element soup.select_one("p.intro")


matching CSS selector

.get_text() Gets the inner text of an soup.select_one("p.intro")


element

.attrs Returns all attributes of an soup.find("a").attrs


element (dict)

tag['attr'] Gets a specific attribute (like soup.find("a")["href"]


href, src)

prettify() Returns nicely formatted HTML print(soup.prettify())

Example
Find the first book title
first_title = soup.find("h3").a["title"]

Find all prices


prices = [p.get_text() for p in soup.find_all("p", class_="price_color")]

Using CSS selectors


quotes = soup.select("h3 a")

Get all visible text (shortened)


all_text = soup.get_text("\n")[:200]
Handling multiple pages
Simply collect all links on the page and loop through them.

base_url = "https://books.toscrape.com/catalogue/page-{}.html"​

all_titles = []​

# Loop through first 5 pages​
for page in range(1, 6):​
url = base_url.format(page)​
res = requests.get(url)​
soup = BeautifulSoup(res.text, "html.parser")​

# Collect all book titles on this page​
for book in soup.find_all("h3"):​
all_titles.append(book.a["title"])

Scraping dynamic page content with Selenium


This Selenium script loads an infinite-scroll product page, waits for .product-wrapper cards to
appear, then repeatedly scrolls to the bottom and waits until the number of cards increases. When
the count stops growing (no more items load), it exits the loop, collects each product’s title and price,
prints a summary, and quits the browser.

from selenium import webdriver​


from selenium.webdriver.common.by import By​
from selenium.webdriver.support.ui import WebDriverWait​
from selenium.webdriver.support import expected_conditions as EC​

URL = "https://webscraper.io/test-sites/e-commerce/scroll/computers/laptops"​

driver = webdriver.Chrome()​
driver.get(URL)​
wait = WebDriverWait(driver, 12)​
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".product-wrapper")))​

last = 0​
while True:​
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")​
try:​
# wait until more product cards exist than before​
wait.until(lambda d: len(d.find_elements(By.CSS_SELECTOR, ".product-wrapper")) >
last)​
count = len(driver.find_elements(By.CSS_SELECTOR, ".product-wrapper"))​
if count == last:​
break​
last = count​
except Exception:​
break # no more items loaded within timeout​

# collect product titles & prices​
items = driver.find_elements(By.CSS_SELECTOR, ".product-wrapper")​
data = [{​
"title": i.find_element(By.CSS_SELECTOR, "a.title").get_attribute("title"),​
"price": i.find_element(By.CSS_SELECTOR, "h4.price").text​
} for i in items]​

print(f"Collected {len(data)} products")​
print(data[:5])​
driver.quit()

Export results as CSV


import csv​
with open("products.csv","w",newline="",encoding="utf-8") as f:​
w = csv.DictWriter(f, fieldnames=["title","price"])​
w.writeheader(); w.writerows(data)

You might also like