Beautiful Soup vs Selenium
Beautiful Soup is a Python library that turns HTML and XML documents into a
tree of Python objects. It’s widely used in web scraping to extract information from
web pages.
Installation: pip install bs4
Selenium is mostly for automating web applications and testing purposes.
Selenium is also available for NET/C#, Ruby, Java, and JavaScript. It requires a
web driver to run.
Installation: pip install selenium
Which one should you use?
If you are scraping a website with static content use Beautiful Soup. If you are
scraping a website with dynamically loaded content (like infinite scrolling) use
Selenium.
Resources
Web Drivers
● Chrome: https://developer.chrome.com/docs/chromedriver
● Edge: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver
● Firefox: https://github.com/mozilla/geckodriver
● Safari: https://webkit.org/blog/6900/webdriver-support-in-safari-10
Sites that allow scraping
● Static: https://books.toscrape.com/
● Dynamic: https://webscraper.io/test-sites/e-commerce/scroll/computers/laptops
Scraping Static Web Pages with Beautiful Soup
1. Import the Beautiful Soup and the requests libraries:
from bs4 import BeautifulSoup
import requests
2. Fetch Page HTML with requests:
url = "https://books.toscrape.com"
response= = requests.get(url)
3. Making the Soup:
Soup = BeautifulSoup(response, 'html.parser')
The Beautiful Soup constructor takes in two arguments the html you want to parse
and the parser you want to use (‘html.parser’ is Python’s built-in parser) . A parser
takes raw text (like HTML or JSON) and breaks it down into a structured format that
your program can understand and work with.
Commonly Used Beautiful Soup Methods
Method What it does Example
find(tag, attrs) Finds the first matching soup.find("h1")
element
find_all(tag, attrs) Finds all matching elements soup.find_all("p")
(list)
select(css_selector) Finds elements using CSS soup.select("div.quote
selectors span.text")
select_one(css_selector) Finds the first element soup.select_one("p.intro")
matching CSS selector
.get_text() Gets the inner text of an soup.select_one("p.intro")
element
.attrs Returns all attributes of an soup.find("a").attrs
element (dict)
tag['attr'] Gets a specific attribute (like soup.find("a")["href"]
href, src)
prettify() Returns nicely formatted HTML print(soup.prettify())
Example
Find the first book title
first_title = soup.find("h3").a["title"]
Find all prices
prices = [p.get_text() for p in soup.find_all("p", class_="price_color")]
Using CSS selectors
quotes = soup.select("h3 a")
Get all visible text (shortened)
all_text = soup.get_text("\n")[:200]
Handling multiple pages
Simply collect all links on the page and loop through them.
base_url = "https://books.toscrape.com/catalogue/page-{}.html"
all_titles = []
# Loop through first 5 pages
for page in range(1, 6):
url = base_url.format(page)
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
# Collect all book titles on this page
for book in soup.find_all("h3"):
all_titles.append(book.a["title"])
Scraping dynamic page content with Selenium
This Selenium script loads an infinite-scroll product page, waits for .product-wrapper cards to
appear, then repeatedly scrolls to the bottom and waits until the number of cards increases. When
the count stops growing (no more items load), it exits the loop, collects each product’s title and price,
prints a summary, and quits the browser.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
URL = "https://webscraper.io/test-sites/e-commerce/scroll/computers/laptops"
driver = webdriver.Chrome()
driver.get(URL)
wait = WebDriverWait(driver, 12)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".product-wrapper")))
last = 0
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
try:
# wait until more product cards exist than before
wait.until(lambda d: len(d.find_elements(By.CSS_SELECTOR, ".product-wrapper")) >
last)
count = len(driver.find_elements(By.CSS_SELECTOR, ".product-wrapper"))
if count == last:
break
last = count
except Exception:
break # no more items loaded within timeout
# collect product titles & prices
items = driver.find_elements(By.CSS_SELECTOR, ".product-wrapper")
data = [{
"title": i.find_element(By.CSS_SELECTOR, "a.title").get_attribute("title"),
"price": i.find_element(By.CSS_SELECTOR, "h4.price").text
} for i in items]
print(f"Collected {len(data)} products")
print(data[:5])
driver.quit()
Export results as CSV
import csv
with open("products.csv","w",newline="",encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=["title","price"])
w.writeheader(); w.writerows(data)