Web Scraping Fundamentals
What is Web Scraping?
Web scraping is the process of automatically extracting data from websites. It
involves fetching the content of a web page and parsing it to collect specific
information.
1993 World Wide Web Wanderer for indexing website links
2004 The first library for web scraping in Python - Beautiful Soup
Common Use Cases
Data Mining: Collecting data for analysis, research, or machine learning.
Price Monitoring: Tracking prices and availability of products across different
e-commerce sites.
Market Research: Gathering insights about competitors, trends, and customer
opinions from forums and reviews.
Content Aggregation: Compiling information from multiple sources into a
single platform, such as news articles or job listings.
Steps to Scrap a Web Page
Data Extraction:
The primary goal is gathering data from web pages, including text, images,
links, and other elements.
Automated Tools:
Web scraping is typically performed using automated tools or scripts,
which can navigate websites, simulate user behavior, and extract data
without manual intervention.
Web Scraping Fundamentals 1
Parsing:
After fetching the HTML content of a page, the next step is to parse it to
identify and extract the desired information. This often involves using
libraries or frameworks that can navigate the HTML structure.
Storage:
Once the data is extracted, it can be stored in various formats, such as
CSV files, databases, or spreadsheets, for further analysis or processing.
Legal and Ethical Considerations
Respect Robots.txt: Many websites have a robots.txt file that specifies rules
about what can be scraped. Always check and comply with these rules.
Terms of Service: Scraping may violate a website's terms of service. Be sure
to review and adhere to them.
Rate Limiting: To avoid overloading a server, it’s important to implement rate
limiting and avoid making too many requests in a short period.
Puppeteer
Puppeteer is a Node.js library developed by Google that provides a high-level API
to control headless Chrome or Chromium browsers. It's widely used for web
scraping due to its powerful capabilities.
Key Features of Puppeteer for Web Scraping
1. Headless Browsing:
Puppeteer can run Chrome in headless mode, meaning it can perform web
scraping without opening a visible browser window. This makes it more
efficient and faster for automated tasks.
2. Full Browser Control:
Puppeteer allows you to control nearly all aspects of the browser,
including navigation, clicking elements, filling forms, and taking
screenshots. This makes it suitable for scraping complex web applications.
3. JavaScript Rendering:
Web Scraping Fundamentals 2
Many modern websites rely on JavaScript to render content. Puppeteer
can execute JavaScript on pages, which allows you to scrape dynamic
content that might not be available in the initial HTML.
4. Easy Navigation:
Puppeteer provides straightforward methods for navigating to pages,
waiting for elements to load, and handling timeouts, which simplifies the
scraping process.
5. Data Extraction:
You can easily extract data from the DOM using methods to query
elements, retrieve text content, and get attribute values.
6. Screenshots and PDFs:
Puppeteer can take screenshots of pages or generate PDFs, which can be
useful for visual verification of scraped content.
Installation
Link for Puppeteer Library on NPM - https://www.npmjs.com/package/puppeteer
npm i puppeteer # Downloads compatible Chrome during installation.
npm i puppeteer-core # Alternatively, install as a library, without downloading Ch
When you install puppeteer-core, you need to specify an executable path for
Chrome or Chromium.
Windows: Typically located at:
Chrome: C:\Program Files\Google\Chrome\Application\chrome.exe
Chromium: C:\Users\<YourUsername>\AppData\Local\Chromium\Application\chrome.exe
macOS: Typically located at:
Chrome: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome
Chromium: ~/Applications/Chromium.app/Contents/MacOS/Chromium
Linux: Usually installed via package managers, often at:
Web Scraping Fundamentals 3
Chrome: /usr/bin/google-chrome
Chromium: /usr/bin/chromium-browser
Classes inside Puppeteer Library
Browser - This instance represents a browser session and allows you to
perform various operations, such as opening new pages, closing the browser,
and managing browser contexts.
Page - The Page object represents a single tab or page in the browser. When
you create a new page using the newPage() method on a Browser instance, you
receive a Page instance. This object allows you to interact with the content of
the page, perform actions, and extract data.
Navigation:
Methods like goto(URL) enable you to navigate to a specific URL.
Content Interaction:
You can perform actions like clicking buttons, filling out forms, and
navigating through links using methods such as click(selector) , type(selector,
text) , and evaluate() .
Data Extraction:
The Page object allows you to extract content from the DOM. You can
use evaluate() to run JavaScript in the context of the page and return
data.
Event Handling:
You can listen to various events on the page, such as load ,
domcontentloaded , and more.
Screenshots and PDFs:
You can take screenshots of the page or generate PDFs using
screenshot() and pdf() methods.
Evaluate Method
Web Scraping Fundamentals 4
The evaluate method of the Page object in Puppeteer takes a function as an
argument, which is executed in the context of the page. This means that the
function you provide will run later, once the page has loaded or when the evaluate
method is called.
Callbacks
Basic Callback
Asynchronous Callbacks
Array Method Callbacks
Web Scraping Fundamentals 5