0% found this document useful (0 votes)

41 views5 pages

Web Scraping 101

Web scraping is the automated process of extracting data from websites, often used for data mining, price monitoring, market research, and content aggregation. The process involves data extraction, automated tools, parsing, and storage, while also considering legal and ethical guidelines. Puppeteer is a popular Node.js library for web scraping that offers features like headless browsing, full browser control, and JavaScript rendering capabilities.

Uploaded by

ermias70ne

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views5 pages

Web Scraping 101

Uploaded by

ermias70ne

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Web Scraping Fundamentals

What is Web Scraping?

Web scraping is the process of automatically extracting data from websites. It
involves fetching the content of a web page and parsing it to collect specific
information.

1993 World Wide Web Wanderer for indexing website links

2004 The first library for web scraping in Python - Beautiful Soup

Common Use Cases

Data Mining: Collecting data for analysis, research, or machine learning.

Price Monitoring: Tracking prices and availability of products across different

e-commerce sites.

Market Research: Gathering insights about competitors, trends, and customer

opinions from forums and reviews.

Content Aggregation: Compiling information from multiple sources into a

single platform, such as news articles or job listings.

Steps to Scrap a Web Page

Data Extraction:

The primary goal is gathering data from web pages, including text, images,
links, and other elements.

Automated Tools:

Web scraping is typically performed using automated tools or scripts,

which can navigate websites, simulate user behavior, and extract data
without manual intervention.

Web Scraping Fundamentals 1

Parsing:

After fetching the HTML content of a page, the next step is to parse it to
identify and extract the desired information. This often involves using
libraries or frameworks that can navigate the HTML structure.

Storage:

Once the data is extracted, it can be stored in various formats, such as

CSV files, databases, or spreadsheets, for further analysis or processing.

Legal and Ethical Considerations

Respect Robots.txt: Many websites have a robots.txt file that specifies rules
about what can be scraped. Always check and comply with these rules.

Terms of Service: Scraping may violate a website's terms of service. Be sure

to review and adhere to them.

Rate Limiting: To avoid overloading a server, it’s important to implement rate

limiting and avoid making too many requests in a short period.

Puppeteer
Puppeteer is a Node.js library developed by Google that provides a high-level API
to control headless Chrome or Chromium browsers. It's widely used for web
scraping due to its powerful capabilities.

Key Features of Puppeteer for Web Scraping

1. Headless Browsing:

Puppeteer can run Chrome in headless mode, meaning it can perform web
scraping without opening a visible browser window. This makes it more
efficient and faster for automated tasks.

2. Full Browser Control:

Puppeteer allows you to control nearly all aspects of the browser,

including navigation, clicking elements, filling forms, and taking
screenshots. This makes it suitable for scraping complex web applications.

3. JavaScript Rendering:

Web Scraping Fundamentals 2

Many modern websites rely on JavaScript to render content. Puppeteer
can execute JavaScript on pages, which allows you to scrape dynamic
content that might not be available in the initial HTML.

4. Easy Navigation:

Puppeteer provides straightforward methods for navigating to pages,

waiting for elements to load, and handling timeouts, which simplifies the
scraping process.

5. Data Extraction:

You can easily extract data from the DOM using methods to query
elements, retrieve text content, and get attribute values.

6. Screenshots and PDFs:

Puppeteer can take screenshots of pages or generate PDFs, which can be

useful for visual verification of scraped content.

Installation
Link for Puppeteer Library on NPM - https://www.npmjs.com/package/puppeteer

npm i puppeteer # Downloads compatible Chrome during installation.

npm i puppeteer-core # Alternatively, install as a library, without downloading Ch

When you install puppeteer-core, you need to specify an executable path for
Chrome or Chromium.

Windows: Typically located at:

Chrome: C:\Program Files\Google\Chrome\Application\chrome.exe

Chromium: C:\Users\<YourUsername>\AppData\Local\Chromium\Application\chrome.exe

macOS: Typically located at:

Chrome: /Applications/Google Chrome.app/Contents/MacOS/Google Chrome

Chromium: ~/Applications/Chromium.app/Contents/MacOS/Chromium

Linux: Usually installed via package managers, often at:

Web Scraping Fundamentals 3

Chrome: /usr/bin/google-chrome

Chromium: /usr/bin/chromium-browser

Classes inside Puppeteer Library

Browser - This instance represents a browser session and allows you to
perform various operations, such as opening new pages, closing the browser,
and managing browser contexts.

Page - The Page object represents a single tab or page in the browser. When
you create a new page using the newPage() method on a Browser instance, you
receive a Page instance. This object allows you to interact with the content of
the page, perform actions, and extract data.

Navigation:

Methods like goto(URL) enable you to navigate to a specific URL.

Content Interaction:

You can perform actions like clicking buttons, filling out forms, and
navigating through links using methods such as click(selector) , type(selector,

text) , and evaluate() .

Data Extraction:

The Page object allows you to extract content from the DOM. You can
use evaluate() to run JavaScript in the context of the page and return

data.

Event Handling:

You can listen to various events on the page, such as load ,

domcontentloaded , and more.

Screenshots and PDFs:

You can take screenshots of the page or generate PDFs using

screenshot() and pdf() methods.

Evaluate Method

Web Scraping Fundamentals 4

The evaluate method of the Page object in Puppeteer takes a function as an
argument, which is executed in the context of the page. This means that the
function you provide will run later, once the page has loaded or when the evaluate
method is called.

Callbacks

Basic Callback

Asynchronous Callbacks

Array Method Callbacks

Web Scraping Fundamentals 5

Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Scraping
100% (1)
Scraping
25 pages
Web Scraping Course Notes
No ratings yet
Web Scraping Course Notes
89 pages
Web Scraping, Web Harvesting, or Web Data Extraction Is
No ratings yet
Web Scraping, Web Harvesting, or Web Data Extraction Is
1 page
Intro To Web Scraping
No ratings yet
Intro To Web Scraping
13 pages
Arindam Manna, Financial Analytics
No ratings yet
Arindam Manna, Financial Analytics
9 pages
Web Scraping Tools
No ratings yet
Web Scraping Tools
17 pages
Data Science
No ratings yet
Data Science
9 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Implementation of Web Application For Disease Prediction Using AI
No ratings yet
Implementation of Web Application For Disease Prediction Using AI
5 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Introduction To Web Scraping
100% (1)
Introduction To Web Scraping
3 pages
The A-Z of Web Scraping in 2020 (A How-To Guide)
No ratings yet
The A-Z of Web Scraping in 2020 (A How-To Guide)
18 pages
DeVito Et Al 2020 How We Learnt To Stop Worrying and
No ratings yet
DeVito Et Al 2020 How We Learnt To Stop Worrying and
3 pages
Data Collection
No ratings yet
Data Collection
10 pages
Module 4
No ratings yet
Module 4
14 pages
Seminar Completed
No ratings yet
Seminar Completed
22 pages
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
No ratings yet
Upadhyay (2017) - Articulating The Construction of A Web Scraper For
4 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Scraperapi Web Scrapping The Basics Explained
No ratings yet
Scraperapi Web Scrapping The Basics Explained
15 pages
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
No ratings yet
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
42 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
20 - BeautifulSoup Library For Web Scraping
No ratings yet
20 - BeautifulSoup Library For Web Scraping
12 pages
Web Data Extraction Techniques
No ratings yet
Web Data Extraction Techniques
2 pages
Q-1 Web Scraping: Definition and Significance
No ratings yet
Q-1 Web Scraping: Definition and Significance
4 pages
Semin
No ratings yet
Semin
8 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
10 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Web Scraping
No ratings yet
Web Scraping
11 pages
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
No ratings yet
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
4 pages
Bypass and Automate The Landing Page by Only Writing One Script
No ratings yet
Bypass and Automate The Landing Page by Only Writing One Script
1 page
chp3A10.10072F978 3 319 32001 4 - 483 1
No ratings yet
chp3A10.10072F978 3 319 32001 4 - 483 1
4 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
Web Automation Scraping JS Handbook Small Size
No ratings yet
Web Automation Scraping JS Handbook Small Size
19 pages
Diouf 2019
No ratings yet
Diouf 2019
3 pages
Tools
No ratings yet
Tools
1 page
AI-Powered Web Scraping in 2024: Best Practices & Use Cases
No ratings yet
AI-Powered Web Scraping in 2024: Best Practices & Use Cases
5 pages
I) Web Crawling: Yash Pahlani D17B 49
No ratings yet
I) Web Crawling: Yash Pahlani D17B 49
7 pages
Web Crawler
0% (1)
Web Crawler
16 pages
Ir 5
No ratings yet
Ir 5
18 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
Web Data Extractors 2025 Guide
No ratings yet
Web Data Extractors 2025 Guide
26 pages
IRSNOTES5
No ratings yet
IRSNOTES5
7 pages
Web Scrapping
No ratings yet
Web Scrapping
15 pages
Python Web Scraping Guide
100% (2)
Python Web Scraping Guide
35 pages
Integrasi Level Antarmuka Pengguna
No ratings yet
Integrasi Level Antarmuka Pengguna
20 pages
Com 059
No ratings yet
Com 059
6 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Webscraping
No ratings yet
Webscraping
12 pages
A Practical Guide To Web Scraping (PDFDrive)
No ratings yet
A Practical Guide To Web Scraping (PDFDrive)
107 pages
Intermediate Scraping Techniques
No ratings yet
Intermediate Scraping Techniques
2 pages
Dynamic Web Scraping Techniques
No ratings yet
Dynamic Web Scraping Techniques
3 pages
Web Scraping in Node - Js - Top 7 Best Tools - Medium
No ratings yet
Web Scraping in Node - Js - Top 7 Best Tools - Medium
13 pages
Web Scraping: Legal and Ethical Insights
No ratings yet
Web Scraping: Legal and Ethical Insights
7 pages
Arts 7 - Q3 - M6 - Carving Out Your Niche Architectures, Sculptures, and Everyday Objects of Mindanao
No ratings yet
Arts 7 - Q3 - M6 - Carving Out Your Niche Architectures, Sculptures, and Everyday Objects of Mindanao
32 pages
Art Deco Interior Design Style
No ratings yet
Art Deco Interior Design Style
13 pages
Example of A Quantitative Research Paper
No ratings yet
Example of A Quantitative Research Paper
11 pages
Profile
No ratings yet
Profile
3 pages
Concept Paper
No ratings yet
Concept Paper
4 pages
Roberts - Synesthesia
100% (3)
Roberts - Synesthesia
279 pages
Baldwin 2014
No ratings yet
Baldwin 2014
37 pages
Final Business Studies GR 12 Marking Guideline - March 2023
No ratings yet
Final Business Studies GR 12 Marking Guideline - March 2023
24 pages
Luxury Resort's Customer Gaps
No ratings yet
Luxury Resort's Customer Gaps
9 pages
Supply Chain Expert's Career Journey
No ratings yet
Supply Chain Expert's Career Journey
1 page
ASIC & FPGA Design Question Bank
No ratings yet
ASIC & FPGA Design Question Bank
10 pages
FS 1 Episode 3
No ratings yet
FS 1 Episode 3
4 pages
LAE Sequence of Operation For T Series Freezers
No ratings yet
LAE Sequence of Operation For T Series Freezers
23 pages
(FREE PDF Sample) Freeze Drying Second Edition Georg?Wilhelm Oetjen Ebooks
100% (6)
(FREE PDF Sample) Freeze Drying Second Edition Georg?Wilhelm Oetjen Ebooks
55 pages
Preparing For Your Final Year in Software Engineering Can Be Both Exciting and Challenging
No ratings yet
Preparing For Your Final Year in Software Engineering Can Be Both Exciting and Challenging
2 pages
Proposal Albert Afful Final
No ratings yet
Proposal Albert Afful Final
33 pages
Modelling The Longitudinal Dynamics of Long Freigh
No ratings yet
Modelling The Longitudinal Dynamics of Long Freigh
6 pages
Delacruz, Jazka L.: Work Experience
No ratings yet
Delacruz, Jazka L.: Work Experience
1 page
Chemical Process Simulation Guide
No ratings yet
Chemical Process Simulation Guide
14 pages
Agribusiness Supply Chain Guide
No ratings yet
Agribusiness Supply Chain Guide
65 pages
667400a31d833d00172262cf - ## - Inverse Trigonometric Functions - DPP 01 (Of Lec 03) - Lakshya JEE 2025
No ratings yet
667400a31d833d00172262cf - ## - Inverse Trigonometric Functions - DPP 01 (Of Lec 03) - Lakshya JEE 2025
2 pages
NCERT Hot Spot Chapter 13 00 00 24
No ratings yet
NCERT Hot Spot Chapter 13 00 00 24
4 pages
Grade 1
No ratings yet
Grade 1
2 pages
Strategic Management Text and Cases 10th Edition Gregory Dess Gerry McNamara Alan Eisner SeungHyun Lee Ebook and TestBank Bundle Unlocked Test Bank
No ratings yet
Strategic Management Text and Cases 10th Edition Gregory Dess Gerry McNamara Alan Eisner SeungHyun Lee Ebook and TestBank Bundle Unlocked Test Bank
352 pages
3 Eng Sub
No ratings yet
3 Eng Sub
4 pages
Xrdocs Io CNBNG Tutorials Inception Server Deployment Guide
No ratings yet
Xrdocs Io CNBNG Tutorials Inception Server Deployment Guide
8 pages
Venus' Influence in Relationships
0% (1)
Venus' Influence in Relationships
9 pages
SmartGrip 532 Tile Adhesive Guide
No ratings yet
SmartGrip 532 Tile Adhesive Guide
2 pages
UniMAP BPA Sarjana Muda Sidang Akademik 2023-2024
No ratings yet
UniMAP BPA Sarjana Muda Sidang Akademik 2023-2024
523 pages
Book Reviews: W. Malcolm Clark
No ratings yet
Book Reviews: W. Malcolm Clark
4 pages