ScrapeBlocks

ScrapeBlocks is a layer on top of Playwright to make scraping automation easier.

You can set actions to be performed before starting scraping and you can also decide which scraping strategy to use.

Start with predefined actions and strategies in a matter of minutes. You can also write your own or use ones from the community.

Who is this for? 🤔

I just want to start scraping right now with little effort as possible
I have a complicated scraping workflow that I want to simplify with still getting the same results
I like to tinker with scraping and build my custom workflows

With ScrapeBlocks getting started with scraping is a matter of minutes.

You can use it with its batteries included or as extension to Playwright.

Whether you are a scraping-hero or just want to monitor the price for that product but you don't know much about scraping, ScrapeBlocks is here for you.

Features 🚀

Pre-scraping actions: perform actions before running a scraping strategy
- Example use-case: you need to click something before your target becomes visible
Plug-n-play: write your own scraping strategies or use those from the community
- Example use-cases: scrape for text of certain elements, get all the images, etc.
Fully customizible (or not): you can use it batteries included or use your own Playwright instances
Easy to start with: it's based on Playwright!

Actions included ⚡

Click on any element
Add cookie
Remove an element
Type anywhere you can type something
Press keyboard buttons (e.g. Enter, CTRL+C, etc.)
Scroll to bottom of the page
Wait a certain amount of time
Select any option from a <select> element
(to be continued...)

Strategies included 🧙🏼

Scrape text element: retrieve the text within any element
Screenshot to map: returns a screenshot of the page with a json with the coordinates and xpath/css selector for elements of your choice
(to be continued...)

Installation 🔧

Install ScrapeBlocks with npm

  npm install scrapeblocks

Usage 🧑🏼‍💻

Using built-in Playwright

Basic textContent strategy

import { Scraper, ScrapingStragegies } from "scrapeblocks";

const URL = "https://webscraper.io/test-sites/e-commerce/allinone";
const selector = "h4.price";

const strategy = new ScrapingStragegies.TextContentScraping(selector);
const result = await new Scraper(URL, strategy).run();

console.log(result);

Output:

['$233.99', '$603.99', '$295.99']

With actions

import { Scraper, ScrapingStragegies, Select } from "scrapeblocks";

const URL = "https://webscraper.io/test-sites/e-commerce/more/product/488";
const selectElement = "div.dropdown > select";
const optionToSelect = "Gold";
const selector = "div.caption > h4:nth-child(2)";

const strategy = new ScrapingStragegies.TextContentScraping(selector);
const selectAction = new Select({
	element: selectElement,
	value: optionToSelect,
});
const result = await new Scraper(URL, strategy, [selectAction]).run();

console.log(result);

Output:

Samsung Galaxy Gold

You can chain multiple actions by passing them in the order you want them to be executed as array.

Example:

const actions = [scrollAction, clickAction, typeAction];

Starting from version 0.1.0, you can also just execute actions without providing any strategy.

The method will return instances of Playwright Browser, BrowserContext, Page.

Example:

const { browser, context, page } =
        await new Scraper<PlaywrightBlocks>(
          URL, undefined, [clickAction,]
          ).run();

TODO ✅

Implement more strategies
Implement more actions
Increase test cases
Write more extensive documentation

Contributing 🤝🏼

Feel free to fork this repo and create a PR. I will review them and merge if ok. The above todos can be a very good place to start.

License 📝

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
.npmignore		.npmignore
.prettierrc		.prettierrc
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
jest.config.js		jest.config.js
logo.png		logo.png
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScrapeBlocks

Who is this for? 🤔

Features 🚀

Actions included ⚡

Strategies included 🧙🏼

Installation 🔧

Usage 🧑🏼‍💻

Using built-in Playwright

Basic textContent strategy

With actions

TODO ✅

Contributing 🤝🏼

License 📝

About

Releases 3

Packages

Languages

License

alexferrari88/scrapeblocks

Folders and files

Latest commit

History

Repository files navigation

ScrapeBlocks

Who is this for? 🤔

Features 🚀

Actions included ⚡

Strategies included 🧙🏼

Installation 🔧

Usage 🧑🏼‍💻

Using built-in Playwright

Basic textContent strategy

With actions

TODO ✅

Contributing 🤝🏼

License 📝

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages