0% found this document useful (0 votes)

132 views253 pages

Crawl4ai Docs

crawl4ai-docs, utiliza inteligência artificial para fazer pesquisas e estudos de informações de site

Uploaded by

Walter Favaron Junior

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

132 views253 pages

Crawl4ai Docs

crawl4ai-docs, utiliza inteligência artificial para fazer pesquisas e estudos de informações de site

Uploaded by

Walter Favaron Junior

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 253

[{

"url": "https://crawl4ai.com/mkdocs/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/",
"loadedTime": "2025-03-05T23:16:06.324Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 0,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/",
"title": "Home - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:00 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"4d645fcdae703856ecb41430b7a3133a\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Home - Crawl4AI Documentation (v0.5.x)\nðŸš€ðŸ¤–
Crawl4AI: Open-Source LLM-Friendly Web Crawler & Scraper
\nCrawl4AI is the #1 trending GitHub repository, actively
maintained by a vibrant community. It delivers blazing-fast,
AI-ready web crawling tailored for large language models, AI
agents, and data pipelines. Fully open source, flexible, and
built for real-time performance, Crawl4AI empowers developers
with unmatched speed, precision, and deployment ease.\nNote:
If you're looking for the old documentation, you can access it
here.\nQuick Start\nHere's a quick example to show you how
easy it is to use Crawl4AI with its asynchronous capabilities:
\nimport asyncio from crawl4ai import AsyncWebCrawler async
def main(): # Create an instance of AsyncWebCrawler async with
AsyncWebCrawler() as crawler: # Run the crawler on a URL
result = await crawler.arun(url=\"https://crawl4ai.com\") #
Print the extracted content print(result.markdown) # Run the
async main function asyncio.run(main()) \nWhat Does Crawl4AI
Do?\nCrawl4AI is a feature-rich crawler and scraper that aims
to:\n1. Generate Clean Markdown: Perfect for RAG pipelines or
direct ingestion into LLMs.\n2. Structured Extraction: Parse
repeated patterns with CSS, XPath, or LLM-based extraction.
\n3. Advanced Browser Control: Hooks, proxies, stealth modes,
session re-useâ€”fine-grained control.\n4. High Performance:
Parallel crawling, chunk-based extraction, real-time use
cases.\n5. Open Source: No forced API keys, no paywallsâ€”
everyone can access their data. \nCore Philosophies: -
Democratize Data: Free to use, transparent, and highly
1
configurable.\n- LLM Friendly: Minimally processed, well-
structured text, images, and metadata, so AI models can easily
consume it.\nDocumentation Structure\nTo help you get started,
weâ€™ve organized our docs into clear sections:\nSetup &
Installation\nBasic instructions to install Crawl4AI via pip
or Docker. \nQuick Start\nA hands-on introduction showing how
to do your first crawl, generate Markdown, and do a simple
extraction. \nCore\nDeeper guides on single-page crawling,
advanced browser/crawler parameters, content filtering, and
caching. \nAdvanced\nExplore link & media handling, lazy
loading, hooking & authentication, proxies, session
management, and more. \nExtraction\nDetailed references for
no-LLM (CSS, XPath) vs. LLM-based strategies, chunking, and
clustering approaches. \nAPI Reference\nFind the technical
specifics of each class and method, including AsyncWebCrawler,
arun(), and CrawlResult.\nThroughout these sections, youâ€™ll
find code samples you can copy-paste into your environment. If
something is missing or unclear, raise an issue or PR.\nHow
You Can Support\nStar & Fork: If you find Crawl4AI helpful,
star the repo on GitHub or fork it to add your own features.
\nFile Issues: Encounter a bug or missing feature? Let us know
by filing an issue, so we can improve. \nPull Requests:
Whether itâ€™s a small fix, a big feature, or better docsâ€”
contributions are always welcome. \nJoin Discord: Come chat
about web scraping, crawling tips, or AI workflows with the
community. \nSpread the Word: Mention Crawl4AI in your blog
posts, talks, or on social media. \nOur mission: to empower
everyoneâ€”students, researchers, entrepreneurs, data
scientistsâ€”to access, parse, and shape the worldâ€™s data
with speed, cost-efficiency, and creative freedom.\nQuick
Links\nGitHub Repo \nInstallation Guide \nQuick Start \nAPI
Reference \nChangelog \nThank you for joining me on this
journey. Letâ€™s keep building an open, democratic approach to
data extraction and AI together.\nHappy Crawling!\nâ€”
Unclecode, Founder & Maintainer of Crawl4AI",
"markdown": "# Home - Crawl4AI Documentation (v0.5.x)\n\n##
ðŸš€ðŸ¤– Crawl4AI: Open-Source LLM-Friendly Web Crawler &
Scraper\n\nCrawl4AI is the #1 trending GitHub repository,
actively maintained by a vibrant community. It delivers
blazing-fast, AI-ready web crawling tailored for large
language models, AI agents, and data pipelines. Fully open
source, flexible, and built for real-time performance,
**Crawl4AI** empowers developers with unmatched speed,
precision, and deployment ease.\n\n> **Note**: If you're
looking for the old documentation, you can access it [here]
(https://old.docs.crawl4ai.com/).\n\n## Quick Start\n\nHere's
a quick example to show you how easy it is to use Crawl4AI
with its asynchronous capabilities:\n\nìmport asyncio from
crawl4ai import AsyncWebCrawler async def main(): #
Create an instance of AsyncWebCrawler async with
AsyncWebCrawler() as crawler: # Run the crawler on a
URL result = await crawler.arun(url=
\"https://crawl4ai.com\") # Print the extracted
content print(result.markdown) # Run the async main
function asyncio.run(main())`\n\n* * *\n\n## What Does
Crawl4AI Do?\n\nCrawl4AI is a feature-rich crawler and scraper
that aims to:\n\n1.â€€**Generate Clean Markdown**: Perfect for
2
RAG pipelines or direct ingestion into LLMs. \n2.â
€€**Structured Extraction**: Parse repeated patterns with CSS,
XPath, or LLM-based extraction. \n3.â€€**Advanced Browser
Control**: Hooks, proxies, stealth modes, session re-useâ€”
fine-grained control. \n4.â€€**High Performance**: Parallel
crawling, chunk-based extraction, real-time use cases. \n5.â
€€**Open Source**: No forced API keys, no paywallsâ€”everyone
can access their data.\n\n**Core Philosophies**: -
**Democratize Data**: Free to use, transparent, and highly
configurable. \n\\- **LLM Friendly**: Minimally processed,
well-structured text, images, and metadata, so AI models can
easily consume it.\n\n* * *\n\n## Documentation Structure\n
\nTo help you get started, weâ€™ve organized our docs into
clear sections:\n\n* **Setup & Installation** \n Basic
instructions to install Crawl4AI via pip or Docker.\n*
**Quick Start** \n A hands-on introduction showing how to
do your first crawl, generate Markdown, and do a simple
extraction.\n* **Core** \n Deeper guides on single-page
crawling, advanced browser/crawler parameters, content
filtering, and caching.\n* **Advanced** \n Explore link
& media handling, lazy loading, hooking & authentication,
proxies, session management, and more.\n* **Extraction** \n
Detailed references for no-LLM (CSS, XPath) vs. LLM-based
strategies, chunking, and clustering approaches.\n* **API
Reference** \n Find the technical specifics of each class
and method, including ÀsyncWebCrawler`, àrun()`, and
`CrawlResult`.\n\nThroughout these sections, youâ€™ll find
code samples you can **copy-paste** into your environment. If
something is missing or unclear, raise an issue or PR.\n\n* *
*\n\n## How You Can Support\n\n* **Star & Fork**: If you
find Crawl4AI helpful, star the repo on GitHub or fork it to
add your own features.\n* **File Issues**: Encounter a bug
or missing feature? Let us know by filing an issue, so we can
improve.\n* **Pull Requests**: Whether itâ€™s a small fix, a
big feature, or better docsâ€”contributions are always
welcome.\n* **Join Discord**: Come chat about web scraping,
crawling tips, or AI workflows with the community.\n*
**Spread the Word**: Mention Crawl4AI in your blog posts,
talks, or on social media.\n\n**Our mission**: to empower
everyoneâ€”students, researchers, entrepreneurs, data
scientistsâ€”to access, parse, and shape the worldâ€™s data
with speed, cost-efficiency, and creative freedom.\n\n* * *\n
\n## Quick Links\n\n* **[GitHub Repo]
(https://github.com/unclecode/crawl4ai)**\n* **[Installation
Guide](https://crawl4ai.com/mkdocs/core/installation/)**\n*
**[Quick Start]
(https://crawl4ai.com/mkdocs/core/quickstart/)**\n* **[API
Reference](https://crawl4ai.com/mkdocs/api/async-
webcrawler/)**\n* **[Changelog]
(https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md)
**\n\nThank you for joining me on this journey. Letâ€™s keep
building an **open, democratic** approach to data extraction
and AI together.\n\nHappy Crawling! \nâ€” _Unclecode, Founder
& Maintainer of Crawl4AI_",
"debug": {
"requestHandlerMode": "browser"
}
3
},
{
"url": "https://crawl4ai.com/mkdocs/core/quickstart/",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/core/quickstart/",
"loadedTime": "2025-03-05T23:16:12.958Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl":
"https://docs.crawl4ai.com/core/quickstart/",
"title": "Quick Start - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:11 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"b0a884d431fe29ee3d8d3710e68b5bf8\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Quick Start - Crawl4AI Documentation
(v0.5.x)\nGetting Started with Crawl4AI\nWelcome to Crawl4AI,
an open-source LLM-friendly Web Crawler & Scraper. In this
tutorial, youâ€™ll:\nRun your first crawl using minimal
configuration. \nGenerate Markdown output (and learn how itâ
€™s influenced by content filters). \nExperiment with a simple
CSS-based extraction strategy. \nSee a glimpse of LLM-based
extraction (including open-source and closed-source model
options). \nCrawl a dynamic page that loads content via
JavaScript.\n1. Introduction\nCrawl4AI provides:\nAn
asynchronous crawler, AsyncWebCrawler. \nConfigurable browser
and run settings via BrowserConfig and CrawlerRunConfig.
\nAutomatic HTML-to-Markdown conversion via
DefaultMarkdownGenerator (supports optional filters).
\nMultiple extraction strategies (LLM-based or â€œtraditionalâ
€ CSS/XPath-based).\nBy the end of this guide, youâ€™ll have
performed a basic crawl, generated Markdown, tried out two
extraction strategies, and crawled a dynamic page that uses â
€œLoad Moreâ€ buttons or JavaScript updates.\n2. Your First
Crawl\nHereâ€™s a minimal Python script that creates an
AsyncWebCrawler, fetches a webpage, and prints the first 300
characters of its Markdown output:\nimport asyncio from
crawl4ai import AsyncWebCrawler async def main(): async with
AsyncWebCrawler() as crawler: result = await
4
crawler.arun(\"https://example.com\")
print(result.markdown[:300]) # Print first 300 chars if
__name__ == \"__main__\": asyncio.run(main()) \nWhatâ€™s
happening? - AsyncWebCrawler launches a headless browser
(Chromium by default). - It fetches https://example.com. -
Crawl4AI automatically converts the HTML into Markdown.\nYou
now have a simple, working crawl!\n3. Basic Configuration
(Light Introduction)\nCrawl4AIâ€™s crawler can be heavily
customized using two main classes:\n1. BrowserConfig: Controls
browser behavior (headless or full UI, user agent, JavaScript
toggles, etc.).\n2. CrawlerRunConfig: Controls how each crawl
runs (caching, extraction, timeouts, hooking, etc.).\nBelow is
an example with minimal usage:\nimport asyncio from crawl4ai
import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig,
CacheMode async def main(): browser_conf =
BrowserConfig(headless=True) # or False to see the browser
run_conf = CrawlerRunConfig( cache_mode=CacheMode.BYPASS )
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun( url=\"https://example.com\",
config=run_conf ) print(result.markdown) if __name__ ==
\"__main__\": asyncio.run(main()) \nIMPORTANT: By default
cache mode is set to CacheMode.ENABLED. So to have fresh
content, you need to set it to CacheMode.BYPASS\nWeâ€™ll
explore more advanced config in later tutorials (like enabling
proxies, PDF output, multi-tab sessions, etc.). For now, just
note how you pass these objects to manage crawling.\n4.
Generating Markdown Output\nBy default, Crawl4AI automatically
generates Markdown from each crawled page. However, the exact
output depends on whether you specify a markdown generator or
content filter.\nresult.markdown:\nThe direct HTML-to-Markdown
conversion. \nresult.markdown.fit_markdown:\nThe same content
after applying any configured content filter (e.g.,
PruningContentFilter).\nExample: Using a Filter with
DefaultMarkdownGenerator\nfrom crawl4ai import
AsyncWebCrawler, CrawlerRunConfig from
crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import
DefaultMarkdownGenerator md_generator =
DefaultMarkdownGenerator( content_filter=PruningContentFilter(
threshold=0.4, threshold_type=\"fixed\") ) config =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator ) async with AsyncWebCrawler()
as crawler: result = await
crawler.arun(\"https://news.ycombinator.com\", config=config)
print(\"Raw Markdown length:\",
len(result.markdown.raw_markdown)) print(\"Fit Markdown
length:\", len(result.markdown.fit_markdown)) \nNote: If you
do not specify a content filter or markdown generator, youâ
€™ll typically see only the raw Markdown. PruningContentFilter
may adds around 50ms in processing time. Weâ€™ll dive deeper
into these strategies in a dedicated Markdown Generation
tutorial.\nCrawl4AI can also extract structured data (JSON)
using CSS or XPath selectors. Below is a minimal CSS-based
example:\nNew! Crawl4AI now provides a powerful utility to
automatically generate extraction schemas using LLM. This is a
one-time cost that gives you a reusable schema for fast, LLM-
free extractions:\nfrom crawl4ai.extraction_strategy import
5
JsonCssExtractionStrategy from crawl4ai.async_configs import
LlmConfig # Generate a schema (one-time cost) html = \"<div
class='product'><h2>Gaming Laptop</h2><span
class='price'>$999.99</span></div>\" # Using OpenAI (requires
API token) schema =
JsonCssExtractionStrategy.generate_schema( html, llmConfig =
LlmConfig(provider=\"openai/gpt-4o\",api_token=\"your-openai-
token\") # Required for OpenAI ) # Or using Ollama (open
source, no token needed) schema =
JsonCssExtractionStrategy.generate_schema( html, llmConfig =
LlmConfig(provider=\"ollama/llama3.3\", api_token=None) # Not
needed for Ollama ) # Use the schema for fast, repeated
extractions strategy = JsonCssExtractionStrategy(schema) \nFor
a complete guide on schema generation and advanced usage, see
No-LLM Extraction Strategies.\nHere's a basic extraction
example:\nimport asyncio import json from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig, CacheMode from
crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def main(): schema = { \"name\": \"Example Items\",
\"baseSelector\": \"div.item\", \"fields\": [ {\"name\":
\"title\", \"selector\": \"h2\", \"type\": \"text\"}, {\"name
\": \"link\", \"selector\": \"a\", \"type\": \"attribute\",
\"attribute\": \"href\"} ] } raw_html = \"<div class='item'>
<h2>Item 1</h2><a href='https://example.com/item1'>Link 1</a>
</div>\" async with AsyncWebCrawler() as crawler: result =
await crawler.arun( url=\"raw://\" + raw_html,
config=CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema) ) ) #
The JSON output is stored in 'extracted_content' data =
json.loads(result.extracted_content) print(data) if __name__
== \"__main__\": asyncio.run(main()) \nWhy is this helpful? -
Great for repetitive page structures (e.g., item listings,
articles). - No AI usage or costs. - The crawler returns a
JSON string you can parse or store.\nTips: You can pass raw
HTML to the crawler instead of a URL. To do so, prefix the
HTML with raw://.\nFor more complex or irregular pages, a
language model can parse text intelligently into a structure
you define. Crawl4AI supports open-source or closed-source
providers:\nOpen-Source Models (e.g., ollama/llama3.3,
no_token) \nOpenAI Models (e.g., openai/gpt-4, requires
api_token) \nOr any provider supported by the underlying
library\nBelow is an example using open-source style (no
token) and closed-source:\nimport os import json import
asyncio from pydantic import BaseModel, Field from crawl4ai
import AsyncWebCrawler, CrawlerRunConfig, LlmConfig from
crawl4ai.extraction_strategy import LLMExtractionStrategy
class OpenAIModelFee(BaseModel): model_name: str = Field(...,
description=\"Name of the OpenAI model.\") input_fee: str =
Field(..., description=\"Fee for input token for the OpenAI
model.\") output_fee: str = Field( ..., description=\"Fee for
output token for the OpenAI model.\" ) async def
extract_structured_data_using_llm( provider: str, api_token:
str = None, extra_headers: Dict[str, str] = None ): print(f
\"\\n--- Extracting Structured Data with {provider} ---\") if
api_token is None and provider != \"ollama\": print(f\"API
token is required for {provider}. Skipping this example.\")
return browser_config = BrowserConfig(headless=True)
6
extra_args = {\"temperature\": 0, \"top_p\": 0.9, \"max_tokens
\": 2000} if extra_headers: extra_args[\"extra_headers\"] =
extra_headers crawler_config =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
word_count_threshold=1, page_timeout=80000,
extraction_strategy=LLMExtractionStrategy( llmConfig =
LlmConfig(provider=provider,api_token=api_token),
schema=OpenAIModelFee.model_json_schema(), extraction_type=
\"schema\", instruction=\"\"\"From the crawled content,
extract all mentioned model names along with their fees for
input and output tokens. Do not miss any models in the entire
content.\"\"\", extra_args=extra_args, ), ) async with
AsyncWebCrawler(config=browser_config) as crawler: result =
await crawler.arun( url=\"https://openai.com/api/pricing/\",
config=crawler_config ) print(result.extracted_content) if
__name__ == \"__main__\":
asyncio.run( extract_structured_data_using_llm( provider=
\"openai/gpt-4o\", api_token=os.getenv(\"OPENAI_API_KEY\") ) )
\nWhatâ€™s happening? - We define a Pydantic schema
(PricingInfo) describing the fields we want. - The LLM
extraction strategy uses that schema and your instructions to
transform raw text into structured JSON. - Depending on the
provider and api_token, you can use local models or a remote
API.\n7. Multi-URL Concurrency (Preview)\nIf you need to crawl
multiple URLs in parallel, you can use arun_many(). By
default, Crawl4AI employs a MemoryAdaptiveDispatcher,
automatically adjusting concurrency based on system resources.
Hereâ€™s a quick glimpse:\nimport asyncio from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig, CacheMode async def
quick_parallel_example(): urls = [ \"https://example.com/page1
\", \"https://example.com/page2\", \"https://example.com/page3
\" ] run_conf = CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
stream=True # Enable streaming mode ) async with
AsyncWebCrawler() as crawler: # Stream results as they
complete async for result in await crawler.arun_many(urls,
config=run_conf): if result.success: print(f\"[OK]
{result.url}, length: {len(result.markdown.raw_markdown)}\")
else: print(f\"[ERROR] {result.url} =>
{result.error_message}\") # Or get all results at once
(default behavior) run_conf = run_conf.clone(stream=False)
results = await crawler.arun_many(urls, config=run_conf) for
res in results: if res.success: print(f\"[OK] {res.url},
length: {len(res.markdown.raw_markdown)}\") else: print(f
\"[ERROR] {res.url} => {res.error_message}\") if __name__ ==
\"__main__\": asyncio.run(quick_parallel_example()) \nThe
example above shows two ways to handle multiple URLs: 1.
Streaming mode (stream=True): Process results as they become
available using async for 2. Batch mode (stream=False): Wait
for all results to complete\nFor more advanced concurrency
(e.g., a semaphore-based approach, adaptive memory usage
throttling, or customized rate limiting), see Advanced Multi-
URL Crawling.\n8. Dynamic Content Example\nSome sites require
multiple â€œpage clicksâ€ or dynamic JavaScript updates.
Below is an example showing how to click a â€œNext Pageâ€
button and wait for new commits to load on GitHub, using
BrowserConfig and CrawlerRunConfig:\nimport asyncio from
crawl4ai import AsyncWebCrawler, BrowserConfig,
7
CrawlerRunConfig, CacheMode from crawl4ai.extraction_strategy
import JsonCssExtractionStrategy async def
extract_structured_data_using_css_extractor(): print(\"\\n---
Using JsonCssExtractionStrategy for Fast Structured
Output ---\") schema = { \"name\": \"KidoCode Courses\",
\"baseSelector\": \"section.charge-methodology .w-tab-
content > div\", \"fields\": [ { \"name\": \"section_title\",
\"selector\": \"h3.heading-50\", \"type\": \"text\", },
{ \"name\": \"section_description\", \"selector\": \".charge-
content\", \"type\": \"text\", }, { \"name\": \"course_name\",
\"selector\": \".text-block-93\", \"type\": \"text\", },
{ \"name\": \"course_description\", \"selector\": \".course-
content-text\", \"type\": \"text\", }, { \"name\":
\"course_icon\", \"selector\": \".image-92\", \"type\":
\"attribute\", \"attribute\": \"src\", }, ], } browser_config
= BrowserConfig(headless=True, java_script_enabled=True)
js_click_tabs = \"\"\" (async () => { const tabs =
document.querySelectorAll(\"section.charge-methodology .tabs-
menu-3 > div\"); for(let tab of tabs) { tab.scrollIntoView();
tab.click(); await new Promise(r => setTimeout(r, 500)); } })
(); \"\"\" crawler_config =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
js_code=[js_click_tabs], ) async with
AsyncWebCrawler(config=browser_config) as crawler: result =
await crawler.arun( url=
\"https://www.kidocode.com/degrees/technology\",
config=crawler_config ) companies =
json.loads(result.extracted_content) print(f\"Successfully
extracted {len(companies)} companies\")
print(json.dumps(companies[0], indent=2)) async def main():
await extract_structured_data_using_css_extractor() if
__name__ == \"__main__\": asyncio.run(main()) \nKey Points:
\nBrowserConfig(headless=False): We want to watch it click â
€œNext Page.â€ \nCrawlerRunConfig(...): We specify the
extraction strategy, pass session_id to reuse the same page.
\njs_code and wait_for are used for subsequent pages (page >
0) to click the â€œNextâ€ button and wait for new commits to
load. \njs_only=True indicates weâ€™re not re-navigating but
continuing the existing session. \nFinally, we call
kill_session() to clean up the page and browser session.\n9.
Next Steps\nCongratulations! You have:\nPerformed a basic
crawl and printed Markdown. \nUsed content filters with a
markdown generator. \nExtracted JSON via CSS or LLM
strategies. \nHandled dynamic pages with JavaScript triggers.
\nIf youâ€™re ready for more, check out:\nInstallation: A
deeper dive into advanced installs, Docker usage
(experimental), or optional dependencies. \nHooks & Auth:
Learn how to run custom JavaScript or handle logins with
cookies, local storage, etc. \nDeployment: Explore ephemeral
testing in Docker or plan for the upcoming stable Docker
release. \nBrowser Management: Delve into user simulation,
stealth modes, and concurrency best practices. \nCrawl4AI is a
powerful, flexible tool. Enjoy building out your scrapers,
data pipelines, or AI-driven extraction flows. Happy
crawling!",
"markdown": "# Quick Start - Crawl4AI Documentation
8
(v0.5.x)\n\n## Getting Started with Crawl4AI\n\nWelcome to
**Crawl4AI**, an open-source LLM-friendly Web Crawler &
Scraper. In this tutorial, youâ€™ll:\n\n1. Run your **first
crawl** using minimal configuration.\n2. Generate
**Markdown** output (and learn how itâ€™s influenced by
content filters).\n3. Experiment with a simple **CSS-based
extraction** strategy.\n4. See a glimpse of **LLM-based
extraction** (including open-source and closed-source model
options).\n5. Crawl a **dynamic** page that loads content via
JavaScript.\n\n* * *\n\n## 1\\. Introduction\n\nCrawl4AI
provides:\n\n* An asynchronous crawler,
**ÀsyncWebCrawler`**.\n* Configurable browser and run
settings via **`BrowserConfig`** and **`CrawlerRunConfig`**.
\n* Automatic HTML-to-Markdown conversion via
**`DefaultMarkdownGenerator`** (supports optional filters).\n*
Multiple extraction strategies (LLM-based or â€œtraditionalâ€
CSS/XPath-based).\n\nBy the end of this guide, youâ€™ll have
performed a basic crawl, generated Markdown, tried out two
extraction strategies, and crawled a dynamic page that uses â
€œLoad Moreâ€ buttons or JavaScript updates.\n\n* * *\n\n## 2
\\. Your First Crawl\n\nHereâ€™s a minimal Python script that
creates an **ÀsyncWebCrawler`**, fetches a webpage, and
prints the first 300 characters of its Markdown output:\n
\nìmport asyncio from crawl4ai import AsyncWebCrawler async
def main(): async with AsyncWebCrawler() as crawler:
result = await crawler.arun(\"https://example.com\")
print(result.markdown[:300]) # Print first 300 chars if
__name__ == \"__main__\": asyncio.run(main())`\n\n**Whatâ
€™s happening?** - **ÀsyncWebCrawler`** launches a headless
browser (Chromium by default). - It fetches
`https://example.com`. - Crawl4AI automatically converts the
HTML into Markdown.\n\nYou now have a simple, working crawl!\n
\n* * *\n\n## 3\\. Basic Configuration (Light Introduction)\n
\nCrawl4AIâ€™s crawler can be heavily customized using two
main classes:\n\n1.â€€**`BrowserConfig`**: Controls browser
behavior (headless or full UI, user agent, JavaScript toggles,
etc.). \n2.â€€**`CrawlerRunConfig`**: Controls how each crawl
runs (caching, extraction, timeouts, hooking, etc.).\n\nBelow
is an example with minimal usage:\n\nìmport asyncio from
crawl4ai import AsyncWebCrawler, BrowserConfig,
CrawlerRunConfig, CacheMode async def main():
browser_conf = BrowserConfig(headless=True) # or False to see
the browser run_conf =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS )
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun( url=
\"https://example.com\", config=run_conf )
print(result.markdown) if __name__ == \"__main__\":
asyncio.run(main())`\n\n> IMPORTANT: By default cache mode is
set to `CacheMode.ENABLED`. So to have fresh content, you need
to set it to `CacheMode.BYPASS`\n\nWeâ€™ll explore more
advanced config in later tutorials (like enabling proxies, PDF
output, multi-tab sessions, etc.). For now, just note how you
pass these objects to manage crawling.\n\n* * *\n\n## 4\\.
Generating Markdown Output\n\nBy default, Crawl4AI
automatically generates Markdown from each crawled page.
However, the exact output depends on whether you specify a
9
**markdown generator** or **content filter**.\n\n*
**`result.markdown`**: \n The direct HTML-to-Markdown
conversion.\n* **`result.markdown.fit_markdown`**: \n
The same content after applying any configured **content
filter** (e.g., `PruningContentFilter`).\n\n### Example: Using
a Filter with `DefaultMarkdownGenerator`\n\n`from crawl4ai
import AsyncWebCrawler, CrawlerRunConfig from
crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import
DefaultMarkdownGenerator md_generator =
DefaultMarkdownGenerator( content_filter=PruningContentFil
ter(threshold=0.4, threshold_type=\"fixed\") ) config =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun(\"https://news.ycombinator.com\", config=config)
print(\"Raw Markdown length:\",
len(result.markdown.raw_markdown)) print(\"Fit Markdown
length:\", len(result.markdown.fit_markdown))`\n\n**Note**: If
you do **not** specify a content filter or markdown generator,
youâ€™ll typically see only the raw Markdown.
`PruningContentFilter` may adds around `50ms` in processing
time. Weâ€™ll dive deeper into these strategies in a dedicated
**Markdown Generation** tutorial.\n\n* * *\n\nCrawl4AI can
also extract structured data (JSON) using CSS or XPath
selectors. Below is a minimal CSS-based example:\n\n> **New!**
Crawl4AI now provides a powerful utility to automatically
generate extraction schemas using LLM. This is a one-time cost
that gives you a reusable schema for fast, LLM-free
extractions:\n\n`from crawl4ai.extraction_strategy import
JsonCssExtractionStrategy from crawl4ai.async_configs import
LlmConfig # Generate a schema (one-time cost) html = \"<div
class='product'><h2>Gaming Laptop</h2><span
class='price'>$999.99</span></div>\" # Using OpenAI (requires
API token) schema =
JsonCssExtractionStrategy.generate_schema( html,
llmConfig = LlmConfig(provider=\"openai/gpt-4o\",api_token=
\"your-openai-token\") # Required for OpenAI ) # Or using
Ollama (open source, no token needed) schema =
JsonCssExtractionStrategy.generate_schema( html,
llmConfig = LlmConfig(provider=\"ollama/llama3.3\",
api_token=None) # Not needed for Ollama ) # Use the schema
for fast, repeated extractions strategy =
JsonCssExtractionStrategy(schema)`\n\nFor a complete guide on
schema generation and advanced usage, see [No-LLM Extraction
Strategies](https://crawl4ai.com/mkdocs/extraction/no-llm-
strategies/).\n\nHere's a basic extraction example:\n\nìmport
asyncio import json from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig, CacheMode from crawl4ai.extraction_strategy
import JsonCssExtractionStrategy async def main(): schema
= { \"name\": \"Example Items\",
\"baseSelector\": \"div.item\", \"fields\":
[ {\"name\": \"title\", \"selector\": \"h2\",
\"type\": \"text\"}, {\"name\": \"link\",
\"selector\": \"a\", \"type\": \"attribute\", \"attribute\":
\"href\"} ] } raw_html = \"<div class='item'>
<h2>Item 1</h2><a href='https://example.com/item1'>Link 1</a>
10
</div>\" async with AsyncWebCrawler() as crawler:
result = await crawler.arun( url=\"raw://\" +
raw_html,
config=CrawlerRunConfig( cache_mode=CacheMode.
BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema)
) ) # The JSON output is stored in
'extracted_content' data =
json.loads(result.extracted_content) print(data) if
__name__ == \"__main__\": asyncio.run(main())`\n\n**Why is
this helpful?** - Great for repetitive page structures (e.g.,
item listings, articles). - No AI usage or costs. - The
crawler returns a JSON string you can parse or store.\n\n>
Tips: You can pass raw HTML to the crawler instead of a URL.
To do so, prefix the HTML with `raw://`.\n\n* * *\n\nFor more
complex or irregular pages, a language model can parse text
intelligently into a structure you define. Crawl4AI supports
**open-source** or **closed-source** providers:\n\n* **Open-
Source Models** (e.g., òllama/llama3.3`, `no_token`)\n*
**OpenAI Models** (e.g., òpenai/gpt-4`, requires
àpi_token`)\n* Or any provider supported by the underlying
library\n\nBelow is an example using **open-source** style (no
token) and closed-source:\n\nìmport os import json import
asyncio from pydantic import BaseModel, Field from crawl4ai
import AsyncWebCrawler, CrawlerRunConfig, LlmConfig from
crawl4ai.extraction_strategy import LLMExtractionStrategy
class OpenAIModelFee(BaseModel): model_name: str =
Field(..., description=\"Name of the OpenAI model.\")
input_fee: str = Field(..., description=\"Fee for input token
for the OpenAI model.\") output_fee: str =
Field( ..., description=\"Fee for output token for the
OpenAI model.\" ) async def
extract_structured_data_using_llm( provider: str,
api_token: str = None, extra_headers: Dict[str, str] = None ):
print(f\"\\n--- Extracting Structured Data with
{provider} ---\") if api_token is None and provider !=
\"ollama\": print(f\"API token is required for
{provider}. Skipping this example.\") return
browser_config = BrowserConfig(headless=True) extra_args
= {\"temperature\": 0, \"top_p\": 0.9, \"max_tokens\": 2000}
if extra_headers: extra_args[\"extra_headers\"] =
extra_headers crawler_config =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
word_count_threshold=1, page_timeout=80000,
extraction_strategy=LLMExtractionStrategy( llmConf
ig = LlmConfig(provider=provider,api_token=api_token),
schema=OpenAIModelFee.model_json_schema(),
extraction_type=\"schema\", instruction=\"\"\"From
the crawled content, extract all mentioned model names along
with their fees for input and output tokens. Do
not miss any models in the entire content.\"\"\",
extra_args=extra_args, ), ) async with
AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun( url=
\"https://openai.com/api/pricing/\",
config=crawler_config )
print(result.extracted_content) if __name__ == \"__main__\":
11
asyncio.run( extract_structured_data_using_llm(
provider=\"openai/gpt-4o\",
api_token=os.getenv(\"OPENAI_API_KEY\") ) )`\n
\n**Whatâ€™s happening?** - We define a Pydantic schema
(`PricingInfo`) describing the fields we want. - The LLM
extraction strategy uses that schema and your instructions to
transform raw text into structured JSON. - Depending on the
**provider** and **api\\_token**, you can use local models or
a remote API.\n\n* * *\n\n## 7\\. Multi-URL Concurrency
(Preview)\n\nIf you need to crawl multiple URLs in
**parallel**, you can use àrun_many()`. By default, Crawl4AI
employs a **MemoryAdaptiveDispatcher**, automatically
adjusting concurrency based on system resources. Hereâ€™s a
quick glimpse:\n\nìmport asyncio from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig, CacheMode async def
quick_parallel_example(): urls =
[ \"https://example.com/page1\",
\"https://example.com/page2\",
\"https://example.com/page3\" ] run_conf =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
stream=True # Enable streaming mode ) async with
AsyncWebCrawler() as crawler: # Stream results as they
complete async for result in await
crawler.arun_many(urls, config=run_conf): if
result.success: print(f\"[OK] {result.url},
length: {len(result.markdown.raw_markdown)}\")
else: print(f\"[ERROR] {result.url} =>
{result.error_message}\") # Or get all results at
once (default behavior) run_conf =
run_conf.clone(stream=False) results = await
crawler.arun_many(urls, config=run_conf) for res in
results: if res.success: print(f
\"[OK] {res.url}, length: {len(res.markdown.raw_markdown)}\")
else: print(f\"[ERROR] {res.url} =>
{res.error_message}\") if __name__ == \"__main__\":
asyncio.run(quick_parallel_example())`\n\nThe example above
shows two ways to handle multiple URLs: 1. **Streaming mode**
(`stream=True`): Process results as they become available
using àsync for` 2. **Batch mode** (`stream=False`): Wait for
all results to complete\n\nFor more advanced concurrency
(e.g., a **semaphore-based** approach, **adaptive memory usage
throttling**, or customized rate limiting), see [Advanced
Multi-URL Crawling]
(https://crawl4ai.com/mkdocs/advanced/multi-url-crawling/).\n
\n* * *\n\n## 8\\. Dynamic Content Example\n\nSome sites
require multiple â€œpage clicksâ€ or dynamic JavaScript
updates. Below is an example showing how to **click** a â
€œNext Pageâ€ button and wait for new commits to load on
GitHub, using **`BrowserConfig`** and **`CrawlerRunConfig`**:
\n\nìmport asyncio from crawl4ai import AsyncWebCrawler,
BrowserConfig, CrawlerRunConfig, CacheMode from
crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def extract_structured_data_using_css_extractor():
print(\"\\n--- Using JsonCssExtractionStrategy for Fast
Structured Output ---\") schema = { \"name\":
\"KidoCode Courses\", \"baseSelector\":
\"section.charge-methodology .w-tab-content > div\",
12
\"fields\": [ { \"name\":
\"section_title\", \"selector\":
\"h3.heading-50\", \"type\": \"text
\", }, { \"name\":
\"section_description\", \"selector\":
\".charge-content\", \"type\": \"text
\", }, { \"name\":
\"course_name\", \"selector\": \".text-
block-93\", \"type\": \"text\", },
{ \"name\": \"course_description\",
\"selector\": \".course-content-text\", \"type
\": \"text\", },
{ \"name\": \"course_icon\",
\"selector\": \".image-92\", \"type\":
\"attribute\", \"attribute\": \"src
\", }, ], } browser_config =
BrowserConfig(headless=True, java_script_enabled=True)
js_click_tabs = \"\"\" (async () => { const tabs =
document.querySelectorAll(\"section.charge-methodology .tabs-
menu-3 > div\"); for(let tab of tabs)
{ tab.scrollIntoView(); tab.click();
await new Promise(r => setTimeout(r, 500)); } })
(); \"\"\" crawler_config =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
js_code=[js_click_tabs], ) async with
AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun( url=
\"https://www.kidocode.com/degrees/technology\",
config=crawler_config ) companies =
json.loads(result.extracted_content) print(f
\"Successfully extracted {len(companies)} companies\")
print(json.dumps(companies[0], indent=2)) async def main():
await extract_structured_data_using_css_extractor() if
__name__ == \"__main__\": asyncio.run(main())`\n\n**Key
Points**:\n\n* **`BrowserConfig(headless=False)`**: We want
to watch it click â€œNext Page.â€ \n*
**`CrawlerRunConfig(...)`**: We specify the extraction
strategy, pass `session_id` to reuse the same page.\n*
**`js_code`** and **`wait_for`** are used for subsequent pages
(`page > 0`) to click the â€œNextâ€ button and wait for new
commits to load.\n* **`js_only=True`** indicates weâ€™re not
re-navigating but continuing the existing session.\n*
Finally, we call `kill_session()` to clean up the page and
browser session.\n\n* * *\n\n## 9\\. Next Steps\n
\nCongratulations! You have:\n\n1. Performed a basic crawl
and printed Markdown.\n2. Used **content filters** with a
markdown generator.\n3. Extracted JSON via **CSS** or **LLM**
strategies.\n4. Handled **dynamic** pages with JavaScript
triggers.\n\nIf youâ€™re ready for more, check out:\n\n*
**Installation**: A deeper dive into advanced installs, Docker
usage (experimental), or optional dependencies.\n* **Hooks &
Auth**: Learn how to run custom JavaScript or handle logins
with cookies, local storage, etc.\n* **Deployment**: Explore
ephemeral testing in Docker or plan for the upcoming stable
Docker release.\n* **Browser Management**: Delve into user
simulation, stealth modes, and concurrency best practices.\n
13
\nCrawl4AI is a powerful, flexible tool. Enjoy building out
your scrapers, data pipelines, or AI-driven extraction flows.
Happy crawling!",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/core/installation/",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/core/installation/",
"loadedTime": "2025-03-05T23:16:13.542Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl":
"https://docs.crawl4ai.com/core/installation/",
"title": "Installation - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:12 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"470ef85684d5b07f2dde0de4e9919a59\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Installation - Crawl4AI Documentation
(v0.5.x)\nInstallation & Setup (2023 Edition)\n1. Basic
Installation\nThis installs the core Crawl4AI library along
with essential dependencies. No advanced features (like
transformers or PyTorch) are included yet.\n2. Initial Setup &
Diagnostics\n2.1 Run the Setup Command\nAfter installing,
call:\nWhat does it do? - Installs or updates required
Playwright browsers (Chromium, Firefox, etc.) - Performs OS-
level checks (e.g., missing libs on Linux) - Confirms your
environment is ready to crawl\n2.2 Diagnostics\nOptionally,
you can run diagnostics to confirm everything is functioning:
\nThis command attempts to: - Check Python version
compatibility - Verify Playwright installation - Inspect
environment variables or library conflicts\nIf any issues
arise, follow its suggestions (e.g., installing additional
system packages) and re-run crawl4ai-setup.\n3. Verifying
Installation: A Simple Crawl (Skip this step if you already
run crawl4ai-doctor)\nBelow is a minimal Python script
14
demonstrating a basic crawl. It uses our new BrowserConfig and
CrawlerRunConfig for clarity, though no custom settings are
passed in this example:\nimport asyncio from crawl4ai import
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig async def
main(): async with AsyncWebCrawler() as crawler: result =
await crawler.arun( url=\"https://www.example.com\", )
print(result.markdown[:300]) # Show the first 300 characters
of extracted text if __name__ == \"__main__\":
asyncio.run(main()) \nExpected outcome: - A headless browser
session loads example.com - Crawl4AI returns ~300 characters
of markdown.\nIf errors occur, rerun crawl4ai-doctor or
manually ensure Playwright is installed correctly.\n4.
Advanced Installation (Optional)\nWarning: Only install these
if you truly need them. They bring in larger dependencies,
including big models, which can increase disk usage and memory
load significantly.\n4.1 Torch, Transformers, or All\nText
Clustering (Torch)\npip install crawl4ai[torch] crawl4ai-setup
\nInstalls PyTorch-based features (e.g., cosine similarity or
advanced semantic chunking). \nTransformers\npip install
crawl4ai[transformer] crawl4ai-setup \nAdds Hugging Face-based
summarization or generation strategies. \nAll Features\npip
install crawl4ai[all] crawl4ai-setup \n(Optional) Pre-Fetching
Models\nThis step caches large models locally (if needed).
Only do this if your workflow requires them. \n5. Docker
(Experimental)\nWe provide a temporary Docker approach for
testing. Itâ€™s not stable and may break with future releases.
We plan a major Docker revamp in a future stable version, 2025
Q1. If you still want to try:\ndocker pull
unclecode/crawl4ai:basic docker run -p 11235:11235
unclecode/crawl4ai:basic \nYou can then make POST requests to
http://localhost:11235/crawl to perform crawls. Production
usage is discouraged until our new Docker approach is ready
(planned in Jan or Feb 2025).\n6. Local Server Mode
(Legacy)\nSome older docs mention running Crawl4AI as a local
server. This approach has been partially replaced by the new
Docker-based prototype and upcoming stable server release. You
can experiment, but expect major changes. Official local
server instructions will arrive once the new Docker
architecture is finalized.\nSummary\n1. Install with pip
install crawl4ai and run crawl4ai-setup. 2. Diagnose with
crawl4ai-doctor if you see errors. 3. Verify by crawling
example.com with minimal BrowserConfig + CrawlerRunConfig. 4.
Advanced features (Torch, Transformers) are optionalâ€”avoid
them if you donâ€™t need them (they significantly increase
resource usage). 5. Docker is experimentalâ€”use at your own
risk until the stable version is released. 6. Local server
references in older docs are largely deprecated; a new
solution is in progress.\nGot questions? Check GitHub issues
for updates or ask the community!",
"markdown": "# Installation - Crawl4AI Documentation
(v0.5.x)\n\n## Installation & Setup (2023 Edition)\n\n## 1\\.
Basic Installation\n\nThis installs the **core** Crawl4AI
library along with essential dependencies.â€€**No** advanced
features (like transformers or PyTorch) are included yet.\n
\n## 2\\. Initial Setup & Diagnostics\n\n### 2.1 Run the Setup
Command\n\nAfter installing, call:\n\n**What does it do?** -
Installs or updates required Playwright browsers (Chromium,
15
Firefox, etc.) - Performs OS-level checks (e.g., missing libs
on Linux) - Confirms your environment is ready to crawl\n\n###
2.2 Diagnostics\n\nOptionally, you can run **diagnostics** to
confirm everything is functioning:\n\nThis command attempts
to: - Check Python version compatibility - Verify Playwright
installation - Inspect environment variables or library
conflicts\n\nIf any issues arise, follow its suggestions
(e.g., installing additional system packages) and re-run
`crawl4ai-setup`.\n\n* * *\n\n## 3\\. Verifying Installation:
A Simple Crawl (Skip this step if you already run `crawl4ai-
doctor`)\n\nBelow is a minimal Python script demonstrating a
**basic** crawl. It uses our new **`BrowserConfig`** and
**`CrawlerRunConfig`** for clarity, though no custom settings
are passed in this example:\n\nìmport asyncio from crawl4ai
import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig async
def main(): async with AsyncWebCrawler() as crawler:
result = await crawler.arun( url=
\"https://www.example.com\", )
print(result.markdown[:300]) # Show the first 300 characters
of extracted text if __name__ == \"__main__\":
asyncio.run(main())`\n\n**Expected** outcome: - A headless
browser session loads èxample.com` - Crawl4AI returns ~300
characters of markdown. \nIf errors occur, rerun `crawl4ai-
doctor` or manually ensure Playwright is installed correctly.
\n\n* * *\n\n## 4\\. Advanced Installation (Optional)\n
\n**Warning**: Only install these **if you truly need them**.
They bring in larger dependencies, including big models, which
can increase disk usage and memory load significantly.\n\n###
4.1 Torch, Transformers, or All\n\n* **Text Clustering
(Torch)** \n \n `pip install crawl4ai[torch] crawl4ai-
setup`\n \n Installs PyTorch-based features (e.g.,
cosine similarity or advanced semantic chunking).\n*
**Transformers** \n \n `pip install
crawl4ai[transformer] crawl4ai-setup`\n \n Adds Hugging
Face-based summarization or generation strategies.\n* **All
Features** \n \n `pip install crawl4ai[all] crawl4ai-
setup`\n \n\n#### (Optional) Pre-Fetching Models\n\nThis
step caches large models locally (if needed).â€€**Only do
this** if your workflow requires them.\n\n* * *\n\n## 5\\.
Docker (Experimental)\n\nWe provide a **temporary** Docker
approach for testing.â€€**Itâ€™s not stable and may break**
with future releases. We plan a major Docker revamp in a
future stable version, 2025 Q1. If you still want to try:\n
\n`docker pull unclecode/crawl4ai:basic docker run -p
11235:11235 unclecode/crawl4ai:basic`\n\nYou can then make
POST requests to `http://localhost:11235/crawl` to perform
crawls.â€€**Production usage** is discouraged until our new
Docker approach is ready (planned in Jan or Feb 2025).\n\n* *
*\n\n## 6\\. Local Server Mode (Legacy)\n\nSome older docs
mention running Crawl4AI as a local server. This approach has
been **partially replaced** by the new Docker-based prototype
and upcoming stable server release. You can experiment, but
expect major changes. Official local server instructions will
arrive once the new Docker architecture is finalized.\n\n* * *
\n\n## Summary\n\n1.â€€**Install** with `pip install crawl4ai`
and run `crawl4ai-setup`. 2.â€€**Diagnose** with `crawl4ai-
doctor` if you see errors. 3.â€€**Verify** by crawling
16
èxample.com` with minimal `BrowserConfig` +
`CrawlerRunConfig`. 4.â€€**Advanced** features (Torch,
Transformers) are **optional**â€”avoid them if you donâ€™t
need them (they significantly increase resource usage). 5.â
€€**Docker** is **experimental**â€”use at your own risk until
the stable version is released. 6.â€€**Local server**
references in older docs are largely deprecated; a new
solution is in progress.\n\n**Got questions?** Check [GitHub
issues](https://github.com/unclecode/crawl4ai/issues) for
updates or ask the community!",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/core/docker-
deployment/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/core/docker-
deployment/",
"loadedTime": "2025-03-05T23:16:14.565Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/core/docker-
deployment/",
"title": "Docker Deployment - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:12 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"097ac26341194f822975a57d00d0896d\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Docker Deployment - Crawl4AI Documentation
(v0.5.x)\nCrawl4AI provides official Docker images for easy
deployment and scalability. This guide covers installation,
configuration, and usage of Crawl4AI in Docker environments.
\nQuick Start ðŸš€\nPull and run the basic version:\n# Basic
run without security docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic # Run with
API security enabled docker run -p 11235:11235 -e
CRAWL4AI_API_TOKEN=your_secret_token unclecode/crawl4ai:basic
17
\nRunning with Docker Compose ðŸ ³\nUse Docker Compose (From
Local Dockerfile or Docker Hub)\nCrawl4AI provides flexibility
to use Docker Compose for managing your containerized
services. You can either build the image locally from the
provided Dockerfile or use the pre-built image from Docker
Hub.\nOption 1: Using Docker Compose to Build Locally\nIf you
want to build the image locally, use the provided docker-
compose.local.yml file.\ndocker-compose -f docker-
compose.local.yml up -d \nThis will: 1. Build the Docker image
from the provided Dockerfile. 2. Start the container and
expose it on http://localhost:11235.\nOption 2: Using Docker
Compose with Pre-Built Image from Hub\nIf you prefer using the
pre-built image on Docker Hub, use the docker-compose.hub.yml
file.\ndocker-compose -f docker-compose.hub.yml up -d \nThis
will: 1. Pull the pre-built image unclecode/crawl4ai:basic (or
all, depending on your configuration). 2. Start the container
and expose it on http://localhost:11235.\nStopping the Running
Services\nTo stop the services started via Docker Compose, you
can use:\ndocker-compose -f docker-compose.local.yml down # OR
docker-compose -f docker-compose.hub.yml down \nIf the
containers donâ€™t stop and the application is still running,
check the running containers:\nFind the CONTAINER ID of the
running service and stop it forcefully:\ndocker stop
<CONTAINER_ID> \nDebugging with Docker Compose\nCheck Logs: To
view the container logs: \ndocker-compose -f docker-
compose.local.yml logs -f \nRemove Orphaned Containers: If the
service is still running unexpectedly: \ndocker-compose -f
docker-compose.local.yml down --remove-orphans \nManually
Remove Network: If the network is still in use: \ndocker
network ls docker network rm crawl4ai_default \nWhy Use Docker
Compose?\nDocker Compose is the recommended way to deploy
Crawl4AI because: 1. It simplifies multi-container setups. 2.
Allows you to define environment variables, resources, and
ports in a single file. 3. Makes it easier to switch between
local development and production-ready images.\nFor example,
your docker-compose.yml could include API keys, token
settings, and memory limits, making deployment quick and
consistent.\nAPI Security ðŸ”’\nUnderstanding
CRAWL4AI_API_TOKEN\nThe CRAWL4AI_API_TOKEN provides optional
security for your Crawl4AI instance:\nIf CRAWL4AI_API_TOKEN is
set: All API endpoints (except /health) require authentication
\nIf CRAWL4AI_API_TOKEN is not set: The API is publicly
accessible\n# Secured Instance docker run -p 11235:11235 -e
CRAWL4AI_API_TOKEN=your_secret_token unclecode/crawl4ai:all #
Unsecured Instance docker run -p 11235:11235
unclecode/crawl4ai:all \nMaking API Calls\nFor secured
instances, include the token in all requests:\nimport requests
# Setup headers if token is being used api_token =
\"your_secret_token\" # Same token set in CRAWL4AI_API_TOKEN
headers = {\"Authorization\": f\"Bearer {api_token}\"} if
api_token else {} # Making authenticated requests response =
requests.post( \"http://localhost:11235/crawl\",
headers=headers, json={ \"urls\": \"https://example.com\",
\"priority\": 10 } ) # Checking task status task_id =
response.json()[\"task_id\"] status = requests.get( f
\"http://localhost:11235/task/{task_id}\", headers=headers )
\nUsing with Docker Compose\nIn your docker-compose.yml:
18
\nservices: crawl4ai: image: unclecode/crawl4ai:all
environment: - CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} #
Optional # ... other configuration \nThen either: 1. Set
in .env file: \nCRAWL4AI_API_TOKEN=your_secret_token \nOr set
via command line: \nCRAWL4AI_API_TOKEN=your_secret_token
docker-compose up \nSecurity Note: If you enable the API
token, make sure to keep it secure and never commit it to
version control. The token will be required for all API
endpoints except the health check endpoint (/health).
\nConfiguration Options ðŸ”§\nEnvironment Variables\nYou can
configure the service using environment variables:\n# Basic
configuration docker run -p 11235:11235 \\ -e
MAX_CONCURRENT_TASKS=5 \\ unclecode/crawl4ai:all # With
security and LLM support docker run -p 11235:11235 \\ -e
CRAWL4AI_API_TOKEN=your_secret_token \\ -e
OPENAI_API_KEY=sk-... \\ -e ANTHROPIC_API_KEY=sk-ant-... \\
unclecode/crawl4ai:all \nUsing Docker Compose (Recommended)
ðŸ ³\nCreate a docker-compose.yml:\nversion: '3.8' services:
crawl4ai: image: unclecode/crawl4ai:all ports: - \"11235:11235
\" environment: - CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} #
Optional API security - MAX_CONCURRENT_TASKS=5 # LLM Provider
Keys - OPENAI_API_KEY=${OPENAI_API_KEY:-} - ANTHROPIC_API_KEY=
${ANTHROPIC_API_KEY:-} volumes: - /dev/shm:/dev/shm deploy:
resources: limits: memory: 4G reservations: memory: 1G \nYou
can run it in two ways:\nUsing environment variables directly:
\nCRAWL4AI_API_TOKEN=secret123 OPENAI_API_KEY=sk-... docker-
compose up \nUsing a .env file (recommended): Create a .env
file in the same directory: \n# API Security (optional)
CRAWL4AI_API_TOKEN=your_secret_token # LLM Provider Keys
OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-... # Other
Configuration MAX_CONCURRENT_TASKS=5 \nThen simply run:
\nTesting the Deployment ðŸ§ª\nimport requests # For unsecured
instances def test_unsecured(): # Health check health =
requests.get(\"http://localhost:11235/health\") print(\"Health
check:\", health.json()) # Basic crawl response =
requests.post( \"http://localhost:11235/crawl\", json={ \"urls
\": \"https://www.nbcnews.com/business\", \"priority\": 10 } )
task_id = response.json()[\"task_id\"] print(\"Task ID:\",
task_id) # For secured instances def test_secured(api_token):
headers = {\"Authorization\": f\"Bearer {api_token}\"} # Basic
crawl with authentication response =
requests.post( \"http://localhost:11235/crawl\",
headers=headers, json={ \"urls\":
\"https://www.nbcnews.com/business\", \"priority\": 10 } )
task_id = response.json()[\"task_id\"] print(\"Task ID:\",
task_id) \nWhen you've configured your LLM provider keys (via
environment variables or .env), you can use LLM extraction:
\nrequest = { \"urls\": \"https://example.com\",
\"extraction_config\": { \"type\": \"llm\", \"params\":
{ \"provider\": \"openai/gpt-4\", \"instruction\": \"Extract
main topics from the page\" } } } # Make the request (add
headers if using API security) response =
requests.post(\"http://localhost:11235/crawl\", json=request)
\nNote: Remember to add .env to your .gitignore to keep your
API keys secure!\nUsage Examples ðŸ“ \nBasic Crawling\nrequest
= { \"urls\": \"https://www.nbcnews.com/business\", \"priority
\": 10 } response =
19
requests.post(\"http://localhost:11235/crawl\", json=request)
task_id = response.json()[\"task_id\"] # Get results result =
requests.get(f\"http://localhost:11235/task/{task_id}\")
\nschema = { \"name\": \"Crypto Prices\", \"baseSelector\":
\".cds-tableRow-t45thuk\", \"fields\": [ { \"name\": \"crypto
\", \"selector\": \"td:nth-child(1) h2\", \"type\": \"text
\", }, { \"name\": \"price\", \"selector\": \"td:nth-
child(2)\", \"type\": \"text\", } ], } request = { \"urls\":
\"https://www.coinbase.com/explore\", \"extraction_config\":
{ \"type\": \"json_css\", \"params\": {\"schema\": schema} } }
\nDynamic Content Handling\nrequest = { \"urls\":
\"https://www.nbcnews.com/business\", \"js_code\": [ \"const
loadMoreButton =
Array.from(document.querySelectorAll('button')).find(button =>
button.textContent.includes('Load More')); loadMoreButton &&
loadMoreButton.click();\" ], \"wait_for\": \"article.tease-
card:nth-child(10)\" } \nrequest = { \"urls\":
\"https://www.nbcnews.com/business\", \"extraction_config\":
{ \"type\": \"cosine\", \"params\": { \"semantic_filter\":
\"business finance economy\", \"word_count_threshold\": 10,
\"max_dist\": 0.2, \"top_k\": 3 } } } \nPlatform-Specific
Instructions ðŸ’»\nmacOS\ndocker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic \nUbuntu\n#
Basic version docker pull unclecode/crawl4ai:basic docker
run -p 11235:11235 unclecode/crawl4ai:basic # With GPU support
docker pull unclecode/crawl4ai:gpu docker run --gpus all -p
11235:11235 unclecode/crawl4ai:gpu \nWindows
(PowerShell)\ndocker pull unclecode/crawl4ai:basic docker
run -p 11235:11235 unclecode/crawl4ai:basic \nTesting ðŸ§ª
\nSave this as test_docker.py:\nimport requests import json
import time import sys class Crawl4AiTester: def
__init__(self, base_url: str = \"http://localhost:11235\"):
self.base_url = base_url def submit_and_wait(self,
request_data: dict, timeout: int = 300) -> dict: # Submit
crawl job response = requests.post(f\"{self.base_url}/crawl\",
json=request_data) task_id = response.json()[\"task_id\"]
print(f\"Task ID: {task_id}\") # Poll for result start_time =
time.time() while True: if time.time() - start_time > timeout:
raise TimeoutError(f\"Task {task_id} timeout\") result =
requests.get(f\"{self.base_url}/task/{task_id}\") status =
result.json() if status[\"status\"] == \"completed\": return
status time.sleep(2) def test_deployment(): tester =
Crawl4AiTester() # Test basic crawl request = { \"urls\":
\"https://www.nbcnews.com/business\", \"priority\": 10 }
result = tester.submit_and_wait(request) print(\"Basic crawl
successful!\") print(f\"Content length: {len(result['result']
['markdown'])}\") if __name__ == \"__main__\":
test_deployment() \nAdvanced Configuration âš™ï¸ \nCrawler
Parameters\nThe crawler_params field allows you to configure
the browser instance and crawling behavior. Here are key
parameters you can use:\nrequest = { \"urls\":
\"https://example.com\", \"crawler_params\": { # Browser
Configuration \"headless\": True, # Run in headless mode
\"browser_type\": \"chromium\", # chromium/firefox/webkit
\"user_agent\": \"custom-agent\", # Custom user agent \"proxy
\": \"http://proxy:8080\", # Proxy configuration # Performance
& Behavior \"page_timeout\": 30000, # Page load timeout (ms)
20
\"verbose\": True, # Enable detailed logging \"semaphore_count
\": 5, # Concurrent request limit # Anti-Detection Features
\"simulate_user\": True, # Simulate human behavior \"magic\":
True, # Advanced anti-detection \"override_navigator\": True,
# Override navigator properties # Session Management
\"user_data_dir\": \"./browser-data\", # Browser profile
location \"use_managed_browser\": True, # Use persistent
browser } } \nThe extra field allows passing additional
parameters directly to the crawler's arun function:\nrequest =
{ \"urls\": \"https://example.com\", \"extra\":
{ \"word_count_threshold\": 10, # Min words per block
\"only_text\": True, # Extract only text \"bypass_cache\":
True, # Force fresh crawl \"process_iframes\": True, # Include
iframe content } } \nComplete Examples\n1. Advanced News
Crawling \nrequest = { \"urls\":
\"https://www.nbcnews.com/business\", \"crawler_params\":
{ \"headless\": True, \"page_timeout\": 30000,
\"remove_overlay_elements\": True # Remove popups }, \"extra
\": { \"word_count_threshold\": 50, # Longer content blocks
\"bypass_cache\": True # Fresh content }, \"css_selector\":
\".article-body\" } \n2. Anti-Detection Configuration
\nrequest = { \"urls\": \"https://example.com\",
\"crawler_params\": { \"simulate_user\": True, \"magic\":
True, \"override_navigator\": True, \"user_agent\":
\"Mozilla/5.0 ...\", \"headers\": { \"Accept-Language\": \"en-
US,en;q=0.9\" } } } \n3. LLM Extraction with Custom Parameters
\nrequest = { \"urls\": \"https://openai.com/pricing\",
\"extraction_config\": { \"type\": \"llm\", \"params\":
{ \"provider\": \"openai/gpt-4\", \"schema\":
pricing_schema } }, \"crawler_params\": { \"verbose\": True,
\"page_timeout\": 60000 }, \"extra\": { \"word_count_threshold
\": 1, \"only_text\": True } } \n4. Session-Based Dynamic
Content \nrequest = { \"urls\": \"https://example.com\",
\"crawler_params\": { \"session_id\": \"dynamic_session\",
\"headless\": False, \"page_timeout\": 60000 }, \"js_code\":
[\"window.scrollTo(0, document.body.scrollHeight);\"],
\"wait_for\": \"js:() =>
document.querySelectorAll('.item').length > 10\", \"extra\":
{ \"delay_before_return_html\": 2.0 } } \n5. Screenshot with
Custom Timing \nrequest = { \"urls\": \"https://example.com\",
\"screenshot\": True, \"crawler_params\": { \"headless\":
True, \"screenshot_wait_for\": \".main-content\" }, \"extra\":
{ \"delay_before_return_html\": 3.0 } } \nParameter Reference
Table\nCategory Parameter Type Description \nBrowser\theadless
\tbool\tRun browser in headless mode\t\nBrowser\tbrowser_type
\tstr\tBrowser engine selection\t\nBrowser\tuser_agent\tstr
\tCustom user agent string\t\nNetwork\tproxy\tstr\tProxy
server URL\t\nNetwork\theaders\tdict\tCustom HTTP headers\t
\nTiming\tpage_timeout\tint\tPage load timeout (ms)\t\nTiming
\tdelay_before_return_html\tfloat\tWait before capture\t
\nAnti-Detection\tsimulate_user\tbool\tHuman behavior
simulation\t\nAnti-Detection\tmagic\tbool\tAdvanced protection
\t\nSession\tsession_id\tstr\tBrowser session ID\t\nSession
\tuser_data_dir\tstr\tProfile directory\t\nContent
\tword_count_threshold\tint\tMinimum words per block\t
\nContent\tonly_text\tbool\tText-only extraction\t\nContent
\tprocess_iframes\tbool\tInclude iframe content\t\nDebug
21
\tverbose\tbool\tDetailed logging\t\nDebug\tlog_console\tbool
\tBrowser console logs\t\nTroubleshooting ðŸ” \nCommon Issues
\n1. Connection Refused \nError: Connection refused at
localhost:11235 \nSolution: Ensure the container is running
and ports are properly mapped. \n2. Resource Limits \nError:
No available slots \nSolution: Increase MAX_CONCURRENT_TASKS
or container resources. \n3. GPU Access \nSolution: Ensure
proper NVIDIA drivers and use --gpus all flag. \nDebug Mode
\nAccess container for debugging: \ndocker run -it --
entrypoint /bin/bash unclecode/crawl4ai:all \nView container
logs: \ndocker logs [container_id] \nBest Practices ðŸŒŸ\n1.
Resource Management - Set appropriate memory and CPU limits -
Monitor resource usage via health endpoint - Use basic version
for simple crawling tasks\n2. Scaling - Use multiple
containers for high load - Implement proper load balancing -
Monitor performance metrics\n3. Security - Use environment
variables for sensitive data - Implement proper network
isolation - Regular security updates\nAPI Reference ðŸ“š
\nHealth Check\nSubmit Crawl Task\nPOST /crawl Content-Type:
application/json { \"urls\": \"string or array\",
\"extraction_config\": { \"type\": \"basic|llm|cosine|json_css
\", \"params\": {} }, \"priority\": 1-10, \"ttl\": 3600 }
\nGet Task Status\nFor more details, visit the official
documentation.",
"markdown": "# Docker Deployment - Crawl4AI Documentation
(v0.5.x)\n\nCrawl4AI provides official Docker images for easy
deployment and scalability. This guide covers installation,
configuration, and usage of Crawl4AI in Docker environments.\n
\n## Quick Start ðŸš€\n\nPull and run the basic version:\n\n`#
Basic run without security docker pull
unclecode/crawl4ai:basic docker run -p 11235:11235
unclecode/crawl4ai:basic # Run with API security enabled
docker run -p 11235:11235 -e
CRAWL4AI_API_TOKEN=your_secret_token unclecode/crawl4ai:basic`
\n\n## Running with Docker Compose ðŸ ³\n\n### Use Docker
Compose (From Local Dockerfile or Docker Hub)\n\nCrawl4AI
provides flexibility to use Docker Compose for managing your
containerized services. You can either build the image locally
from the provided `Dockerfile` or use the pre-built image from
Docker Hub.\n\n### **Option 1: Using Docker Compose to Build
Locally**\n\nIf you want to build the image locally, use the
provided `docker-compose.local.yml` file.\n\n`docker-compose -
f docker-compose.local.yml up -d`\n\nThis will: 1. Build the
Docker image from the provided `Dockerfile`. 2. Start the
container and expose it on `http://localhost:11235`.\n\n* * *
\n\n### **Option 2: Using Docker Compose with Pre-Built Image
from Hub**\n\nIf you prefer using the pre-built image on
Docker Hub, use the `docker-compose.hub.yml` file.\n\n`docker-
compose -f docker-compose.hub.yml up -d`\n\nThis will: 1. Pull
the pre-built image ùnclecode/crawl4ai:basic` (or àll`,
depending on your configuration). 2. Start the container and
expose it on `http://localhost:11235`.\n\n* * *\n\n###
**Stopping the Running Services**\n\nTo stop the services
started via Docker Compose, you can use:\n\n`docker-compose -f
docker-compose.local.yml down # OR docker-compose -f docker-
compose.hub.yml down`\n\nIf the containers donâ€™t stop and
the application is still running, check the running
22
containers:\n\nFind the `CONTAINER ID` of the running service
and stop it forcefully:\n\n`docker stop <CONTAINER_ID>`\n\n* *
*\n\n### **Debugging with Docker Compose**\n\n* **Check
Logs**: To view the container logs:\n \n `docker-
compose -f docker-compose.local.yml logs -f`\n \n*
**Remove Orphaned Containers**: If the service is still
running unexpectedly:\n \n `docker-compose -f docker-
compose.local.yml down --remove-orphans`\n \n* **Manually
Remove Network**: If the network is still in use:\n \n
`docker network ls docker network rm crawl4ai_default`\n \n
\n* * *\n\n### Why Use Docker Compose?\n\nDocker Compose is
the recommended way to deploy Crawl4AI because: 1. It
simplifies multi-container setups. 2. Allows you to define
environment variables, resources, and ports in a single file.
3. Makes it easier to switch between local development and
production-ready images.\n\nFor example, your `docker-
compose.yml` could include API keys, token settings, and
memory limits, making deployment quick and consistent.\n\n##
API Security ðŸ”’\n\n### Understanding CRAWL4AI\\_API\\_TOKEN
\n\nThe `CRAWL4AI_API_TOKEN` provides optional security for
your Crawl4AI instance:\n\n* If `CRAWL4AI_API_TOKEN` is set:
All API endpoints (except `/health`) require authentication\n*
If `CRAWL4AI_API_TOKEN` is not set: The API is publicly
accessible\n\n`# Secured Instance docker run -p 11235:11235 -e
CRAWL4AI_API_TOKEN=your_secret_token unclecode/crawl4ai:all #
Unsecured Instance docker run -p 11235:11235
unclecode/crawl4ai:all`\n\n### Making API Calls\n\nFor secured
instances, include the token in all requests:\n\nìmport
requests # Setup headers if token is being used api_token =
\"your_secret_token\" # Same token set in CRAWL4AI_API_TOKEN
headers = {\"Authorization\": f\"Bearer {api_token}\"} if
api_token else {} # Making authenticated requests response =
requests.post( \"http://localhost:11235/crawl\",
headers=headers, json={ \"urls\":
\"https://example.com\", \"priority\": 10 } ) #
Checking task status task_id = response.json()[\"task_id\"]
status = requests.get( f
\"http://localhost:11235/task/{task_id}\",
headers=headers )`\n\n### Using with Docker Compose\n\nIn your
`docker-compose.yml`:\n\n`services: crawl4ai: image:
unclecode/crawl4ai:all environment: -
CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-} # Optional
# ... other configuration`\n\nThen either: 1. Set in `.env`
file:\n\n`CRAWL4AI_API_TOKEN=your_secret_token`\n\n1. Or set
via command line:\n \n
`CRAWL4AI_API_TOKEN=your_secret_token docker-compose up`\n
\n\n> **Security Note**: If you enable the API token, make
sure to keep it secure and never commit it to version control.
The token will be required for all API endpoints except the
health check endpoint (`/health`).\n\n## Configuration Options
ðŸ”§\n\n### Environment Variables\n\nYou can configure the
service using environment variables:\n\n`# Basic configuration
docker run -p 11235:11235 \\ -e MAX_CONCURRENT_TASKS=5 \\
unclecode/crawl4ai:all # With security and LLM support docker
run -p 11235:11235 \\ -e
CRAWL4AI_API_TOKEN=your_secret_token \\ -e
OPENAI_API_KEY=sk-... \\ -e ANTHROPIC_API_KEY=sk-ant-...
23
\\ unclecode/crawl4ai:all`\n\n### Using Docker Compose
(Recommended) ðŸ ³\n\nCreate a `docker-compose.yml`:\n
\n`version: '3.8' services: crawl4ai: image:
unclecode/crawl4ai:all ports: - \"11235:11235\"
environment: - CRAWL4AI_API_TOKEN=
${CRAWL4AI_API_TOKEN:-} # Optional API security -
MAX_CONCURRENT_TASKS=5 # LLM Provider Keys -
OPENAI_API_KEY=${OPENAI_API_KEY:-} - ANTHROPIC_API_KEY=
${ANTHROPIC_API_KEY:-} volumes: - /dev/shm:/dev/shm
deploy: resources: limits: memory: 4G
reservations: memory: 1G`\n\nYou can run it in two
ways:\n\n1. Using environment variables directly:\n \n
`CRAWL4AI_API_TOKEN=secret123 OPENAI_API_KEY=sk-... docker-
compose up`\n \n2. Using a `.env` file (recommended):
Create a `.env` file in the same directory:\n \n `# API
Security (optional) CRAWL4AI_API_TOKEN=your_secret_token #
LLM Provider Keys OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-
ant-... # Other Configuration MAX_CONCURRENT_TASKS=5`\n \n
\nThen simply run:\n\n### Testing the Deployment ðŸ§ª\n
\nìmport requests # For unsecured instances def
test_unsecured(): # Health check health =
requests.get(\"http://localhost:11235/health\")
print(\"Health check:\", health.json()) # Basic crawl
response =
requests.post( \"http://localhost:11235/crawl\",
json={ \"urls\":
\"https://www.nbcnews.com/business\", \"priority
\": 10 } ) task_id = response.json()[\"task_id
\"] print(\"Task ID:\", task_id) # For secured instances
def test_secured(api_token): headers = {\"Authorization\":
f\"Bearer {api_token}\"} # Basic crawl with
authentication response =
requests.post( \"http://localhost:11235/crawl\",
headers=headers, json={ \"urls\":
\"https://www.nbcnews.com/business\", \"priority
\": 10 } ) task_id = response.json()[\"task_id
\"] print(\"Task ID:\", task_id)`\n\nWhen you've
configured your LLM provider keys (via environment variables
or `.env`), you can use LLM extraction:\n\n`request =
{ \"urls\": \"https://example.com\",
\"extraction_config\": { \"type\": \"llm\",
\"params\": { \"provider\": \"openai/gpt-4\",
\"instruction\": \"Extract main topics from the page
\" } } } # Make the request (add headers if using
API security) response =
requests.post(\"http://localhost:11235/crawl\", json=request)`
\n\n> **Note**: Remember to add `.env` to your `.gitignore` to
keep your API keys secure!\n\n## Usage Examples ðŸ“ \n\n###
Basic Crawling\n\n`request = { \"urls\":
\"https://www.nbcnews.com/business\", \"priority\": 10 }
response = requests.post(\"http://localhost:11235/crawl\",
json=request) task_id = response.json()[\"task_id\"] # Get
results result = requests.get(f
\"http://localhost:11235/task/{task_id}\")`\n\n`schema =
{ \"name\": \"Crypto Prices\", \"baseSelector\":
\".cds-tableRow-t45thuk\", \"fields\":
[ { \"name\": \"crypto\",
24
\"selector\": \"td:nth-child(1) h2\", \"type\":
\"text\", }, { \"name\": \"price
\", \"selector\": \"td:nth-child(2)\",
\"type\": \"text\", } ], } request = { \"urls
\": \"https://www.coinbase.com/explore\",
\"extraction_config\": { \"type\": \"json_css\",
\"params\": {\"schema\": schema} } }`\n\n### Dynamic
Content Handling\n\n`request = { \"urls\":
\"https://www.nbcnews.com/business\", \"js_code\":
[ \"const loadMoreButton =
Array.from(document.querySelectorAll('button')).find(button =>
button.textContent.includes('Load More')); loadMoreButton &&
loadMoreButton.click();\" ], \"wait_for\":
\"article.tease-card:nth-child(10)\" }`\n\n`request =
{ \"urls\": \"https://www.nbcnews.com/business\",
\"extraction_config\": { \"type\": \"cosine\",
\"params\": { \"semantic_filter\": \"business
finance economy\", \"word_count_threshold\": 10,
\"max_dist\": 0.2, \"top_k\": 3 } } }`
\n\n## Platform-Specific Instructions ðŸ’»\n\n### macOS\n
\n`docker pull unclecode/crawl4ai:basic docker run -p
11235:11235 unclecode/crawl4ai:basic`\n\n### Ubuntu\n\n`#
Basic version docker pull unclecode/crawl4ai:basic docker
run -p 11235:11235 unclecode/crawl4ai:basic # With GPU
support docker pull unclecode/crawl4ai:gpu docker run --gpus
all -p 11235:11235 unclecode/crawl4ai:gpu`\n\n### Windows
(PowerShell)\n\n`docker pull unclecode/crawl4ai:basic docker
run -p 11235:11235 unclecode/crawl4ai:basic`\n\n## Testing
ðŸ§ª\n\nSave this as `test_docker.py`:\n\nìmport requests
import json import time import sys class Crawl4AiTester:
def __init__(self, base_url: str = \"http://localhost:11235
\"): self.base_url = base_url def
submit_and_wait(self, request_data: dict, timeout: int =
300) -> dict: # Submit crawl job response =
requests.post(f\"{self.base_url}/crawl\", json=request_data)
task_id = response.json()[\"task_id\"] print(f\"Task
ID: {task_id}\") # Poll for result start_time
= time.time() while True: if time.time() -
start_time > timeout: raise TimeoutError(f
\"Task {task_id} timeout\") result =
requests.get(f\"{self.base_url}/task/{task_id}\")
status = result.json() if status[\"status\"] ==
\"completed\": return status
time.sleep(2) def test_deployment(): tester =
Crawl4AiTester() # Test basic crawl request =
{ \"urls\": \"https://www.nbcnews.com/business\",
\"priority\": 10 } result =
tester.submit_and_wait(request) print(\"Basic crawl
successful!\") print(f\"Content length:
{len(result['result']['markdown'])}\") if __name__ ==
\"__main__\": test_deployment()`\n\n## Advanced
Configuration âš™ï¸ \n\n### Crawler Parameters\n\nThe
`crawler_params` field allows you to configure the browser
instance and crawling behavior. Here are key parameters you
can use:\n\n`request = { \"urls\": \"https://example.com
\", \"crawler_params\": { # Browser Configuration
\"headless\": True, # Run in headless mode
25
\"browser_type\": \"chromium\", #
chromium/firefox/webkit \"user_agent\": \"custom-agent
\", # Custom user agent \"proxy\":
\"http://proxy:8080\", # Proxy configuration #
Performance & Behavior \"page_timeout\": 30000,
# Page load timeout (ms) \"verbose\": True,
# Enable detailed logging \"semaphore_count\": 5,
# Concurrent request limit # Anti-Detection Features
\"simulate_user\": True, # Simulate human
behavior \"magic\": True, #
Advanced anti-detection \"override_navigator\": True,
# Override navigator properties # Session Management
\"user_data_dir\": \"./browser-data\", # Browser profile
location \"use_managed_browser\": True, # Use
persistent browser } }`\n\nThe èxtra` field allows
passing additional parameters directly to the crawler's àrun`
function:\n\n`request = { \"urls\": \"https://example.com
\", \"extra\": { \"word_count_threshold\": 10,
# Min words per block \"only_text\": True,
# Extract only text \"bypass_cache\": True,
# Force fresh crawl \"process_iframes\": True,
# Include iframe content } }`\n\n### Complete Examples\n
\n1.â€€**Advanced News Crawling**\n\n`request = { \"urls
\": \"https://www.nbcnews.com/business\", \"crawler_params
\": { \"headless\": True, \"page_timeout\":
30000, \"remove_overlay_elements\": True # Remove
popups }, \"extra\": { \"word_count_threshold
\": 50, # Longer content blocks
\"bypass_cache\": True # Fresh content },
\"css_selector\": \".article-body\" }`\n\n2.â€€**Anti-
Detection Configuration**\n\n`request = { \"urls\":
\"https://example.com\", \"crawler_params\":
{ \"simulate_user\": True, \"magic\": True,
\"override_navigator\": True, \"user_agent\":
\"Mozilla/5.0 ...\", \"headers\":
{ \"Accept-Language\": \"en-US,en;q=0.9
\" } } }`\n\n3.â€€**LLM Extraction with Custom
Parameters**\n\n`request = { \"urls\":
\"https://openai.com/pricing\", \"extraction_config\":
{ \"type\": \"llm\", \"params\":
{ \"provider\": \"openai/gpt-4\",
\"schema\": pricing_schema } },
\"crawler_params\": { \"verbose\": True,
\"page_timeout\": 60000 }, \"extra\":
{ \"word_count_threshold\": 1, \"only_text\":
True } }`\n\n4.â€€**Session-Based Dynamic Content**\n
\n`request = { \"urls\": \"https://example.com\",
\"crawler_params\": { \"session_id\":
\"dynamic_session\", \"headless\": False,
\"page_timeout\": 60000 }, \"js_code\":
[\"window.scrollTo(0, document.body.scrollHeight);\"],
\"wait_for\": \"js:() =>
document.querySelectorAll('.item').length > 10\", \"extra
\": { \"delay_before_return_html\": 2.0 } }`\n
\n5.â€€**Screenshot with Custom Timing**\n\n`request =
{ \"urls\": \"https://example.com\", \"screenshot\":
True, \"crawler_params\": { \"headless\": True,
26
\"screenshot_wait_for\": \".main-content\" }, \"extra
\": { \"delay_before_return_html\": 3.0 } }`\n
\n### Parameter Reference Table\n\n| Category | Parameter |
Type | Description |\n| --- | --- | --- | --- |\n| Browser |
headless | bool | Run browser in headless mode |\n| Browser |
browser\\_type | str | Browser engine selection |\n| Browser |
user\\_agent | str | Custom user agent string |\n| Network |
proxy | str | Proxy server URL |\n| Network | headers | dict |
Custom HTTP headers |\n| Timing | page\\_timeout | int | Page
load timeout (ms) |\n| Timing | delay\\_before\\_return\\_html
| float | Wait before capture |\n| Anti-Detection | simulate
\\_user | bool | Human behavior simulation |\n| Anti-Detection
| magic | bool | Advanced protection |\n| Session | session
\\_id | str | Browser session ID |\n| Session | user\\_data
\\_dir | str | Profile directory |\n| Content | word\\_count
\\_threshold | int | Minimum words per block |\n| Content |
only\\_text | bool | Text-only extraction |\n| Content |
process\\_iframes | bool | Include iframe content |\n| Debug |
verbose | bool | Detailed logging |\n| Debug | log\\_console |
bool | Browser console logs |\n\n## Troubleshooting ðŸ” \n
\n### Common Issues\n\n1.â€€**Connection Refused**\n\nÈrror:
Connection refused at localhost:11235`\n\nSolution: Ensure the
container is running and ports are properly mapped.\n\n2.â
€€**Resource Limits**\n\nÈrror: No available slots`\n
\nSolution: Increase MAX\\_CONCURRENT\\_TASKS or container
resources.\n\n3.â€€**GPU Access**\n\nSolution: Ensure proper
NVIDIA drivers and use `--gpus all` flag.\n\n### Debug Mode\n
\nAccess container for debugging:\n\n`docker run -it --
entrypoint /bin/bash unclecode/crawl4ai:all`\n\nView container
logs:\n\n`docker logs [container_id]`\n\n## Best Practices
ðŸŒŸ\n\n1.â€€**Resource Management** - Set appropriate memory
and CPU limits - Monitor resource usage via health endpoint -
Use basic version for simple crawling tasks\n\n2.â
€€**Scaling** - Use multiple containers for high load -
Implement proper load balancing - Monitor performance metrics
\n\n3.â€€**Security** - Use environment variables for
sensitive data - Implement proper network isolation - Regular
security updates\n\n## API Reference ðŸ“š\n\n### Health Check
\n\n### Submit Crawl Task\n\n`POST /crawl Content-Type:
application/json { \"urls\": \"string or array\",
\"extraction_config\": { \"type\":
\"basic|llm|cosine|json_css\", \"params\": {} },
\"priority\": 1-10, \"ttl\": 3600 }`\n\n### Get Task
Status\n\nFor more details, visit the [official documentation]
(https://docs.crawl4ai.com/).",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/blog/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/blog/",
"loadedTime": "2025-03-05T23:16:20.554Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
27
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/blog/",
"title": "Blog Home - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:18 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"e836cc010728ab5020f197c6e1b0fb69\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Blog Home - Crawl4AI Documentation
(v0.5.x)\nWelcome to the Crawl4AI blog! Here you'll find
detailed release notes, technical insights, and updates about
the project. Whether you're looking for the latest
improvements or want to dive deep into web crawling
techniques, this is the place.\nLatest Release\nCrawl4AI
v0.5.0: Deep Crawling, Scalability, and a New CLI!\nMy dear
friends and crawlers, there you go, this is the release of
Crawl4AI v0.5.0! This release brings a wealth of new features,
performance improvements, and a more streamlined developer
experience. Here's a breakdown of what's new:\nMajor New
Features:\nDeep Crawling: Explore entire websites with
configurable strategies (BFS, DFS, Best-First). Define custom
filters and URL scoring for targeted crawls.\nMemory-Adaptive
Dispatcher: Handle large-scale crawls with ease! Our new
dispatcher dynamically adjusts concurrency based on available
memory and includes built-in rate limiting.\nMultiple Crawler
Strategies: Choose between the full-featured Playwright
browser-based crawler or a new, much faster HTTP-only crawler
for simpler tasks.\nDocker Deployment: Deploy Crawl4AI as a
scalable, self-contained service with built-in API endpoints
and optional JWT authentication.\nCommand-Line Interface
(CLI): Interact with Crawl4AI directly from your terminal.
Crawl, configure, and extract data with simple commands.\nLLM
Configuration (LlmConfig): A new, unified way to configure LLM
providers (OpenAI, Anthropic, Ollama, etc.) for extraction,
filtering, and schema generation. Simplifies API key
management and switching between models.\nMinor Updates &
Improvements:\nLXML Scraping Mode: Faster HTML parsing with
LXMLWebScrapingStrategy.\nProxy Rotation: Added
ProxyRotationStrategy with a RoundRobinProxyStrategy
implementation.\nPDF Processing: Extract text, images, and
metadata from PDF files.\nURL Redirection Tracking:
Automatically follows and records redirects.\nRobots.txt
Compliance: Optionally respect website crawling rules.\nLLM-
28
Powered Schema Generation: Automatically create extraction
schemas using an LLM.\nLLMContentFilter: Generate high-
quality, focused markdown using an LLM.\nImproved Error
Handling & Stability: Numerous bug fixes and performance
enhancements.\nEnhanced Documentation: Updated guides and
examples.\nBreaking Changes & Migration:\nThis release
includes several breaking changes to improve the library's
structure and consistency. Here's what you need to know:
\narun_many() Behavior: Now uses the MemoryAdaptiveDispatcher
by default. The return type depends on the stream parameter in
CrawlerRunConfig. Adjust code that relied on unbounded
concurrency.\nmax_depth Location: Moved to CrawlerRunConfig
and now controls crawl depth.\nDeep Crawling Imports: Import
DeepCrawlStrategy and related classes from
crawl4ai.deep_crawling.\nBrowserContext API: Updated; the old
get_context method is deprecated.\nOptional Model Fields: Many
data model fields are now optional. Handle potential None
values.\nScrapingMode Enum: Replaced with strategy pattern
(WebScrapingStrategy, LXMLWebScrapingStrategy).
\ncontent_filter Parameter: Removed from CrawlerRunConfig. Use
extraction strategies or markdown generators with filters.
\nRemoved Functionality: The synchronous WebCrawler, the old
CLI, and docs management tools have been removed.\nDocker:
Significant changes to deployment. See the Docker
documentation.\nssl_certificate.json: This file has been
removed.\nConfig: FastFilterChain has been replaced with
FilterChain\nDeep-Crawl: DeepCrawlStrategy.arun now returns
Union[CrawlResultT, List[CrawlResultT],
AsyncGenerator[CrawlResultT, None]]\nProxy: Removed
synchronous WebCrawler support and related rate limiting
configurations\nLLM Parameters: Use the new LlmConfig object
instead of passing provider, api_token, base_url, and api_base
directly to LLMExtractionStrategy and LLMContentFilter.\nIn
short: Update imports, adjust arun_many() usage, check for
optional fields, and review the Docker deployment guide.
\nLicense Change\nCrawl4AI v0.5.0 updates the license to
Apache 2.0 with a required attribution clause. This means you
are free to use, modify, and distribute Crawl4AI (even
commercially), but you must clearly attribute the project in
any public use or distribution. See the updated LICENSE file
for the full legal text and specific requirements.\nGet
Started:\nInstallation: pip install \"crawl4ai[all]\" (or use
the Docker image)\nDocumentation: https://docs.crawl4ai.com
\nGitHub: https://github.com/unclecode/crawl4ai\nI'm very
excited to see what you build with Crawl4AI v0.5.0!\n0.4.2 -
Configurable Crawlers, Session Management, and Smarter
Screenshots\nDecember 12, 2024\nThe 0.4.2 update brings
massive improvements to configuration, making crawlers and
browsers easier to manage with dedicated objects. You can now
import/export local storage for seamless session management.
Plus, long-page screenshots are faster and cleaner, and full-
page PDF exports are now possible. Check out all the new
features to make your crawling experience even smoother.\nRead
full release notes â†’\n0.4.1 - Smarter Crawling with Lazy-
Load Handling, Text-Only Mode, and More\nDecember 8, 2024
\nThis release brings major improvements to handling lazy-
loaded images, a blazing-fast Text-Only Mode, full-page
29
scanning for infinite scrolls, dynamic viewport adjustments,
and session reuse for efficient crawling. If you're looking to
improve speed, reliability, or handle dynamic content with
ease, this update has you covered.\nRead full release notes
â†’\n0.4.0 - Major Content Filtering Update\nDecember 1, 2024
\nIntroduced significant improvements to content filtering,
multi-threaded environment handling, and user-agent
generation. This release features the new
PruningContentFilter, enhanced thread safety, and improved
test coverage.\nRead full release notes â†’\nProject History
\nCurious about how Crawl4AI has evolved? Check out our
complete changelog for a detailed history of all versions and
updates.\nStay Updated\nStar us on GitHub\nFollow @unclecode
on Twitter\nJoin our community discussions on GitHub",
"markdown": "# Blog Home - Crawl4AI Documentation (v0.5.x)\n
\nWelcome to the Crawl4AI blog! Here you'll find detailed
release notes, technical insights, and updates about the
project. Whether you're looking for the latest improvements or
want to dive deep into web crawling techniques, this is the
place.\n\n## Latest Release\n\n### [Crawl4AI v0.5.0: Deep
Crawling, Scalability, and a New CLI!]
(https://crawl4ai.com/mkdocs/blog/releases/0.5.0/)\n\nMy dear
friends and crawlers, there you go, this is the release of
Crawl4AI v0.5.0! This release brings a wealth of new features,
performance improvements, and a more streamlined developer
experience. Here's a breakdown of what's new:\n\n**Major New
Features:**\n\n* **Deep Crawling:** Explore entire websites
with configurable strategies (BFS, DFS, Best-First). Define
custom filters and URL scoring for targeted crawls.\n*
**Memory-Adaptive Dispatcher:** Handle large-scale crawls with
ease! Our new dispatcher dynamically adjusts concurrency based
on available memory and includes built-in rate limiting.\n*
**Multiple Crawler Strategies:** Choose between the full-
featured Playwright browser-based crawler or a new, _much_
faster HTTP-only crawler for simpler tasks.\n* **Docker
Deployment:** Deploy Crawl4AI as a scalable, self-contained
service with built-in API endpoints and optional JWT
authentication.\n* **Command-Line Interface (CLI):**
Interact with Crawl4AI directly from your terminal. Crawl,
configure, and extract data with simple commands.\n* **LLM
Configuration (`LlmConfig`):** A new, unified way to configure
LLM providers (OpenAI, Anthropic, Ollama, etc.) for
extraction, filtering, and schema generation. Simplifies API
key management and switching between models.\n\n**Minor
Updates & Improvements:**\n\n* **LXML Scraping Mode:**
Faster HTML parsing with `LXMLWebScrapingStrategy`.\n*
**Proxy Rotation:** Added `ProxyRotationStrategy` with a
`RoundRobinProxyStrategy` implementation.\n* **PDF
Processing:** Extract text, images, and metadata from PDF
files.\n* **URL Redirection Tracking:** Automatically
follows and records redirects.\n* **Robots.txt Compliance:**
Optionally respect website crawling rules.\n* **LLM-Powered
Schema Generation:** Automatically create extraction schemas
using an LLM.\n* **`LLMContentFilter`:** Generate high-
quality, focused markdown using an LLM.\n* **Improved Error
Handling & Stability:** Numerous bug fixes and performance
enhancements.\n* **Enhanced Documentation:** Updated guides
30
and examples.\n\n**Breaking Changes & Migration:**\n\nThis
release includes several breaking changes to improve the
library's structure and consistency. Here's what you need to
know:\n\n* **àrun_many()` Behavior:** Now uses the
`MemoryAdaptiveDispatcher` by default. The return type depends
on the `stream` parameter in `CrawlerRunConfig`. Adjust code
that relied on unbounded concurrency.\n* **`max_depth`
Location:** Moved to `CrawlerRunConfig` and now controls
_crawl depth_.\n* **Deep Crawling Imports:** Import
`DeepCrawlStrategy` and related classes from
`crawl4ai.deep_crawling`.\n* **`BrowserContext` API:**
Updated; the old `get_context` method is deprecated.\n*
**Optional Model Fields:** Many data model fields are now
optional. Handle potential `None` values.\n*
**`ScrapingMode` Enum:** Replaced with strategy pattern
(`WebScrapingStrategy`, `LXMLWebScrapingStrategy`).\n*
**`content_filter` Parameter:** Removed from
`CrawlerRunConfig`. Use extraction strategies or markdown
generators with filters.\n* **Removed Functionality:** The
synchronous `WebCrawler`, the old CLI, and docs management
tools have been removed.\n* **Docker:** Significant changes
to deployment. See the [Docker documentation]
(https://crawl4ai.com/mkdocs/deploy/docker/README.md).\n*
**`ssl_certificate.json`:** This file has been removed.\n*
**Config**: FastFilterChain has been replaced with FilterChain
\n* **Deep-Crawl**: DeepCrawlStrategy.arun now returns Union
\\[CrawlResultT, List\\[CrawlResultT\\], AsyncGenerator
\\[CrawlResultT, None\\]\\]\n* **Proxy**: Removed
synchronous WebCrawler support and related rate limiting
configurations\n* **LLM Parameters:** Use the new
`LlmConfig` object instead of passing `provider`, àpi_token`,
`base_url`, and àpi_base` directly to `LLMExtractionStrategy`
and `LLMContentFilter`.\n\n**In short:** Update imports,
adjust àrun_many()` usage, check for optional fields, and
review the Docker deployment guide.\n\n## License Change\n
\nCrawl4AI v0.5.0 updates the license to Apache 2.0 _with a
required attribution clause_. This means you are free to use,
modify, and distribute Crawl4AI (even commercially), but you
_must_ clearly attribute the project in any public use or
distribution. See the updated `LICENSE` file for the full
legal text and specific requirements.\n\n**Get Started:**\n\n*
**Installation:** `pip install \"crawl4ai[all]\"` (or use the
Docker image)\n* **Documentation:**
[https://docs.crawl4ai.com](https://docs.crawl4ai.com/)\n*
**GitHub:** [https://github.com/unclecode/crawl4ai]
(https://github.com/unclecode/crawl4ai)\n\nI'm very excited to
see what you build with Crawl4AI v0.5.0!\n\n* * *\n\n###
[0.4.2 - Configurable Crawlers, Session Management, and
Smarter Screenshots]
(https://crawl4ai.com/mkdocs/blog/releases/0.4.2/)\n
\n_December 12, 2024_\n\nThe 0.4.2 update brings massive
improvements to configuration, making crawlers and browsers
easier to manage with dedicated objects. You can now
import/export local storage for seamless session management.
Plus, long-page screenshots are faster and cleaner, and full-
page PDF exports are now possible. Check out all the new
features to make your crawling experience even smoother.\n
31
\n[Read full release notes â†’]
(https://crawl4ai.com/mkdocs/blog/releases/0.4.2/)\n\n* * *\n
\n### [0.4.1 - Smarter Crawling with Lazy-Load Handling, Text-
Only Mode, and More]
(https://crawl4ai.com/mkdocs/blog/releases/0.4.1/)\n
\n_December 8, 2024_\n\nThis release brings major improvements
to handling lazy-loaded images, a blazing-fast Text-Only Mode,
full-page scanning for infinite scrolls, dynamic viewport
adjustments, and session reuse for efficient crawling. If
you're looking to improve speed, reliability, or handle
dynamic content with ease, this update has you covered.\n
\n[Read full release notes â†’]
(https://crawl4ai.com/mkdocs/blog/releases/0.4.1/)\n\n* * *\n
\n### [0.4.0 - Major Content Filtering Update]
(https://crawl4ai.com/mkdocs/blog/releases/0.4.0/)\n
\n_December 1, 2024_\n\nIntroduced significant improvements to
content filtering, multi-threaded environment handling, and
user-agent generation. This release features the new
PruningContentFilter, enhanced thread safety, and improved
test coverage.\n\n[Read full release notes â†’]
(https://crawl4ai.com/mkdocs/blog/releases/0.4.0/)\n\n##
Project History\n\nCurious about how Crawl4AI has evolved?
Check out our [complete changelog]
(https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md)
for a detailed history of all versions and updates.\n\n## Stay
Updated\n\n* Star us on [GitHub]
(https://github.com/unclecode/crawl4ai)\n* Follow
[@unclecode](https://twitter.com/unclecode) on Twitter\n*
Join our community discussions on GitHub",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/core/cli/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/core/cli/",
"loadedTime": "2025-03-05T23:16:21.149Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/core/cli/",
"title": "Command Line Interface - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:18 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
32
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"f31c3edee7ec1876ba130b2b01d53d37\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Command Line Interface - Crawl4AI Documentation
(v0.5.x)\nCrawl4AI CLI Guide\nTable of Contents\nInstallation
\nBasic Usage\nConfiguration\nBrowser Configuration\nCrawler
Configuration\nExtraction Configuration\nContent Filtering
\nAdvanced Features\nLLM Q&A\nStructured Data Extraction
\nContent Filtering\nOutput Formats\nExamples\nConfiguration
Reference\nBest Practices & Tips\nBasic Usage\nThe Crawl4AI
CLI (crwl) provides a simple interface to the Crawl4AI
library:\n# Basic crawling crwl https://example.com # Get
markdown output crwl https://example.com -o markdown # Verbose
JSON output with cache bypass crwl https://example.com -o
json -v --bypass-cache # See usage examples crwl --example
\nQuick Example of Advanced Usage\nIf you clone the repository
and run the following command, you will receive the content of
the page in JSON format according to a JSON-CSS schema:\ncrwl
\"https://www.infoq.com/ai-ml-data-eng/\" -e
docs/examples/cli/extract_css.yml -s
docs/examples/cli/css_schema.json -o json; \nConfiguration
\nBrowser Configuration\nBrowser settings can be configured
via YAML file or command line parameters:\n# browser.yml
headless: true viewport_width: 1280 user_agent_mode: \"random
\" verbose: true ignore_https_errors: true \n# Using config
file crwl https://example.com -B browser.yml # Using direct
parameters crwl https://example.com -b
\"headless=true,viewport_width=1280,user_agent_mode=random\"
\nCrawler Configuration\nControl crawling behavior:\n#
crawler.yml cache_mode: \"bypass\" wait_until: \"networkidle\"
page_timeout: 30000 delay_before_return_html: 0.5
word_count_threshold: 100 scan_full_page: true scroll_delay:
0.3 process_iframes: false remove_overlay_elements: true
magic: true verbose: true \n# Using config file crwl
https://example.com -C crawler.yml # Using direct parameters
crwl https://example.com -c
\"css_selector=#main,delay_before_return_html=
2,scan_full_page=true\" \nTwo types of extraction are
supported:\nCSS/XPath-based extraction: \n# extract_css.yml
type: \"json-css\" params: verbose: true \n// css_schema.json
{ \"name\": \"ArticleExtractor\", \"baseSelector\": \".article
\", \"fields\": [ { \"name\": \"title\", \"selector\":
\"h1.title\", \"type\": \"text\" }, { \"name\": \"link\",
\"selector\": \"a.read-more\", \"type\": \"attribute\",
\"attribute\": \"href\" } ] } \nLLM-based extraction: \n#
extract_llm.yml type: \"llm\" provider: \"openai/gpt-4\"
instruction: \"Extract all articles with their titles and
links\" api_token: \"your-token\" params: temperature: 0.3
max_tokens: 1000 \n// llm_schema.json { \"title\": \"Article
\", \"type\": \"object\", \"properties\": { \"title\":
{ \"type\": \"string\", \"description\": \"The title of the
article\" }, \"link\": { \"type\": \"string\", \"description
\": \"URL to the full article\" } } } \nAdvanced Features\nLLM
33
Q&A\nAsk questions about crawled content:\n# Simple question
crwl https://example.com -q \"What is the main topic
discussed?\" # View content then ask questions crwl
https://example.com -o markdown # See content first crwl
https://example.com -q \"Summarize the key points\" crwl
https://example.com -q \"What are the conclusions?\" #
Combined with advanced crawling crwl https://example.com \\ -B
browser.yml \\ -c \"css_selector=article,scan_full_page=true\"
\\ -q \"What are the pros and cons mentioned?\" \nFirst-time
setup: - Prompts for LLM provider and API token - Saves
configuration in ~/.crawl4ai/global.yml - Supports various
providers (openai/gpt-4, anthropic/claude-3-sonnet, etc.) -
For case of ollama you do not need to provide API token. - See
LiteLLM Providers for full list\nExtract structured data using
CSS selectors:\ncrwl https://example.com \\ -e extract_css.yml
\\ -s css_schema.json \\ -o json \nOr using LLM-based
extraction:\ncrwl https://example.com \\ -e extract_llm.yml
\\ -s llm_schema.json \\ -o json \nContent Filtering\nFilter
content for relevance:\n# filter_bm25.yml type: \"bm25\"
query: \"target content\" threshold: 1.0 # filter_pruning.yml
type: \"pruning\" query: \"focus topic\" threshold: 0.48
\ncrwl https://example.com -f filter_bm25.yml -o markdown-fit
\nOutput Formats\nall - Full crawl result including metadata
\njson - Extracted structured data (when using
extraction)\nmarkdown / md - Raw markdown output\nmarkdown-fit
/ md-fit - Filtered markdown for better readability\nComplete
Examples\nBasic Extraction: \ncrwl https://example.com \\ -B
browser.yml \\ -C crawler.yml \\ -o json \nStructured Data
Extraction: \ncrwl https://example.com \\ -e extract_css.yml
\\ -s css_schema.json \\ -o json \\ -v \nLLM Extraction with
Filtering: \ncrwl https://example.com \\ -B browser.yml \\ -e
extract_llm.yml \\ -s llm_schema.json \\ -f filter_bm25.yml
\\ -o json \nInteractive Q&A: \n# First crawl and view crwl
https://example.com -o markdown # Then ask questions crwl
https://example.com -q \"What are the main points?\" crwl
https://example.com -q \"Summarize the conclusions\" \nBest
Practices & Tips\nConfiguration Management:\nKeep common
configurations in YAML files\nUse CLI parameters for quick
overrides\nStore sensitive data (API tokens) in
~/.crawl4ai/global.yml\nPerformance Optimization:\nUse --
bypass-cache for fresh content\nEnable scan_full_page for
infinite scroll pages\nAdjust delay_before_return_html for
dynamic content\nContent Extraction:\nUse CSS extraction for
structured content\nUse LLM extraction for unstructured
content\nCombine with filters for focused results\nQ&A
Workflow:\nView content first with -o markdown\nAsk specific
questions\nUse broader context with appropriate selectors
\nRecap\nThe Crawl4AI CLI provides: - Flexible configuration
via files and parameters - Multiple extraction strategies
(CSS, XPath, LLM) - Content filtering and optimization -
Interactive Q&A capabilities - Various output formats",
"markdown": "# Command Line Interface - Crawl4AI
Documentation (v0.5.x)\n\n## Crawl4AI CLI Guide\n\n## Table of
Contents\n\n* [Installation](#installation)\n* [Basic
Usage](#basic-usage)\n* [Configuration](#configuration)\n*
[Browser Configuration](#browser-configuration)\n* [Crawler
Configuration](#crawler-configuration)\n* [Extraction
34
Configuration](#extraction-configuration)\n* [Content
Filtering](#content-filtering)\n* [Advanced Features]
(#advanced-features)\n* [LLM Q&A](#llm-qa)\n* [Structured
Data Extraction](#structured-data-extraction)\n* [Content
Filtering](#content-filtering-1)\n* [Output Formats]
(#output-formats)\n* [Examples](#examples)\n*
[Configuration Reference](#configuration-reference)\n* [Best
Practices & Tips](#best-practices--tips)\n\n## Basic Usage\n
\nThe Crawl4AI CLI (`crwl`) provides a simple interface to the
Crawl4AI library:\n\n`# Basic crawling crwl
https://example.com # Get markdown output crwl
https://example.com -o markdown # Verbose JSON output with
cache bypass crwl https://example.com -o json -v --bypass-
cache # See usage examples crwl --example`\n\n## Quick
Example of Advanced Usage\n\nIf you clone the repository and
run the following command, you will receive the content of the
page in JSON format according to a JSON-CSS schema:\n\n`crwl
\"https://www.infoq.com/ai-ml-data-eng/\" -e
docs/examples/cli/extract_css.yml -s
docs/examples/cli/css_schema.json -o json;`\n\n##
Configuration\n\n### Browser Configuration\n\nBrowser settings
can be configured via YAML file or command line parameters:\n
\n`# browser.yml headless: true viewport_width: 1280
user_agent_mode: \"random\" verbose: true ignore_https_errors:
true`\n\n`# Using config file crwl https://example.com -B
browser.yml # Using direct parameters crwl
https://example.com -b \"headless=true,viewport_width=
1280,user_agent_mode=random\"`\n\n### Crawler Configuration\n
\nControl crawling behavior:\n\n`# crawler.yml cache_mode:
\"bypass\" wait_until: \"networkidle\" page_timeout: 30000
delay_before_return_html: 0.5 word_count_threshold: 100
scan_full_page: true scroll_delay: 0.3 process_iframes: false
remove_overlay_elements: true magic: true verbose: true`\n\n`#
Using config file crwl https://example.com -C crawler.yml #
Using direct parameters crwl https://example.com -c
\"css_selector=#main,delay_before_return_html=
2,scan_full_page=true\"`\n\nTwo types of extraction are
supported:\n\n1. CSS/XPath-based extraction:\n \n `#
extract_css.yml type: \"json-css\" params: verbose: true`\n
\n\n`// css_schema.json { \"name\": \"ArticleExtractor\",
\"baseSelector\": \".article\", \"fields\":
[ { \"name\": \"title\", \"selector\":
\"h1.title\", \"type\": \"text\" },
{ \"name\": \"link\", \"selector\": \"a.read-more
\", \"type\": \"attribute\", \"attribute\": \"href
\" } ] }`\n\n1. LLM-based extraction:\n \n `#
extract_llm.yml type: \"llm\" provider: \"openai/gpt-4\"
instruction: \"Extract all articles with their titles and
links\" api_token: \"your-token\" params: temperature: 0.3
max_tokens: 1000`\n \n\n`// llm_schema.json { \"title\":
\"Article\", \"type\": \"object\", \"properties\":
{ \"title\": { \"type\": \"string\",
\"description\": \"The title of the article\" },
\"link\": { \"type\": \"string\", \"description\":
\"URL to the full article\" } } }`\n\n## Advanced
Features\n\n### LLM Q&A\n\nAsk questions about crawled
content:\n\n`# Simple question crwl https://example.com -q
35
\"What is the main topic discussed?\" # View content then ask
questions crwl https://example.com -o markdown # See content
first crwl https://example.com -q \"Summarize the key points\"
crwl https://example.com -q \"What are the conclusions?\" #
Combined with advanced crawling crwl https://example.com
\\ -B browser.yml \\ -c
\"css_selector=article,scan_full_page=true\" \\ -q \"What
are the pros and cons mentioned?\"`\n\nFirst-time setup: -
Prompts for LLM provider and API token - Saves configuration
in `~/.crawl4ai/global.yml` - Supports various providers
(openai/gpt-4, anthropic/claude-3-sonnet, etc.) - For case of
òllama` you do not need to provide API token. - See [LiteLLM
Providers](https://docs.litellm.ai/docs/providers) for full
list\n\nExtract structured data using CSS selectors:\n\n`crwl
https://example.com \\ -e extract_css.yml \\ -s
css_schema.json \\ -o json`\n\nOr using LLM-based
extraction:\n\n`crwl https://example.com \\ -e
extract_llm.yml \\ -s llm_schema.json \\ -o json`\n
\n### Content Filtering\n\nFilter content for relevance:\n\n`#
filter_bm25.yml type: \"bm25\" query: \"target content\"
threshold: 1.0 # filter_pruning.yml type: \"pruning\" query:
\"focus topic\" threshold: 0.48`\n\n`crwl
https://example.com -f filter_bm25.yml -o markdown-fit`\n\n##
Output Formats\n\n* àll` - Full crawl result including
metadata\n* `json` - Extracted structured data (when using
extraction)\n* `markdown` / `md` - Raw markdown output\n*
`markdown-fit` / `md-fit` - Filtered markdown for better
readability\n\n## Complete Examples\n\n1. Basic Extraction:\n
\n `crwl https://example.com \\ -B browser.yml \\ -
C crawler.yml \\ -o json`\n \n2. Structured Data
Extraction:\n \n `crwl https://example.com \\ -e
extract_css.yml \\ -s css_schema.json \\ -o json
\\ -v`\n \n3. LLM Extraction with Filtering:\n \n
`crwl https://example.com \\ -B browser.yml \\ -e
extract_llm.yml \\ -s llm_schema.json \\ -f
filter_bm25.yml \\ -o json`\n \n4. Interactive Q&A:\n
\n `# First crawl and view crwl https://example.com -o
markdown # Then ask questions crwl https://example.com -q
\"What are the main points?\" crwl https://example.com -q
\"Summarize the conclusions\"`\n \n\n## Best Practices &
Tips\n\n1. **Configuration Management**:\n2. Keep common
configurations in YAML files\n3. Use CLI parameters for quick
overrides\n4. Store sensitive data (API tokens) in
`~/.crawl4ai/global.yml`\n \n5. **Performance
Optimization**:\n \n6. Use `--bypass-cache` for fresh
content\n7. Enable `scan_full_page` for infinite scroll pages
\n8. Adjust `delay_before_return_html` for dynamic content\n
\n9. **Content Extraction**:\n \n10. Use CSS extraction
for structured content\n11. Use LLM extraction for
unstructured content\n12. Combine with filters for focused
results\n \n13. **Q&A Workflow**:\n \n14. View content
first with `-o markdown`\n15. Ask specific questions\n16.
Use broader context with appropriate selectors\n\n## Recap\n
\nThe Crawl4AI CLI provides: - Flexible configuration via
files and parameters - Multiple extraction strategies (CSS,
XPath, LLM) - Content filtering and optimization - Interactive
Q&A capabilities - Various output formats",
36
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/core/simple-crawling/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/core/simple-
crawling/",
"loadedTime": "2025-03-05T23:16:21.838Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/core/simple-
crawling/",
"title": "Simple Crawling - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:19 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"98091617655d8841e38e650b346db5dd\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Simple Crawling - Crawl4AI Documentation
(v0.5.x)\nThis guide covers the basics of web crawling with
Crawl4AI. You'll learn how to set up a crawler, make your
first request, and understand the response.\nBasic Usage\nSet
up a simple crawl using BrowserConfig and CrawlerRunConfig:
\nimport asyncio from crawl4ai import AsyncWebCrawler from
crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
async def main(): browser_config = BrowserConfig() # Default
browser configuration run_config = CrawlerRunConfig() #
Default crawl run configuration async with
AsyncWebCrawler(config=browser_config) as crawler: result =
await crawler.arun( url=\"https://example.com\",
config=run_config ) print(result.markdown) # Print clean
markdown content if __name__ == \"__main__\":
asyncio.run(main()) \nUnderstanding the Response\nThe arun()
method returns a CrawlResult object with several useful
properties. Here's a quick overview (see CrawlResult for
complete details):\nresult = await crawler.arun( url=
\"https://example.com\",
config=CrawlerRunConfig(fit_markdown=True) ) # Different
37
content formats print(result.html) # Raw HTML
print(result.cleaned_html) # Cleaned HTML
print(result.markdown.raw_markdown) # Raw markdown from
cleaned html print(result.markdown.fit_markdown) # Most
relevant content in markdown # Check success status
print(result.success) # True if crawl succeeded
print(result.status_code) # HTTP status code (e.g., 200, 404)
# Access extracted media and links print(result.media) #
Dictionary of found media (images, videos, audio)
print(result.links) # Dictionary of internal and external
links \nAdding Basic Options\nCustomize your crawl using
CrawlerRunConfig:\nrun_config =
CrawlerRunConfig( word_count_threshold=10, # Minimum words per
content block exclude_external_links=True, # Remove external
links remove_overlay_elements=True, # Remove popups/modals
process_iframes=True # Process iframe content ) result = await
crawler.arun( url=\"https://example.com\", config=run_config )
\nHandling Errors\nAlways check if the crawl was successful:
\nrun_config = CrawlerRunConfig() result = await
crawler.arun(url=\"https://example.com\", config=run_config)
if not result.success: print(f\"Crawl failed:
{result.error_message}\") print(f\"Status code:
{result.status_code}\") \nLogging and Debugging\nEnable
verbose logging in BrowserConfig:\nbrowser_config =
BrowserConfig(verbose=True) async with
AsyncWebCrawler(config=browser_config) as crawler: run_config
= CrawlerRunConfig() result = await crawler.arun(url=
\"https://example.com\", config=run_config) \nComplete Example
\nHere's a more comprehensive example demonstrating common
usage patterns:\nimport asyncio from crawl4ai import
AsyncWebCrawler from crawl4ai.async_configs import
BrowserConfig, CrawlerRunConfig, CacheMode async def main():
browser_config = BrowserConfig(verbose=True) run_config =
CrawlerRunConfig( # Content filtering word_count_threshold=10,
excluded_tags=['form', 'header'], exclude_external_links=True,
# Content processing process_iframes=True,
remove_overlay_elements=True, # Cache control
cache_mode=CacheMode.ENABLED # Use cache if available ) async
with AsyncWebCrawler(config=browser_config) as crawler: result
= await crawler.arun( url=\"https://example.com\",
config=run_config ) if result.success: # Print clean content
print(\"Content:\", result.markdown[:500]) # First 500 chars #
Process images for image in result.media[\"images\"]: print(f
\"Found image: {image['src']}\") # Process links for link in
result.links[\"internal\"]: print(f\"Internal link:
{link['href']}\") else: print(f\"Crawl failed:
{result.error_message}\") if __name__ == \"__main__\":
asyncio.run(main())",
"markdown": "# Simple Crawling - Crawl4AI Documentation
(v0.5.x)\n\nThis guide covers the basics of web crawling with
Crawl4AI. You'll learn how to set up a crawler, make your
first request, and understand the response.\n\n## Basic Usage
\n\nSet up a simple crawl using `BrowserConfig` and
`CrawlerRunConfig`:\n\nìmport asyncio from crawl4ai import
AsyncWebCrawler from crawl4ai.async_configs import
BrowserConfig, CrawlerRunConfig async def main():
browser_config = BrowserConfig() # Default browser
38
configuration run_config = CrawlerRunConfig() # Default
crawl run configuration async with
AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun( url=
\"https://example.com\",
config=run_config ) print(result.markdown) #
Print clean markdown content if __name__ == \"__main__\":
asyncio.run(main())`\n\n## Understanding the Response\n\nThe
àrun()` method returns a `CrawlResult` object with several
useful properties. Here's a quick overview (see [CrawlResult]
(https://crawl4ai.com/mkdocs/api/crawl-result/) for complete
details):\n\n`result = await crawler.arun( url=
\"https://example.com\",
config=CrawlerRunConfig(fit_markdown=True) ) # Different
content formats print(result.html) # Raw HTML
print(result.cleaned_html) # Cleaned HTML
print(result.markdown.raw_markdown) # Raw markdown from
cleaned html print(result.markdown.fit_markdown) # Most
relevant content in markdown # Check success status
print(result.success) # True if crawl succeeded
print(result.status_code) # HTTP status code (e.g., 200, 404)
# Access extracted media and links print(result.media)
# Dictionary of found media (images, videos, audio)
print(result.links) # Dictionary of internal and
external links`\n\n## Adding Basic Options\n\nCustomize your
crawl using `CrawlerRunConfig`:\n\n`run_config =
CrawlerRunConfig( word_count_threshold=10, #
Minimum words per content block
exclude_external_links=True, # Remove external links
remove_overlay_elements=True, # Remove popups/modals
process_iframes=True # Process iframe content )
result = await crawler.arun( url=\"https://example.com\",
config=run_config )`\n\n## Handling Errors\n\nAlways check if
the crawl was successful:\n\n`run_config = CrawlerRunConfig()
result = await crawler.arun(url=\"https://example.com\",
config=run_config) if not result.success: print(f\"Crawl
failed: {result.error_message}\") print(f\"Status code:
{result.status_code}\")`\n\n## Logging and Debugging\n\nEnable
verbose logging in `BrowserConfig`:\n\n`browser_config =
BrowserConfig(verbose=True) async with
AsyncWebCrawler(config=browser_config) as crawler:
run_config = CrawlerRunConfig() result = await
crawler.arun(url=\"https://example.com\", config=run_config)`
\n\n## Complete Example\n\nHere's a more comprehensive example
demonstrating common usage patterns:\n\nìmport asyncio from
crawl4ai import AsyncWebCrawler from crawl4ai.async_configs
import BrowserConfig, CrawlerRunConfig, CacheMode async def
main(): browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig( # Content filtering
word_count_threshold=10, excluded_tags=['form',
'header'], exclude_external_links=True, #
Content processing process_iframes=True,
remove_overlay_elements=True, # Cache control
cache_mode=CacheMode.ENABLED # Use cache if available )
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun( url=
\"https://example.com\",
39
config=run_config ) if result.success:
# Print clean content print(\"Content:\",
result.markdown[:500]) # First 500 chars #
Process images for image in result.media[\"images
\"]: print(f\"Found image: {image['src']}\")
# Process links for link in
result.links[\"internal\"]: print(f\"Internal
link: {link['href']}\") else: print(f
\"Crawl failed: {result.error_message}\") if __name__ ==
\"__main__\": asyncio.run(main())`",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/core/crawler-result/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/core/crawler-
result/",
"loadedTime": "2025-03-05T23:16:27.557Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/core/crawler-
result/",
"title": "Crawler Result - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:25 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"5f0d407cc87ab9d249974a957b0bb1a3\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Crawler Result - Crawl4AI Documentation
(v0.5.x)\nCrawl Result and Output\nWhen you call arun() on a
page, Crawl4AI returns a CrawlResult object containing
everything you might needâ€”raw HTML, a cleaned version,
optional screenshots or PDFs, structured extraction results,
and more. This document explains those fields and how they map
to different output types. \n1. The CrawlResult Model\nBelow
is the core schema. Each field captures a different aspect of
the crawlâ€™s result:\nclass
MarkdownGenerationResult(BaseModel): raw_markdown: str
40
markdown_with_citations: str references_markdown: str
fit_markdown: Optional[str] = None fit_html: Optional[str] =
None class CrawlResult(BaseModel): url: str html: str success:
bool cleaned_html: Optional[str] = None media: Dict[str,
List[Dict]] = {} links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None screenshot:
Optional[str] = None pdf : Optional[bytes] = None markdown:
Optional[Union[str, MarkdownGenerationResult]] = None
extracted_content: Optional[str] = None metadata:
Optional[dict] = None error_message: Optional[str] = None
session_id: Optional[str] = None response_headers:
Optional[dict] = None status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None class Config:
arbitrary_types_allowed = True \nTable: Key Fields in
CrawlResult\nField (Name & Type) Description \nurl (str)\tThe
final or actual URL crawled (in case of redirects).\t\nhtml
(str)\tOriginal, unmodified page HTML. Good for debugging or
custom processing.\t\nsuccess (bool)\tTrue if the crawl
completed without major errors, else False.\t\ncleaned_html
(Optional[str])\tSanitized HTML with scripts/styles removed;
can exclude tags if configured via excluded_tags etc.\t\nmedia
(Dict[str, List[Dict]])\tExtracted media info (images, audio,
etc.), each with attributes like src, alt, score, etc.\t
\nlinks (Dict[str, List[Dict]])\tExtracted link data, split by
internal and external. Each link usually has href, text, etc.
\t\ndownloaded_files (Optional[List[str]])\tIf
accept_downloads=True in BrowserConfig, this lists the
filepaths of saved downloads.\t\nscreenshot
(Optional[str])\tScreenshot of the page (base64-encoded) if
screenshot=True.\t\npdf (Optional[bytes])\tPDF of the page if
pdf=True.\t\nmarkdown (Optional[str or
MarkdownGenerationResult])\tIt holds a
MarkdownGenerationResult. Over time, this will be consolidated
into markdown. The generator can provide raw markdown,
citations, references, and optionally fit_markdown.\t
\nextracted_content (Optional[str])\tThe output of a
structured extraction (CSS/LLM-based) stored as JSON string or
other text.\t\nmetadata (Optional[dict])\tAdditional info
about the crawl or extracted data.\t\nerror_message
(Optional[str])\tIf success=False, contains a short
description of what went wrong.\t\nsession_id
(Optional[str])\tThe ID of the session used for multi-page or
persistent crawling.\t\nresponse_headers
(Optional[dict])\tHTTP response headers, if captured.\t
\nstatus_code (Optional[int])\tHTTP status code (e.g., 200 for
OK).\t\nssl_certificate (Optional[SSLCertificate])\tSSL
certificate info if fetch_ssl_certificate=True.\t\n2. HTML
Variants\nhtml: Raw HTML\nCrawl4AI preserves the exact HTML as
result.html. Useful for:\nDebugging page issues or checking
the original content.\nPerforming your own specialized parse
if needed.\ncleaned_html: Sanitized\nIf you specify any
cleanup or exclusion parameters in CrawlerRunConfig (like
excluded_tags, remove_forms, etc.), youâ€™ll see the result
here:\nconfig = CrawlerRunConfig( excluded_tags=[\"form\",
\"header\", \"footer\"], keep_data_attributes=False ) result =
await crawler.arun(\"https://example.com\", config=config)
print(result.cleaned_html) # Freed of forms, header, footer,
41
data-* attributes \n3. Markdown Generation\n3.1 markdown
\nmarkdown: The current location for detailed markdown output,
returning a MarkdownGenerationResult object. \nmarkdown_v2:
Deprecated since v0.5.\nMarkdownGenerationResult Fields:
\nField Description \nraw_markdown\tThe basic HTMLâ†’Markdown
conversion.\t\nmarkdown_with_citations\tMarkdown including
inline citations that reference links at the end.\t
\nreferences_markdown\tThe references/citations themselves (if
citations=True).\t\nfit_markdown\tThe filtered/â€œfitâ€
markdown if a content filter was used.\t\nfit_html\tThe
filtered HTML that generated fit_markdown.\t\n3.2 Basic
Example with a Markdown Generator\nfrom crawl4ai import
AsyncWebCrawler, CrawlerRunConfig from
crawl4ai.markdown_generation_strategy import
DefaultMarkdownGenerator config =
CrawlerRunConfig( markdown_generator=DefaultMarkdownGenerator(
options={\"citations\": True, \"body_width\": 80} # e.g. pass
html2text style options ) ) result = await crawler.arun(url=
\"https://example.com\", config=config) md_res =
result.markdown # or eventually 'result.markdown'
print(md_res.raw_markdown[:500])
print(md_res.markdown_with_citations)
print(md_res.references_markdown) \nNote: If you use a filter
like PruningContentFilter, youâ€™ll get fit_markdown and
fit_html as well.\n4. Structured Extraction: extracted_content
\nIf you run a JSON-based extraction strategy (CSS, XPath,
LLM, etc.), the structured data is not stored in markdownâ€”
itâ€™s placed in result.extracted_content as a JSON string (or
sometimes plain text).\nimport asyncio import json from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import
JsonCssExtractionStrategy async def main(): schema = { \"name
\": \"Example Items\", \"baseSelector\": \"div.item\",
\"fields\": [ {\"name\": \"title\", \"selector\": \"h2\",
\"type\": \"text\"}, {\"name\": \"link\", \"selector\": \"a\",
\"type\": \"attribute\", \"attribute\": \"href\"} ] } raw_html
= \"<div class='item'><h2>Item 1</h2><a
href='https://example.com/item1'>Link 1</a></div>\" async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"raw://\" + raw_html,
config=CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema) ) ) data
= json.loads(result.extracted_content) print(data) if __name__
== \"__main__\": asyncio.run(main()) \nHere: - url=\"raw://...
\" passes the HTML content directly, no network requests.\n-
The CSS extraction strategy populates result.extracted_content
with the JSON array [{\"title\": \"...\", \"link\": \"...\"}].
\n5.1 links\nA dictionary, typically with \"internal\" and
\"external\" lists. Each entry might have href, text, title,
etc. This is automatically captured if you havenâ€™t disabled
link extraction.\nprint(result.links[\"internal\"][:3]) # Show
first 3 internal links \n5.2 media\nSimilarly, a dictionary
with \"images\", \"audio\", \"video\", etc. Each item could
include src, alt, score, and more, if your crawler is set to
gather them.\nimages = result.media.get(\"images\", []) for
img in images: print(\"Image URL:\", img[\"src\"], \"Alt:\",
img.get(\"alt\")) \n5.3 screenshot and pdf\nIf you set
42
screenshot=True or pdf=True in CrawlerRunConfig, then:
\nresult.screenshot contains a base64-encoded PNG string.
\nresult.pdf contains raw PDF bytes (you can write them to a
file).\nwith open(\"page.pdf\", \"wb\") as f:
f.write(result.pdf) \n5.4 ssl_certificate\nIf
fetch_ssl_certificate=True, result.ssl_certificate holds
details about the siteâ€™s SSL cert, such as issuer, validity
dates, etc.\n6. Accessing These Fields\nAfter you run:\nresult
= await crawler.arun(url=\"https://example.com\",
config=some_config) \nCheck any field:\nif result.success:
print(result.status_code, result.response_headers)
print(\"Links found:\", len(result.links.get(\"internal\",
[]))) if result.markdown: print(\"Markdown snippet:\",
result.markdown.raw_markdown[:200]) if
result.extracted_content: print(\"Structured JSON:\",
result.extracted_content) else: print(\"Error:\",
result.error_message) \nDeprecation: Since v0.5
result.markdown_v2, result.fit_html,result.fit_markdown are
deprecated. Use result.markdown instead! It holds
MarkdownGenerationResult, which includes fit_html and
fit_markdown as it's properties.\n7. Next Steps\nMarkdown
Generation: Dive deeper into how to configure
DefaultMarkdownGenerator and various filters. \nContent
Filtering: Learn how to use BM25ContentFilter and
PruningContentFilter.\nSession & Hooks: If you want to
manipulate the page or preserve state across multiple arun()
calls, see the hooking or session docs. \nLLM Extraction: For
complex or unstructured content requiring AI-driven parsing,
check the LLM-based strategies doc.\nEnjoy exploring all that
CrawlResult offersâ€”whether you need raw HTML, sanitized
output, markdown, or fully structured data, Crawl4AI has you
covered!",
"markdown": "# Crawler Result - Crawl4AI Documentation
(v0.5.x)\n\n## Crawl Result and Output\n\nWhen you call
àrun()` on a page, Crawl4AI returns a **`CrawlResult`**
object containing everything you might needâ€”raw HTML, a
cleaned version, optional screenshots or PDFs, structured
extraction results, and more. This document explains those
fields and how they map to different output types.\n\n* * *\n
\n## 1\\. The `CrawlResult` Model\n\nBelow is the core schema.
Each field captures a different aspect of the crawlâ€™s
result:\n\n`class MarkdownGenerationResult(BaseModel):
raw_markdown: str markdown_with_citations: str
references_markdown: str fit_markdown: Optional[str] =
None fit_html: Optional[str] = None class
CrawlResult(BaseModel): url: str html: str
success: bool cleaned_html: Optional[str] = None
media: Dict[str, List[Dict]] = {} links: Dict[str,
List[Dict]] = {} downloaded_files: Optional[List[str]] =
None screenshot: Optional[str] = None pdf :
Optional[bytes] = None markdown: Optional[Union[str,
MarkdownGenerationResult]] = None extracted_content:
Optional[str] = None metadata: Optional[dict] = None
error_message: Optional[str] = None session_id:
Optional[str] = None response_headers: Optional[dict] =
None status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None class
43
Config: arbitrary_types_allowed = True`\n\n### Table:
Key Fields in `CrawlResult`\n\n| Field (Name & Type) |
Description |\n| --- | --- |\n| **url (`str`)** | The final or
actual URL crawled (in case of redirects). |\n| **html
(`str`)** | Original, unmodified page HTML. Good for debugging
or custom processing. |\n| **success (`bool`)** | `True` if
the crawl completed without major errors, else `False`. |\n|
**cleaned\\_html (Òptional[str]`)** | Sanitized HTML with
scripts/styles removed; can exclude tags if configured via
èxcluded_tags` etc. |\n| **media (`Dict[str, List[Dict]]`)**
| Extracted media info (images, audio, etc.), each with
attributes like `src`, àlt`, `score`, etc. |\n| **links
(`Dict[str, List[Dict]]`)** | Extracted link data, split by
ìnternal` and èxternal`. Each link usually has `href`,
`text`, etc. |\n| **downloaded\\_files
(Òptional[List[str]]`)** | If àccept_downloads=True` in
`BrowserConfig`, this lists the filepaths of saved downloads.
|\n| **screenshot (Òptional[str]`)** | Screenshot of the page
(base64-encoded) if `screenshot=True`. |\n| **pdf
(Òptional[bytes]`)** | PDF of the page if `pdf=True`. |\n|
**markdown (Òptional[str or MarkdownGenerationResult]`)** |
It holds a `MarkdownGenerationResult`. Over time, this will be
consolidated into `markdown`. The generator can provide raw
markdown, citations, references, and optionally
`fit_markdown`. |\n| **extracted\\_content (Òptional[str]`)**
| The output of a structured extraction (CSS/LLM-based) stored
as JSON string or other text. |\n| **metadata
(Òptional[dict]`)** | Additional info about the crawl or
extracted data. |\n| **error\\_message (Òptional[str]`)** |
If `success=False`, contains a short description of what went
wrong. |\n| **session\\_id (Òptional[str]`)** | The ID of the
session used for multi-page or persistent crawling. |\n|
**response\\_headers (Òptional[dict]`)** | HTTP response
headers, if captured. |\n| **status\\_code (Òptional[int]`)**
| HTTP status code (e.g., 200 for OK). |\n| **ssl
\\_certificate (Òptional[SSLCertificate]`)** | SSL
certificate info if `fetch_ssl_certificate=True`. |\n\n* * *\n
\n## 2\\. HTML Variants\n\n### `html`: Raw HTML\n\nCrawl4AI
preserves the exact HTML as `result.html`. Useful for:\n\n*
Debugging page issues or checking the original content.\n*
Performing your own specialized parse if needed.\n\n###
`cleaned_html`: Sanitized\n\nIf you specify any cleanup or
exclusion parameters in `CrawlerRunConfig` (like
èxcluded_tags`, `remove_forms`, etc.), youâ€™ll see the
result here:\n\n`config =
CrawlerRunConfig( excluded_tags=[\"form\", \"header\",
\"footer\"], keep_data_attributes=False ) result = await
crawler.arun(\"https://example.com\", config=config)
print(result.cleaned_html) # Freed of forms, header, footer,
data-* attributes`\n\n* * *\n\n## 3\\. Markdown Generation\n
\n### 3.1 `markdown`\n\n* **`markdown`**: The current
location for detailed markdown output, returning a
**`MarkdownGenerationResult`** object.\n* **`markdown_v2`**:
Deprecated since v0.5.\n\n**`MarkdownGenerationResult`**
Fields:\n\n| Field | Description |\n| --- | --- |\n| **raw
\\_markdown** | The basic HTMLâ†’Markdown conversion. |\n|
**markdown\\_with\\_citations** | Markdown including inline
44
citations that reference links at the end. |\n| **references
\\_markdown** | The references/citations themselves (if
`citations=True`). |\n| **fit\\_markdown** | The filtered/â
€œfitâ€ markdown if a content filter was used. |\n| **fit
\\_html** | The filtered HTML that generated `fit_markdown`. |
\n\n### 3.2 Basic Example with a Markdown Generator\n\n`from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig from
crawl4ai.markdown_generation_strategy import
DefaultMarkdownGenerator config =
CrawlerRunConfig( markdown_generator=DefaultMarkdownGenera
tor( options={\"citations\": True, \"body_width\": 80}
# e.g. pass html2text style options ) ) result = await
crawler.arun(url=\"https://example.com\", config=config)
md_res = result.markdown # or eventually 'result.markdown'
print(md_res.raw_markdown[:500])
print(md_res.markdown_with_citations)
print(md_res.references_markdown)`\n\n**Note**: If you use a
filter like `PruningContentFilter`, youâ€™ll get
`fit_markdown` and `fit_html` as well.\n\n* * *\n\n## 4\\.
Structured Extraction: èxtracted_content`\n\nIf you run a
JSON-based extraction strategy (CSS, XPath, LLM, etc.), the
structured data is **not** stored in `markdown`â€”itâ€™s
placed in **`result.extracted_content`** as a JSON string (or
sometimes plain text).\n\nìmport asyncio import json from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import
JsonCssExtractionStrategy async def main(): schema =
{ \"name\": \"Example Items\", \"baseSelector
\": \"div.item\", \"fields\": [ {\"name\":
\"title\", \"selector\": \"h2\", \"type\": \"text\"},
{\"name\": \"link\", \"selector\": \"a\", \"type\":
\"attribute\", \"attribute\": \"href\"} ] }
raw_html = \"<div class='item'><h2>Item 1</h2><a
href='https://example.com/item1'>Link 1</a></div>\" async
with AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"raw://\" + raw_html,
config=CrawlerRunConfig( cache_mode=CacheMode.
BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema)
) ) data =
json.loads(result.extracted_content) print(data) if
__name__ == \"__main__\": asyncio.run(main())`\n\nHere: -
ùrl=\"raw://...\"` passes the HTML content directly, no
network requests. \n\\- The **CSS** extraction strategy
populates `result.extracted_content` with the JSON array
`[{\"title\": \"...\", \"link\": \"...\"}]`.\n\n* * *\n\n###
5.1 `links`\n\nA dictionary, typically with `\"internal\"` and
`\"external\"` lists. Each entry might have `href`, `text`,
`title`, etc. This is automatically captured if you havenâ€™t
disabled link extraction.\n\n`print(result.links[\"internal\"]
[:3]) # Show first 3 internal links`\n\n### 5.2 `media`\n
\nSimilarly, a dictionary with `\"images\"`, `\"audio\"`, `
\"video\"`, etc. Each item could include `src`, àlt`,
`score`, and more, if your crawler is set to gather them.\n
\nìmages = result.media.get(\"images\", []) for img in
images: print(\"Image URL:\", img[\"src\"], \"Alt:\",
img.get(\"alt\"))`\n\n### 5.3 `screenshot` and `pdf`\n\nIf you
45
set `screenshot=True` or `pdf=True` in **`CrawlerRunConfig`**,
then:\n\n* `result.screenshot` contains a base64-encoded PNG
string.\n* `result.pdf` contains raw PDF bytes (you can
write them to a file).\n\n`with open(\"page.pdf\", \"wb\") as
f: f.write(result.pdf)`\n\n### 5.4 `ssl_certificate`\n\nIf
`fetch_ssl_certificate=True`, `result.ssl_certificate` holds
details about the siteâ€™s SSL cert, such as issuer, validity
dates, etc.\n\n* * *\n\n## 6\\. Accessing These Fields\n
\nAfter you run:\n\n`result = await crawler.arun(url=
\"https://example.com\", config=some_config)`\n\nCheck any
field:\n\nìf result.success: print(result.status_code,
result.response_headers) print(\"Links found:\",
len(result.links.get(\"internal\", []))) if
result.markdown: print(\"Markdown snippet:\",
result.markdown.raw_markdown[:200]) if
result.extracted_content: print(\"Structured JSON:\",
result.extracted_content) else: print(\"Error:\",
result.error_message)`\n\n**Deprecation**: Since v0.5
`result.markdown_v2`, `result.fit_html`,`result.fit_markdown`
are deprecated. Use `result.markdown` instead! It holds
`MarkdownGenerationResult`, which includes `fit_html` and
`fit_markdown` as it's properties.\n\n* * *\n\n## 7\\. Next
Steps\n\n* **Markdown Generation**: Dive deeper into how to
configure `DefaultMarkdownGenerator` and various filters.\n*
**Content Filtering**: Learn how to use `BM25ContentFilter`
and `PruningContentFilter`.\n* **Session & Hooks**: If you
want to manipulate the page or preserve state across multiple
àrun()` calls, see the hooking or session docs.\n* **LLM
Extraction**: For complex or unstructured content requiring
AI-driven parsing, check the LLM-based strategies doc.\n
\n**Enjoy** exploring all that `CrawlResult` offersâ€”whether
you need raw HTML, sanitized output, markdown, or fully
structured data, Crawl4AI has you covered!",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/core/deep-crawling/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/core/deep-
crawling/",
"loadedTime": "2025-03-05T23:16:28.639Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/core/deep-
crawling/",
"title": "Deep Crawling - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
46
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:26 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"814c77859758b9138c7bb32384f14b4c\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Deep Crawling - Crawl4AI Documentation
(v0.5.x)\nOne of Crawl4AI's most powerful features is its
ability to perform configurable deep crawling that can explore
websites beyond a single page. With fine-tuned control over
crawl depth, domain boundaries, and content filtering,
Crawl4AI gives you the tools to extract precisely the content
you need.\nIn this tutorial, you'll learn:\nHow to set up a
Basic Deep Crawler with BFS strategy \nUnderstanding the
difference between streamed and non-streamed output
\nImplementing filters and scorers to target specific content
\nCreating advanced filtering chains for sophisticated crawls
\nUsing BestFirstCrawling for intelligent exploration
prioritization \nPrerequisites\n- Youâ€™ve completed or read
AsyncWebCrawler Basics to understand how to run a simple
crawl.\n- You know how to configure CrawlerRunConfig.\n1.
Quick Example\nHere's a minimal code snippet that implements a
basic deep crawl using the BFSDeepCrawlStrategy:\nimport
asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy from
crawl4ai.content_scraping_strategy import
LXMLWebScrapingStrategy async def main(): # Configure a 2-
level deep crawl config =
CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy( ma
x_depth=2, include_external=False ),
scraping_strategy=LXMLWebScrapingStrategy(), verbose=True )
async with AsyncWebCrawler() as crawler: results = await
crawler.arun(\"https://example.com\", config=config) print(f
\"Crawled {len(results)} pages in total\") # Access individual
results for result in results[:3]: # Show first 3 results
print(f\"URL: {result.url}\") print(f\"Depth:
{result.metadata.get('depth', 0)}\") if __name__ == \"__main__
\": asyncio.run(main()) \nWhat's happening?\n-
BFSDeepCrawlStrategy(max_depth=2, include_external=False)
instructs Crawl4AI to: - Crawl the starting page (depth 0)
plus 2 more levels - Stay within the same domain (don't follow
external links) - Each result contains metadata like the crawl
depth - Results are returned as a list after all crawling is
complete\n2. Understanding Deep Crawling Strategy Options\n2.1
BFSDeepCrawlStrategy (Breadth-First Search)\nThe
BFSDeepCrawlStrategy uses a breadth-first approach, exploring
all links at one depth before moving deeper:\nfrom
crawl4ai.deep_crawling import BFSDeepCrawlStrategy # Basic
configuration strategy = BFSDeepCrawlStrategy( max_depth=2, #
Crawl initial page + 2 levels deep include_external=False, #
47
Stay within the same domain max_pages=50, # Maximum number of
pages to crawl (optional) score_threshold=0.3, # Minimum score
for URLs to be crawled (optional) ) \nKey parameters: -
max_depth: Number of levels to crawl beyond the starting
page - include_external: Whether to follow links to other
domains - max_pages: Maximum number of pages to crawl
(default: infinite) - score_threshold: Minimum score for URLs
to be crawled (default: -inf) - filter_chain: FilterChain
instance for URL filtering - url_scorer: Scorer instance for
evaluating URLs\n2.2 DFSDeepCrawlStrategy (Depth-First
Search)\nThe DFSDeepCrawlStrategy uses a depth-first approach,
explores as far down a branch as possible before backtracking.
\nfrom crawl4ai.deep_crawling import DFSDeepCrawlStrategy #
Basic configuration strategy =
DFSDeepCrawlStrategy( max_depth=2, # Crawl initial page + 2
levels deep include_external=False, # Stay within the same
domain max_pages=30, # Maximum number of pages to crawl
(optional) score_threshold=0.5, # Minimum score for URLs to be
crawled (optional) ) \nKey parameters: - max_depth: Number of
levels to crawl beyond the starting page - include_external:
Whether to follow links to other domains - max_pages: Maximum
number of pages to crawl (default: infinite) -
score_threshold: Minimum score for URLs to be crawled
(default: -inf) - filter_chain: FilterChain instance for URL
filtering - url_scorer: Scorer instance for evaluating URLs
\n2.3 BestFirstCrawlingStrategy (â ï¸ - Recommended Deep
crawl strategy)\nFor more intelligent crawling, use
BestFirstCrawlingStrategy with scorers to prioritize the most
relevant pages:\nfrom crawl4ai.deep_crawling import
BestFirstCrawlingStrategy from crawl4ai.deep_crawling.scorers
import KeywordRelevanceScorer # Create a scorer scorer =
KeywordRelevanceScorer( keywords=[\"crawl\", \"example\",
\"async\", \"configuration\"], weight=0.7 ) # Configure the
strategy strategy = BestFirstCrawlingStrategy( max_depth=2,
include_external=False, url_scorer=scorer, max_pages=25, #
Maximum number of pages to crawl (optional) ) \nThis crawling
approach: - Evaluates each discovered URL based on scorer
criteria - Visits higher-scoring pages first - Helps focus
crawl resources on the most relevant content - Can limit total
pages crawled with max_pages - Does not need score_threshold
as it naturally prioritizes by score\n3. Streaming vs. Non-
Streaming Results\nCrawl4AI can return results in two modes:
\n3.1 Non-Streaming Mode (Default)\nconfig =
CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy(max
_depth=1), stream=False # Default behavior ) async with
AsyncWebCrawler() as crawler: # Wait for ALL results to be
collected before returning results = await
crawler.arun(\"https://example.com\", config=config) for
result in results: process_result(result) \nWhen to use non-
streaming mode: - You need the complete dataset before
processing - You're performing batch operations on all results
together - Crawl time isn't a critical factor\n3.2 Streaming
Mode\nconfig =
CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy(max
_depth=1), stream=True # Enable streaming ) async with
AsyncWebCrawler() as crawler: # Returns an async iterator
async for result in await crawler.arun(\"https://example.com
48
\", config=config): # Process each result as it becomes
available process_result(result) \nBenefits of streaming
mode: - Process results immediately as they're discovered -
Start working with early results while crawling continues -
Better for real-time applications or progressive display -
Reduces memory pressure when handling many pages\n4. Filtering
Content with Filter Chains\nFilters help you narrow down which
pages to crawl. Combine multiple filters using FilterChain for
powerful targeting.\n4.1 Basic URL Pattern Filter\nfrom
crawl4ai.deep_crawling.filters import FilterChain,
URLPatternFilter # Only follow URLs containing \"blog\" or
\"docs\" url_filter = URLPatternFilter(patterns=[\"*blog*\",
\"*docs*\"]) config =
CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy( ma
x_depth=1, filter_chain=FilterChain([url_filter]) ) ) \n4.2
Combining Multiple Filters\nfrom
crawl4ai.deep_crawling.filters import ( FilterChain,
URLPatternFilter, DomainFilter, ContentTypeFilter ) # Create a
chain of filters filter_chain = FilterChain([ # Only follow
URLs with specific patterns
URLPatternFilter(patterns=[\"*guide*\", \"*tutorial*\"]), #
Only crawl specific domains
DomainFilter( allowed_domains=[\"docs.example.com\"],
blocked_domains=[\"old.docs.example.com\"] ), # Only include
specific content types
ContentTypeFilter(allowed_types=[\"text/html\"]) ]) config =
CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy( ma
x_depth=2, filter_chain=filter_chain ) ) \n4.3 Available
Filter Types\nCrawl4AI includes several specialized filters:
\nURLPatternFilter: Matches URL patterns using wildcard syntax
\nDomainFilter: Controls which domains to include or exclude
\nContentTypeFilter: Filters based on HTTP Content-Type
\nContentRelevanceFilter: Uses similarity to a text query
\nSEOFilter: Evaluates SEO elements (meta tags, headers,
etc.)\n5. Using Scorers for Prioritized Crawling\nScorers
assign priority values to discovered URLs, helping the crawler
focus on the most relevant content first.\n5.1
KeywordRelevanceScorer\nfrom crawl4ai.deep_crawling.scorers
import KeywordRelevanceScorer from crawl4ai.deep_crawling
import BestFirstCrawlingStrategy # Create a keyword relevance
scorer keyword_scorer =
KeywordRelevanceScorer( keywords=[\"crawl\", \"example\",
\"async\", \"configuration\"], weight=0.7 # Importance of this
scorer (0.0 to 1.0) ) config =
CrawlerRunConfig( deep_crawl_strategy=BestFirstCrawlingStrateg
y( max_depth=2, url_scorer=keyword_scorer ), stream=True #
Recommended with BestFirstCrawling ) # Results will come in
order of relevance score async with AsyncWebCrawler() as
crawler: async for result in await
crawler.arun(\"https://example.com\", config=config): score =
result.metadata.get(\"score\", 0) print(f\"Score: {score:.2f}
| {result.url}\") \nHow scorers work: - Evaluate each
discovered URL before crawling - Calculate relevance based on
various signals - Help the crawler make intelligent choices
about traversal order\n6. Advanced Filtering Techniques\n6.1
SEO Filter for Quality Assessment\nThe SEOFilter helps you
identify pages with strong SEO characteristics:\nfrom
49
crawl4ai.deep_crawling.filters import FilterChain, SEOFilter #
Create an SEO filter that looks for specific keywords in page
metadata seo_filter = SEOFilter( threshold=0.5, # Minimum
score (0.0 to 1.0) keywords=[\"tutorial\", \"guide\",
\"documentation\"] ) config =
CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy( ma
x_depth=1, filter_chain=FilterChain([seo_filter]) ) ) \n6.2
Content Relevance Filter\nThe ContentRelevanceFilter analyzes
the actual content of pages:\nfrom
crawl4ai.deep_crawling.filters import FilterChain,
ContentRelevanceFilter # Create a content relevance filter
relevance_filter = ContentRelevanceFilter( query=\"Web
crawling and data extraction with Python\", threshold=0.7 #
Minimum similarity score (0.0 to 1.0) ) config =
CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy( ma
x_depth=1, filter_chain=FilterChain([relevance_filter]) ) )
\nThis filter: - Measures semantic similarity between query
and page content - It's a BM25-based relevance filter using
head section content\n7. Building a Complete Advanced Crawler
\nThis example combines multiple techniques for a
sophisticated crawl:\nimport asyncio from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig from
crawl4ai.content_scraping_strategy import
LXMLWebScrapingStrategy from crawl4ai.deep_crawling import
BestFirstCrawlingStrategy from crawl4ai.deep_crawling.filters
import ( FilterChain, DomainFilter, URLPatternFilter,
ContentTypeFilter ) from crawl4ai.deep_crawling.scorers import
KeywordRelevanceScorer async def run_advanced_crawler(): #
Create a sophisticated filter chain filter_chain =
FilterChain([ # Domain boundaries
DomainFilter( allowed_domains=[\"docs.example.com\"],
blocked_domains=[\"old.docs.example.com\"] ), # URL patterns
to include URLPatternFilter(patterns=[\"*guide*\",
\"*tutorial*\", \"*blog*\"]), # Content type filtering
ContentTypeFilter(allowed_types=[\"text/html\"]) ]) # Create a
relevance scorer keyword_scorer =
KeywordRelevanceScorer( keywords=[\"crawl\", \"example\",
\"async\", \"configuration\"], weight=0.7 ) # Set up the
configuration config =
CrawlerRunConfig( deep_crawl_strategy=BestFirstCrawlingStrateg
y( max_depth=2, include_external=False,
filter_chain=filter_chain, url_scorer=keyword_scorer ),
scraping_strategy=LXMLWebScrapingStrategy(), stream=True,
verbose=True ) # Execute the crawl results = [] async with
AsyncWebCrawler() as crawler: async for result in await
crawler.arun(\"https://docs.example.com\", config=config):
results.append(result) score = result.metadata.get(\"score\",
0) depth = result.metadata.get(\"depth\", 0) print(f\"Depth:
{depth} | Score: {score:.2f} | {result.url}\") # Analyze the
results print(f\"Crawled {len(results)} high-value pages\")
print(f\"Average score: {sum(r.metadata.get('score', 0) for r
in results) / len(results):.2f}\") # Group by depth
depth_counts = {} for result in results: depth =
result.metadata.get(\"depth\", 0) depth_counts[depth] =
depth_counts.get(depth, 0) + 1 print(\"Pages crawled by depth:
\") for depth, count in sorted(depth_counts.items()): print(f
\" Depth {depth}: {count} pages\") if __name__ == \"__main__
50
\": asyncio.run(run_advanced_crawler()) \n8. Limiting and
Controlling Crawl Size\n8.1 Using max_pages\nYou can limit the
total number of pages crawled with the max_pages parameter:\n#
Limit to exactly 20 pages regardless of depth strategy =
BFSDeepCrawlStrategy( max_depth=3, max_pages=20 ) \nThis
feature is useful for: - Controlling API costs - Setting
predictable execution times - Focusing on the most important
content - Testing crawl configurations before full execution
\n8.2 Using score_threshold\nFor BFS and DFS strategies, you
can set a minimum score threshold to only crawl high-quality
pages:\n# Only follow links with scores above 0.4 strategy =
DFSDeepCrawlStrategy( max_depth=2,
url_scorer=KeywordRelevanceScorer(keywords=[\"api\", \"guide
\", \"reference\"]), score_threshold=0.4 # Skip URLs with
scores below this value ) \nNote that for
BestFirstCrawlingStrategy, score_threshold is not needed since
pages are already processed in order of highest score first.
\n9. Common Pitfalls & Tips\n1.Set realistic limits. Be
cautious with max_depth values > 3, which can exponentially
increase crawl size. Use max_pages to set hard limits.
\n2.Don't neglect the scoring component. BestFirstCrawling
works best with well-tuned scorers. Experiment with keyword
weights for optimal prioritization.\n3.Be a good web citizen.
Respect robots.txt. (disabled by default)\n4.Handle page
errors gracefully. Not all pages will be accessible. Check
result.status when processing results.\n5.Balance breadth vs.
depth. Choose your strategy wisely - BFS for comprehensive
coverage, DFS for deep exploration, BestFirst for focused
relevance-based crawling.\n10. Summary & Next Steps\nIn this
Deep Crawling with Crawl4AI tutorial, you learned to:
\nConfigure BFSDeepCrawlStrategy, DFSDeepCrawlStrategy, and
BestFirstCrawlingStrategy\nProcess results in streaming or
non-streaming mode\nApply filters to target specific content
\nUse scorers to prioritize the most relevant pages\nLimit
crawls with max_pages and score_threshold parameters\nBuild a
complete advanced crawler with combined techniques\nWith these
tools, you can efficiently extract structured data from
websites at scale, focusing precisely on the content you need
for your specific use case.",
"markdown": "# Deep Crawling - Crawl4AI Documentation
(v0.5.x)\n\nOne of Crawl4AI's most powerful features is its
ability to perform **configurable deep crawling** that can
explore websites beyond a single page. With fine-tuned control
over crawl depth, domain boundaries, and content filtering,
Crawl4AI gives you the tools to extract precisely the content
you need.\n\nIn this tutorial, you'll learn:\n\n1. How to set
up a **Basic Deep Crawler** with BFS strategy\n2.
Understanding the difference between **streamed and non-
streamed** output\n3. Implementing **filters and scorers** to
target specific content\n4. Creating **advanced filtering
chains** for sophisticated crawls\n5. Using
**BestFirstCrawling** for intelligent exploration
prioritization\n\n> **Prerequisites** \n> \\- Youâ€™ve
completed or read [AsyncWebCrawler Basics]
(https://crawl4ai.com/mkdocs/core/simple-crawling/) to
understand how to run a simple crawl. \n> \\- You know how to
configure `CrawlerRunConfig`.\n\n* * *\n\n## 1\\. Quick
51
Example\n\nHere's a minimal code snippet that implements a
basic deep crawl using the **BFSDeepCrawlStrategy**:\n
\nìmport asyncio from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig from crawl4ai.deep_crawling import
BFSDeepCrawlStrategy from crawl4ai.content_scraping_strategy
import LXMLWebScrapingStrategy async def main(): #
Configure a 2-level deep crawl config =
CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStra
tegy( max_depth=2,
include_external=False ),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True ) async with AsyncWebCrawler() as
crawler: results = await
crawler.arun(\"https://example.com\", config=config)
print(f\"Crawled {len(results)} pages in total\") #
Access individual results for result in results[:3]:
# Show first 3 results print(f\"URL:
{result.url}\") print(f\"Depth:
{result.metadata.get('depth', 0)}\") if __name__ ==
\"__main__\": asyncio.run(main())`\n\n**What's happening?
** \n\\- `BFSDeepCrawlStrategy(max_depth=2,
include_external=False)` instructs Crawl4AI to: - Crawl the
starting page (depth 0) plus 2 more levels - Stay within the
same domain (don't follow external links) - Each result
contains metadata like the crawl depth - Results are returned
as a list after all crawling is complete\n\n* * *\n\n## 2\\.
Understanding Deep Crawling Strategy Options\n\n### 2.1
BFSDeepCrawlStrategy (Breadth-First Search)\n\nThe
**BFSDeepCrawlStrategy** uses a breadth-first approach,
exploring all links at one depth before moving deeper:\n
\n`from crawl4ai.deep_crawling import BFSDeepCrawlStrategy #
Basic configuration strategy =
BFSDeepCrawlStrategy( max_depth=2, # Crawl
initial page + 2 levels deep include_external=False, #
Stay within the same domain max_pages=50, #
Maximum number of pages to crawl (optional)
score_threshold=0.3, # Minimum score for URLs to be
crawled (optional) )`\n\n**Key parameters:** -
**`max_depth`**: Number of levels to crawl beyond the starting
page - **ìnclude_external`**: Whether to follow links to
other domains - **`max_pages`**: Maximum number of pages to
crawl (default: infinite) - **`score_threshold`**: Minimum
score for URLs to be crawled (default: -inf) -
**`filter_chain`**: FilterChain instance for URL filtering -
**ùrl_scorer`**: Scorer instance for evaluating URLs\n\n###
2.2 DFSDeepCrawlStrategy (Depth-First Search)\n\nThe
**DFSDeepCrawlStrategy** uses a depth-first approach, explores
as far down a branch as possible before backtracking.\n\n`from
crawl4ai.deep_crawling import DFSDeepCrawlStrategy # Basic
configuration strategy = DFSDeepCrawlStrategy( max_depth=
2, # Crawl initial page + 2 levels deep
include_external=False, # Stay within the same domain
max_pages=30, # Maximum number of pages to crawl
(optional) score_threshold=0.5, # Minimum score for
URLs to be crawled (optional) )`\n\n**Key parameters:** -
**`max_depth`**: Number of levels to crawl beyond the starting
page - **ìnclude_external`**: Whether to follow links to
52
other domains - **`max_pages`**: Maximum number of pages to
crawl (default: infinite) - **`score_threshold`**: Minimum
score for URLs to be crawled (default: -inf) -
**`filter_chain`**: FilterChain instance for URL filtering -
**ùrl_scorer`**: Scorer instance for evaluating URLs\n\n###
2.3 BestFirstCrawlingStrategy (â ï¸ - Recommended Deep crawl
strategy)\n\nFor more intelligent crawling, use
**BestFirstCrawlingStrategy** with scorers to prioritize the
most relevant pages:\n\n`from crawl4ai.deep_crawling import
BestFirstCrawlingStrategy from crawl4ai.deep_crawling.scorers
import KeywordRelevanceScorer # Create a scorer scorer =
KeywordRelevanceScorer( keywords=[\"crawl\", \"example\",
\"async\", \"configuration\"], weight=0.7 ) # Configure
the strategy strategy =
BestFirstCrawlingStrategy( max_depth=2,
include_external=False, url_scorer=scorer, max_pages=
25, # Maximum number of pages to crawl
(optional) )`\n\nThis crawling approach: - Evaluates each
discovered URL based on scorer criteria - Visits higher-
scoring pages first - Helps focus crawl resources on the most
relevant content - Can limit total pages crawled with
`max_pages` - Does not need `score_threshold` as it naturally
prioritizes by score\n\n* * *\n\n## 3\\. Streaming vs. Non-
Streaming Results\n\nCrawl4AI can return results in two modes:
\n\n### 3.1 Non-Streaming Mode (Default)\n\n`config =
CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy
(max_depth=1), stream=False # Default behavior ) async
with AsyncWebCrawler() as crawler: # Wait for ALL results
to be collected before returning results = await
crawler.arun(\"https://example.com\", config=config) for
result in results: process_result(result)`\n\n**When
to use non-streaming mode:** - You need the complete dataset
before processing - You're performing batch operations on all
results together - Crawl time isn't a critical factor\n\n###
3.2 Streaming Mode\n\n`config =
CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy
(max_depth=1), stream=True # Enable streaming ) async
with AsyncWebCrawler() as crawler: # Returns an async
iterator async for result in await
crawler.arun(\"https://example.com\", config=config):
# Process each result as it becomes available
process_result(result)`\n\n**Benefits of streaming mode:** -
Process results immediately as they're discovered - Start
working with early results while crawling continues - Better
for real-time applications or progressive display - Reduces
memory pressure when handling many pages\n\n* * *\n\n## 4\\.
Filtering Content with Filter Chains\n\nFilters help you
narrow down which pages to crawl. Combine multiple filters
using **FilterChain** for powerful targeting.\n\n### 4.1 Basic
URL Pattern Filter\n\n`from crawl4ai.deep_crawling.filters
import FilterChain, URLPatternFilter # Only follow URLs
containing \"blog\" or \"docs\" url_filter =
URLPatternFilter(patterns=[\"*blog*\", \"*docs*\"]) config =
CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy
( max_depth=1,
filter_chain=FilterChain([url_filter]) ) )`\n\n### 4.2
Combining Multiple Filters\n\n`from
53
crawl4ai.deep_crawling.filters import ( FilterChain,
URLPatternFilter, DomainFilter, ContentTypeFilter ) #
Create a chain of filters filter_chain = FilterChain([ #
Only follow URLs with specific patterns
URLPatternFilter(patterns=[\"*guide*\", \"*tutorial*\"]),
# Only crawl specific domains
DomainFilter( allowed_domains=[\"docs.example.com\"],
blocked_domains=[\"old.docs.example.com\"] ), # Only
include specific content types
ContentTypeFilter(allowed_types=[\"text/html\"]) ]) config =
CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy
( max_depth=2,
filter_chain=filter_chain ) )`\n\n### 4.3 Available Filter
Types\n\nCrawl4AI includes several specialized filters:\n\n*
**ÙRLPatternFilter`**: Matches URL patterns using wildcard
syntax\n* **`DomainFilter`**: Controls which domains to
include or exclude\n* **`ContentTypeFilter`**: Filters based
on HTTP Content-Type\n* **`ContentRelevanceFilter`**: Uses
similarity to a text query\n* **`SEOFilter`**: Evaluates SEO
elements (meta tags, headers, etc.)\n\n* * *\n\n## 5\\. Using
Scorers for Prioritized Crawling\n\nScorers assign priority
values to discovered URLs, helping the crawler focus on the
most relevant content first.\n\n### 5.1 KeywordRelevanceScorer
\n\n`from crawl4ai.deep_crawling.scorers import
KeywordRelevanceScorer from crawl4ai.deep_crawling import
BestFirstCrawlingStrategy # Create a keyword relevance scorer
keyword_scorer = KeywordRelevanceScorer( keywords=[\"crawl
\", \"example\", \"async\", \"configuration\"], weight=0.7
# Importance of this scorer (0.0 to 1.0) ) config =
CrawlerRunConfig( deep_crawl_strategy=BestFirstCrawlingStr
ategy( max_depth=2,
url_scorer=keyword_scorer ), stream=True #
Recommended with BestFirstCrawling ) # Results will come in
order of relevance score async with AsyncWebCrawler() as
crawler: async for result in await
crawler.arun(\"https://example.com\", config=config):
score = result.metadata.get(\"score\", 0) print(f
\"Score: {score:.2f} | {result.url}\")`\n\n**How scorers work:
** - Evaluate each discovered URL before crawling - Calculate
relevance based on various signals - Help the crawler make
intelligent choices about traversal order\n\n* * *\n\n## 6\\.
Advanced Filtering Techniques\n\n### 6.1 SEO Filter for
Quality Assessment\n\nThe **SEOFilter** helps you identify
pages with strong SEO characteristics:\n\n`from
crawl4ai.deep_crawling.filters import FilterChain, SEOFilter
# Create an SEO filter that looks for specific keywords in
page metadata seo_filter = SEOFilter( threshold=0.5, #
Minimum score (0.0 to 1.0) keywords=[\"tutorial\", \"guide
\", \"documentation\"] ) config =
CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy
( max_depth=1,
filter_chain=FilterChain([seo_filter]) ) )`\n\n### 6.2
Content Relevance Filter\n\nThe **ContentRelevanceFilter**
analyzes the actual content of pages:\n\n`from
crawl4ai.deep_crawling.filters import FilterChain,
ContentRelevanceFilter # Create a content relevance filter
relevance_filter = ContentRelevanceFilter( query=\"Web
54
crawling and data extraction with Python\", threshold=0.7
# Minimum similarity score (0.0 to 1.0) ) config =
CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy
( max_depth=1,
filter_chain=FilterChain([relevance_filter]) ) )`\n\nThis
filter: - Measures semantic similarity between query and page
content - It's a BM25-based relevance filter using head
section content\n\n* * *\n\n## 7\\. Building a Complete
Advanced Crawler\n\nThis example combines multiple techniques
for a sophisticated crawl:\n\nìmport asyncio from crawl4ai
import AsyncWebCrawler, CrawlerRunConfig from
crawl4ai.content_scraping_strategy import
LXMLWebScrapingStrategy from crawl4ai.deep_crawling import
BestFirstCrawlingStrategy from crawl4ai.deep_crawling.filters
import ( FilterChain, DomainFilter,
URLPatternFilter, ContentTypeFilter ) from
crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
async def run_advanced_crawler(): # Create a sophisticated
filter chain filter_chain = FilterChain([ # Domain
boundaries
DomainFilter( allowed_domains=[\"docs.example.com
\"], blocked_domains=[\"old.docs.example.com
\"] ), # URL patterns to include
URLPatternFilter(patterns=[\"*guide*\", \"*tutorial*\",
\"*blog*\"]), # Content type filtering
ContentTypeFilter(allowed_types=[\"text/html\"]) ]) #
Create a relevance scorer keyword_scorer =
KeywordRelevanceScorer( keywords=[\"crawl\", \"example
\", \"async\", \"configuration\"], weight=0.7 )
# Set up the configuration config =
CrawlerRunConfig( deep_crawl_strategy=BestFirstCrawlin
gStrategy( max_depth=2,
include_external=False, filter_chain=filter_chain,
url_scorer=keyword_scorer ),
scraping_strategy=LXMLWebScrapingStrategy(),
stream=True, verbose=True ) # Execute the
crawl results = [] async with AsyncWebCrawler() as
crawler: async for result in await
crawler.arun(\"https://docs.example.com\", config=config):
results.append(result) score =
result.metadata.get(\"score\", 0) depth =
result.metadata.get(\"depth\", 0) print(f\"Depth:
{depth} | Score: {score:.2f} | {result.url}\") # Analyze
the results print(f\"Crawled {len(results)} high-value
pages\") print(f\"Average score:
{sum(r.metadata.get('score', 0) for r in results) /
len(results):.2f}\") # Group by depth depth_counts =
{} for result in results: depth =
result.metadata.get(\"depth\", 0) depth_counts[depth]
= depth_counts.get(depth, 0) + 1 print(\"Pages crawled by
depth:\") for depth, count in
sorted(depth_counts.items()): print(f\" Depth
{depth}: {count} pages\") if __name__ == \"__main__\":
asyncio.run(run_advanced_crawler())`\n\n* * *\n\n## 8\\.
Limiting and Controlling Crawl Size\n\n### 8.1 Using max
\\_pages\n\nYou can limit the total number of pages crawled
with the `max_pages` parameter:\n\n`# Limit to exactly 20
55
pages regardless of depth strategy =
BFSDeepCrawlStrategy( max_depth=3, max_pages=20 )`\n
\nThis feature is useful for: - Controlling API costs -
Setting predictable execution times - Focusing on the most
important content - Testing crawl configurations before full
execution\n\n### 8.2 Using score\\_threshold\n\nFor BFS and
DFS strategies, you can set a minimum score threshold to only
crawl high-quality pages:\n\n`# Only follow links with scores
above 0.4 strategy = DFSDeepCrawlStrategy( max_depth=2,
url_scorer=KeywordRelevanceScorer(keywords=[\"api\", \"guide
\", \"reference\"]), score_threshold=0.4 # Skip URLs with
scores below this value )`\n\nNote that for
BestFirstCrawlingStrategy, score\\_threshold is not needed
since pages are already processed in order of highest score
first.\n\n## 9\\. Common Pitfalls & Tips\n\n1.**Set realistic
limits.** Be cautious with `max_depth` values > 3, which can
exponentially increase crawl size. Use `max_pages` to set hard
limits.\n\n2.**Don't neglect the scoring component.**
BestFirstCrawling works best with well-tuned scorers.
Experiment with keyword weights for optimal prioritization.\n
\n3.**Be a good web citizen.** Respect robots.txt. (disabled
by default)\n\n4.**Handle page errors gracefully.** Not all
pages will be accessible. Check `result.status` when
processing results.\n\n5.**Balance breadth vs. depth.** Choose
your strategy wisely - BFS for comprehensive coverage, DFS for
deep exploration, BestFirst for focused relevance-based
crawling.\n\n* * *\n\n## 10\\. Summary & Next Steps\n\nIn this
**Deep Crawling with Crawl4AI** tutorial, you learned to:\n\n*
Configure **BFSDeepCrawlStrategy**, **DFSDeepCrawlStrategy**,
and **BestFirstCrawlingStrategy**\n* Process results in
streaming or non-streaming mode\n* Apply filters to target
specific content\n* Use scorers to prioritize the most
relevant pages\n* Limit crawls with `max_pages` and
`score_threshold` parameters\n* Build a complete advanced
crawler with combined techniques\n\nWith these tools, you can
efficiently extract structured data from websites at scale,
focusing precisely on the content you need for your specific
use case.",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/core/browser-crawler-
config/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/core/browser-
crawler-config/",
"loadedTime": "2025-03-05T23:16:28.957Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/core/browser-
crawler-config/",
"title": "Browser, Crawler & LLM Config - Crawl4AI
56
Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:27 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"0f70be866eaf20f282356d4052c8ed7f\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Browser, Crawler & LLM Config\nBrowser, Crawler &
LLM Configuration (Quick Overview)\nCrawl4AIâ€™s flexibility
stems from two key classes:\n1. BrowserConfig â€“ Dictates how
the browser is launched and behaves (e.g., headless or
visible, proxy, user agent).\n2. CrawlerRunConfig â€“ Dictates
how each crawl operates (e.g., caching, extraction, timeouts,
JavaScript code to run, etc.).\n3. LlmConfig - Dictates how
LLM providers are configured. (model, api token, base url,
temperature etc.)\nIn most examples, you create one
BrowserConfig for the entire crawler session, then pass a
fresh or re-used CrawlerRunConfig whenever you call arun().
This tutorial shows the most commonly used parameters. If you
need advanced or rarely used fields, see the Configuration
Parameters.\n1. BrowserConfig Essentials\nclass BrowserConfig:
def __init__( browser_type=\"chromium\", headless=True,
proxy_config=None, viewport_width=1080, viewport_height=600,
verbose=True, use_persistent_context=False,
user_data_dir=None, cookies=None, headers=None,
user_agent=None, text_mode=False, light_mode=False,
extra_args=None, # ... other advanced parameters omitted
here ): ... \nKey Fields to Note\n1. browser_type\n- Options:
\"chromium\", \"firefox\", or \"webkit\".\n- Defaults to
\"chromium\".\n- If you need a different engine, specify it
here.\n2. headless\n- True: Runs the browser in headless mode
(invisible browser).\n- False: Runs the browser in visible
mode, which helps with debugging.\n3. proxy_config\n- A
dictionary with fields like:\n{ \"server\":
\"http://proxy.example.com:8080\", \"username\": \"...\",
\"password\": \"...\" } \n- Leave as None if a proxy is not
required. \n4. viewport_width & viewport_height:\n- The
initial window size.\n- Some sites behave differently with
smaller or bigger viewports.\n5. verbose:\n- If True, prints
extra logs.\n- Handy for debugging.\n6.
use_persistent_context:\n- If True, uses a persistent browser
profile, storing cookies/local storage across runs.\n-
Typically also set user_data_dir to point to a folder.\n7.
cookies & headers:\n- If you want to start with specific
cookies or add universal HTTP headers, set them here.\n- E.g.
57
cookies=[{\"name\": \"session\", \"value\": \"abc123\",
\"domain\": \"example.com\"}].\n8. user_agent:\n- Custom User-
Agent string. If None, a default is used.\n- You can also set
user_agent_mode=\"random\" for randomization (if you want to
fight bot detection).\n9. text_mode & light_mode:\n-
text_mode=True disables images, possibly speeding up text-only
crawls.\n- light_mode=True turns off certain background
features for performance. \n10. extra_args:\n- Additional
flags for the underlying browser.\n- E.g. [\"--disable-
extensions\"].\nHelper Methods\nBoth configuration classes
provide a clone() method to create modified copies:\n# Create
a base browser config base_browser =
BrowserConfig( browser_type=\"chromium\", headless=True,
text_mode=True ) # Create a visible browser config for
debugging debug_browser = base_browser.clone( headless=False,
verbose=True ) \nMinimal Example:\nfrom crawl4ai import
AsyncWebCrawler, BrowserConfig browser_conf =
BrowserConfig( browser_type=\"firefox\", headless=False,
text_mode=True ) async with
AsyncWebCrawler(config=browser_conf) as crawler: result =
await crawler.arun(\"https://example.com\")
print(result.markdown[:300]) \n2. CrawlerRunConfig Essentials
\nclass CrawlerRunConfig: def __init__( word_count_threshold=
200, extraction_strategy=None, markdown_generator=None,
cache_mode=None, js_code=None, wait_for=None,
screenshot=False, pdf=False, enable_rate_limiting=False,
rate_limit_config=None, memory_threshold_percent=70.0,
check_interval=1.0, max_session_permit=20, display_mode=None,
verbose=True, stream=False, # Enable streaming for arun_many()
# ... other advanced parameters omitted ): ... \nKey Fields to
Note\n1. word_count_threshold:\n- The minimum word count
before a block is considered.\n- If your site has lots of
short paragraphs or items, you can lower it.\n2.
extraction_strategy:\n- Where you plug in JSON-based
extraction (CSS, LLM, etc.).\n- If None, no structured
extraction is done (only raw/cleaned HTML + markdown).\n3.
markdown_generator:\n- E.g., DefaultMarkdownGenerator(...),
controlling how HTMLâ†’Markdown conversion is done.\n- If
None, a default approach is used.\n4. cache_mode:\n- Controls
caching behavior (ENABLED, BYPASS, DISABLED, etc.).\n- If
None, defaults to some level of caching or you can specify
CacheMode.ENABLED.\n5. js_code:\n- A string or list of JS
strings to execute.\n- Great for â€œLoad Moreâ€ buttons or
user interactions. \n6. wait_for:\n- A CSS or JS expression to
wait for before extracting content.\n- Common usage: wait_for=
\"css:.main-loaded\" or wait_for=\"js:() => window.loaded ===
true\".\n7. screenshot & pdf:\n- If True, captures a
screenshot or PDF after the page is fully loaded.\n- The
results go to result.screenshot (base64) or result.pdf
(bytes).\n8. verbose:\n- Logs additional runtime details.\n-
Overlaps with the browserâ€™s verbosity if also set to True in
BrowserConfig.\n9. enable_rate_limiting:\n- If True, enables
rate limiting for batch processing.\n- Requires
rate_limit_config to be set.\n10. memory_threshold_percent:\n-
The memory threshold (as a percentage) to monitor.\n- If
exceeded, the crawler will pause or slow down.\n11.
check_interval:\n- The interval (in seconds) to check system
58
resources.\n- Affects how often memory and CPU usage are
monitored.\n12. max_session_permit:\n- The maximum number of
concurrent crawl sessions.\n- Helps prevent overwhelming the
system.\n13. display_mode:\n- The display mode for progress
information (DETAILED, BRIEF, etc.).\n- Affects how much
information is printed during the crawl.\nHelper Methods\nThe
clone() method is particularly useful for creating variations
of your crawler configuration:\n# Create a base configuration
base_config = CrawlerRunConfig( cache_mode=CacheMode.ENABLED,
word_count_threshold=200, wait_until=\"networkidle\" ) #
Create variations for different use cases stream_config =
base_config.clone( stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS ) debug_config =
base_config.clone( page_timeout=120000, # Longer timeout for
debugging verbose=True ) \nThe clone() method: - Creates a new
instance with all the same settings - Updates only the
specified parameters - Leaves the original configuration
unchanged - Perfect for creating variations without repeating
all parameters\n3. LlmConfig Essentials\nKey fields to note
\n1. provider:\n- Which LLM provoder to use. - Possible values
are \"ollama/llama3\",\"groq/llama3-70b-8192\",
\"groq/llama3-8b-8192\", \"openai/gpt-4o-mini\" ,
\"openai/gpt-4o\",\"openai/o1-mini\",\"openai/o1-preview\",
\"openai/o3-mini\",\"openai/o3-mini-high\",
\"anthropic/claude-3-haiku-20240307\",\"anthropic/claude-3-
opus-20240229\",\"anthropic/claude-3-sonnet-20240229\",
\"anthropic/claude-3-5-sonnet-20240620\",\"gemini/gemini-pro
\",\"gemini/gemini-1.5-pro\",\"gemini/gemini-2.0-flash\",
\"gemini/gemini-2.0-flash-exp\",\"gemini/gemini-2.0-flash-
lite-preview-02-05\",\"deepseek/deepseek-chat\"\n(default:
\"openai/gpt-4o-mini\")\n2. api_token:\n- Optional. When not
provided explicitly, api_token will be read from environment
variables based on provider. For example: If a gemini model is
passed as provider then,\"GEMINI_API_KEY\" will be read from
environment variables\n- API token of LLM provider \neg:
api_token = \"gsk_
1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv\" -
Environment variable - use with prefix \"env:\" \neg:api_token
= \"env: GROQ_API_KEY\" \n3. base_url:\n- If your provider has
a custom endpoint\nllmConfig = LlmConfig(provider=
\"openai/gpt-4o-mini\", api_token=os.getenv(\"OPENAI_API_KEY
\")) \n4. Putting It All Together\nIn a typical scenario, you
define one BrowserConfig for your crawler session, then create
one or more CrawlerRunConfig & LlmConfig depending on each
callâ€™s needs:\nimport asyncio from crawl4ai import
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode,
LlmConfig from crawl4ai.extraction_strategy import
JsonCssExtractionStrategy async def main(): # 1) Browser
config: headless, bigger viewport, no proxy browser_conf =
BrowserConfig( headless=True, viewport_width=1280,
viewport_height=720 ) # 2) Example extraction strategy schema
= { \"name\": \"Articles\", \"baseSelector\": \"div.article\",
\"fields\": [ {\"name\": \"title\", \"selector\": \"h2\",
\"type\": \"text\"}, {\"name\": \"link\", \"selector\": \"a\",
\"type\": \"attribute\", \"attribute\": \"href\"} ] }
extraction = JsonCssExtractionStrategy(schema) # 3) Example
LLM content filtering gemini_config = LlmConfig( provider=
59
\"gemini/gemini-1.5-pro\" api_token = \"env:GEMINI_API_TOKEN
\" ) # Initialize LLM filter with specific instruction filter
= LLMContentFilter( llmConfig=gemini_config, # or your
preferred provider instruction=\"\"\" Focus on extracting the
core educational content. Include: - Key concepts and
explanations - Important code examples - Essential technical
details Exclude: - Navigation elements - Sidebars - Footer
content Format the output as clean markdown with proper code
blocks and headers. \"\"\", chunk_token_threshold=500, #
Adjust based on your needs verbose=True ) md_generator =
DefaultMarkdownGenerator( content_filter=filter,
options={\"ignore_links\": True} # 4) Crawler run config: skip
cache, use extraction run_conf =
CrawlerRunConfig( markdown_generator=md_generator,
extraction_strategy=extraction, cache_mode=CacheMode.BYPASS, )
async with AsyncWebCrawler(config=browser_conf) as crawler: #
4) Execute the crawl result = await crawler.arun(url=
\"https://example.com/news\", config=run_conf) if
result.success: print(\"Extracted content:\",
result.extracted_content) else: print(\"Error:\",
result.error_message) if __name__ == \"__main__\":
asyncio.run(main()) \n5. Next Steps\nFor a detailed list of
available parameters (including advanced ones), see:
\nBrowserConfig, CrawlerRunConfig & LlmConfig Reference \nYou
can explore topics like:\nCustom Hooks & Auth (Inject
JavaScript or handle login forms). \nSession Management (Re-
use pages, preserve state across multiple calls). \nMagic Mode
or Identity-based Crawling (Fight bot detection by simulating
user behavior). \nAdvanced Caching (Fine-tune read/write cache
modes). \n6. Conclusion\nBrowserConfig, CrawlerRunConfig and
LlmConfig give you straightforward ways to define:\nWhich
browser to launch, how it should run, and any proxy or user
agent needs. \nHow each crawl should behaveâ€”caching,
timeouts, JavaScript code, extraction strategies, etc.\nWhich
LLM provider to use, api token, temperature and base url for
custom endpoints\nUse them together for clear, maintainable
code, and when you need more specialized behavior, check out
the advanced parameters in the reference docs. Happy
crawling!",
"markdown": "# Browser, Crawler & LLM Config\n\n## Browser,
Crawler & LLM Configuration (Quick Overview)\n\nCrawl4AIâ€™s
flexibility stems from two key classes:\n\n1.â
€€**`BrowserConfig`** â€“ Dictates **how** the browser is
launched and behaves (e.g., headless or visible, proxy, user
agent). \n2.â€€**`CrawlerRunConfig`** â€“ Dictates **how**
each **crawl** operates (e.g., caching, extraction, timeouts,
JavaScript code to run, etc.). \n3\\. **`LlmConfig`** -
Dictates **how** LLM providers are configured. (model, api
token, base url, temperature etc.)\n\nIn most examples, you
create **one** `BrowserConfig` for the entire crawler session,
then pass a **fresh** or re-used `CrawlerRunConfig` whenever
you call àrun()`. This tutorial shows the most commonly used
parameters. If you need advanced or rarely used fields, see
the [Configuration Parameters]
(https://crawl4ai.com/mkdocs/api/parameters/).\n\n* * *\n\n##
1\\. BrowserConfig Essentials\n\n`class BrowserConfig: def
__init__( browser_type=\"chromium\",
60
headless=True, proxy_config=None,
viewport_width=1080, viewport_height=600,
verbose=True, use_persistent_context=False,
user_data_dir=None, cookies=None,
headers=None, user_agent=None,
text_mode=False, light_mode=False,
extra_args=None, # ... other advanced parameters
omitted here ): ...`\n\n### Key Fields to Note\n
\n1.â€€**`browser_type`** \n\\- Options: `\"chromium\"`, `
\"firefox\"`, or `\"webkit\"`. \n\\- Defaults to `\"chromium
\"`. \n\\- If you need a different engine, specify it here.\n
\n2.â€€**`headless`** \n\\- `True`: Runs the browser in
headless mode (invisible browser). \n\\- `False`: Runs the
browser in visible mode, which helps with debugging.\n\n3.â
€€**`proxy_config`** \n\\- A dictionary with fields like: \n
\n`{ \"server\": \"http://proxy.example.com:8080\",
\"username\": \"...\", \"password\": \"...\" }`\n\n\\-
Leave as `None` if a proxy is not required.\n\n4.â
€€**`viewport_width` & `viewport_height`**: \n\\- The initial
window size. \n\\- Some sites behave differently with smaller
or bigger viewports.\n\n5.â€€**`verbose`**: \n\\- If `True`,
prints extra logs. \n\\- Handy for debugging.\n\n6.â
€€**ùse_persistent_context`**: \n\\- If `True`, uses a
**persistent** browser profile, storing cookies/local storage
across runs. \n\\- Typically also set ùser_data_dir` to
point to a folder.\n\n7.â€€**`cookies`** & **`headers`**: \n
\\- If you want to start with specific cookies or add
universal HTTP headers, set them here. \n\\- E.g.
`cookies=[{\"name\": \"session\", \"value\": \"abc123\",
\"domain\": \"example.com\"}]`.\n\n8.â€€**ùser_agent`**: \n
\\- Custom User-Agent string. If `None`, a default is used.
\n\\- You can also set ùser_agent_mode=\"random\"` for
randomization (if you want to fight bot detection).\n\n9.â
€€**`text_mode`** & **`light_mode`**: \n\\- `text_mode=True`
disables images, possibly speeding up text-only crawls. \n\\-
`light_mode=True` turns off certain background features for
performance.\n\n10.â€€**èxtra_args`**: \n\\- Additional
flags for the underlying browser. \n\\- E.g. `[\"--disable-
extensions\"]`.\n\n### Helper Methods\n\nBoth configuration
classes provide a `clone()` method to create modified copies:
\n\n`# Create a base browser config base_browser =
BrowserConfig( browser_type=\"chromium\",
headless=True, text_mode=True ) # Create a visible
browser config for debugging debug_browser =
base_browser.clone( headless=False, verbose=True )`\n
\n**Minimal Example**:\n\n`from crawl4ai import
AsyncWebCrawler, BrowserConfig browser_conf =
BrowserConfig( browser_type=\"firefox\",
headless=False, text_mode=True ) async with
AsyncWebCrawler(config=browser_conf) as crawler: result =
await crawler.arun(\"https://example.com\")
print(result.markdown[:300])`\n\n* * *\n\n## 2\\.
CrawlerRunConfig Essentials\n\n`class CrawlerRunConfig:
def __init__( word_count_threshold=200,
extraction_strategy=None, markdown_generator=None,
cache_mode=None, js_code=None, wait_for=None,
screenshot=False, pdf=False,
61
enable_rate_limiting=False, rate_limit_config=None,
memory_threshold_percent=70.0, check_interval=1.0,
max_session_permit=20, display_mode=None,
verbose=True, stream=False, # Enable streaming for
arun_many() # ... other advanced parameters
omitted ): ...`\n\n### Key Fields to Note\n\n1.â
€€**`word_count_threshold`**: \n\\- The minimum word count
before a block is considered. \n\\- If your site has lots of
short paragraphs or items, you can lower it.\n\n2.â
€€**èxtraction_strategy`**: \n\\- Where you plug in JSON-
based extraction (CSS, LLM, etc.). \n\\- If `None`, no
structured extraction is done (only raw/cleaned HTML +
markdown).\n\n3.â€€**`markdown_generator`**: \n\\- E.g.,
`DefaultMarkdownGenerator(...)`, controlling how
HTMLâ†’Markdown conversion is done. \n\\- If `None`, a
default approach is used.\n\n4.â€€**`cache_mode`**: \n\\-
Controls caching behavior (ÈNABLED`, `BYPASS`, `DISABLED`,
etc.). \n\\- If `None`, defaults to some level of caching or
you can specify `CacheMode.ENABLED`.\n\n5.â€€**`js_code`**:
\n\\- A string or list of JS strings to execute. \n\\- Great
for â€œLoad Moreâ€ buttons or user interactions.\n\n6.â
€€**`wait_for`**: \n\\- A CSS or JS expression to wait for
before extracting content. \n\\- Common usage: `wait_for=
\"css:.main-loaded\"` or `wait_for=\"js:() => window.loaded
=== true\"`.\n\n7.â€€**`screenshot`** & **`pdf`**: \n\\- If
`True`, captures a screenshot or PDF after the page is fully
loaded. \n\\- The results go to `result.screenshot` (base64)
or `result.pdf` (bytes).\n\n8.â€€**`verbose`**: \n\\- Logs
additional runtime details. \n\\- Overlaps with the browserâ
€™s verbosity if also set to `True` in `BrowserConfig`.\n\n9.â
€€**ènable_rate_limiting`**: \n\\- If `True`, enables rate
limiting for batch processing. \n\\- Requires
`rate_limit_config` to be set.\n\n10.â
€€**`memory_threshold_percent`**: \n\\- The memory threshold
(as a percentage) to monitor. \n\\- If exceeded, the crawler
will pause or slow down.\n\n11.â€€**`check_interval`**: \n\\-
The interval (in seconds) to check system resources. \n\\-
Affects how often memory and CPU usage are monitored.\n\n12.â
€€**`max_session_permit`**: \n\\- The maximum number of
concurrent crawl sessions. \n\\- Helps prevent overwhelming
the system.\n\n13.â€€**`display_mode`**: \n\\- The display
mode for progress information (`DETAILED`, `BRIEF`, etc.). \n
\\- Affects how much information is printed during the crawl.
\n\n### Helper Methods\n\nThe `clone()` method is particularly
useful for creating variations of your crawler configuration:
\n\n`# Create a base configuration base_config =
CrawlerRunConfig( cache_mode=CacheMode.ENABLED,
word_count_threshold=200, wait_until=\"networkidle\" ) #
Create variations for different use cases stream_config =
base_config.clone( stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS ) debug_config =
base_config.clone( page_timeout=120000, # Longer timeout
for debugging verbose=True )`\n\nThe `clone()` method: -
Creates a new instance with all the same settings - Updates
only the specified parameters - Leaves the original
configuration unchanged - Perfect for creating variations
without repeating all parameters\n\n* * *\n\n## 3\\. LlmConfig
62
Essentials\n\n### Key fields to note\n\n1.â€€**`provider`**:
\n\\- Which LLM provoder to use. - Possible values are `
\"ollama/llama3\",\"groq/llama3-70b-8192\",
\"groq/llama3-8b-8192\", \"openai/gpt-4o-mini\" ,
\"openai/gpt-4o\",\"openai/o1-mini\",\"openai/o1-preview\",
\"openai/o3-mini\",\"openai/o3-mini-high\",
\"anthropic/claude-3-haiku-20240307\",\"anthropic/claude-3-
opus-20240229\",\"anthropic/claude-3-sonnet-20240229\",
\"anthropic/claude-3-5-sonnet-20240620\",\"gemini/gemini-pro
\",\"gemini/gemini-1.5-pro\",\"gemini/gemini-2.0-flash\",
\"gemini/gemini-2.0-flash-exp\",\"gemini/gemini-2.0-flash-
lite-preview-02-05\",\"deepseek/deepseek-chat\"` \n_(default:
`\"openai/gpt-4o-mini\"`)_\n\n2.â€€**àpi_token`**: \n\\-
Optional. When not provided explicitly, api\\_token will be
read from environment variables based on provider. For
example: If a gemini model is passed as provider then,`
\"GEMINI_API_KEY\"` will be read from environment variables
\n\\- API token of LLM provider \neg: àpi_token = \"gsk_
1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv\"` -
Environment variable - use with prefix \"env:\" \neg:
àpi_token = \"env: GROQ_API_KEY\"`\n\n3.â€€**`base_url`**:
\n\\- If your provider has a custom endpoint\n\n`llmConfig =
LlmConfig(provider=\"openai/gpt-4o-mini\",
api_token=os.getenv(\"OPENAI_API_KEY\"))`\n\n## 4\\. Putting
It All Together\n\nIn a typical scenario, you define **one**
`BrowserConfig` for your crawler session, then create **one or
more** `CrawlerRunConfig` & `LlmConfig` depending on each
callâ€™s needs:\n\nìmport asyncio from crawl4ai import
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode,
LlmConfig from crawl4ai.extraction_strategy import
JsonCssExtractionStrategy async def main(): # 1) Browser
config: headless, bigger viewport, no proxy browser_conf =
BrowserConfig( headless=True, viewport_width=
1280, viewport_height=720 ) # 2) Example
extraction strategy schema = { \"name\":
\"Articles\", \"baseSelector\": \"div.article\",
\"fields\": [ {\"name\": \"title\", \"selector\":
\"h2\", \"type\": \"text\"}, {\"name\": \"link\",
\"selector\": \"a\", \"type\": \"attribute\", \"attribute\":
\"href\"} ] } extraction =
JsonCssExtractionStrategy(schema) # 3) Example LLM
content filtering gemini_config =
LlmConfig( provider=\"gemini/gemini-1.5-pro\"
api_token = \"env:GEMINI_API_TOKEN\" ) # Initialize
LLM filter with specific instruction filter =
LLMContentFilter( llmConfig=gemini_config, # or your
preferred provider instruction=\"\"\" Focus on
extracting the core educational content.
Include: - Key concepts and explanations -
Important code examples - Essential technical details
Exclude: - Navigation elements -
Sidebars - Footer content Format the output as
clean markdown with proper code blocks and headers.
\"\"\", chunk_token_threshold=500, # Adjust based on
your needs verbose=True ) md_generator =
DefaultMarkdownGenerator( content_filter=filter,
options={\"ignore_links\": True} # 4) Crawler run config:
63
skip cache, use extraction run_conf =
CrawlerRunConfig( markdown_generator=md_generator,
extraction_strategy=extraction,
cache_mode=CacheMode.BYPASS, ) async with
AsyncWebCrawler(config=browser_conf) as crawler: # 4)
Execute the crawl result = await crawler.arun(url=
\"https://example.com/news\", config=run_conf) if
result.success: print(\"Extracted content:\",
result.extracted_content) else:
print(\"Error:\", result.error_message) if __name__ ==
\"__main__\": asyncio.run(main())`\n\n* * *\n\n## 5\\.
Next Steps\n\nFor a **detailed list** of available parameters
(including advanced ones), see:\n\n* [BrowserConfig,
CrawlerRunConfig & LlmConfig Reference]
(https://crawl4ai.com/mkdocs/api/parameters/)\n\nYou can
explore topics like:\n\n* **Custom Hooks & Auth** (Inject
JavaScript or handle login forms).\n* **Session Management**
(Re-use pages, preserve state across multiple calls).\n*
**Magic Mode** or **Identity-based Crawling** (Fight bot
detection by simulating user behavior).\n* **Advanced
Caching** (Fine-tune read/write cache modes).\n\n* * *\n\n## 6
\\. Conclusion\n\n**BrowserConfig**, **CrawlerRunConfig** and
**LlmConfig** give you straightforward ways to define:\n\n*
**Which** browser to launch, how it should run, and any proxy
or user agent needs.\n* **How** each crawl should behaveâ€”
caching, timeouts, JavaScript code, extraction strategies,
etc.\n* **Which** LLM provider to use, api token,
temperature and base url for custom endpoints\n\nUse them
together for **clear, maintainable** code, and when you need
more specialized behavior, check out the advanced parameters
in the [reference docs]
(https://crawl4ai.com/mkdocs/api/parameters/). Happy
crawling!",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/core/markdown-
generation/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/core/markdown-
generation/",
"loadedTime": "2025-03-05T23:16:36.663Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/core/markdown-
generation/",
"title": "Markdown Generation - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
64
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:34 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"04fa0b5b48532659d902572fb9f62167\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Markdown Generation - Crawl4AI Documentation
(v0.5.x)\nOne of Crawl4AIâ€™s core features is generating
clean, structured markdown from web pages. Originally built to
solve the problem of extracting only the â€œactualâ€ content
and discarding boilerplate or noise, Crawl4AIâ€™s markdown
system remains one of its biggest draws for AI workflows.\nIn
this tutorial, youâ€™ll learn:\nHow to configure the Default
Markdown Generator \nHow content filters (BM25 or Pruning)
help you refine markdown and discard junk \nThe difference
between raw markdown (result.markdown) and filtered markdown
(fit_markdown) \nPrerequisites\n- Youâ€™ve completed or read
AsyncWebCrawler Basics to understand how to run a simple
crawl.\n- You know how to configure CrawlerRunConfig.\n1.
Quick Example\nHereâ€™s a minimal code snippet that uses the
DefaultMarkdownGenerator with no additional filtering:\nimport
asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import
DefaultMarkdownGenerator async def main(): config =
CrawlerRunConfig( markdown_generator=DefaultMarkdownGenerator(
) ) async with AsyncWebCrawler() as crawler: result = await
crawler.arun(\"https://example.com\", config=config) if
result.success: print(\"Raw Markdown Output:\\n\")
print(result.markdown) # The unfiltered markdown from the page
else: print(\"Crawl failed:\", result.error_message) if
__name__ == \"__main__\": asyncio.run(main()) \nWhatâ€™s
happening?\n- CrawlerRunConfig( markdown_generator =
DefaultMarkdownGenerator() ) instructs Crawl4AI to convert the
final HTML into markdown at the end of each crawl.\n- The
resulting markdown is accessible via result.markdown.\n2. How
Markdown Generation Works\n2.1 HTML-to-Text Conversion (Forked
& Modified)\nUnder the hood, DefaultMarkdownGenerator uses a
specialized HTML-to-text approach that:\nPreserves headings,
code blocks, bullet points, etc. \nRemoves extraneous tags
(scripts, styles) that donâ€™t add meaningful content. \nCan
optionally generate references for links or skip them
altogether.\nA set of options (passed as a dict) allows you to
customize precisely how HTML converts to markdown. These map
to standard html2text-like configuration plus your own
enhancements (e.g., ignoring internal links, preserving
certain tags verbatim, or adjusting line widths).\n2.2 Link
Citations & References\nBy default, the generator can convert
<a href=\"...\"> elements into [text][1] citations, then place
the actual links at the bottom of the document. This is handy
65
for research workflows that demand references in a structured
manner.\n2.3 Optional Content Filters\nBefore or after the
HTML-to-Markdown step, you can apply a content filter (like
BM25 or Pruning) to reduce noise and produce a â
€œfit_markdownâ€ â€”a heavily pruned version focusing on the
pageâ€™s main text. Weâ€™ll cover these filters shortly.\n3.
Configuring the Default Markdown Generator\nYou can tweak the
output by passing an options dict to DefaultMarkdownGenerator.
For example:\nfrom crawl4ai.markdown_generation_strategy
import DefaultMarkdownGenerator from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig async def main(): # Example:
ignore all links, don't escape HTML, and wrap text at 80
characters md_generator =
DefaultMarkdownGenerator( options={ \"ignore_links\": True,
\"escape_html\": False, \"body_width\": 80 } ) config =
CrawlerRunConfig( markdown_generator=md_generator ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun(\"https://example.com/docs\", config=config) if
result.success: print(\"Markdown:\\n\", result.markdown[:500])
# Just a snippet else: print(\"Crawl failed:\",
result.error_message) if __name__ == \"__main__\": import
asyncio asyncio.run(main()) \nSome commonly used options:
\nignore_links (bool): Whether to remove all hyperlinks in the
final markdown. \nignore_images (bool): Remove all ![image]()
references. \nescape_html (bool): Turn HTML entities into text
(default is often True). \nbody_width (int): Wrap text at N
characters. 0 or None means no wrapping. \nskip_internal_links
(bool): If True, omit #localAnchors or internal links
referencing the same page. \ninclude_sup_sub (bool): Attempt
to handle <sup> / <sub> in a more readable way.\n4. Content
Filters\nContent filters selectively remove or rank sections
of text before turning them into Markdown. This is especially
helpful if your page has ads, nav bars, or other clutter you
donâ€™t want.\n4.1 BM25ContentFilter\nIf you have a search
query, BM25 is a good choice:\nfrom
crawl4ai.markdown_generation_strategy import
DefaultMarkdownGenerator from crawl4ai.content_filter_strategy
import BM25ContentFilter from crawl4ai import CrawlerRunConfig
bm25_filter = BM25ContentFilter( user_query=\"machine learning
\", bm25_threshold=1.2, use_stemming=True ) md_generator =
DefaultMarkdownGenerator( content_filter=bm25_filter,
options={\"ignore_links\": True} ) config =
CrawlerRunConfig(markdown_generator=md_generator)
\nuser_query: The term you want to focus on. BM25 tries to
keep only content blocks relevant to that query. \nbm25
_threshold: Raise it to keep fewer blocks; lower it to keep
more. \nuse_stemming: If True, variations of words match
(e.g., â€œlearn,â€ â€œlearning,â€ â€œlearntâ€ ).\nNo query
provided? BM25 tries to glean a context from page metadata, or
you can simply treat it as a scorched-earth approach that
discards text with low generic score. Realistically, you want
to supply a query for best results.\n4.2 PruningContentFilter
\nIf you donâ€™t have a specific query, or if you just want a
robust â€œjunk remover,â€ use PruningContentFilter. It
analyzes text density, link density, HTML structure, and known
patterns (like â€œnav,â€ â€œfooterâ€ ) to systematically
prune extraneous or repetitive sections.\nfrom
66
crawl4ai.content_filter_strategy import PruningContentFilter
prune_filter = PruningContentFilter( threshold=0.5,
threshold_type=\"fixed\", # or \"dynamic\" min_word_threshold=
50 ) \nthreshold: Score boundary. Blocks below this score get
removed. \nthreshold_type: \n\"fixed\": Straight comparison
(score >= threshold keeps the block). \n\"dynamic\": The
filter adjusts threshold in a data-driven manner.
\nmin_word_threshold: Discard blocks under N words as likely
too short or unhelpful.\nWhen to Use PruningContentFilter\n-
You want a broad cleanup without a user query.\n- The page has
lots of repeated sidebars, footers, or disclaimers that hamper
text extraction.\n4.3 LLMContentFilter\nFor intelligent
content filtering and high-quality markdown generation, you
can use the LLMContentFilter. This filter leverages LLMs to
generate relevant markdown while preserving the original
content's meaning and structure:\nfrom crawl4ai import
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LlmConfig
from crawl4ai.content_filter_strategy import LLMContentFilter
async def main(): # Initialize LLM filter with specific
instruction filter = LLMContentFilter( llmConfig =
LlmConfig(provider=\"openai/gpt-4o\",api_token=\"your-api-
token\"), #or use environment variable instruction=\"\"\"
Focus on extracting the core educational content. Include: -
Key concepts and explanations - Important code examples -
Essential technical details Exclude: - Navigation elements -
Sidebars - Footer content Format the output as clean markdown
with proper code blocks and headers. \"\"\",
chunk_token_threshold=4096, # Adjust based on your needs
verbose=True ) config =
CrawlerRunConfig( content_filter=filter ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun(\"https://example.com\", config=config)
print(result.markdown.fit_markdown) # Filtered markdown
content \nKey Features: - Intelligent Filtering: Uses LLMs to
understand and extract relevant content while maintaining
context - Customizable Instructions: Tailor the filtering
process with specific instructions - Chunk Processing: Handles
large documents by processing them in chunks (controlled by
chunk_token_threshold) - Parallel Processing: For better
performance, use smaller chunk_token_threshold (e.g., 2048 or
4096) to enable parallel processing of content chunks\nTwo
Common Use Cases:\nExact Content Preservation: \nfilter =
LLMContentFilter( instruction=\"\"\" Extract the main
educational content while preserving its original wording and
substance completely. 1. Maintain the exact language and
terminology 2. Keep all technical explanations and examples
intact 3. Preserve the original flow and structure 4. Remove
only clearly irrelevant elements like navigation menus and ads
\"\"\", chunk_token_threshold=4096 ) \nFocused Content
Extraction: \nfilter = LLMContentFilter( instruction=\"\"\"
Focus on extracting specific types of content: - Technical
documentation - Code examples - API references Reformat the
content into clear, well-structured markdown \"\"\",
chunk_token_threshold=4096 ) \nPerformance Tip: Set a smaller
chunk_token_threshold (e.g., 2048 or 4096) to enable parallel
processing of content chunks. The default value is infinity,
which processes the entire content as a single chunk.\n5.
67
Using Fit Markdown\nWhen a content filter is active, the
library produces two forms of markdown inside result.markdown:
\n1. raw_markdown: The full unfiltered markdown.\n2.
fit_markdown: A â€œfitâ€ version where the filter has removed
or trimmed noisy segments.\nimport asyncio from crawl4ai
import AsyncWebCrawler, CrawlerRunConfig from
crawl4ai.markdown_generation_strategy import
DefaultMarkdownGenerator from crawl4ai.content_filter_strategy
import PruningContentFilter async def main(): config =
CrawlerRunConfig( markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.6),
options={\"ignore_links\": True} ) ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun(\"https://news.example.com/tech\", config=config)
if result.success: print(\"Raw markdown:\\n\",
result.markdown) # If a filter is used, we also
have .fit_markdown: md_object = result.markdown # or your
equivalent print(\"Filtered markdown:\\n\",
md_object.fit_markdown) else: print(\"Crawl failed:\",
result.error_message) if __name__ == \"__main__\":
asyncio.run(main()) \n6. The MarkdownGenerationResult Object
\nIf your library stores detailed markdown output in an object
like MarkdownGenerationResult, youâ€™ll see fields such as:
\nraw_markdown: The direct HTML-to-markdown transformation (no
filtering). \nmarkdown_with_citations: A version that moves
links to reference-style footnotes. \nreferences_markdown: A
separate string or section containing the gathered references.
\nfit_markdown: The filtered markdown if you used a content
filter. \nfit_html: The corresponding HTML snippet used to
generate fit_markdown (helpful for debugging or advanced
usage).\nExample:\nmd_obj = result.markdown # your libraryâ€™s
naming may vary print(\"RAW:\\n\", md_obj.raw_markdown)
print(\"CITED:\\n\", md_obj.markdown_with_citations)
print(\"REFERENCES:\\n\", md_obj.references_markdown)
print(\"FIT:\\n\", md_obj.fit_markdown) \nWhy Does This
Matter?\n- You can supply raw_markdown to an LLM if you want
the entire text.\n- Or feed fit_markdown into a vector
database to reduce token usage.\n- references_markdown can
help you keep track of link provenance.\nBelow is a revised
section under â€œCombining Filters (BM25 + Pruning)â€ that
demonstrates how you can run two passes of content filtering
without re-crawling, by taking the HTML (or text) from a first
pass and feeding it into the second filter. It uses real code
patterns from the snippet you provided for BM25ContentFilter,
which directly accepts HTML strings (and can also handle plain
text with minimal adaptation).\n7. Combining Filters (BM25 +
Pruning) in Two Passes\nYou might want to prune out noisy
boilerplate first (with PruningContentFilter), and then rank
whatâ€™s left against a user query (with BM25ContentFilter).
You donâ€™t have to crawl the page twice. Instead:\n1. First
pass: Apply PruningContentFilter directly to the raw HTML from
result.html (the crawlerâ€™s downloaded HTML).\n2. Second
pass: Take the pruned HTML (or text) from step 1, and feed it
into BM25ContentFilter, focusing on a user query.\nTwo-Pass
Example\nimport asyncio from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig from crawl4ai.content_filter_strategy import
PruningContentFilter, BM25ContentFilter from bs4 import
68
BeautifulSoup async def main(): # 1. Crawl with minimal or no
markdown generator, just get raw HTML config =
CrawlerRunConfig( # If you only want raw HTML, you can skip
passing a markdown_generator # or provide one but focus
on .html in this example ) async with AsyncWebCrawler() as
crawler: result = await
crawler.arun(\"https://example.com/tech-article\",
config=config) if not result.success or not result.html:
print(\"Crawl failed or no HTML content.\") return raw_html =
result.html # 2. First pass: PruningContentFilter on raw HTML
pruning_filter = PruningContentFilter(threshold=0.5,
min_word_threshold=50) # filter_content returns a list of
\"text chunks\" or cleaned HTML sections pruned_chunks =
pruning_filter.filter_content(raw_html) # This list is
basically pruned content blocks, presumably in HTML or text
form # For demonstration, let's combine these chunks back into
a single HTML-like string # or you could do further
processing. It's up to your pipeline design. pruned_html =
\"\\n\".join(pruned_chunks) # 3. Second pass:
BM25ContentFilter with a user query bm25_filter =
BM25ContentFilter( user_query=\"machine learning\", bm25
_threshold=1.2, language=\"english\" ) # returns a list of
text chunks bm25_chunks = bm25
_filter.filter_content(pruned_html) if not bm25_chunks:
print(\"Nothing matched the BM25 query after pruning.\")
return # 4. Combine or display final results final_text =
\"\\n---\\n\".join(bm25_chunks) print(\"==== PRUNED OUTPUT
(first pass) ====\") print(pruned_html[:500], \"...
(truncated)\") # preview print(\"\\n==== BM25 OUTPUT (second
pass) ====\") print(final_text[:500], \"... (truncated)\") if
__name__ == \"__main__\": asyncio.run(main()) \nWhatâ€™s
Happening?\n1. Raw HTML: We crawl once and store the raw HTML
in result.html.\n2. PruningContentFilter: Takes HTML +
optional parameters. It extracts blocks of text or partial
HTML, removing headings/sections deemed â€œnoise.â€ It
returns a list of text chunks.\n3. Combine or Transform: We
join these pruned chunks back into a single HTML-like string.
(Alternatively, you could store them in a list for further
logicâ€”whatever suits your pipeline.)\n4. BM25ContentFilter:
We feed the pruned string into BM25ContentFilter with a user
query. This second pass further narrows the content to chunks
relevant to â€œmachine learning.â€ \nNo Re-Crawling: We used
raw_html from the first pass, so thereâ€™s no need to run
arun() againâ€”no second network request.\nTips & Variations
\nPlain Text vs. HTML: If your pruned output is mostly text,
BM25 can still handle it; just keep in mind it expects a valid
string input. If you supply partial HTML (like \"<p>some
text</p>\"), it will parse it as HTML. \nChaining in a Single
Pipeline: If your code supports it, you can chain multiple
filters automatically. Otherwise, manual two-pass filtering
(as shown) is straightforward. \nAdjust Thresholds: If you see
too much or too little text in step one, tweak threshold=0.5
or min_word_threshold=50. Similarly, bm25_threshold=1.2 can be
raised/lowered for more or fewer chunks in step two.\nOne-Pass
Combination?\nIf your codebase or pipeline design allows
applying multiple filters in one pass, you could do so. But
often itâ€™s simplerâ€”and more transparentâ€”to run them
69
sequentially, analyzing each stepâ€™s result.\nBottom Line: By
manually chaining your filtering logic in two passes, you get
powerful incremental control over the final content. First,
remove â€œglobalâ€ clutter with Pruning, then refine further
with BM25-based query relevanceâ€”without incurring a second
network crawl.\n8. Common Pitfalls & Tips\n1. No Markdown
Output?\n- Make sure the crawler actually retrieved HTML. If
the site is heavily JS-based, you may need to enable dynamic
rendering or wait for elements.\n- Check if your content
filter is too aggressive. Lower thresholds or disable the
filter to see if content reappears.\n2. Performance
Considerations\n- Very large pages with multiple filters can
be slower. Consider cache_mode to avoid re-downloading.\n- If
your final use case is LLM ingestion, consider summarizing
further or chunking big texts.\n3. Take Advantage of
fit_markdown\n- Great for RAG pipelines, semantic search, or
any scenario where extraneous boilerplate is unwanted.\n-
Still verify the textual qualityâ€”some sites have crucial
data in footers or sidebars.\n4. Adjusting html2text Options
\n- If you see lots of raw HTML slipping into the text, turn
on escape_html.\n- If code blocks look messy, experiment with
mark_code or handle_code_in_pre.\n9. Summary & Next Steps\nIn
this Markdown Generation Basics tutorial, you learned to:
\nConfigure the DefaultMarkdownGenerator with HTML-to-text
options. \nUse BM25ContentFilter for query-specific extraction
or PruningContentFilter for general noise removal.
\nDistinguish between raw and filtered markdown
(fit_markdown). \nLeverage the MarkdownGenerationResult object
to handle different forms of output (citations, references,
etc.).\nNow you can produce high-quality Markdown from any
website, focusing on exactly the content you needâ€”an
essential step for powering AI models, summarization
pipelines, or knowledge-base queries.\nLast Updated:
2025-01-01",
"markdown": "# Markdown Generation - Crawl4AI Documentation
(v0.5.x)\n\nOne of Crawl4AIâ€™s core features is generating
**clean, structured markdown** from web pages. Originally
built to solve the problem of extracting only the â€œactualâ€
content and discarding boilerplate or noise, Crawl4AIâ€™s
markdown system remains one of its biggest draws for AI
workflows.\n\nIn this tutorial, youâ€™ll learn:\n\n1. How to
configure the **Default Markdown Generator**\n2. How
**content filters** (BM25 or Pruning) help you refine markdown
and discard junk\n3. The difference between raw markdown
(`result.markdown`) and filtered markdown (`fit_markdown`)\n
\n> **Prerequisites** \n> \\- Youâ€™ve completed or read
[AsyncWebCrawler Basics]
(https://crawl4ai.com/mkdocs/core/simple-crawling/) to
understand how to run a simple crawl. \n> \\- You know how to
configure `CrawlerRunConfig`.\n\n* * *\n\n## 1\\. Quick
Example\n\nHereâ€™s a minimal code snippet that uses the
**DefaultMarkdownGenerator** with no additional filtering:\n
\nìmport asyncio from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig from crawl4ai.markdown_generation_strategy
import DefaultMarkdownGenerator async def main(): config
=
CrawlerRunConfig( markdown_generator=DefaultMarkdownGe
70
nerator() ) async with AsyncWebCrawler() as crawler:
result = await crawler.arun(\"https://example.com\",
config=config) if result.success:
print(\"Raw Markdown Output:\\n\")
print(result.markdown) # The unfiltered markdown from the
page else: print(\"Crawl failed:\",
result.error_message) if __name__ == \"__main__\":
asyncio.run(main())`\n\n**Whatâ€™s happening?** \n\\-
`CrawlerRunConfig( markdown_generator =
DefaultMarkdownGenerator() )` instructs Crawl4AI to convert
the final HTML into markdown at the end of each crawl. \n\\-
The resulting markdown is accessible via `result.markdown`.\n
\n* * *\n\n## 2\\. How Markdown Generation Works\n\n### 2.1
HTML-to-Text Conversion (Forked & Modified)\n\nUnder the hood,
**DefaultMarkdownGenerator** uses a specialized HTML-to-text
approach that:\n\n* Preserves headings, code blocks, bullet
points, etc.\n* Removes extraneous tags (scripts, styles)
that donâ€™t add meaningful content.\n* Can optionally
generate references for links or skip them altogether.\n\nA
set of **options** (passed as a dict) allows you to customize
precisely how HTML converts to markdown. These map to standard
html2text-like configuration plus your own enhancements (e.g.,
ignoring internal links, preserving certain tags verbatim, or
adjusting line widths).\n\n### 2.2 Link Citations & References
\n\nBy default, the generator can convert `<a href=\"...\">`
elements into `[text][1]` citations, then place the actual
links at the bottom of the document. This is handy for
research workflows that demand references in a structured
manner.\n\n### 2.3 Optional Content Filters\n\nBefore or after
the HTML-to-Markdown step, you can apply a **content filter**
(like BM25 or Pruning) to reduce noise and produce a â€œfit
\\_markdownâ€ â€”a heavily pruned version focusing on the
pageâ€™s main text. Weâ€™ll cover these filters shortly.\n\n*
* *\n\n## 3\\. Configuring the Default Markdown Generator\n
\nYou can tweak the output by passing an òptions` dict to
`DefaultMarkdownGenerator`. For example:\n\n`from
crawl4ai.markdown_generation_strategy import
DefaultMarkdownGenerator from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig async def main(): # Example: ignore all
links, don't escape HTML, and wrap text at 80 characters
md_generator =
DefaultMarkdownGenerator( options={ \"igno
re_links\": True, \"escape_html\": False,
\"body_width\": 80 } ) config =
CrawlerRunConfig( markdown_generator=md_generator
) async with AsyncWebCrawler() as crawler: result
= await crawler.arun(\"https://example.com/docs\",
config=config) if result.success:
print(\"Markdown:\\n\", result.markdown[:500]) # Just a
snippet else: print(\"Crawl failed:\",
result.error_message) if __name__ == \"__main__\": import
asyncio asyncio.run(main())`\n\nSome commonly used
òptions`:\n\n* **ìgnore_links`** (bool): Whether to remove
all hyperlinks in the final markdown.\n* **ìgnore_images`**
(bool): Remove all `![image]()` references.\n*
**èscape_html`** (bool): Turn HTML entities into text
(default is often `True`).\n* **`body_width`** (int): Wrap
71
text at N characters. `0` or `None` means no wrapping.\n*
**`skip_internal_links`** (bool): If `True`, omit
`#localAnchors` or internal links referencing the same page.
\n* **ìnclude_sup_sub`** (bool): Attempt to handle `<sup>`
/ `<sub>` in a more readable way.\n\n* * *\n\n## 4\\. Content
Filters\n\n**Content filters** selectively remove or rank
sections of text before turning them into Markdown. This is
especially helpful if your page has ads, nav bars, or other
clutter you donâ€™t want.\n\n### 4.1 BM25ContentFilter\n\nIf
you have a **search query**, BM25 is a good choice:\n\n`from
crawl4ai.markdown_generation_strategy import
DefaultMarkdownGenerator from crawl4ai.content_filter_strategy
import BM25ContentFilter from crawl4ai import CrawlerRunConfig
bm25_filter = BM25ContentFilter( user_query=\"machine
learning\", bm25_threshold=1.2, use_stemming=True )
md_generator =
DefaultMarkdownGenerator( content_filter=bm25_filter,
options={\"ignore_links\": True} ) config =
CrawlerRunConfig(markdown_generator=md_generator)`\n\n*
**ùser_query`**: The term you want to focus on. BM25 tries to
keep only content blocks relevant to that query.\n* **`bm25
_threshold`**: Raise it to keep fewer blocks; lower it to keep
more.\n* **ùse_stemming`**: If `True`, variations of words
match (e.g., â€œlearn,â€ â€œlearning,â€ â€œlearntâ€ ).\n
\n**No query provided?** BM25 tries to glean a context from
page metadata, or you can simply treat it as a scorched-earth
approach that discards text with low generic score.
Realistically, you want to supply a query for best results.\n
\n### 4.2 PruningContentFilter\n\nIf you **donâ€™t** have a
specific query, or if you just want a robust â€œjunk remover,â
€ use `PruningContentFilter`. It analyzes text density, link
density, HTML structure, and known patterns (like â€œnav,â€ â
€œfooterâ€ ) to systematically prune extraneous or repetitive
sections.\n\n`from crawl4ai.content_filter_strategy import
PruningContentFilter prune_filter =
PruningContentFilter( threshold=0.5, threshold_type=
\"fixed\", # or \"dynamic\" min_word_threshold=50 )`\n\n*
**`threshold`**: Score boundary. Blocks below this score get
removed.\n* **`threshold_type`**:\n * `\"fixed\"`:
Straight comparison (`score >= threshold` keeps the block).\n
* `\"dynamic\"`: The filter adjusts threshold in a data-
driven manner.\n* **`min_word_threshold`**: Discard blocks
under N words as likely too short or unhelpful.\n\n**When to
Use PruningContentFilter** \n\\- You want a broad cleanup
without a user query. \n\\- The page has lots of repeated
sidebars, footers, or disclaimers that hamper text extraction.
\n\n### 4.3 LLMContentFilter\n\nFor intelligent content
filtering and high-quality markdown generation, you can use
the **LLMContentFilter**. This filter leverages LLMs to
generate relevant markdown while preserving the original
content's meaning and structure:\n\n`from crawl4ai import
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LlmConfig
from crawl4ai.content_filter_strategy import LLMContentFilter
async def main(): # Initialize LLM filter with specific
instruction filter = LLMContentFilter( llmConfig =
LlmConfig(provider=\"openai/gpt-4o\",api_token=\"your-api-
token\"), #or use environment variable instruction=
72
\"\"\" Focus on extracting the core educational
content. Include: - Key concepts and
explanations - Important code examples -
Essential technical details Exclude: -
Navigation elements - Sidebars - Footer
content Format the output as clean markdown with
proper code blocks and headers. \"\"\",
chunk_token_threshold=4096, # Adjust based on your needs
verbose=True ) config =
CrawlerRunConfig( content_filter=filter )
async with AsyncWebCrawler() as crawler: result =
await crawler.arun(\"https://example.com\", config=config)
print(result.markdown.fit_markdown) # Filtered markdown
content`\n\n**Key Features:** - **Intelligent Filtering**:
Uses LLMs to understand and extract relevant content while
maintaining context - **Customizable Instructions**: Tailor
the filtering process with specific instructions - **Chunk
Processing**: Handles large documents by processing them in
chunks (controlled by `chunk_token_threshold`) - **Parallel
Processing**: For better performance, use smaller
`chunk_token_threshold` (e.g., 2048 or 4096) to enable
parallel processing of content chunks\n\n**Two Common Use
Cases:**\n\n1. **Exact Content Preservation**:\n \n
`filter = LLMContentFilter( instruction=\"\"\" Extract
the main educational content while preserving its original
wording and substance completely. 1. Maintain the exact
language and terminology 2. Keep all technical
explanations and examples intact 3. Preserve the original
flow and structure 4. Remove only clearly irrelevant
elements like navigation menus and ads \"\"\",
chunk_token_threshold=4096 )`\n \n2. **Focused Content
Extraction**:\n \n `filter =
LLMContentFilter( instruction=\"\"\" Focus on
extracting specific types of content: - Technical
documentation - Code examples - API references
Reformat the content into clear, well-structured markdown
\"\"\", chunk_token_threshold=4096 )`\n \n\n>
**Performance Tip**: Set a smaller `chunk_token_threshold`
(e.g., 2048 or 4096) to enable parallel processing of content
chunks. The default value is infinity, which processes the
entire content as a single chunk.\n\n* * *\n\n## 5\\. Using
Fit Markdown\n\nWhen a content filter is active, the library
produces two forms of markdown inside `result.markdown`:\n
\n1.â€€**`raw_markdown`**: The full unfiltered markdown.
\n2.â€€**`fit_markdown`**: A â€œfitâ€ version where the
filter has removed or trimmed noisy segments.\n\nìmport
asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import
DefaultMarkdownGenerator from crawl4ai.content_filter_strategy
import PruningContentFilter async def main(): config =
CrawlerRunConfig( markdown_generator=DefaultMarkdownGe
nerator( content_filter=PruningContentFilter(thres
hold=0.6), options={\"ignore_links\":
True} ) ) async with AsyncWebCrawler() as
crawler: result = await
crawler.arun(\"https://news.example.com/tech\", config=config)
if result.success: print(\"Raw markdown:\\n\",
73
result.markdown) # If a filter is used, we also
have .fit_markdown: md_object = result.markdown #
or your equivalent print(\"Filtered markdown:\\n
\", md_object.fit_markdown) else:
print(\"Crawl failed:\", result.error_message) if __name__ ==
\"__main__\": asyncio.run(main())`\n\n* * *\n\n## 6\\. The
`MarkdownGenerationResult` Object\n\nIf your library stores
detailed markdown output in an object like
`MarkdownGenerationResult`, youâ€™ll see fields such as:\n\n*
**`raw_markdown`**: The direct HTML-to-markdown transformation
(no filtering).\n* **`markdown_with_citations`**: A version
that moves links to reference-style footnotes.\n*
**`references_markdown`**: A separate string or section
containing the gathered references.\n* **`fit_markdown`**:
The filtered markdown if you used a content filter.\n*
**`fit_html`**: The corresponding HTML snippet used to
generate `fit_markdown` (helpful for debugging or advanced
usage).\n\n**Example**:\n\n`md_obj = result.markdown # your
libraryâ€™s naming may vary print(\"RAW:\\n\",
md_obj.raw_markdown) print(\"CITED:\\n\",
md_obj.markdown_with_citations) print(\"REFERENCES:\\n\",
md_obj.references_markdown) print(\"FIT:\\n\",
md_obj.fit_markdown)`\n\n**Why Does This Matter?** \n\\- You
can supply `raw_markdown` to an LLM if you want the entire
text. \n\\- Or feed `fit_markdown` into a vector database to
reduce token usage. \n\\- `references_markdown` can help you
keep track of link provenance.\n\n* * *\n\nBelow is a
**revised section** under â€œCombining Filters (BM25 +
Pruning)â€ that demonstrates how you can run **two** passes
of content filtering without re-crawling, by taking the HTML
(or text) from a first pass and feeding it into the second
filter. It uses real code patterns from the snippet you
provided for **BM25ContentFilter**, which directly accepts
**HTML** strings (and can also handle plain text with minimal
adaptation).\n\n* * *\n\n## 7\\. Combining Filters (BM25 +
Pruning) in Two Passes\n\nYou might want to **prune out**
noisy boilerplate first (with `PruningContentFilter`), and
then **rank whatâ€™s left** against a user query (with
`BM25ContentFilter`). You donâ€™t have to crawl the page
twice. Instead:\n\n1.â€€**First pass**: Apply
`PruningContentFilter` directly to the raw HTML from
`result.html` (the crawlerâ€™s downloaded HTML). \n2.â
€€**Second pass**: Take the pruned HTML (or text) from step 1,
and feed it into `BM25ContentFilter`, focusing on a user
query.\n\n### Two-Pass Example\n\nìmport asyncio from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig from
crawl4ai.content_filter_strategy import PruningContentFilter,
BM25ContentFilter from bs4 import BeautifulSoup async def
main(): # 1. Crawl with minimal or no markdown generator,
just get raw HTML config = CrawlerRunConfig( # If
you only want raw HTML, you can skip passing a
markdown_generator # or provide one but focus on .html
in this example ) async with AsyncWebCrawler() as
crawler: result = await
crawler.arun(\"https://example.com/tech-article\",
config=config) if not result.success or not
result.html: print(\"Crawl failed or no HTML
74
content.\") return raw_html = result.html
# 2. First pass: PruningContentFilter on raw HTML
pruning_filter = PruningContentFilter(threshold=0.5,
min_word_threshold=50) # filter_content returns a
list of \"text chunks\" or cleaned HTML sections
pruned_chunks = pruning_filter.filter_content(raw_html)
# This list is basically pruned content blocks, presumably in
HTML or text form # For demonstration, let's combine
these chunks back into a single HTML-like string # or
you could do further processing. It's up to your pipeline
design. pruned_html = \"\\n\".join(pruned_chunks)
# 3. Second pass: BM25ContentFilter with a user query
bm25_filter = BM25ContentFilter( user_query=
\"machine learning\", bm25_threshold=1.2,
language=\"english\" ) # returns a list of
text chunks bm25_chunks = bm25
_filter.filter_content(pruned_html) if not bm25
_chunks: print(\"Nothing matched the BM25 query
after pruning.\") return # 4. Combine or
display final results final_text = \"\\n---\\n
\".join(bm25_chunks) print(\"==== PRUNED OUTPUT
(first pass) ====\") print(pruned_html[:500], \"...
(truncated)\") # preview print(\"\\n==== BM25 OUTPUT
(second pass) ====\") print(final_text[:500], \"...
(truncated)\") if __name__ == \"__main__\":
asyncio.run(main())`\n\n### Whatâ€™s Happening?\n\n1.â€€**Raw
HTML**: We crawl once and store the raw HTML in `result.html`.
\n2.â€€**PruningContentFilter**: Takes HTML + optional
parameters. It extracts blocks of text or partial HTML,
removing headings/sections deemed â€œnoise.â€ It returns a
**list of text chunks**. \n3.â€€**Combine or Transform**: We
join these pruned chunks back into a single HTML-like string.
(Alternatively, you could store them in a list for further
logicâ€”whatever suits your pipeline.) \n4.â
€€**BM25ContentFilter**: We feed the pruned string into
`BM25ContentFilter` with a user query. This second pass
further narrows the content to chunks relevant to â€œmachine
learning.â€ \n\n**No Re-Crawling**: We used `raw_html` from
the first pass, so thereâ€™s no need to run àrun()` againâ
€”**no second network request**.\n\n### Tips & Variations\n\n*
**Plain Text vs. HTML**: If your pruned output is mostly text,
BM25 can still handle it; just keep in mind it expects a valid
string input. If you supply partial HTML (like `\"<p>some
text</p>\"`), it will parse it as HTML.\n* **Chaining in a
Single Pipeline**: If your code supports it, you can chain
multiple filters automatically. Otherwise, manual two-pass
filtering (as shown) is straightforward.\n* **Adjust
Thresholds**: If you see too much or too little text in step
one, tweak `threshold=0.5` or `min_word_threshold=50`.
Similarly, `bm25_threshold=1.2` can be raised/lowered for more
or fewer chunks in step two.\n\n### One-Pass Combination?\n
\nIf your codebase or pipeline design allows applying multiple
filters in one pass, you could do so. But often itâ€™s
simplerâ€”and more transparentâ€”to run them sequentially,
analyzing each stepâ€™s result.\n\n**Bottom Line**: By
**manually chaining** your filtering logic in two passes, you
get powerful incremental control over the final content.
75
First, remove â€œglobalâ€ clutter with Pruning, then refine
further with BM25-based query relevanceâ€”without incurring a
second network crawl.\n\n* * *\n\n## 8\\. Common Pitfalls &
Tips\n\n1.â€€**No Markdown Output?** \n\\- Make sure the
crawler actually retrieved HTML. If the site is heavily JS-
based, you may need to enable dynamic rendering or wait for
elements. \n\\- Check if your content filter is too
aggressive. Lower thresholds or disable the filter to see if
content reappears.\n\n2.â€€**Performance Considerations** \n
\\- Very large pages with multiple filters can be slower.
Consider `cache_mode` to avoid re-downloading. \n\\- If your
final use case is LLM ingestion, consider summarizing further
or chunking big texts.\n\n3.â€€**Take Advantage of
`fit_markdown`** \n\\- Great for RAG pipelines, semantic
search, or any scenario where extraneous boilerplate is
unwanted. \n\\- Still verify the textual qualityâ€”some sites
have crucial data in footers or sidebars.\n\n4.â€€**Adjusting
`html2text` Options** \n\\- If you see lots of raw HTML
slipping into the text, turn on èscape_html`. \n\\- If code
blocks look messy, experiment with `mark_code` or
`handle_code_in_pre`.\n\n* * *\n\n## 9\\. Summary & Next Steps
\n\nIn this **Markdown Generation Basics** tutorial, you
learned to:\n\n* Configure the **DefaultMarkdownGenerator**
with HTML-to-text options.\n* Use **BM25ContentFilter** for
query-specific extraction or **PruningContentFilter** for
general noise removal.\n* Distinguish between raw and
filtered markdown (`fit_markdown`).\n* Leverage the
`MarkdownGenerationResult` object to handle different forms of
output (citations, references, etc.).\n\nNow you can produce
high-quality Markdown from any website, focusing on exactly
the content you needâ€”an essential step for powering AI
models, summarization pipelines, or knowledge-base queries.\n
\n**Last Updated**: 2025-01-01",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/core/cache-modes/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/core/cache-
modes/",
"loadedTime": "2025-03-05T23:16:37.647Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/core/cache-
modes/",
"title": "Cache Modes - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
76
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:35 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"67a05496b620356afa1554148ac5747e\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Cache Modes - Crawl4AI Documentation
(v0.5.x)\nCrawl4AI Cache System and Migration Guide\nOverview
\nStarting from version 0.5.0, Crawl4AI introduces a new
caching system that replaces the old boolean flags with a more
intuitive CacheMode enum. This change simplifies cache control
and makes the behavior more predictable.\nOld vs New Approach
\nOld Way (Deprecated)\nThe old system used multiple boolean
flags: - bypass_cache: Skip cache entirely - disable_cache:
Disable all caching - no_cache_read: Don't read from cache -
no_cache_write: Don't write to cache\nNew Way
(Recommended)\nThe new system uses a single CacheMode enum: -
CacheMode.ENABLED: Normal caching (read/write) -
CacheMode.DISABLED: No caching at all - CacheMode.READ_ONLY:
Only read from cache - CacheMode.WRITE_ONLY: Only write to
cache - CacheMode.BYPASS: Skip cache for this operation
\nMigration Example\nOld Code (Deprecated)\nimport asyncio
from crawl4ai import AsyncWebCrawler async def use_proxy():
async with AsyncWebCrawler(verbose=True) as crawler: result =
await crawler.arun( url=\"https://www.nbcnews.com/business\",
bypass_cache=True # Old way ) print(len(result.markdown))
async def main(): await use_proxy() if __name__ == \"__main__
\": asyncio.run(main()) \nNew Code (Recommended)\nimport
asyncio from crawl4ai import AsyncWebCrawler, CacheMode from
crawl4ai.async_configs import CrawlerRunConfig async def
use_proxy(): # Use CacheMode in CrawlerRunConfig config =
CrawlerRunConfig(cache_mode=CacheMode.BYPASS) async with
AsyncWebCrawler(verbose=True) as crawler: result = await
crawler.arun( url=\"https://www.nbcnews.com/business\",
config=config # Pass the configuration object )
print(len(result.markdown)) async def main(): await
use_proxy() if __name__ == \"__main__\": asyncio.run(main())
\nCommon Migration Patterns\nOld Flag New Mode
\nbypass_cache=True\tcache_mode=CacheMode.BYPASS\t
\ndisable_cache=True\tcache_mode=CacheMode.DISABLED\t
\nno_cache_read=True\tcache_mode=CacheMode.WRITE_ONLY\t
\nno_cache_write=True\tcache_mode=CacheMode.READ_ONLY",
"markdown": "# Cache Modes - Crawl4AI Documentation
(v0.5.x)\n\n## Crawl4AI Cache System and Migration Guide\n\n##
Overview\n\nStarting from version 0.5.0, Crawl4AI introduces a
new caching system that replaces the old boolean flags with a
more intuitive `CacheMode` enum. This change simplifies cache
control and makes the behavior more predictable.\n\n## Old vs
New Approach\n\n### Old Way (Deprecated)\n\nThe old system
used multiple boolean flags: - `bypass_cache`: Skip cache
entirely - `disable_cache`: Disable all caching -
77
`no_cache_read`: Don't read from cache - `no_cache_write`:
Don't write to cache\n\n### New Way (Recommended)\n\nThe new
system uses a single `CacheMode` enum: - `CacheMode.ENABLED`:
Normal caching (read/write) - `CacheMode.DISABLED`: No caching
at all - `CacheMode.READ_ONLY`: Only read from cache -
`CacheMode.WRITE_ONLY`: Only write to cache -
`CacheMode.BYPASS`: Skip cache for this operation\n\n##
Migration Example\n\n### Old Code (Deprecated)\n\nìmport
asyncio from crawl4ai import AsyncWebCrawler async def
use_proxy(): async with AsyncWebCrawler(verbose=True) as
crawler: result = await crawler.arun( url=
\"https://www.nbcnews.com/business\",
bypass_cache=True # Old way )
print(len(result.markdown)) async def main(): await
use_proxy() if __name__ == \"__main__\":
asyncio.run(main())`\n\n### New Code (Recommended)\n\nìmport
asyncio from crawl4ai import AsyncWebCrawler, CacheMode from
crawl4ai.async_configs import CrawlerRunConfig async def
use_proxy(): # Use CacheMode in CrawlerRunConfig
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun( url=
\"https://www.nbcnews.com/business\",
config=config # Pass the configuration object )
print(len(result.markdown)) async def main(): await
use_proxy() if __name__ == \"__main__\":
asyncio.run(main())`\n\n## Common Migration Patterns\n\n| Old
Flag | New Mode |\n| --- | --- |\n| `bypass_cache=True` |
`cache_mode=CacheMode.BYPASS` |\n| `disable_cache=True` |
`cache_mode=CacheMode.DISABLED` |\n| `no_cache_read=True` |
`cache_mode=CacheMode.WRITE_ONLY` |\n| `no_cache_write=True` |
`cache_mode=CacheMode.READ_ONLY` |",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/core/page-interaction/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/core/page-
interaction/",
"loadedTime": "2025-03-05T23:16:38.859Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/core/page-
interaction/",
"title": "Page Interaction - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
78
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:35 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"9bd4c78f67c941feff8b23b242e81fc9\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Page Interaction - Crawl4AI Documentation
(v0.5.x)\nCrawl4AI provides powerful features for interacting
with dynamic webpages, handling JavaScript execution, waiting
for conditions, and managing multi-step flows. By combining
js_code, wait_for, and certain CrawlerRunConfig parameters,
you can:\nClick â€œLoad Moreâ€ buttons \nFill forms and
submit them \nWait for elements or data to appear \nReuse
sessions across multiple steps \nBelow is a quick overview of
how to do it.\n1. JavaScript Execution\nBasic Execution
\njs_code in CrawlerRunConfig accepts either a single JS
string or a list of JS snippets.\nExample: Weâ€™ll scroll to
the bottom of the page, then optionally click a â€œLoad Moreâ
€ button.\nimport asyncio from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig async def main(): # Single
JS command config = CrawlerRunConfig( js_code=
\"window.scrollTo(0, document.body.scrollHeight);\" ) async
with AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://news.ycombinator.com\", # Example
site config=config ) print(\"Crawled length:\",
len(result.cleaned_html)) # Multiple commands js_commands =
[ \"window.scrollTo(0, document.body.scrollHeight);\", #
'More' link on Hacker News
\"document.querySelector('a.morelink')?.click();\", ] config =
CrawlerRunConfig(js_code=js_commands) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://news.ycombinator.com\", # Another
pass config=config ) print(\"After scroll+click, length:\",
len(result.cleaned_html)) if __name__ == \"__main__\":
asyncio.run(main()) \nRelevant CrawlerRunConfig params: -
js_code: A string or list of strings with JavaScript to run
after the page loads. - js_only: If set to True on subsequent
calls, indicates weâ€™re continuing an existing session
without a new full navigation.\n- session_id: If you want to
keep the same page across multiple calls, specify an ID.\n2.
Wait Conditions\n2.1 CSS-Based Waiting\nSometimes, you just
want to wait for a specific element to appear. For example:
\nimport asyncio from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig async def main(): config =
CrawlerRunConfig( # Wait for at least 30 items on Hacker News
wait_for=\"css:.athing:nth-child(30)\" ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://news.ycombinator.com\",
config=config ) print(\"We have at least 30 items loaded!\") #
Rough check print(\"Total items in HTML:\",
result.cleaned_html.count(\"athing\")) if __name__ ==
79
\"__main__\": asyncio.run(main()) \nKey param: - wait_for=
\"css:...\": Tells the crawler to wait until that CSS selector
is present.\n2.2 JavaScript-Based Waiting\nFor more complex
conditions (e.g., waiting for content length to exceed a
threshold), prefix js::\nwait_condition = \"\"\"() => { const
items = document.querySelectorAll('.athing'); return
items.length > 50; // Wait for at least 51 items }\"\"\"
config = CrawlerRunConfig(wait_for=f\"js:{wait_condition}\")
\nBehind the Scenes: Crawl4AI keeps polling the JS function
until it returns true or a timeout occurs.\n3. Handling
Dynamic Content\nMany modern sites require multiple steps:
scrolling, clicking â€œLoad More,â€ or updating via
JavaScript. Below are typical patterns.\n3.1 Load More Example
(Hacker News â€œMoreâ€ Link)\nimport asyncio from crawl4ai
import AsyncWebCrawler, CrawlerRunConfig async def main(): #
Step 1: Load initial Hacker News page config =
CrawlerRunConfig( wait_for=\"css:.athing:nth-child(30)\" #
Wait for 30 items ) async with AsyncWebCrawler() as crawler:
result = await crawler.arun( url=
\"https://news.ycombinator.com\", config=config )
print(\"Initial items loaded.\") # Step 2: Let's scroll and
click the \"More\" link load_more_js = [ \"window.scrollTo(0,
document.body.scrollHeight);\", # The \"More\" link at page
bottom \"document.querySelector('a.morelink')?.click();\" ]
next_page_conf = CrawlerRunConfig( js_code=load_more_js,
wait_for=\"\"\"js:() => { return
document.querySelectorAll('.athing').length > 30; }\"\"\", #
Mark that we do not re-navigate, but run JS in the same
session: js_only=True, session_id=\"hn_session\" ) # Re-use
the same crawler session result2 = await crawler.arun( url=
\"https://news.ycombinator.com\", # same URL but continuing
session config=next_page_conf ) total_items =
result2.cleaned_html.count(\"athing\") print(\"Items after
load-more:\", total_items) if __name__ == \"__main__\":
asyncio.run(main()) \nKey params: - session_id=\"hn_session\":
Keep the same page across multiple calls to arun(). -
js_only=True: Weâ€™re not performing a full reload, just
applying JS in the existing page. - wait_for with js:: Wait
for item count to grow beyond 30.\n3.2 Form Interaction\nIf
the site has a search or login form, you can fill fields and
submit them with js_code. For instance, if GitHub had a local
search form:\njs_form_interaction = \"\"\"
document.querySelector('#your-search').value = 'TypeScript
commits'; document.querySelector('form').submit(); \"\"\"
config = CrawlerRunConfig( js_code=js_form_interaction,
wait_for=\"css:.commit\" ) result = await crawler.arun(url=
\"https://github.com/search\", config=config) \nIn reality:
Replace IDs or classes with the real siteâ€™s form selectors.
\n4. Timing Control\n1. page_timeout (ms): Overall page load
or script execution time limit.\n2. delay_before_return_html
(seconds): Wait an extra moment before capturing the final
HTML.\n3. mean_delay & max_range: If you call arun_many() with
multiple URLs, these add a random pause between each request.
\nExample:\nconfig = CrawlerRunConfig( page_timeout=60000, #
60s limit delay_before_return_html=2.5 ) \n5. Multi-Step
Interaction Example\nBelow is a simplified script that does
multiple â€œLoad Moreâ€ clicks on GitHubâ€™s TypeScript
80
commits page. It re-uses the same session to accumulate new
commits each time. The code includes the relevant
CrawlerRunConfig parameters youâ€™d rely on.\nimport asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig,
CrawlerRunConfig, CacheMode async def multi_page_commits():
browser_cfg = BrowserConfig( headless=False, # Visible for
demonstration verbose=True ) session_id = \"github_ts_commits
\" base_wait = \"\"\"js:() => { const commits =
document.querySelectorAll('li.Box-sc-g0xbh4-0 h4'); return
commits.length > 0; }\"\"\" # Step 1: Load initial commits
config1 = CrawlerRunConfig( wait_for=base_wait,
session_id=session_id, cache_mode=CacheMode.BYPASS, # Not
using js_only yet since it's our first load ) async with
AsyncWebCrawler(config=browser_cfg) as crawler: result = await
crawler.arun( url=
\"https://github.com/microsoft/TypeScript/commits/main\",
config=config1 ) print(\"Initial commits loaded. Count:\",
result.cleaned_html.count(\"commit\")) # Step 2: For
subsequent pages, we run JS to click 'Next Page' if it exists
js_next_page = \"\"\" const selector = 'a[data-testid=
\"pagination-next-button\"]'; const button =
document.querySelector(selector); if (button) button.click();
\"\"\" # Wait until new commits appear wait_for_more =
\"\"\"js:() => { const commits =
document.querySelectorAll('li.Box-sc-g0xbh4-0 h4'); if (!
window.firstCommit && commits.length>0) { window.firstCommit =
commits[0].textContent; return false; } // If top commit
changes, we have new commits const topNow =
commits[0]?.textContent.trim(); return topNow && topNow !==
window.firstCommit; }\"\"\" for page in range(2): # let's do 2
more \"Next\" pages config_next =
CrawlerRunConfig( session_id=session_id, js_code=js_next_page,
wait_for=wait_for_more, js_only=True, # We're continuing from
the open tab cache_mode=CacheMode.BYPASS ) result2 = await
crawler.arun( url=
\"https://github.com/microsoft/TypeScript/commits/main\",
config=config_next ) print(f\"Page {page+2} commits count:\",
result2.cleaned_html.count(\"commit\")) # Optionally kill
session await
crawler.crawler_strategy.kill_session(session_id) async def
main(): await multi_page_commits() if __name__ == \"__main__
\": asyncio.run(main()) \nKey Points:\nsession_id: Keep the
same page open. \njs_code + wait_for + js_only=True: We do
partial refreshes, waiting for new commits to appear.
\ncache_mode=CacheMode.BYPASS ensures we always see fresh data
each step.\nOnce dynamic content is loaded, you can attach an
extraction_strategy (like JsonCssExtractionStrategy or
LLMExtractionStrategy). For example:\nfrom
crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = { \"name\": \"Commits\", \"baseSelector\": \"li.Box-
sc-g0xbh4-0\", \"fields\": [ {\"name\": \"title\", \"selector
\": \"h4.markdown-title\", \"type\": \"text\"} ] } config =
CrawlerRunConfig( session_id=\"ts_commits_session\",
js_code=js_next_page, wait_for=wait_for_more,
extraction_strategy=JsonCssExtractionStrategy(schema) ) \nWhen
done, check result.extracted_content for the JSON.\n7.
Relevant CrawlerRunConfig Parameters\nBelow are the key
81
interaction-related parameters in CrawlerRunConfig. For a full
list, see Configuration Parameters.\njs_code: JavaScript to
run after initial load. \njs_only: If True, no new page
navigationâ€”only JS in the existing session. \nwait_for: CSS
(\"css:...\") or JS (\"js:...\") expression to wait for.
\nsession_id: Reuse the same page across calls. \ncache_mode:
Whether to read/write from the cache or bypass.
\nremove_overlay_elements: Remove certain popups
automatically. \nsimulate_user, override_navigator, magic:
Anti-bot or â€œhuman-likeâ€ interactions.\n8. Conclusion
\nCrawl4AIâ€™s page interaction features let you:\n1. Execute
JavaScript for scrolling, clicks, or form filling.\n2. Wait
for CSS or custom JS conditions before capturing data.\n3.
Handle multi-step flows (like â€œLoad Moreâ€ ) with partial
reloads or persistent sessions.\n4. Combine with structured
extraction for dynamic sites.\nWith these tools, you can
scrape modern, interactive webpages confidently. For advanced
hooking, user simulation, or in-depth config, check the API
reference or related advanced docs. Happy scripting!",
"markdown": "# Page Interaction - Crawl4AI Documentation
(v0.5.x)\n\nCrawl4AI provides powerful features for
interacting with **dynamic** webpages, handling JavaScript
execution, waiting for conditions, and managing multi-step
flows. By combining **js\\_code**, **wait\\_for**, and certain
**CrawlerRunConfig** parameters, you can:\n\n1. Click â€œLoad
Moreâ€ buttons\n2. Fill forms and submit them\n3. Wait for
elements or data to appear\n4. Reuse sessions across multiple
steps\n\nBelow is a quick overview of how to do it.\n\n* * *\n
\n## 1\\. JavaScript Execution\n\n### Basic Execution\n
\n**`js_code`** in **`CrawlerRunConfig`** accepts either a
single JS string or a list of JS snippets. \n**Example**: Weâ
€™ll scroll to the bottom of the page, then optionally click a
â€œLoad Moreâ€ button.\n\nìmport asyncio from crawl4ai
import AsyncWebCrawler, CrawlerRunConfig async def main():
# Single JS command config =
CrawlerRunConfig( js_code=\"window.scrollTo(0,
document.body.scrollHeight);\" ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://news.ycombinator.com
\", # Example site config=config )
print(\"Crawled length:\", len(result.cleaned_html)) #
Multiple commands js_commands =
[ \"window.scrollTo(0, document.body.scrollHeight);\",
# 'More' link on Hacker News
\"document.querySelector('a.morelink')?.click();\", ]
config = CrawlerRunConfig(js_code=js_commands) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://news.ycombinator.com
\", # Another pass config=config )
print(\"After scroll+click, length:\",
len(result.cleaned_html)) if __name__ == \"__main__\":
asyncio.run(main())`\n\n**Relevant `CrawlerRunConfig`
params**: - **`js_code`**: A string or list of strings with
JavaScript to run after the page loads. - **`js_only`**: If
set to `True` on subsequent calls, indicates weâ€™re
continuing an existing session without a new full navigation.
\n\\- **`session_id`**: If you want to keep the same page
82
across multiple calls, specify an ID.\n\n* * *\n\n## 2\\. Wait
Conditions\n\n### 2.1 CSS-Based Waiting\n\nSometimes, you just
want to wait for a specific element to appear. For example:\n
\nìmport asyncio from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig async def main(): config =
CrawlerRunConfig( # Wait for at least 30 items on
Hacker News wait_for=\"css:.athing:nth-
child(30)\" ) async with AsyncWebCrawler() as
crawler: result = await crawler.arun( url=
\"https://news.ycombinator.com\",
config=config ) print(\"We have at least 30
items loaded!\") # Rough check print(\"Total
items in HTML:\", result.cleaned_html.count(\"athing\")) if
__name__ == \"__main__\": asyncio.run(main())`\n\n**Key
param**: - **`wait_for=\"css:...\"`**: Tells the crawler to
wait until that CSS selector is present.\n\n### 2.2
JavaScript-Based Waiting\n\nFor more complex conditions (e.g.,
waiting for content length to exceed a threshold), prefix `js:
`:\n\n`wait_condition = \"\"\"() => { const items =
document.querySelectorAll('.athing'); return
items.length > 50; // Wait for at least 51 items }\"\"\"
config = CrawlerRunConfig(wait_for=f\"js:{wait_condition}\")`
\n\n**Behind the Scenes**: Crawl4AI keeps polling the JS
function until it returns `true` or a timeout occurs.\n\n* * *
\n\n## 3\\. Handling Dynamic Content\n\nMany modern sites
require **multiple steps**: scrolling, clicking â€œLoad More,â
€ or updating via JavaScript. Below are typical patterns.\n
\n### 3.1 Load More Example (Hacker News â€œMoreâ€ Link)\n
\nìmport asyncio from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig async def main(): # Step 1: Load initial
Hacker News page config =
CrawlerRunConfig( wait_for=\"css:.athing:nth-
child(30)\" # Wait for 30 items ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://news.ycombinator.com
\", config=config )
print(\"Initial items loaded.\") # Step 2: Let's
scroll and click the \"More\" link load_more_js =
[ \"window.scrollTo(0,
document.body.scrollHeight);\", # The \"More\"
link at page bottom
\"document.querySelector('a.morelink')?.click();\" ]
next_page_conf =
CrawlerRunConfig( js_code=load_more_js,
wait_for=\"\"\"js:() => { return
document.querySelectorAll('.athing').length >
30; }\"\"\", # Mark that we do not re-
navigate, but run JS in the same session:
js_only=True, session_id=\"hn_session\" )
# Re-use the same crawler session result2 = await
crawler.arun( url=\"https://news.ycombinator.com
\", # same URL but continuing session
config=next_page_conf ) total_items =
result2.cleaned_html.count(\"athing\") print(\"Items
after load-more:\", total_items) if __name__ == \"__main__\":
asyncio.run(main())`\n\n**Key params**: - **`session_id=
\"hn_session\"`**: Keep the same page across multiple calls to
83
àrun()`. - **`js_only=True`**: Weâ€™re not performing a full
reload, just applying JS in the existing page. -
**`wait_for`** with `js:`: Wait for item count to grow beyond
30.\n\n* * *\n\n### 3.2 Form Interaction\n\nIf the site has a
search or login form, you can fill fields and submit them with
**`js_code`**. For instance, if GitHub had a local search
form:\n\n`js_form_interaction = \"\"\"
document.querySelector('#your-search').value = 'TypeScript
commits'; document.querySelector('form').submit(); \"\"\"
config = CrawlerRunConfig( js_code=js_form_interaction,
wait_for=\"css:.commit\" ) result = await crawler.arun(url=
\"https://github.com/search\", config=config)`\n\n**In
reality**: Replace IDs or classes with the real siteâ€™s form
selectors.\n\n* * *\n\n## 4\\. Timing Control\n\n1.â
€€**`page_timeout`** (ms): Overall page load or script
execution time limit. \n2.â€€**`delay_before_return_html`**
(seconds): Wait an extra moment before capturing the final
HTML. \n3.â€€**`mean_delay`** & **`max_range`**: If you call
àrun_many()` with multiple URLs, these add a random pause
between each request.\n\n**Example**:\n\n`config =
CrawlerRunConfig( page_timeout=60000, # 60s limit
delay_before_return_html=2.5 )`\n\n* * *\n\n## 5\\. Multi-Step
Interaction Example\n\nBelow is a simplified script that does
multiple â€œLoad Moreâ€ clicks on GitHubâ€™s TypeScript
commits page. It **re-uses** the same session to accumulate
new commits each time. The code includes the relevant
**`CrawlerRunConfig`** parameters youâ€™d rely on.\n\nìmport
asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig,
CrawlerRunConfig, CacheMode async def multi_page_commits():
browser_cfg = BrowserConfig( headless=False, #
Visible for demonstration verbose=True )
session_id = \"github_ts_commits\" base_wait =
\"\"\"js:() => { const commits =
document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
return commits.length > 0; }\"\"\" # Step 1: Load
initial commits config1 =
CrawlerRunConfig( wait_for=base_wait,
session_id=session_id, cache_mode=CacheMode.BYPASS,
# Not using js_only yet since it's our first load )
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun( url=
\"https://github.com/microsoft/TypeScript/commits/main\",
config=config1 ) print(\"Initial commits
loaded. Count:\", result.cleaned_html.count(\"commit\"))
# Step 2: For subsequent pages, we run JS to click 'Next Page'
if it exists js_next_page = \"\"\" const
selector = 'a[data-testid=\"pagination-next-button\"]';
const button = document.querySelector(selector); if
(button) button.click(); \"\"\" # Wait until
new commits appear wait_for_more = \"\"\"js:() =>
{ const commits =
document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
if (!window.firstCommit && commits.length>0)
{ window.firstCommit = commits[0].textContent;
return false; } // If top commit
changes, we have new commits const topNow =
commits[0]?.textContent.trim(); return topNow &&
84
topNow !== window.firstCommit; }\"\"\" for
page in range(2): # let's do 2 more \"Next\" pages
config_next =
CrawlerRunConfig( session_id=session_id,
js_code=js_next_page, wait_for=wait_for_more,
js_only=True, # We're continuing from the open tab
cache_mode=CacheMode.BYPASS ) result2
= await crawler.arun( url=
\"https://github.com/microsoft/TypeScript/commits/main\",
config=config_next ) print(f\"Page
{page+2} commits count:\", result2.cleaned_html.count(\"commit
\")) # Optionally kill session await
crawler.crawler_strategy.kill_session(session_id) async def
main(): await multi_page_commits() if __name__ ==
\"__main__\": asyncio.run(main())`\n\n**Key Points**:\n\n*
**`session_id`**: Keep the same page open.\n* **`js_code`**
+ **`wait_for`** + **`js_only=True`**: We do partial
refreshes, waiting for new commits to appear.\n*
**`cache_mode=CacheMode.BYPASS`** ensures we always see fresh
data each step.\n\n* * *\n\nOnce dynamic content is loaded,
you can attach an **èxtraction_strategy`** (like
`JsonCssExtractionStrategy` or `LLMExtractionStrategy`). For
example:\n\n`from crawl4ai.extraction_strategy import
JsonCssExtractionStrategy schema = { \"name\": \"Commits
\", \"baseSelector\": \"li.Box-sc-g0xbh4-0\", \"fields
\": [ {\"name\": \"title\", \"selector\":
\"h4.markdown-title\", \"type\": \"text\"} ] } config =
CrawlerRunConfig( session_id=\"ts_commits_session\",
js_code=js_next_page, wait_for=wait_for_more,
extraction_strategy=JsonCssExtractionStrategy(schema) )`\n
\nWhen done, check `result.extracted_content` for the JSON.\n
\n* * *\n\n## 7\\. Relevant `CrawlerRunConfig` Parameters\n
\nBelow are the key interaction-related parameters in
`CrawlerRunConfig`. For a full list, see [Configuration
Parameters](https://crawl4ai.com/mkdocs/api/parameters/).\n\n*
**`js_code`**: JavaScript to run after initial load.\n*
**`js_only`**: If `True`, no new page navigationâ€”only JS in
the existing session.\n* **`wait_for`**: CSS (`\"css:...\"`)
or JS (`\"js:...\"`) expression to wait for.\n*
**`session_id`**: Reuse the same page across calls.\n*
**`cache_mode`**: Whether to read/write from the cache or
bypass.\n* **`remove_overlay_elements`**: Remove certain
popups automatically.\n* **`simulate_user`,
òverride_navigator`, `magic`**: Anti-bot or â€œhuman-likeâ€
interactions.\n\n* * *\n\n## 8\\. Conclusion\n\nCrawl4AIâ€™s
**page interaction** features let you:\n\n1.â€€**Execute
JavaScript** for scrolling, clicks, or form filling. \n2.â
€€**Wait** for CSS or custom JS conditions before capturing
data. \n3.â€€**Handle** multi-step flows (like â€œLoad Moreâ
€ ) with partial reloads or persistent sessions. \n4\\.
Combine with **structured extraction** for dynamic sites.\n
\nWith these tools, you can scrape modern, interactive
webpages confidently. For advanced hooking, user simulation,
or in-depth config, check the [API reference]
(https://crawl4ai.com/mkdocs/api/parameters/) or related
advanced docs. Happy scripting!",
"debug": {
85
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/core/content-
selection/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/core/content-
selection/",
"loadedTime": "2025-03-05T23:16:40.640Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/core/content-
selection/",
"title": "Content Selection - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:37 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"e7a4be92f4f3b87d0b9a84f769d451af\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Content Selection - Crawl4AI Documentation
(v0.5.x)\nCrawl4AI provides multiple ways to select, filter,
and refine the content from your crawls. Whether you need to
target a specific CSS region, exclude entire tags, filter out
external links, or remove certain domains and images,
CrawlerRunConfig offers a wide range of parameters.\nBelow, we
show how to configure these parameters and combine them for
precise control.\n1. CSS-Based Selection\nA straightforward
way to limit your crawl results to a certain region of the
page is css_selector in CrawlerRunConfig:\nimport asyncio from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig async def
main(): config = CrawlerRunConfig( # e.g., first 30 items from
Hacker News css_selector=\".athing:nth-child(-n+30)\" ) async
with AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://news.ycombinator.com/newest\",
config=config ) print(\"Partial HTML length:\",
len(result.cleaned_html)) if __name__ == \"__main__\":
asyncio.run(main()) \nResult: Only elements matching that
selector remain in result.cleaned_html.\n2. Content Filtering
& Exclusions\n2.1 Basic Overview\nconfig = CrawlerRunConfig( #
86
Content thresholds word_count_threshold=10, # Minimum words
per block # Tag exclusions excluded_tags=['form', 'header',
'footer', 'nav'], # Link filtering
exclude_external_links=True, exclude_social_media_links=True,
# Block entire domains exclude_domains=[\"adtrackers.com\",
\"spammynews.org\"],
exclude_social_media_domains=[\"facebook.com\", \"twitter.com
\"], # Media filtering exclude_external_images=True )
\nExplanation:\nword_count_threshold: Ignores text blocks
under X words. Helps skip trivial blocks like short nav or
disclaimers. \nexcluded_tags: Removes entire tags (<form>,
<header>, <footer>, etc.). \nLink Filtering:
\nexclude_external_links: Strips out external links and may
remove them from result.links. \nexclude_social_media_links:
Removes links pointing to known social media domains.
\nexclude_domains: A custom list of domains to block if
discovered in links. \nexclude_social_media_domains: A curated
list (override or add to it) for social media sites. \nMedia
Filtering: \nexclude_external_images: Discards images not
hosted on the same domain as the main page (or its
subdomains).\nBy default in case you set
exclude_social_media_links=True, the following social media
domains are excluded: \n[ 'facebook.com', 'twitter.com',
'x.com', 'linkedin.com', 'instagram.com', 'pinterest.com',
'tiktok.com', 'snapchat.com', 'reddit.com', ] \n2.2 Example
Usage\nimport asyncio from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig, CacheMode async def main(): config =
CrawlerRunConfig( css_selector=\"main.content\",
word_count_threshold=10, excluded_tags=[\"nav\", \"footer\"],
exclude_external_links=True, exclude_social_media_links=True,
exclude_domains=[\"ads.com\", \"spammytrackers.net\"],
exclude_external_images=True, cache_mode=CacheMode.BYPASS )
async with AsyncWebCrawler() as crawler: result = await
crawler.arun(url=\"https://news.ycombinator.com\",
config=config) print(\"Cleaned HTML length:\",
len(result.cleaned_html)) if __name__ == \"__main__\":
asyncio.run(main()) \nNote: If these parameters remove too
much, reduce or disable them accordingly.\n3. Handling Iframes
\nSome sites embed content in <iframe> tags. If you want that
inline: \nconfig = CrawlerRunConfig( # Merge iframe content
into the final output process_iframes=True,
remove_overlay_elements=True ) \nUsage: \nimport asyncio from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig async def
main(): config = CrawlerRunConfig( process_iframes=True,
remove_overlay_elements=True ) async with AsyncWebCrawler() as
crawler: result = await crawler.arun( url=
\"https://example.org/iframe-demo\", config=config )
print(\"Iframe-merged length:\", len(result.cleaned_html)) if
__name__ == \"__main__\": asyncio.run(main()) \nYou can
combine content selection with a more advanced extraction
strategy. For instance, a CSS-based or LLM-based extraction
strategy can run on the filtered HTML.\nimport asyncio import
json from crawl4ai import AsyncWebCrawler, CrawlerRunConfig,
CacheMode from crawl4ai.extraction_strategy import
JsonCssExtractionStrategy async def main(): # Minimal schema
for repeated items schema = { \"name\": \"News Items\",
\"baseSelector\": \"tr.athing\", \"fields\": [ {\"name\":
87
\"title\", \"selector\": \"span.titleline a\", \"type\":
\"text\"}, { \"name\": \"link\", \"selector\":
\"span.titleline a\", \"type\": \"attribute\", \"attribute\":
\"href\" } ] } config = CrawlerRunConfig( # Content filtering
excluded_tags=[\"form\", \"header\"],
exclude_domains=[\"adsite.com\"], # CSS selection or entire
page css_selector=\"table.itemlist\", # No caching for
demonstration cache_mode=CacheMode.BYPASS, # Extraction
strategy
extraction_strategy=JsonCssExtractionStrategy(schema) ) async
with AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://news.ycombinator.com/newest\",
config=config ) data = json.loads(result.extracted_content)
print(\"Sample extracted item:\", data[:1]) # Show first item
if __name__ == \"__main__\": asyncio.run(main()) \nimport
asyncio import json from pydantic import BaseModel, Field from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LlmConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
class ArticleData(BaseModel): headline: str summary: str async
def main(): llm_strategy = LLMExtractionStrategy( llmConfig =
LlmConfig(provider=\"openai/gpt-4\",api_token=\"sk-
YOUR_API_KEY\") schema=ArticleData.schema(), extraction_type=
\"schema\", instruction=\"Extract 'headline' and a short
'summary' from the content.\" ) config =
CrawlerRunConfig( exclude_external_links=True,
word_count_threshold=20, extraction_strategy=llm_strategy )
async with AsyncWebCrawler() as crawler: result = await
crawler.arun(url=\"https://news.ycombinator.com\",
config=config) article = json.loads(result.extracted_content)
print(article) if __name__ == \"__main__\":
asyncio.run(main()) \nHere, the crawler:\nFilters out external
links (exclude_external_links=True). \nIgnores very short text
blocks (word_count_threshold=20). \nPasses the final HTML to
your LLM strategy for an AI-driven parse.\n5. Comprehensive
Example\nBelow is a short function that unifies CSS selection,
exclusion logic, and a pattern-based extraction, demonstrating
how you can fine-tune your final data:\nimport asyncio import
json from crawl4ai import AsyncWebCrawler, CrawlerRunConfig,
CacheMode from crawl4ai.extraction_strategy import
JsonCssExtractionStrategy async def extract_main_articles(url:
str): schema = { \"name\": \"ArticleBlock\", \"baseSelector\":
\"div.article-block\", \"fields\": [ {\"name\": \"headline\",
\"selector\": \"h2\", \"type\": \"text\"}, {\"name\":
\"summary\", \"selector\": \".summary\", \"type\": \"text\"},
{ \"name\": \"metadata\", \"type\": \"nested\", \"fields\":
[ {\"name\": \"author\", \"selector\": \".author\", \"type\":
\"text\"}, {\"name\": \"date\", \"selector\": \".date\",
\"type\": \"text\"} ] } ] } config = CrawlerRunConfig( # Keep
only #main-content css_selector=\"#main-content\", # Filtering
word_count_threshold=10, excluded_tags=[\"nav\", \"footer\"],
exclude_external_links=True,
exclude_domains=[\"somebadsite.com\"],
exclude_external_images=True, # Extraction
extraction_strategy=JsonCssExtractionStrategy(schema),
cache_mode=CacheMode.BYPASS ) async with AsyncWebCrawler() as
crawler: result = await crawler.arun(url=url, config=config)
if not result.success: print(f\"Error:
88
{result.error_message}\") return None return
json.loads(result.extracted_content) async def main():
articles = await
extract_main_articles(\"https://news.ycombinator.com/newest\")
if articles: print(\"Extracted Articles:\", articles[:2]) #
Show first 2 if __name__ == \"__main__\": asyncio.run(main())
\nWhy This Works: - CSS scoping with #main-content.\n-
Multiple exclude_ parameters to remove domains, external
images, etc.\n- A JsonCssExtractionStrategy to parse repeated
article blocks.\n6. Scraping Modes\nCrawl4AI provides two
different scraping strategies for HTML content processing:
WebScrapingStrategy (BeautifulSoup-based, default) and
LXMLWebScrapingStrategy (LXML-based). The LXML strategy offers
significantly better performance, especially for large HTML
documents.\nfrom crawl4ai import AsyncWebCrawler,
CrawlerRunConfig, LXMLWebScrapingStrategy async def main():
config =
CrawlerRunConfig( scraping_strategy=LXMLWebScrapingStrategy()
# Faster alternative to default BeautifulSoup ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://example.com\", config=config )
\nYou can also create your own custom scraping strategy by
inheriting from ContentScrapingStrategy. The strategy must
return a ScrapingResult object with the following structure:
\nfrom crawl4ai import ContentScrapingStrategy,
ScrapingResult, MediaItem, Media, Link, Links class
CustomScrapingStrategy(ContentScrapingStrategy): def
scrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
# Implement your custom scraping logic here return
ScrapingResult( cleaned_html=\"<html>...</html>\", # Cleaned
HTML content success=True, # Whether scraping was successful
media=Media( images=[ # List of images found MediaItem( src=
\"https://example.com/image.jpg\", alt=\"Image description\",
desc=\"Surrounding text\", score=1, type=\"image\", group_id=
1, format=\"jpg\", width=800 ) ], videos=[], # List of videos
(same structure as images) audios=[] # List of audio files
(same structure as images) ), links=Links( internal=[ # List
of internal links Link( href=\"https://example.com/page\",
text=\"Link text\", title=\"Link title\", base_domain=
\"example.com\" ) ], external=[] # List of external links
(same structure) ), metadata={ # Additional metadata \"title
\": \"Page Title\", \"description\": \"Page description\" } )
async def ascrap(self, url: str, html: str, **kwargs) ->
ScrapingResult: # For simple cases, you can use the sync
version return await asyncio.to_thread(self.scrap, url, html,
**kwargs) \nPerformance Considerations\nThe LXML strategy can
be up to 10-20x faster than BeautifulSoup strategy,
particularly when processing large HTML documents. However,
please note:\nLXML strategy is currently experimental\nIn some
edge cases, the parsing results might differ slightly from
BeautifulSoup\nIf you encounter any inconsistencies between
LXML and BeautifulSoup results, please raise an issue with a
reproducible example\nChoose LXML strategy when: - Processing
large HTML documents (recommended for >100KB) - Performance is
critical - Working with well-formed HTML\nStick to
BeautifulSoup strategy (default) when: - Maximum compatibility
is needed - Working with malformed HTML - Exact parsing
89
behavior is critical\n7. Conclusion\nBy mixing css_selector
scoping, content filtering parameters, and advanced extraction
strategies, you can precisely choose which data to keep. Key
parameters in CrawlerRunConfig for content selection include:
\n1. css_selector â€“ Basic scoping to an element or region.
\n2. word_count_threshold â€“ Skip short blocks.\n3.
excluded_tags â€“ Remove entire HTML tags.\n4.
exclude_external_links, exclude_social_media_links,
exclude_domains â€“ Filter out unwanted links or domains.\n5.
exclude_external_images â€“ Remove images from external
sources.\n6. process_iframes â€“ Merge iframe content if
needed. \nCombine these with structured extraction (CSS, LLM-
based, or others) to build powerful crawls that yield exactly
the content you want, from raw or cleaned HTML up to
sophisticated JSON structures. For more detail, see
Configuration Reference. Enjoy curating your data to the
max!",
"markdown": "# Content Selection - Crawl4AI Documentation
(v0.5.x)\n\nCrawl4AI provides multiple ways to **select**,
**filter**, and **refine** the content from your crawls.
Whether you need to target a specific CSS region, exclude
entire tags, filter out external links, or remove certain
domains and images, **`CrawlerRunConfig`** offers a wide range
of parameters.\n\nBelow, we show how to configure these
parameters and combine them for precise control.\n\n* * *\n
\n## 1\\. CSS-Based Selection\n\nA straightforward way to
**limit** your crawl results to a certain region of the page
is **`css_selector`** in **`CrawlerRunConfig`**:\n\nìmport
asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main(): config = CrawlerRunConfig( #
e.g., first 30 items from Hacker News css_selector=
\".athing:nth-child(-n+30)\" ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=
\"https://news.ycombinator.com/newest\",
config=config ) print(\"Partial HTML length:
\", len(result.cleaned_html)) if __name__ == \"__main__\":
asyncio.run(main())`\n\n**Result**: Only elements matching
that selector remain in `result.cleaned_html`.\n\n* * *\n\n##
2\\. Content Filtering & Exclusions\n\n### 2.1 Basic Overview
\n\n`config = CrawlerRunConfig( # Content thresholds
word_count_threshold=10, # Minimum words per block
# Tag exclusions excluded_tags=['form', 'header',
'footer', 'nav'], # Link filtering
exclude_external_links=True,
exclude_social_media_links=True, # Block entire domains
exclude_domains=[\"adtrackers.com\", \"spammynews.org\"],
exclude_social_media_domains=[\"facebook.com\", \"twitter.com
\"], # Media filtering
exclude_external_images=True )`\n\n**Explanation**:\n\n*
**`word_count_threshold`**: Ignores text blocks under X words.
Helps skip trivial blocks like short nav or disclaimers.\n*
**èxcluded_tags`**: Removes entire tags (`<form>`,
`<header>`, `<footer>`, etc.).\n* **Link Filtering**:\n*
èxclude_external_links`: Strips out external links and may
remove them from `result.links`.\n*
èxclude_social_media_links`: Removes links pointing to known
90
social media domains.\n* èxclude_domains`: A custom list of
domains to block if discovered in links.\n*
èxclude_social_media_domains`: A curated list (override or
add to it) for social media sites.\n* **Media Filtering**:
\n* èxclude_external_images`: Discards images not hosted on
the same domain as the main page (or its subdomains).\n\nBy
default in case you set èxclude_social_media_links=True`, the
following social media domains are excluded:\n
\n`[ 'facebook.com', 'twitter.com', 'x.com',
'linkedin.com', 'instagram.com', 'pinterest.com',
'tiktok.com', 'snapchat.com', 'reddit.com', ]`\n\n###
2.2 Example Usage\n\nìmport asyncio from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig, CacheMode async def
main(): config = CrawlerRunConfig( css_selector=
\"main.content\", word_count_threshold=10,
excluded_tags=[\"nav\", \"footer\"],
exclude_external_links=True,
exclude_social_media_links=True,
exclude_domains=[\"ads.com\", \"spammytrackers.net\"],
exclude_external_images=True,
cache_mode=CacheMode.BYPASS ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun(url=\"https://news.ycombinator.com\",
config=config) print(\"Cleaned HTML length:\",
len(result.cleaned_html)) if __name__ == \"__main__\":
asyncio.run(main())`\n\n**Note**: If these parameters remove
too much, reduce or disable them accordingly.\n\n* * *\n\n## 3
\\. Handling Iframes\n\nSome sites embed content in `<iframe>`
tags. If you want that inline:\n\n`config =
CrawlerRunConfig( # Merge iframe content into the final
output process_iframes=True,
remove_overlay_elements=True )`\n\n**Usage**:\n\nìmport
asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main(): config =
CrawlerRunConfig( process_iframes=True,
remove_overlay_elements=True ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://example.org/iframe-
demo\", config=config )
print(\"Iframe-merged length:\", len(result.cleaned_html)) if
__name__ == \"__main__\": asyncio.run(main())`\n\n* * *\n
\nYou can combine content selection with a more advanced
extraction strategy. For instance, a **CSS-based** or **LLM-
based** extraction strategy can run on the filtered HTML.\n
\nìmport asyncio import json from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig, CacheMode from
crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def main(): # Minimal schema for repeated items
schema = { \"name\": \"News Items\",
\"baseSelector\": \"tr.athing\", \"fields\":
[ {\"name\": \"title\", \"selector\":
\"span.titleline a\", \"type\": \"text\"},
{ \"name\": \"link\",
\"selector\": \"span.titleline a\", \"type\":
\"attribute\", \"attribute\": \"href
\" } ] } config =
CrawlerRunConfig( # Content filtering
91
excluded_tags=[\"form\", \"header\"],
exclude_domains=[\"adsite.com\"], # CSS selection or
entire page css_selector=\"table.itemlist\",
# No caching for demonstration
cache_mode=CacheMode.BYPASS, # Extraction strategy
extraction_strategy=JsonCssExtractionStrategy(schema) )
async with AsyncWebCrawler() as crawler: result =
await crawler.arun( url=
\"https://news.ycombinator.com/newest\",
config=config ) data =
json.loads(result.extracted_content) print(\"Sample
extracted item:\", data[:1]) # Show first item if __name__
== \"__main__\": asyncio.run(main())`\n\nìmport asyncio
import json from pydantic import BaseModel, Field from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LlmConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
class ArticleData(BaseModel): headline: str summary:
str async def main(): llm_strategy =
LLMExtractionStrategy( llmConfig = LlmConfig(provider=
\"openai/gpt-4\",api_token=\"sk-YOUR_API_KEY\")
schema=ArticleData.schema(), extraction_type=\"schema
\", instruction=\"Extract 'headline' and a short
'summary' from the content.\" ) config =
CrawlerRunConfig( exclude_external_links=True,
word_count_threshold=20,
extraction_strategy=llm_strategy ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun(url=\"https://news.ycombinator.com\",
config=config) article =
json.loads(result.extracted_content) print(article)
if __name__ == \"__main__\": asyncio.run(main())`\n\nHere,
the crawler:\n\n* Filters out external links
(èxclude_external_links=True`).\n* Ignores very short text
blocks (`word_count_threshold=20`).\n* Passes the final HTML
to your LLM strategy for an AI-driven parse.\n\n* * *\n\n## 5
\\. Comprehensive Example\n\nBelow is a short function that
unifies **CSS selection**, **exclusion** logic, and a pattern-
based extraction, demonstrating how you can fine-tune your
final data:\n\nìmport asyncio import json from crawl4ai
import AsyncWebCrawler, CrawlerRunConfig, CacheMode from
crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def extract_main_articles(url: str): schema =
{ \"name\": \"ArticleBlock\", \"baseSelector
\": \"div.article-block\", \"fields\":
[ {\"name\": \"headline\", \"selector\": \"h2\",
\"type\": \"text\"}, {\"name\": \"summary\",
\"selector\": \".summary\", \"type\": \"text\"},
{ \"name\": \"metadata\",
\"type\": \"nested\", \"fields\":
[ {\"name\": \"author\", \"selector\":
\".author\", \"type\": \"text\"}, {\"name
\": \"date\", \"selector\": \".date\", \"type\": \"text
\"} ] } ] }
config = CrawlerRunConfig( # Keep only #main-content
css_selector=\"#main-content\", # Filtering
word_count_threshold=10, excluded_tags=[\"nav\",
\"footer\"], exclude_external_links=True,
92
exclude_domains=[\"somebadsite.com\"],
exclude_external_images=True, # Extraction
extraction_strategy=JsonCssExtractionStrategy(schema),
cache_mode=CacheMode.BYPASS ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun(url=url, config=config) if not
result.success: print(f\"Error:
{result.error_message}\") return None
return json.loads(result.extracted_content) async def main():
articles = await
extract_main_articles(\"https://news.ycombinator.com/newest\")
if articles: print(\"Extracted Articles:\",
articles[:2]) # Show first 2 if __name__ == \"__main__\":
asyncio.run(main())`\n\n**Why This Works**: - **CSS** scoping
with `#main-content`. \n\\- Multiple **exclude\\_**
parameters to remove domains, external images, etc. \n\\- A
**JsonCssExtractionStrategy** to parse repeated article
blocks.\n\n* * *\n\n## 6\\. Scraping Modes\n\nCrawl4AI
provides two different scraping strategies for HTML content
processing: `WebScrapingStrategy` (BeautifulSoup-based,
default) and `LXMLWebScrapingStrategy` (LXML-based). The LXML
strategy offers significantly better performance, especially
for large HTML documents.\n\n`from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig, LXMLWebScrapingStrategy
async def main(): config =
CrawlerRunConfig( scraping_strategy=LXMLWebScrapingStr
ategy() # Faster alternative to default BeautifulSoup )
async with AsyncWebCrawler() as crawler: result =
await crawler.arun( url=\"https://example.com\",
config=config )`\n\nYou can also create your own
custom scraping strategy by inheriting from
`ContentScrapingStrategy`. The strategy must return a
`ScrapingResult` object with the following structure:\n\n`from
crawl4ai import ContentScrapingStrategy, ScrapingResult,
MediaItem, Media, Link, Links class
CustomScrapingStrategy(ContentScrapingStrategy): def
scrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
# Implement your custom scraping logic here return
ScrapingResult( cleaned_html=\"<html>...</html>\",
# Cleaned HTML content success=True,
# Whether scraping was successful
media=Media( images=[ #
List of images found
MediaItem( src=
\"https://example.com/image.jpg\",
alt=\"Image description\", desc=
\"Surrounding text\", score=1,
type=\"image\", group_id=1,
format=\"jpg\", width=
800 ) ],
videos=[], # List of videos (same structure
as images) audios=[] #
List of audio files (same structure as images) ),
links=Links( internal=[ #
List of internal links
Link( href=\"https://example.com/page
\", text=\"Link text\",
93
title=\"Link title\", base_domain=
\"example.com\" ) ],
external=[] # List of external links (same
structure) ),
metadata={ # Additional metadata
\"title\": \"Page Title\", \"description\":
\"Page description\" } ) async def
ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult:
# For simple cases, you can use the sync version
return await asyncio.to_thread(self.scrap, url, html,
**kwargs)`\n\n### Performance Considerations\n\nThe LXML
strategy can be up to 10-20x faster than BeautifulSoup
strategy, particularly when processing large HTML documents.
However, please note:\n\n1. LXML strategy is currently
experimental\n2. In some edge cases, the parsing results
might differ slightly from BeautifulSoup\n3. If you encounter
any inconsistencies between LXML and BeautifulSoup results,
please [raise an issue]
(https://github.com/codeium/crawl4ai/issues) with a
reproducible example\n\nChoose LXML strategy when: -
Processing large HTML documents (recommended for >100KB) -
Performance is critical - Working with well-formed HTML\n
\nStick to BeautifulSoup strategy (default) when: - Maximum
compatibility is needed - Working with malformed HTML - Exact
parsing behavior is critical\n\n* * *\n\n## 7\\. Conclusion\n
\nBy mixing **css\\_selector** scoping, **content filtering**
parameters, and advanced **extraction strategies**, you can
precisely **choose** which data to keep. Key parameters in
**`CrawlerRunConfig`** for content selection include:\n\n1.â
€€**`css_selector`** â€“ Basic scoping to an element or
region. \n2.â€€**`word_count_threshold`** â€“ Skip short
blocks. \n3.â€€**èxcluded_tags`** â€“ Remove entire HTML
tags. \n4.â€€**èxclude_external_links`**,
**èxclude_social_media_links`**, **èxclude_domains`** â
€“ Filter out unwanted links or domains. \n5.â
€€**èxclude_external_images`** â€“ Remove images from
external sources. \n6.â€€**`process_iframes`** â€“ Merge
iframe content if needed.\n\nCombine these with structured
extraction (CSS, LLM-based, or others) to build powerful
crawls that yield exactly the content you want, from raw or
cleaned HTML up to sophisticated JSON structures. For more
detail, see [Configuration Reference]
(https://crawl4ai.com/mkdocs/api/parameters/). Enjoy curating
your data to the max!",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/core/fit-markdown/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/core/fit-
markdown/",
"loadedTime": "2025-03-05T23:16:41.143Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
94
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/core/fit-
markdown/",
"title": "Fit Markdown - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:38 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"c51065420eec3395c90aa5cb5a57bd96\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Fit Markdown - Crawl4AI Documentation (v0.5.x)\nFit
Markdown with Pruning & BM25\nFit Markdown is a specialized
filtered version of your pageâ€™s markdown, focusing on the
most relevant content. By default, Crawl4AI converts the
entire HTML into a broad raw_markdown. With fit markdown, we
apply a content filter algorithm (e.g., Pruning or BM25) to
remove or rank low-value sectionsâ€”such as repetitive
sidebars, shallow text blocks, or irrelevanciesâ€”leaving a
concise textual â€œcore.â€ \n1. How â€œFit Markdownâ€ Works
\n1.1 The content_filter\nIn CrawlerRunConfig, you can specify
a content_filter to shape how content is pruned or ranked
before final markdown generation. A filterâ€™s logic is
applied before or during the HTMLâ†’Markdown process,
producing:\nresult.markdown.raw_markdown
(unfiltered)\nresult.markdown.fit_markdown (filtered or â
€œfitâ€ version)\nresult.markdown.fit_html (the corresponding
HTML snippet that produced fit_markdown)\n1.2 Common Filters
\n1. PruningContentFilter â€“ Scores each node by text
density, link density, and tag importance, discarding those
below a threshold.\n2. BM25ContentFilter â€“ Focuses on
textual relevance using BM25 ranking, especially useful if you
have a specific user query (e.g., â€œmachine learningâ€ or â
€œfood nutritionâ€ ).\n2. PruningContentFilter\nPruning
discards less relevant nodes based on text density, link
density, and tag importance. Itâ€™s a heuristic-based
approachâ€”if certain sections appear too â€œthinâ€ or too â
€œspammy,â€ theyâ€™re pruned.\n2.1 Usage Example\nimport
asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import
PruningContentFilter from
crawl4ai.markdown_generation_strategy import
DefaultMarkdownGenerator async def main(): # Step 1: Create a
pruning filter prune_filter = PruningContentFilter( # Lower
â†’ more content retained, higher â†’ more content pruned
95
threshold=0.45, # \"fixed\" or \"dynamic\" threshold_type=
\"dynamic\", # Ignore nodes with <5 words min_word_threshold=
5 ) # Step 2: Insert it into a Markdown Generator md_generator
= DefaultMarkdownGenerator(content_filter=prune_filter) # Step
3: Pass it to CrawlerRunConfig config =
CrawlerRunConfig( markdown_generator=md_generator ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://news.ycombinator.com\",
config=config ) if result.success: # 'fit_markdown' is your
pruned content, focusing on \"denser\" text print(\"Raw
Markdown length:\", len(result.markdown.raw_markdown))
print(\"Fit Markdown length:\",
len(result.markdown.fit_markdown)) else: print(\"Error:\",
result.error_message) if __name__ == \"__main__\":
asyncio.run(main()) \n2.2 Key Parameters\nmin_word_threshold
(int): If a block has fewer words than this, itâ€™s pruned.
\nthreshold_type (str):\n\"fixed\" â†’ each node must exceed
threshold (0â€“1). \n\"dynamic\" â†’ node scoring adjusts
according to tag type, text/link density, etc. \nthreshold
(float, default ~0.48): The base or â€œanchorâ€ cutoff.
\nAlgorithmic Factors:\nText density â€“ Encourages blocks
that have a higher ratio of text to overall content. \nLink
density â€“ Penalizes sections that are mostly links. \nTag
importance â€“ e.g., an <article> or <p> might be more
important than a <div>. \nStructural context â€“ If a node is
deeply nested or in a suspected sidebar, it might be
deprioritized.\n3. BM25ContentFilter\nBM25 is a classical text
ranking algorithm often used in search engines. If you have a
user query or rely on page metadata to derive a query, BM25
can identify which text chunks best match that query.\n3.1
Usage Example\nimport asyncio from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig from
crawl4ai.content_filter_strategy import BM25ContentFilter from
crawl4ai.markdown_generation_strategy import
DefaultMarkdownGenerator async def main(): # 1) A BM25 filter
with a user query bm25_filter = BM25ContentFilter( user_query=
\"startup fundraising tips\", # Adjust for stricter or looser
results bm25_threshold=1.2 ) # 2) Insert into a Markdown
Generator md_generator =
DefaultMarkdownGenerator(content_filter=bm25_filter) # 3) Pass
to crawler config config =
CrawlerRunConfig( markdown_generator=md_generator ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://news.ycombinator.com\",
config=config ) if result.success: print(\"Fit Markdown (BM25
query-based):\") print(result.markdown.fit_markdown) else:
print(\"Error:\", result.error_message) if __name__ ==
\"__main__\": asyncio.run(main()) \n3.2 Parameters\nuser_query
(str, optional): E.g. \"machine learning\". If blank, the
filter tries to glean a query from page metadata. \nbm25
_threshold (float, default 1.0): \nHigher â†’ fewer chunks but
more relevant. \nLower â†’ more inclusive. \nIn more advanced
scenarios, you might see parameters like use_stemming,
case_sensitive, or priority_tags to refine how text is
tokenized or weighted.\n4. Accessing the â€œFitâ€ Output
\nAfter the crawl, your â€œfitâ€ content is found in
result.markdown.fit_markdown. \nfit_md =
96
result.markdown.fit_markdown fit_html =
result.markdown.fit_html \nIf the content filter is BM25, you
might see additional logic or references in fit_markdown that
highlight relevant segments. If itâ€™s Pruning, the text is
typically well-cleaned but not necessarily matched to a query.
\n5. Code Patterns Recap\n5.1 Pruning\nprune_filter =
PruningContentFilter( threshold=0.5, threshold_type=\"fixed\",
min_word_threshold=10 ) md_generator =
DefaultMarkdownGenerator(content_filter=prune_filter) config =
CrawlerRunConfig(markdown_generator=md_generator) \n5.2 BM25
\nbm25_filter = BM25ContentFilter( user_query=\"health
benefits fruit\", bm25_threshold=1.2 ) md_generator =
DefaultMarkdownGenerator(content_filter=bm25_filter) config =
CrawlerRunConfig(markdown_generator=md_generator) \n6.
Combining with â€œword_count_thresholdâ€ & Exclusions
\nRemember you can also specify:\nconfig =
CrawlerRunConfig( word_count_threshold=10,
excluded_tags=[\"nav\", \"footer\", \"header\"],
exclude_external_links=True,
markdown_generator=DefaultMarkdownGenerator( content_filter=Pr
uningContentFilter(threshold=0.5) ) ) \nThus, multi-level
filtering occurs:\nThe crawlerâ€™s excluded_tags are removed
from the HTML first. \nThe content filter (Pruning, BM25, or
custom) prunes or ranks the remaining text blocks. \nThe final
â€œfitâ€ content is generated in
result.markdown.fit_markdown.\n7. Custom Filters\nIf you need
a different approach (like a specialized ML model or site-
specific heuristics), you can create a new class inheriting
from RelevantContentFilter and implement filter_content(html).
Then inject it into your markdown generator:\nfrom
crawl4ai.content_filter_strategy import RelevantContentFilter
class MyCustomFilter(RelevantContentFilter): def
filter_content(self, html, min_word_threshold=None): # parse
HTML, implement custom logic return [block for block in ...
if ... some condition...] \nSteps:\nSubclass
RelevantContentFilter. \nImplement filter_content(...). \nUse
it in your
DefaultMarkdownGenerator(content_filter=MyCustomFilter(...)).
\n8. Final Thoughts\nFit Markdown is a crucial feature for:
\nSummaries: Quickly get the important text from a cluttered
page. \nSearch: Combine with BM25 to produce content relevant
to a query. \nAI Pipelines: Filter out boilerplate so LLM-
based extraction or summarization runs on denser text.\nKey
Points: - PruningContentFilter: Great if you just want the â
€œmeatiestâ€ text without a user query.\n- BM25ContentFilter:
Perfect for query-based extraction or searching.\n- Combine
with excluded_tags, exclude_external_links,
word_count_threshold to refine your final â€œfitâ€ text.\n-
Fit markdown ends up in result.markdown.fit_markdown;
eventually result.markdown.fit_markdown in future versions.
\nWith these tools, you can zero in on the text that truly
matters, ignoring spammy or boilerplate content, and produce a
concise, relevant â€œfit markdownâ€ for your AI or data
pipelines. Happy pruning and searching!\nLast Updated:
2025-01-01",
"markdown": "# Fit Markdown - Crawl4AI Documentation
(v0.5.x)\n\n## Fit Markdown with Pruning & BM25\n\n**Fit
97
Markdown** is a specialized **filtered** version of your pageâ
€™s markdown, focusing on the most relevant content. By
default, Crawl4AI converts the entire HTML into a broad **raw
\\_markdown**. With fit markdown, we apply a **content
filter** algorithm (e.g., **Pruning** or **BM25**) to remove
or rank low-value sectionsâ€”such as repetitive sidebars,
shallow text blocks, or irrelevanciesâ€”leaving a concise
textual â€œcore.â€ \n\n* * *\n\n## 1\\. How â€œFit Markdownâ€
Works\n\n### 1.1 The `content_filter`\n\nIn
**`CrawlerRunConfig`**, you can specify a **`content_filter`**
to shape how content is pruned or ranked before final markdown
generation. A filterâ€™s logic is applied **before** or
**during** the HTMLâ†’Markdown process, producing:\n\n*
**`result.markdown.raw_markdown`** (unfiltered)\n*
**`result.markdown.fit_markdown`** (filtered or â€œfitâ€
version)\n* **`result.markdown.fit_html`** (the
corresponding HTML snippet that produced `fit_markdown`)\n
\n### 1.2 Common Filters\n\n1.â€€**PruningContentFilter** â
€“ Scores each node by text density, link density, and tag
importance, discarding those below a threshold. \n2.â
€€**BM25ContentFilter** â€“ Focuses on textual relevance using
BM25 ranking, especially useful if you have a specific user
query (e.g., â€œmachine learningâ€ or â€œfood nutritionâ€ ).
\n\n* * *\n\n## 2\\. PruningContentFilter\n\n**Pruning**
discards less relevant nodes based on **text density, link
density, and tag importance**. Itâ€™s a heuristic-based
approachâ€”if certain sections appear too â€œthinâ€ or too â
€œspammy,â€ theyâ€™re pruned.\n\n### 2.1 Usage Example\n
\nìmport asyncio from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig from crawl4ai.content_filter_strategy import
PruningContentFilter from
crawl4ai.markdown_generation_strategy import
DefaultMarkdownGenerator async def main(): # Step 1:
Create a pruning filter prune_filter =
PruningContentFilter( # Lower â†’ more content
retained, higher â†’ more content pruned threshold=
0.45, # \"fixed\" or \"dynamic\"
threshold_type=\"dynamic\", # Ignore nodes with <5
words min_word_threshold=5 ) # Step 2:
Insert it into a Markdown Generator md_generator =
DefaultMarkdownGenerator(content_filter=prune_filter) #
Step 3: Pass it to CrawlerRunConfig config =
CrawlerRunConfig( markdown_generator=md_generator
) async with AsyncWebCrawler() as crawler: result
= await crawler.arun( url=
\"https://news.ycombinator.com\",
config=config ) if result.success:
# 'fit_markdown' is your pruned content, focusing on \"denser
\" text print(\"Raw Markdown length:\",
len(result.markdown.raw_markdown)) print(\"Fit
Markdown length:\", len(result.markdown.fit_markdown))
else: print(\"Error:\", result.error_message) if
__name__ == \"__main__\": asyncio.run(main())`\n\n### 2.2
Key Parameters\n\n* **`min_word_threshold`** (int): If a
block has fewer words than this, itâ€™s pruned.\n*
**`threshold_type`** (str):\n* `\"fixed\"` â†’ each node
must exceed `threshold` (0â€“1).\n* `\"dynamic\"` â†’ node
98
scoring adjusts according to tag type, text/link density, etc.
\n* **`threshold`** (float, default ~0.48): The base or â
€œanchorâ€ cutoff.\n\n**Algorithmic Factors**:\n\n* **Text
density** â€“ Encourages blocks that have a higher ratio of
text to overall content.\n* **Link density** â€“ Penalizes
sections that are mostly links.\n* **Tag importance** â
€“ e.g., an `<article>` or `<p>` might be more important than
a `<div>`.\n* **Structural context** â€“ If a node is deeply
nested or in a suspected sidebar, it might be deprioritized.\n
\n* * *\n\n## 3\\. BM25ContentFilter\n\n**BM25** is a
classical text ranking algorithm often used in search engines.
If you have a **user query** or rely on page metadata to
derive a query, BM25 can identify which text chunks best match
that query.\n\n### 3.1 Usage Example\n\nìmport asyncio from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig from
crawl4ai.content_filter_strategy import BM25ContentFilter from
crawl4ai.markdown_generation_strategy import
DefaultMarkdownGenerator async def main(): # 1) A BM25
filter with a user query bm25_filter =
BM25ContentFilter( user_query=\"startup fundraising
tips\", # Adjust for stricter or looser results
bm25_threshold=1.2 ) # 2) Insert into a Markdown
Generator md_generator =
DefaultMarkdownGenerator(content_filter=bm25_filter) # 3)
Pass to crawler config config =
CrawlerRunConfig( markdown_generator=md_generator
) async with AsyncWebCrawler() as crawler: result
= await crawler.arun( url=
\"https://news.ycombinator.com\",
config=config ) if result.success:
print(\"Fit Markdown (BM25 query-based):\")
print(result.markdown.fit_markdown) else:
print(\"Error:\", result.error_message) if __name__ ==
\"__main__\": asyncio.run(main())`\n\n### 3.2 Parameters\n
\n* **ùser_query`** (str, optional): E.g. `\"machine
learning\"`. If blank, the filter tries to glean a query from
page metadata.\n* **`bm25_threshold`** (float, default 1.0):
\n* Higher â†’ fewer chunks but more relevant.\n* Lower
â†’ more inclusive.\n\n> In more advanced scenarios, you might
see parameters like ùse_stemming`, `case_sensitive`, or
`priority_tags` to refine how text is tokenized or weighted.\n
\n* * *\n\n## 4\\. Accessing the â€œFitâ€ Output\n\nAfter the
crawl, your â€œfitâ€ content is found in
**`result.markdown.fit_markdown`**.\n\n`fit_md =
result.markdown.fit_markdown fit_html =
result.markdown.fit_html`\n\nIf the content filter is **BM25
**, you might see additional logic or references in
`fit_markdown` that highlight relevant segments. If itâ€™s
**Pruning**, the text is typically well-cleaned but not
necessarily matched to a query.\n\n* * *\n\n## 5\\. Code
Patterns Recap\n\n### 5.1 Pruning\n\n`prune_filter =
PruningContentFilter( threshold=0.5, threshold_type=
\"fixed\", min_word_threshold=10 ) md_generator =
DefaultMarkdownGenerator(content_filter=prune_filter) config =
CrawlerRunConfig(markdown_generator=md_generator)`\n\n### 5.2
BM25\n\n`bm25_filter = BM25ContentFilter( user_query=
\"health benefits fruit\", bm25_threshold=1.2 )
99
md_generator = DefaultMarkdownGenerator(content_filter=bm25
_filter) config =
CrawlerRunConfig(markdown_generator=md_generator)`\n\n* * *\n
\n## 6\\. Combining with â€œword\\_count\\_thresholdâ€ &
Exclusions\n\nRemember you can also specify:\n\n`config =
CrawlerRunConfig( word_count_threshold=10,
excluded_tags=[\"nav\", \"footer\", \"header\"],
exclude_external_links=True,
markdown_generator=DefaultMarkdownGenerator( content_f
ilter=PruningContentFilter(threshold=0.5) ) )`\n\nThus,
**multi-level** filtering occurs:\n\n1. The crawlerâ€™s
èxcluded_tags` are removed from the HTML first.\n2. The
content filter (Pruning, BM25, or custom) prunes or ranks the
remaining text blocks.\n3. The final â€œfitâ€ content is
generated in `result.markdown.fit_markdown`.\n\n* * *\n\n## 7
\\. Custom Filters\n\nIf you need a different approach (like a
specialized ML model or site-specific heuristics), you can
create a new class inheriting from `RelevantContentFilter` and
implement `filter_content(html)`. Then inject it into your
**markdown generator**:\n\n`from
crawl4ai.content_filter_strategy import RelevantContentFilter
class MyCustomFilter(RelevantContentFilter): def
filter_content(self, html, min_word_threshold=None): #
parse HTML, implement custom logic return [block for
block in ... if ... some condition...]`\n\n**Steps**:\n\n1.
Subclass `RelevantContentFilter`.\n2. Implement
`filter_content(...)`.\n3. Use it in your
`DefaultMarkdownGenerator(content_filter=MyCustomFilter(...))`
.\n\n* * *\n\n## 8\\. Final Thoughts\n\n**Fit Markdown** is a
crucial feature for:\n\n* **Summaries**: Quickly get the
important text from a cluttered page.\n* **Search**: Combine
with **BM25** to produce content relevant to a query.\n*
**AI Pipelines**: Filter out boilerplate so LLM-based
extraction or summarization runs on denser text.\n\n**Key
Points**: - **PruningContentFilter**: Great if you just want
the â€œmeatiestâ€ text without a user query. \n\\-
**BM25ContentFilter**: Perfect for query-based extraction or
searching. \n\\- Combine with **èxcluded_tags`,
èxclude_external_links`, `word_count_threshold`** to refine
your final â€œfitâ€ text. \n\\- Fit markdown ends up in
**`result.markdown.fit_markdown`**; eventually
**`result.markdown.fit_markdown`** in future versions.\n\nWith
these tools, you can **zero in** on the text that truly
matters, ignoring spammy or boilerplate content, and produce a
concise, relevant â€œfit markdownâ€ for your AI or data
pipelines. Happy pruning and searching!\n\n* Last Updated:
2025-01-01",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/core/local-files/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/core/local-
files/",
"loadedTime": "2025-03-05T23:16:48.570Z",
100
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/core/local-
files/",
"title": "Local Files & Raw HTML - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:46 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"2de24187c5996f894a1af63b3522a806\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Local Files & Raw HTML\nPrefix-Based Input Handling
in Crawl4AI\nThis guide will walk you through using the
Crawl4AI library to crawl web pages, local HTML files, and raw
HTML strings. We'll demonstrate these capabilities using a
Wikipedia page as an example.\nCrawling a Web URL\nTo crawl a
live web page, provide the URL starting with http:// or
https://, using a CrawlerRunConfig object:\nimport asyncio
from crawl4ai import AsyncWebCrawler from
crawl4ai.async_configs import CrawlerRunConfig async def
crawl_web(): config = CrawlerRunConfig(bypass_cache=True)
async with AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://en.wikipedia.org/wiki/apple\",
config=config ) if result.success: print(\"Markdown Content:
\") print(result.markdown) else: print(f\"Failed to crawl:
{result.error_message}\") asyncio.run(crawl_web()) \nCrawling
a Local HTML File\nTo crawl a local HTML file, prefix the file
path with file://.\nimport asyncio from crawl4ai import
AsyncWebCrawler from crawl4ai.async_configs import
CrawlerRunConfig async def crawl_local_file(): local_file_path
= \"/path/to/apple.html\" # Replace with your file path
file_url = f\"file://{local_file_path}\" config =
CrawlerRunConfig(bypass_cache=True) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun(url=file_url, config=config) if result.success:
print(\"Markdown Content from Local File:\")
print(result.markdown) else: print(f\"Failed to crawl local
file: {result.error_message}\")
asyncio.run(crawl_local_file()) \nCrawling Raw HTML Content
\nTo crawl raw HTML content, prefix the HTML string with raw:.
\nimport asyncio from crawl4ai import AsyncWebCrawler from
101
crawl4ai.async_configs import CrawlerRunConfig async def
crawl_raw_html(): raw_html = \"<html><body><h1>Hello,
World!</h1></body></html>\" raw_html_url = f\"raw:{raw_html}\"
config = CrawlerRunConfig(bypass_cache=True) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun(url=raw_html_url, config=config) if
result.success: print(\"Markdown Content from Raw HTML:\")
print(result.markdown) else: print(f\"Failed to crawl raw
HTML: {result.error_message}\") asyncio.run(crawl_raw_html())
\nComplete Example\nBelow is a comprehensive script that:
\nCrawls the Wikipedia page for \"Apple.\"\nSaves the HTML
content to a local file (apple.html).\nCrawls the local HTML
file and verifies the markdown length matches the original
crawl.\nCrawls the raw HTML content from the saved file and
verifies consistency.\nimport os import sys import asyncio
from pathlib import Path from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig async def
main(): wikipedia_url = \"https://en.wikipedia.org/wiki/apple
\" script_dir = Path(__file__).parent html_file_path =
script_dir / \"apple.html\" async with AsyncWebCrawler() as
crawler: # Step 1: Crawl the Web URL print(\"\\n=== Step 1:
Crawling the Wikipedia URL ===\") web_config =
CrawlerRunConfig(bypass_cache=True) result = await
crawler.arun(url=wikipedia_url, config=web_config) if not
result.success: print(f\"Failed to crawl {wikipedia_url}:
{result.error_message}\") return with open(html_file_path,
'w', encoding='utf-8') as f: f.write(result.html)
web_crawl_length = len(result.markdown) print(f\"Length of
markdown from web crawl: {web_crawl_length}\\n\") # Step 2:
Crawl from the Local HTML File print(\"=== Step 2: Crawling
from the Local HTML File ===\") file_url = f
\"file://{html_file_path.resolve()}\" file_config =
CrawlerRunConfig(bypass_cache=True) local_result = await
crawler.arun(url=file_url, config=file_config) if not
local_result.success: print(f\"Failed to crawl local file
{file_url}: {local_result.error_message}\") return
local_crawl_length = len(local_result.markdown) assert
web_crawl_length == local_crawl_length, \"Markdown length
mismatch\" print(\"âœ… Markdown length matches between web and
local file crawl.\\n\") # Step 3: Crawl Using Raw HTML Content
print(\"=== Step 3: Crawling Using Raw HTML Content ===\")
with open(html_file_path, 'r', encoding='utf-8') as f:
raw_html_content = f.read() raw_html_url = f
\"raw:{raw_html_content}\" raw_config =
CrawlerRunConfig(bypass_cache=True) raw_result = await
crawler.arun(url=raw_html_url, config=raw_config) if not
raw_result.success: print(f\"Failed to crawl raw HTML content:
{raw_result.error_message}\") return raw_crawl_length =
len(raw_result.markdown) assert web_crawl_length ==
raw_crawl_length, \"Markdown length mismatch\" print(\"âœ…
Markdown length matches between web and raw HTML crawl.\\n\")
print(\"All tests passed successfully!\") if
html_file_path.exists(): os.remove(html_file_path) if __name__
== \"__main__\": asyncio.run(main()) \nConclusion\nWith the
unified url parameter and prefix-based handling in Crawl4AI,
you can seamlessly handle web URLs, local HTML files, and raw
HTML content. Use CrawlerRunConfig for flexible and consistent
102
configuration in all scenarios.",
"markdown": "# Local Files & Raw HTML\n\n## Prefix-Based
Input Handling in Crawl4AI\n\nThis guide will walk you through
using the Crawl4AI library to crawl web pages, local HTML
files, and raw HTML strings. We'll demonstrate these
capabilities using a Wikipedia page as an example.\n\n##
Crawling a Web URL\n\nTo crawl a live web page, provide the
URL starting with `http://` or `https://`, using a
`CrawlerRunConfig` object:\n\nìmport asyncio from crawl4ai
import AsyncWebCrawler from crawl4ai.async_configs import
CrawlerRunConfig async def crawl_web(): config =
CrawlerRunConfig(bypass_cache=True) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=
\"https://en.wikipedia.org/wiki/apple\",
config=config ) if result.success:
print(\"Markdown Content:\")
print(result.markdown) else: print(f
\"Failed to crawl: {result.error_message}\")
asyncio.run(crawl_web())`\n\n## Crawling a Local HTML File\n
\nTo crawl a local HTML file, prefix the file path with
`file://`.\n\nìmport asyncio from crawl4ai import
AsyncWebCrawler from crawl4ai.async_configs import
CrawlerRunConfig async def crawl_local_file():
local_file_path = \"/path/to/apple.html\" # Replace with your
file path file_url = f\"file://{local_file_path}\"
config = CrawlerRunConfig(bypass_cache=True) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun(url=file_url, config=config) if
result.success: print(\"Markdown Content from
Local File:\") print(result.markdown)
else: print(f\"Failed to crawl local file:
{result.error_message}\") asyncio.run(crawl_local_file())`\n
\n## Crawling Raw HTML Content\n\nTo crawl raw HTML content,
prefix the HTML string with `raw:`.\n\nìmport asyncio from
crawl4ai import AsyncWebCrawler from crawl4ai.async_configs
import CrawlerRunConfig async def crawl_raw_html():
raw_html = \"<html><body><h1>Hello, World!</h1></body>
</html>\" raw_html_url = f\"raw:{raw_html}\" config =
CrawlerRunConfig(bypass_cache=True) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun(url=raw_html_url, config=config) if
result.success: print(\"Markdown Content from Raw
HTML:\") print(result.markdown) else:
print(f\"Failed to crawl raw HTML: {result.error_message}\")
asyncio.run(crawl_raw_html())`\n\n* * *\n\n## Complete Example
\n\nBelow is a comprehensive script that:\n\n1. Crawls the
Wikipedia page for \"Apple.\"\n2. Saves the HTML content to a
local file (àpple.html`).\n3. Crawls the local HTML file and
verifies the markdown length matches the original crawl.\n4.
Crawls the raw HTML content from the saved file and verifies
consistency.\n\nìmport os import sys import asyncio from
pathlib import Path from crawl4ai import AsyncWebCrawler from
crawl4ai.async_configs import CrawlerRunConfig async def
main(): wikipedia_url =
\"https://en.wikipedia.org/wiki/apple\" script_dir =
Path(__file__).parent html_file_path = script_dir /
103
\"apple.html\" async with AsyncWebCrawler() as crawler:
# Step 1: Crawl the Web URL print(\"\\n=== Step 1:
Crawling the Wikipedia URL ===\") web_config =
CrawlerRunConfig(bypass_cache=True) result = await
crawler.arun(url=wikipedia_url, config=web_config) if
not result.success: print(f\"Failed to crawl
{wikipedia_url}: {result.error_message}\") return
with open(html_file_path, 'w', encoding='utf-8') as f:
f.write(result.html) web_crawl_length =
len(result.markdown) print(f\"Length of markdown from
web crawl: {web_crawl_length}\\n\") # Step 2: Crawl
from the Local HTML File print(\"=== Step 2: Crawling
from the Local HTML File ===\") file_url = f
\"file://{html_file_path.resolve()}\" file_config =
CrawlerRunConfig(bypass_cache=True) local_result =
await crawler.arun(url=file_url, config=file_config)
if not local_result.success: print(f\"Failed to
crawl local file {file_url}: {local_result.error_message}\")
return local_crawl_length =
len(local_result.markdown) assert web_crawl_length ==
local_crawl_length, \"Markdown length mismatch\"
print(\"âœ… Markdown length matches between web and local file
crawl.\\n\") # Step 3: Crawl Using Raw HTML Content
print(\"=== Step 3: Crawling Using Raw HTML Content ===\")
with open(html_file_path, 'r', encoding='utf-8') as f:
raw_html_content = f.read() raw_html_url = f
\"raw:{raw_html_content}\" raw_config =
CrawlerRunConfig(bypass_cache=True) raw_result = await
crawler.arun(url=raw_html_url, config=raw_config) if
not raw_result.success: print(f\"Failed to crawl
raw HTML content: {raw_result.error_message}\")
return raw_crawl_length = len(raw_result.markdown)
assert web_crawl_length == raw_crawl_length, \"Markdown length
mismatch\" print(\"âœ… Markdown length matches between
web and raw HTML crawl.\\n\") print(\"All tests
passed successfully!\") if html_file_path.exists():
os.remove(html_file_path) if __name__ == \"__main__\":
asyncio.run(main())`\n\n* * *\n\n## Conclusion\n\nWith the
unified ùrl` parameter and prefix-based handling in
**Crawl4AI**, you can seamlessly handle web URLs, local HTML
files, and raw HTML content. Use `CrawlerRunConfig` for
flexible and consistent configuration in all scenarios.",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/core/link-media/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/core/link-
media/",
"loadedTime": "2025-03-05T23:16:49.544Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
104
"canonicalUrl": "https://docs.crawl4ai.com/core/link-
media/",
"title": "Link & Media - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:46 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"8fa74750d3b67a5136325c4bbe025d96\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Link & Media - Crawl4AI Documentation (v0.5.x)\nIn
this tutorial, youâ€™ll learn how to:\nExtract links
(internal, external) from crawled pages \nFilter or exclude
specific domains (e.g., social media or custom domains)
\nAccess and manage media data (especially images) in the
crawl result \nConfigure your crawler to exclude or prioritize
certain images\nPrerequisites\n- You have completed or are
familiar with the AsyncWebCrawler Basics tutorial.\n- You can
run Crawl4AI in your environment (Playwright, Python, etc.).
\nBelow is a revised version of the Link Extraction and Media
Extraction sections that includes example data structures
showing how links and media items are stored in CrawlResult.
Feel free to adjust any field names or descriptions to match
your actual output.\n1.1 result.links\nWhen you call arun() or
arun_many() on a URL, Crawl4AI automatically extracts links
and stores them in the links field of CrawlResult. By default,
the crawler tries to distinguish internal links (same domain)
from external links (different domains).\nBasic Example:\nfrom
crawl4ai import AsyncWebCrawler async with AsyncWebCrawler()
as crawler: result = await
crawler.arun(\"https://www.example.com\") if result.success:
internal_links = result.links.get(\"internal\", [])
external_links = result.links.get(\"external\", []) print(f
\"Found {len(internal_links)} internal links.\") print(f
\"Found {len(internal_links)} external links.\") print(f
\"Found {len(result.media)} media items.\") # Each link is
typically a dictionary with fields like: # { \"href\": \"...
\", \"text\": \"...\", \"title\": \"...\", \"base_domain\":
\"...\" } if internal_links: print(\"Sample Internal Link:\",
internal_links[0]) else: print(\"Crawl failed:\",
result.error_message) \nStructure Example:\nresult.links =
{ \"internal\": [ { \"href\": \"https://kidocode.com/\",
\"text\": \"\", \"title\": \"\", \"base_domain\":
\"kidocode.com\" }, { \"href\":
\"https://kidocode.com/degrees/technology\", \"text\":
\"Technology Degree\", \"title\": \"KidoCode Tech Program\",
105
\"base_domain\": \"kidocode.com\" }, # ... ], \"external\":
[ # possibly other links leading to third-party sites ] }
\nhref: The raw hyperlink URL. \ntext: The link text (if any)
within the <a> tag. \ntitle: The title attribute of the link
(if present). \nbase_domain: The domain extracted from href.
Helpful for filtering or grouping by domain.\n2. Domain
Filtering\nSome websites contain hundreds of third-party or
affiliate links. You can filter out certain domains at crawl
time by configuring the crawler. The most relevant parameters
in CrawlerRunConfig are:\nexclude_external_links: If True,
discard any link pointing outside the root domain.
\nexclude_social_media_domains: Provide a list of social media
platforms (e.g., [\"facebook.com\", \"twitter.com\"]) to
exclude from your crawl. \nexclude_social_media_links: If
True, automatically skip known social platforms.
\nexclude_domains: Provide a list of custom domains you want
to exclude (e.g., [\"spammyads.com\", \"tracker.net\"]).
\nimport asyncio from crawl4ai import AsyncWebCrawler,
BrowserConfig, CrawlerRunConfig async def main(): crawler_cfg
= CrawlerRunConfig( exclude_external_links=True, # No links
outside primary domain exclude_social_media_links=True # Skip
recognized social media domains ) async with AsyncWebCrawler()
as crawler: result = await
crawler.arun( \"https://www.example.com\",
config=crawler_cfg ) if result.success: print(\"[OK] Crawled:
\", result.url) print(\"Internal links count:\",
len(result.links.get(\"internal\", []))) print(\"External
links count:\", len(result.links.get(\"external\", []))) #
Likely zero external links in this scenario else:
print(\"[ERROR]\", result.error_message) if __name__ ==
\"__main__\": asyncio.run(main()) \n2.2 Example: Excluding
Specific Domains\nIf you want to let external links in, but
specifically exclude a domain (e.g., suspiciousads.com), do
this:\ncrawler_cfg =
CrawlerRunConfig( exclude_domains=[\"suspiciousads.com\"] )
\nThis approach is handy when you still want external links
but need to block certain sites you consider spammy.\n3.1
Accessing result.media\nBy default, Crawl4AI collects images,
audio, and video URLs it finds on the page. These are stored
in result.media, a dictionary keyed by media type (e.g.,
images, videos, audio).\nBasic Example:\nif result.success:
images_info = result.media.get(\"images\", []) print(f\"Found
{len(images_info)} images in total.\") for i, img in
enumerate(images_info[:5]): # Inspect just the first 5 print(f
\"[Image {i}] URL: {img['src']}\") print(f\" Alt text:
{img.get('alt', '')}\") print(f\" Score: {img.get('score')}\")
print(f\" Description: {img.get('desc', '')}\\n\") \nStructure
Example:\nresult.media = { \"images\": [ { \"src\":
\"https://cdn.prod.website-files.com/.../Group%2089.svg\",
\"alt\": \"coding school for kids\", \"desc\": \"Trial Class
Degrees degrees All Degrees AI Degree Technology ...\",
\"score\": 3, \"type\": \"image\", \"group_id\": 0, \"format
\": None, \"width\": None, \"height\": None }, # ... ],
\"videos\": [ # Similar structure but with video-specific
fields ], \"audio\": [ # Similar structure but with audio-
specific fields ] } \nDepending on your Crawl4AI version or
scraping strategy, these dictionaries can include fields like:
106
\nsrc: The media URL (e.g., image source) \nalt: The alt text
for images (if present) \ndesc: A snippet of nearby text or a
short description (optional) \nscore: A heuristic relevance
score if youâ€™re using content-scoring features \nwidth,
height: If the crawler detects dimensions for the image/video
\ntype: Usually \"image\", \"video\", or \"audio\" \ngroup_id:
If youâ€™re grouping related media items, the crawler might
assign an ID \nWith these details, you can easily filter out
or focus on certain images (for instance, ignoring images with
very low scores or a different domain), or gather metadata for
analytics.\n3.2 Excluding External Images\nIf youâ€™re dealing
with heavy pages or want to skip third-party images
(advertisements, for example), you can turn on:\ncrawler_cfg =
CrawlerRunConfig( exclude_external_images=True ) \nThis
setting attempts to discard images from outside the primary
domain, keeping only those from the site youâ€™re crawling.
\n3.3 Additional Media Config\nscreenshot: Set to True if you
want a full-page screenshot stored as base64 in
result.screenshot. \npdf: Set to True if you want a PDF
version of the page in result.pdf. \nwait_for_images: If True,
attempts to wait until images are fully loaded before final
extraction.\nHereâ€™s a combined example demonstrating how to
filter out external links, skip certain domains, and exclude
external images:\nimport asyncio from crawl4ai import
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig async def
main(): # Suppose we want to keep only internal links, remove
certain domains, # and discard external images from the final
crawl data. crawler_cfg =
CrawlerRunConfig( exclude_external_links=True,
exclude_domains=[\"spammyads.com\"],
exclude_social_media_links=True, # skip Twitter, Facebook,
etc. exclude_external_images=True, # keep only images from
main domain wait_for_images=True, # ensure images are loaded
verbose=True ) async with AsyncWebCrawler() as crawler: result
= await crawler.arun(\"https://www.example.com\",
config=crawler_cfg) if result.success: print(\"[OK] Crawled:
\", result.url) # 1. Links in_links =
result.links.get(\"internal\", []) ext_links =
result.links.get(\"external\", []) print(\"Internal link
count:\", len(in_links)) print(\"External link count:\",
len(ext_links)) # should be zero with
exclude_external_links=True # 2. Images images =
result.media.get(\"images\", []) print(\"Images found:\",
len(images)) # Let's see a snippet of these images for i, img
in enumerate(images[:3]): print(f\" - {img['src']}
(alt={img.get('alt','')}, score={img.get('score','N/A')})\")
else: print(\"[ERROR] Failed to crawl. Reason:\",
result.error_message) if __name__ == \"__main__\":
asyncio.run(main()) \n5. Common Pitfalls & Tips\n1.
Conflicting Flags:\n- exclude_external_links=True but then
also specifying exclude_social_media_links=True is typically
fine, but understand that the first setting already discards
all external links. The second becomes somewhat redundant.\n-
exclude_external_images=True but want to keep some external
images? Currently no partial domain-based setting for images,
so you might need a custom approach or hook logic.\n2.
Relevancy Scores:\n- If your version of Crawl4AI or your
107
scraping strategy includes an img[\"score\"], itâ€™s typically
a heuristic based on size, position, or content analysis.
Evaluate carefully if you rely on it.\n3. Performance:\n-
Excluding certain domains or external images can speed up your
crawl, especially for large, media-heavy pages.\n- If you want
a â€œfullâ€ link map, do not exclude them. Instead, you can
post-filter in your own code.\n4. Social Media Lists:\n-
exclude_social_media_links=True typically references an
internal list of known social domains like Facebook, Twitter,
LinkedIn, etc. If you need to add or remove from that list,
look for library settings or a local config file (depending on
your version).\nThatâ€™s it for Link & Media Analysis! Youâ
€™re now equipped to filter out unwanted sites and zero in on
the images and videos that matter for your project.",
"markdown": "# Link & Media - Crawl4AI Documentation
(v0.5.x)\n\nIn this tutorial, youâ€™ll learn how to:\n\n1.
Extract links (internal, external) from crawled pages\n2.
Filter or exclude specific domains (e.g., social media or
custom domains)\n3. Access and manage media data (especially
images) in the crawl result\n4. Configure your crawler to
exclude or prioritize certain images\n\n> **Prerequisites**
\n> \\- You have completed or are familiar with the
[AsyncWebCrawler Basics]
(https://crawl4ai.com/mkdocs/core/simple-crawling/) tutorial.
\n> \\- You can run Crawl4AI in your environment (Playwright,
Python, etc.).\n\n* * *\n\nBelow is a revised version of the
**Link Extraction** and **Media Extraction** sections that
includes example data structures showing how links and media
items are stored in `CrawlResult`. Feel free to adjust any
field names or descriptions to match your actual output.\n\n*
* *\n\n### 1.1 `result.links`\n\nWhen you call àrun()` or
àrun_many()` on a URL, Crawl4AI automatically extracts links
and stores them in the `links` field of `CrawlResult`. By
default, the crawler tries to distinguish **internal** links
(same domain) from **external** links (different domains).\n
\n**Basic Example**:\n\n`from crawl4ai import AsyncWebCrawler
async with AsyncWebCrawler() as crawler: result = await
crawler.arun(\"https://www.example.com\") if
result.success: internal_links =
result.links.get(\"internal\", []) external_links =
result.links.get(\"external\", []) print(f\"Found
{len(internal_links)} internal links.\") print(f
\"Found {len(internal_links)} external links.\")
print(f\"Found {len(result.media)} media items.\") #
Each link is typically a dictionary with fields like:
# { \"href\": \"...\", \"text\": \"...\", \"title\": \"...\",
\"base_domain\": \"...\" } if internal_links:
print(\"Sample Internal Link:\", internal_links[0]) else:
print(\"Crawl failed:\", result.error_message)`\n\n**Structure
Example**:\n\n`result.links = { \"internal\":
[ { \"href\": \"https://kidocode.com/\",
\"text\": \"\", \"title\": \"\", \"base_domain\":
\"kidocode.com\" }, { \"href\":
\"https://kidocode.com/degrees/technology\", \"text\":
\"Technology Degree\", \"title\": \"KidoCode Tech
Program\", \"base_domain\": \"kidocode.com\" },
# ... ], \"external\": [ # possibly other links
108
leading to third-party sites ] }`\n\n* **`href`**: The raw
hyperlink URL.\n* **`text`**: The link text (if any) within
the `<a>` tag.\n* **`title`**: The `title` attribute of the
link (if present).\n* **`base_domain`**: The domain
extracted from `href`. Helpful for filtering or grouping by
domain.\n\n* * *\n\n## 2\\. Domain Filtering\n\nSome websites
contain hundreds of third-party or affiliate links. You can
filter out certain domains at **crawl time** by configuring
the crawler. The most relevant parameters in
`CrawlerRunConfig` are:\n\n* **èxclude_external_links`**:
If `True`, discard any link pointing outside the root domain.
\n* **èxclude_social_media_domains`**: Provide a list of
social media platforms (e.g., `[\"facebook.com\",
\"twitter.com\"]`) to exclude from your crawl.\n*
**èxclude_social_media_links`**: If `True`, automatically
skip known social platforms.\n* **èxclude_domains`**:
Provide a list of custom domains you want to exclude (e.g.,
`[\"spammyads.com\", \"tracker.net\"]`).\n\nìmport asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig,
CrawlerRunConfig async def main(): crawler_cfg =
CrawlerRunConfig( exclude_external_links=True,
# No links outside primary domain
exclude_social_media_links=True # Skip recognized social
media domains ) async with AsyncWebCrawler() as
crawler: result = await
crawler.arun( \"https://www.example.com\",
config=crawler_cfg ) if result.success:
print(\"[OK] Crawled:\", result.url)
print(\"Internal links count:\",
len(result.links.get(\"internal\", [])))
print(\"External links count:\",
len(result.links.get(\"external\", []))) #
Likely zero external links in this scenario else:
print(\"[ERROR]\", result.error_message) if __name__ ==
\"__main__\": asyncio.run(main())`\n\n### 2.2 Example:
Excluding Specific Domains\n\nIf you want to let external
links in, but specifically exclude a domain (e.g.,
`suspiciousads.com`), do this:\n\n`crawler_cfg =
CrawlerRunConfig( exclude_domains=[\"suspiciousads.com
\"] )`\n\nThis approach is handy when you still want external
links but need to block certain sites you consider spammy.\n
\n* * *\n\n### 3.1 Accessing `result.media`\n\nBy default,
Crawl4AI collects images, audio, and video URLs it finds on
the page. These are stored in `result.media`, a dictionary
keyed by media type (e.g., ìmages`, `videos`, àudio`).\n
\n**Basic Example**:\n\nìf result.success: images_info =
result.media.get(\"images\", []) print(f\"Found
{len(images_info)} images in total.\") for i, img in
enumerate(images_info[:5]): # Inspect just the first 5
print(f\"[Image {i}] URL: {img['src']}\") print(f\"
Alt text: {img.get('alt', '')}\") print(f\"
Score: {img.get('score')}\") print(f\"
Description: {img.get('desc', '')}\\n\")`\n\n**Structure
Example**:\n\n`result.media = { \"images\":
[ { \"src\": \"https://cdn.prod.website-
files.com/.../Group%2089.svg\", \"alt\": \"coding school
for kids\", \"desc\": \"Trial Class Degrees degrees All
109
Degrees AI Degree Technology ...\", \"score\": 3,
\"type\": \"image\", \"group_id\": 0, \"format\":
None, \"width\": None, \"height\": None },
# ... ], \"videos\": [ # Similar structure but with
video-specific fields ], \"audio\": [ # Similar
structure but with audio-specific fields ] }`\n\nDepending
on your Crawl4AI version or scraping strategy, these
dictionaries can include fields like:\n\n* **`src`**: The
media URL (e.g., image source)\n* **àlt`**: The alt text
for images (if present)\n* **`desc`**: A snippet of nearby
text or a short description (optional)\n* **`score`**: A
heuristic relevance score if youâ€™re using content-scoring
features\n* **`width`**, **`height`**: If the crawler
detects dimensions for the image/video\n* **`type`**:
Usually `\"image\"`, `\"video\"`, or `\"audio\"`\n*
**`group_id`**: If youâ€™re grouping related media items, the
crawler might assign an ID\n\nWith these details, you can
easily filter out or focus on certain images (for instance,
ignoring images with very low scores or a different domain),
or gather metadata for analytics.\n\n### 3.2 Excluding
External Images\n\nIf youâ€™re dealing with heavy pages or
want to skip third-party images (advertisements, for example),
you can turn on:\n\n`crawler_cfg =
CrawlerRunConfig( exclude_external_images=True )`\n\nThis
setting attempts to discard images from outside the primary
domain, keeping only those from the site youâ€™re crawling.\n
\n### 3.3 Additional Media Config\n\n* **`screenshot`**: Set
to `True` if you want a full-page screenshot stored as `base64
` in `result.screenshot`.\n* **`pdf`**: Set to `True` if you
want a PDF version of the page in `result.pdf`.\n*
**`wait_for_images`**: If `True`, attempts to wait until
images are fully loaded before final extraction.\n\n* * *\n
\nHereâ€™s a combined example demonstrating how to filter out
external links, skip certain domains, and exclude external
images:\n\nìmport asyncio from crawl4ai import
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig async def
main(): # Suppose we want to keep only internal links,
remove certain domains, # and discard external images
from the final crawl data. crawler_cfg =
CrawlerRunConfig( exclude_external_links=True,
exclude_domains=[\"spammyads.com\"],
exclude_social_media_links=True, # skip Twitter, Facebook,
etc. exclude_external_images=True, # keep only
images from main domain wait_for_images=True,
# ensure images are loaded verbose=True )
async with AsyncWebCrawler() as crawler: result =
await crawler.arun(\"https://www.example.com\",
config=crawler_cfg) if result.success:
print(\"[OK] Crawled:\", result.url) # 1. Links
in_links = result.links.get(\"internal\", [])
ext_links = result.links.get(\"external\", [])
print(\"Internal link count:\", len(in_links))
print(\"External link count:\", len(ext_links)) # should be
zero with exclude_external_links=True # 2. Images
images = result.media.get(\"images\", [])
print(\"Images found:\", len(images)) # Let's see
a snippet of these images for i, img in
110
enumerate(images[:3]): print(f\" -
{img['src']} (alt={img.get('alt','')},
score={img.get('score','N/A')})\") else:
print(\"[ERROR] Failed to crawl. Reason:\",
result.error_message) if __name__ == \"__main__\":
asyncio.run(main())`\n\n* * *\n\n## 5\\. Common Pitfalls &
Tips\n\n1.â€€**Conflicting Flags**: \n\\-
èxclude_external_links=True` but then also specifying
èxclude_social_media_links=True` is typically fine, but
understand that the first setting already discards _all_
external links. The second becomes somewhat redundant. \n\\-
èxclude_external_images=True` but want to keep some external
images? Currently no partial domain-based setting for images,
so you might need a custom approach or hook logic.\n\n2.â
€€**Relevancy Scores**: \n\\- If your version of Crawl4AI or
your scraping strategy includes an ìmg[\"score\"]`, itâ€™s
typically a heuristic based on size, position, or content
analysis. Evaluate carefully if you rely on it.\n\n3.â
€€**Performance**: \n\\- Excluding certain domains or
external images can speed up your crawl, especially for large,
media-heavy pages. \n\\- If you want a â€œfullâ€ link map,
do _not_ exclude them. Instead, you can post-filter in your
own code.\n\n4.â€€**Social Media Lists**: \n\\-
èxclude_social_media_links=True` typically references an
internal list of known social domains like Facebook, Twitter,
LinkedIn, etc. If you need to add or remove from that list,
look for library settings or a local config file (depending on
your version).\n\n* * *\n\n**Thatâ€™s it for Link & Media
Analysis!** Youâ€™re now equipped to filter out unwanted sites
and zero in on the images and videos that matter for your
project.",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/advanced/advanced-
features/",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/advanced/advanced-features/",
"loadedTime": "2025-03-05T23:16:50.570Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl":
"https://docs.crawl4ai.com/advanced/advanced-features/",
"title": "Overview - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
111
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:48 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"263ae84d962a9dd1df63d7edf861188e\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Overview - Crawl4AI Documentation
(v0.5.x)\nOverview of Some Important Advanced Features
\n(Proxy, PDF, Screenshot, SSL, Headers, & Storage
State)\nCrawl4AI offers multiple power-user features that go
beyond simple crawling. This tutorial covers:\n1. Proxy Usage
\n2. Capturing PDFs & Screenshots\n3. Handling SSL
Certificates\n4. Custom Headers\n5. Session Persistence &
Local Storage\n6. Robots.txt Compliance \nPrerequisites\n- You
have a basic grasp of AsyncWebCrawler Basics\n- You know how
to run or configure your Python environment with Playwright
installed\n1. Proxy Usage\nIf you need to route your crawl
traffic through a proxyâ€”whether for IP rotation, geo-
testing, or privacyâ€”Crawl4AI supports it via
BrowserConfig.proxy_config.\nimport asyncio from crawl4ai
import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig async
def main(): browser_cfg =
BrowserConfig( proxy_config={ \"server\":
\"http://proxy.example.com:8080\", \"username\": \"myuser\",
\"password\": \"mypass\", }, headless=True ) crawler_cfg =
CrawlerRunConfig( verbose=True ) async with
AsyncWebCrawler(config=browser_cfg) as crawler: result = await
crawler.arun( url=\"https://www.whatismyip.com/\",
config=crawler_cfg ) if result.success: print(\"[OK] Page
fetched via proxy.\") print(\"Page HTML snippet:\",
result.html[:200]) else: print(\"[ERROR]\",
result.error_message) if __name__ == \"__main__\":
asyncio.run(main()) \nKey Points\n- proxy_config expects a
dict with server and optional auth credentials.\n- Many
commercial proxies provide an HTTP/HTTPS â€œgatewayâ€ server
that you specify in server.\n- If your proxy doesnâ€™t need
auth, omit username/password.\n2. Capturing PDFs & Screenshots
\nSometimes you need a visual record of a page or a PDF â
€œprintout.â€ Crawl4AI can do both in one pass:\nimport os,
asyncio from base64 import b64decode from crawl4ai import
AsyncWebCrawler, CacheMode async def main(): async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=
\"https://en.wikipedia.org/wiki/List_of_common_misconceptions
\", cache_mode=CacheMode.BYPASS, pdf=True, screenshot=True )
if result.success: # Save screenshot if result.screenshot:
with open(\"wikipedia_screenshot.png\", \"wb\") as f:
f.write(b64decode(result.screenshot)) # Save PDF if
result.pdf: with open(\"wikipedia_page.pdf\", \"wb\") as f:
f.write(result.pdf) print(\"[OK] PDF & screenshot captured.\")
else: print(\"[ERROR]\", result.error_message) if __name__ ==
\"__main__\": asyncio.run(main()) \nWhy PDF + Screenshot?\n-
112
Large or complex pages can be slow or error-prone with â
€œtraditionalâ€ full-page screenshots.\n- Exporting a PDF is
more reliable for very long pages. Crawl4AI automatically
converts the first PDF page into an image if you request both.
\nRelevant Parameters\n- pdf=True: Exports the current page as
a PDF (base64-encoded in result.pdf).\n- screenshot=True:
Creates a screenshot (base64-encoded in result.screenshot).\n-
scan_full_page or advanced hooking can further refine how the
crawler captures content.\n3. Handling SSL Certificates\nIf
you need to verify or export a siteâ€™s SSL certificateâ€”for
compliance, debugging, or data analysisâ€”Crawl4AI can fetch
it during the crawl:\nimport asyncio, os from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig, CacheMode async def main():
tmp_dir = os.path.join(os.getcwd(), \"tmp\")
os.makedirs(tmp_dir, exist_ok=True) config =
CrawlerRunConfig( fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS ) async with AsyncWebCrawler() as
crawler: result = await crawler.arun(url=\"https://example.com
\", config=config) if result.success and
result.ssl_certificate: cert = result.ssl_certificate
print(\"\\nCertificate Information:\") print(f\"Issuer (CN):
{cert.issuer.get('CN', '')}\") print(f\"Valid until:
{cert.valid_until}\") print(f\"Fingerprint:
{cert.fingerprint}\") # Export in multiple formats:
cert.to_json(os.path.join(tmp_dir, \"certificate.json\"))
cert.to_pem(os.path.join(tmp_dir, \"certificate.pem\"))
cert.to_der(os.path.join(tmp_dir, \"certificate.der\"))
print(\"\\nCertificate exported to JSON/PEM/DER in 'tmp'
folder.\") else: print(\"[ERROR] No certificate or crawl
failed.\") if __name__ == \"__main__\": asyncio.run(main())
\nKey Points\n- fetch_ssl_certificate=True triggers
certificate retrieval.\n- result.ssl_certificate includes
methods (to_json, to_pem, to_der) for saving in various
formats (handy for server config, Java keystores, etc.).
\nSometimes you need to set custom headers (e.g., language
preferences, authentication tokens, or specialized user-agent
strings). You can do this in multiple ways:\nimport asyncio
from crawl4ai import AsyncWebCrawler async def main(): #
Option 1: Set headers at the crawler strategy level crawler1 =
AsyncWebCrawler( # The underlying strategy can accept headers
in its constructor crawler_strategy=None # We'll override
below for clarity )
crawler1.crawler_strategy.update_user_agent(\"MyCustomUA/1.0
\") crawler1.crawler_strategy.set_custom_headers({ \"Accept-
Language\": \"fr-FR,fr;q=0.9\" }) result1 = await
crawler1.arun(\"https://www.example.com\") print(\"Example 1
result success:\", result1.success) # Option 2: Pass headers
directly to àrun()` crawler2 = AsyncWebCrawler() result2 =
await crawler2.arun( url=\"https://www.example.com\",
headers={\"Accept-Language\": \"es-ES,es;q=0.9\"} )
print(\"Example 2 result success:\", result2.success) if
__name__ == \"__main__\": asyncio.run(main()) \nNotes\n- Some
sites may react differently to certain headers (e.g., Accept-
Language).\n- If you need advanced user-agent randomization or
client hints, see Identity-Based Crawling (Anti-Bot) or use
UserAgentGenerator.\n5. Session Persistence & Local Storage
\nCrawl4AI can preserve cookies and localStorage so you can
113
continue where you left offâ€”ideal for logging into sites or
skipping repeated auth flows.\n5.1 storage_state\nimport
asyncio from crawl4ai import AsyncWebCrawler async def main():
storage_dict = { \"cookies\": [ { \"name\": \"session\",
\"value\": \"abcd1234\", \"domain\": \"example.com\", \"path
\": \"/\", \"expires\": 1699999999.0, \"httpOnly\": False,
\"secure\": False, \"sameSite\": \"None\" } ], \"origins\":
[ { \"origin\": \"https://example.com\", \"localStorage\":
[ {\"name\": \"token\", \"value\": \"my_auth_token\"} ] } ] }
# Provide the storage state as a dictionary to start \"already
logged in\" async with AsyncWebCrawler( headless=True,
storage_state=storage_dict ) as crawler: result = await
crawler.arun(\"https://example.com/protected\") if
result.success: print(\"Protected page content length:\",
len(result.html)) else: print(\"Failed to crawl protected page
\") if __name__ == \"__main__\": asyncio.run(main()) \n5.2
Exporting & Reusing State\nYou can sign in once, export the
browser context, and reuse it laterâ€”without re-entering
credentials.\nawait context.storage_state(path=
\"my_storage.json\"): Exports cookies, localStorage, etc. to a
file. \nProvide storage_state=\"my_storage.json\" on
subsequent runs to skip the login step.\nSee: Detailed session
management tutorial or Explanations â†’ Browser Context &
Managed Browser for more advanced scenarios (like multi-step
logins, or capturing after interactive pages).\n6. Robots.txt
Compliance\nCrawl4AI supports respecting robots.txt rules with
efficient caching:\nimport asyncio from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig async def main(): # Enable
robots.txt checking in config config =
CrawlerRunConfig( check_robots_txt=True # Will check and
respect robots.txt rules ) async with AsyncWebCrawler() as
crawler: result = await crawler.arun( \"https://example.com\",
config=config ) if not result.success and result.status_code
== 403: print(\"Access denied by robots.txt\") if __name__ ==
\"__main__\": asyncio.run(main()) \nKey Points - Robots.txt
files are cached locally for efficiency - Cache is stored in
~/.crawl4ai/robots/robots_cache.db - Cache has a default TTL
of 7 days - If robots.txt can't be fetched, crawling is
allowed - Returns 403 status code if URL is disallowed
\nPutting It All Together\nHereâ€™s a snippet that combines
multiple â€œadvancedâ€ features (proxy, PDF, screenshot, SSL,
custom headers, and session reuse) into one run. Normally,
youâ€™d tailor each setting to your projectâ€™s needs.\nimport
os, asyncio from base64 import b64decode from crawl4ai import
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main(): # 1. Browser config with proxy + headless
browser_cfg = BrowserConfig( proxy_config={ \"server\":
\"http://proxy.example.com:8080\", \"username\": \"myuser\",
\"password\": \"mypass\", }, headless=True, ) # 2. Crawler
config with PDF, screenshot, SSL, custom headers, and ignoring
caches crawler_cfg = CrawlerRunConfig( pdf=True,
screenshot=True, fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS, headers={\"Accept-Language\":
\"en-US,en;q=0.8\"}, storage_state=\"my_storage.json\", #
Reuse session from a previous sign-in verbose=True, ) # 3.
Crawl async with AsyncWebCrawler(config=browser_cfg) as
crawler: result = await crawler.arun( url =
114
\"https://secure.example.com/protected\", config=crawler_cfg )
if result.success: print(\"[OK] Crawled the secure page. Links
found:\", len(result.links.get(\"internal\", []))) # Save PDF
& screenshot if result.pdf: with open(\"result.pdf\", \"wb\")
as f: f.write(b64decode(result.pdf)) if result.screenshot:
with open(\"result.png\", \"wb\") as f:
f.write(b64decode(result.screenshot)) # Check SSL cert if
result.ssl_certificate: print(\"SSL Issuer CN:\",
result.ssl_certificate.issuer.get(\"CN\", \"\")) else:
print(\"[ERROR]\", result.error_message) if __name__ ==
\"__main__\": asyncio.run(main()) \nConclusion & Next Steps
\nYouâ€™ve now explored several advanced features:\nProxy
Usage \nPDF & Screenshot capturing for large or critical pages
\nSSL Certificate retrieval & exporting \nCustom Headers for
language or specialized requests \nSession Persistence via
storage state\nRobots.txt Compliance\nWith these power tools,
you can build robust scraping workflows that mimic real user
behavior, handle secure sites, capture detailed snapshots, and
manage sessions across multiple runsâ€”streamlining your
entire data collection pipeline.\nLast Updated: 2025-01-01",
"markdown": "# Overview - Crawl4AI Documentation (v0.5.x)\n
\n## Overview of Some Important Advanced Features\n\n(Proxy,
PDF, Screenshot, SSL, Headers, & Storage State)\n\nCrawl4AI
offers multiple power-user features that go beyond simple
crawling. This tutorial covers:\n\n1.â€€**Proxy Usage** \n2.â
€€**Capturing PDFs & Screenshots** \n3.â€€**Handling SSL
Certificates** \n4.â€€**Custom Headers** \n5.â€€**Session
Persistence & Local Storage** \n6.â€€**Robots.txt
Compliance**\n\n> **Prerequisites** \n> \\- You have a basic
grasp of [AsyncWebCrawler Basics]
(https://crawl4ai.com/mkdocs/core/simple-crawling/) \n> \\-
You know how to run or configure your Python environment with
Playwright installed\n\n* * *\n\n## 1\\. Proxy Usage\n\nIf you
need to route your crawl traffic through a proxyâ€”whether for
IP rotation, geo-testing, or privacyâ€”Crawl4AI supports it
via `BrowserConfig.proxy_config`.\n\nìmport asyncio from
crawl4ai import AsyncWebCrawler, BrowserConfig,
CrawlerRunConfig async def main(): browser_cfg =
BrowserConfig( proxy_config={ \"server\":
\"http://proxy.example.com:8080\", \"username\":
\"myuser\", \"password\": \"mypass\", },
headless=True ) crawler_cfg =
CrawlerRunConfig( verbose=True ) async with
AsyncWebCrawler(config=browser_cfg) as crawler: result
= await crawler.arun( url=
\"https://www.whatismyip.com/\",
config=crawler_cfg ) if result.success:
print(\"[OK] Page fetched via proxy.\")
print(\"Page HTML snippet:\", result.html[:200]) else:
print(\"[ERROR]\", result.error_message) if __name__ ==
\"__main__\": asyncio.run(main())`\n\n**Key Points** \n
\\- **`proxy_config`** expects a dict with `server` and
optional auth credentials. \n\\- Many commercial proxies
provide an HTTP/HTTPS â€œgatewayâ€ server that you specify in
`server`. \n\\- If your proxy doesnâ€™t need auth, omit
ùsername`/`password`.\n\n* * *\n\n## 2\\. Capturing PDFs &
Screenshots\n\nSometimes you need a visual record of a page or
115
a PDF â€œprintout.â€ Crawl4AI can do both in one pass:\n
\nìmport os, asyncio from base64 import b64decode from
crawl4ai import AsyncWebCrawler, CacheMode async def main():
async with AsyncWebCrawler() as crawler: result =
await crawler.arun( url=
\"https://en.wikipedia.org/wiki/List_of_common_misconceptions
\", cache_mode=CacheMode.BYPASS,
pdf=True, screenshot=True ) if
result.success: # Save screenshot if
result.screenshot: with
open(\"wikipedia_screenshot.png\", \"wb\") as f:
f.write(b64decode(result.screenshot)) # Save PDF
if result.pdf: with open(\"wikipedia_page.pdf
\", \"wb\") as f: f.write(result.pdf)
print(\"[OK] PDF & screenshot captured.\") else:
print(\"[ERROR]\", result.error_message) if __name__ ==
\"__main__\": asyncio.run(main())`\n\n**Why PDF +
Screenshot?** \n\\- Large or complex pages can be slow or
error-prone with â€œtraditionalâ€ full-page screenshots. \n
\\- Exporting a PDF is more reliable for very long pages.
Crawl4AI automatically converts the first PDF page into an
image if you request both.\n\n**Relevant Parameters** \n\\-
**`pdf=True`**: Exports the current page as a PDF (base64-
encoded in `result.pdf`). \n\\- **`screenshot=True`**:
Creates a screenshot (base64-encoded in `result.screenshot`).
\n\\- **`scan_full_page`** or advanced hooking can further
refine how the crawler captures content.\n\n* * *\n\n## 3\\.
Handling SSL Certificates\n\nIf you need to verify or export a
siteâ€™s SSL certificateâ€”for compliance, debugging, or data
analysisâ€”Crawl4AI can fetch it during the crawl:\n\nìmport
asyncio, os from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig, CacheMode async def main(): tmp_dir =
os.path.join(os.getcwd(), \"tmp\") os.makedirs(tmp_dir,
exist_ok=True) config =
CrawlerRunConfig( fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun(url=\"https://example.com\", config=config)
if result.success and result.ssl_certificate: cert
= result.ssl_certificate print(\"\\nCertificate
Information:\") print(f\"Issuer (CN):
{cert.issuer.get('CN', '')}\") print(f\"Valid
until: {cert.valid_until}\") print(f\"Fingerprint:
{cert.fingerprint}\") # Export in multiple
formats: cert.to_json(os.path.join(tmp_dir,
\"certificate.json\"))
cert.to_pem(os.path.join(tmp_dir, \"certificate.pem\"))
cert.to_der(os.path.join(tmp_dir, \"certificate.der\"))
print(\"\\nCertificate exported to JSON/PEM/DER in 'tmp'
folder.\") else: print(\"[ERROR] No
certificate or crawl failed.\") if __name__ == \"__main__\":
asyncio.run(main())`\n\n**Key Points** \n\\-
**`fetch_ssl_certificate=True`** triggers certificate
retrieval. \n\\- `result.ssl_certificate` includes methods
(`to_json`, `to_pem`, `to_der`) for saving in various formats
(handy for server config, Java keystores, etc.).\n\n* * *\n
\nSometimes you need to set custom headers (e.g., language
116
preferences, authentication tokens, or specialized user-agent
strings). You can do this in multiple ways:\n\n`ìmport
asyncio from crawl4ai import AsyncWebCrawler async def
main(): # Option 1: Set headers at the crawler strategy
level crawler1 = AsyncWebCrawler( # The underlying
strategy can accept headers in its constructor
crawler_strategy=None # We'll override below for
clarity )
crawler1.crawler_strategy.update_user_agent(\"MyCustomUA/1.0
\")
crawler1.crawler_strategy.set_custom_headers({ \"Accep
t-Language\": \"fr-FR,fr;q=0.9\" }) result1 = await
crawler1.arun(\"https://www.example.com\") print(\"Example
1 result success:\", result1.success) # Option 2: Pass
headers directly to àrun()` crawler2 = AsyncWebCrawler()
result2 = await crawler2.arun( url=
\"https://www.example.com\", headers={\"Accept-
Language\": \"es-ES,es;q=0.9\"} ) print(\"Example 2
result success:\", result2.success) if __name__ == \"__main__
\": asyncio.run(main())``\n\n**Notes** \n\\- Some sites
may react differently to certain headers (e.g., Àccept-
Language`). \n\\- If you need advanced user-agent
randomization or client hints, see [Identity-Based Crawling
(Anti-Bot)](https://crawl4ai.com/mkdocs/advanced/identity-
based-crawling/) or use ÙserAgentGenerator`.\n\n* * *\n\n## 5
\\. Session Persistence & Local Storage\n\nCrawl4AI can
preserve cookies and localStorage so you can continue where
you left offâ€”ideal for logging into sites or skipping
repeated auth flows.\n\n### 5.1 `storage_state`\n\nìmport
asyncio from crawl4ai import AsyncWebCrawler async def
main(): storage_dict = { \"cookies\":
[ { \"name\": \"session\",
\"value\": \"abcd1234\", \"domain\":
\"example.com\", \"path\": \"/\",
\"expires\": 1699999999.0, \"httpOnly\":
False, \"secure\": False,
\"sameSite\": \"None\" } ],
\"origins\": [ { \"origin\":
\"https://example.com\", \"localStorage\":
[ {\"name\": \"token\", \"value\":
\"my_auth_token
\"} ] } ] } #
Provide the storage state as a dictionary to start \"already
logged in\" async with
AsyncWebCrawler( headless=True,
storage_state=storage_dict ) as crawler: result =
await crawler.arun(\"https://example.com/protected\")
if result.success: print(\"Protected page content
length:\", len(result.html)) else:
print(\"Failed to crawl protected page\") if __name__ ==
\"__main__\": asyncio.run(main())`\n\n### 5.2 Exporting &
Reusing State\n\nYou can sign in once, export the browser
context, and reuse it laterâ€”without re-entering credentials.
\n\n* **àwait context.storage_state(path=\"my_storage.json
\")`**: Exports cookies, localStorage, etc. to a file.\n*
Provide `storage_state=\"my_storage.json\"` on subsequent runs
to skip the login step.\n\n**See**: [Detailed session
117
management tutorial]
(https://crawl4ai.com/mkdocs/advanced/session-management/) or
[Explanations â†’ Browser Context & Managed Browser]
(https://crawl4ai.com/mkdocs/advanced/identity-based-
crawling/) for more advanced scenarios (like multi-step
logins, or capturing after interactive pages).\n\n* * *\n\n##
6\\. Robots.txt Compliance\n\nCrawl4AI supports respecting
robots.txt rules with efficient caching:\n\nìmport asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig async
def main(): # Enable robots.txt checking in config
config = CrawlerRunConfig( check_robots_txt=True #
Will check and respect robots.txt rules ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( \"https://example.com\",
config=config ) if not result.success and
result.status_code == 403: print(\"Access denied
by robots.txt\") if __name__ == \"__main__\":
asyncio.run(main())`\n\n**Key Points** - Robots.txt files are
cached locally for efficiency - Cache is stored in
`~/.crawl4ai/robots/robots_cache.db` - Cache has a default TTL
of 7 days - If robots.txt can't be fetched, crawling is
allowed - Returns 403 status code if URL is disallowed\n\n* *
*\n\n## Putting It All Together\n\nHereâ€™s a snippet that
combines multiple â€œadvancedâ€ features (proxy, PDF,
screenshot, SSL, custom headers, and session reuse) into one
run. Normally, youâ€™d tailor each setting to your projectâ€™s
needs.\n\nìmport os, asyncio from base64 import b64decode
from crawl4ai import AsyncWebCrawler, BrowserConfig,
CrawlerRunConfig, CacheMode async def main(): # 1.
Browser config with proxy + headless browser_cfg =
BrowserConfig( proxy_config={ \"server\":
\"http://proxy.example.com:8080\", \"username\":
\"myuser\", \"password\": \"mypass\", },
headless=True, ) # 2. Crawler config with PDF,
screenshot, SSL, custom headers, and ignoring caches
crawler_cfg = CrawlerRunConfig( pdf=True,
screenshot=True, fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS, headers={\"Accept-
Language\": \"en-US,en;q=0.8\"}, storage_state=
\"my_storage.json\", # Reuse session from a previous sign-in
verbose=True, ) # 3. Crawl async with
AsyncWebCrawler(config=browser_cfg) as crawler: result
= await crawler.arun( url =
\"https://secure.example.com/protected\",
config=crawler_cfg ) if result.success:
print(\"[OK] Crawled the secure page. Links found:\",
len(result.links.get(\"internal\", []))) # Save
PDF & screenshot if result.pdf:
with open(\"result.pdf\", \"wb\") as f:
f.write(b64decode(result.pdf)) if
result.screenshot: with open(\"result.png\",
\"wb\") as f:
f.write(b64decode(result.screenshot)) # Check SSL
cert if result.ssl_certificate:
print(\"SSL Issuer CN:\",
result.ssl_certificate.issuer.get(\"CN\", \"\")) else:
print(\"[ERROR]\", result.error_message) if __name__ ==
118
\"__main__\": asyncio.run(main())`\n\n* * *\n\n##
Conclusion & Next Steps\n\nYouâ€™ve now explored several
**advanced** features:\n\n* **Proxy Usage**\n* **PDF &
Screenshot** capturing for large or critical pages\n* **SSL
Certificate** retrieval & exporting\n* **Custom Headers**
for language or specialized requests\n* **Session
Persistence** via storage state\n* **Robots.txt Compliance**
\n\nWith these power tools, you can build robust scraping
workflows that mimic real user behavior, handle secure sites,
capture detailed snapshots, and manage sessions across
multiple runsâ€”streamlining your entire data collection
pipeline.\n\n**Last Updated**: 2025-01-01",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/advanced/lazy-loading/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/advanced/lazy-
loading/",
"loadedTime": "2025-03-05T23:16:55.836Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/advanced/lazy-
loading/",
"title": "Lazy Loading - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:16:54 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"87018d147bd59fa8d52465700eb6d990\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Lazy Loading - Crawl4AI Documentation
(v0.5.x)\nHandling Lazy-Loaded Images\nMany websites now load
images lazily as you scroll. If you need to ensure they appear
in your final crawl (and in result.media), consider:\n1.
wait_for_images=True â€“ Wait for images to fully load.\n2.
scan_full_page â€“ Force the crawler to scroll the entire
page, triggering lazy loads.\n3. scroll_delay â€“ Add small
delays between scroll steps. \nNote: If the site requires
multiple â€œLoad Moreâ€ triggers or complex interactions, see
119
the Page Interaction docs.\nExample: Ensuring Lazy Images
Appear\nimport asyncio from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig, BrowserConfig from crawl4ai.async_configs
import CacheMode async def main(): config =
CrawlerRunConfig( # Force the crawler to wait until images are
fully loaded wait_for_images=True, # Option 1: If you want to
automatically scroll the page to load images
scan_full_page=True, # Tells the crawler to try scrolling the
entire page scroll_delay=0.5, # Delay (seconds) between scroll
steps # Option 2: If the site uses a 'Load More' or JS
triggers for images, # you can also specify js_code or
wait_for logic here. cache_mode=CacheMode.BYPASS,
verbose=True ) async with
AsyncWebCrawler(config=BrowserConfig(headless=True)) as
crawler: result = await
crawler.arun(\"https://www.example.com/gallery\",
config=config) if result.success: images =
result.media.get(\"images\", []) print(\"Images found:\",
len(images)) for i, img in enumerate(images[:5]): print(f
\"[Image {i}] URL: {img['src']}, Score:
{img.get('score','N/A')}\") else: print(\"Error:\",
result.error_message) if __name__ == \"__main__\":
asyncio.run(main()) \nExplanation:\nwait_for_images=True\nThe
crawler tries to ensure images have finished loading before
finalizing the HTML. \nscan_full_page=True\nTells the crawler
to attempt scrolling from top to bottom. Each scroll step
helps trigger lazy loading. \nscroll_delay=0.5\nPause half a
second between each scroll step. Helps the site load images
before continuing.\nWhen to Use:\nLazy-Loading: If images
appear only when the user scrolls into view, scan_full_page +
scroll_delay helps the crawler see them. \nHeavier Pages: If a
page is extremely long, be mindful that scanning the entire
page can be slow. Adjust scroll_delay or the max scroll steps
as needed.\nYou can still combine lazy-load logic with the
usual exclude_external_images, exclude_domains, or link
filtration:\nconfig = CrawlerRunConfig( wait_for_images=True,
scan_full_page=True, scroll_delay=0.5, # Filter out external
images if you only want local ones
exclude_external_images=True, # Exclude certain domains for
links exclude_domains=[\"spammycdn.com\"], ) \nThis approach
ensures you see all images from the main domain while ignoring
external ones, and the crawler physically scrolls the entire
page so that lazy-loading triggers.\nTips & Troubleshooting
\n1. Long Pages\n- Setting scan_full_page=True on extremely
long or infinite-scroll pages can be resource-intensive.\n-
Consider using hooks or specialized logic to load specific
sections or â€œLoad Moreâ€ triggers repeatedly.\n2. Mixed
Image Behavior\n- Some sites load images in batches as you
scroll. If youâ€™re missing images, increase your scroll_delay
or call multiple partial scrolls in a loop with JS code or
hooks.\n3. Combining with Dynamic Wait\n- If the site has a
placeholder that only changes to a real image after a certain
event, you might do wait_for=\"css:img.loaded\" or a custom JS
wait_for.\n4. Caching\n- If cache_mode is enabled, repeated
crawls might skip some network fetches. If you suspect caching
is missing new images, set cache_mode=CacheMode.BYPASS for
fresh fetches.\nWith lazy-loading support, wait_for_images,
120
and scan_full_page settings, you can capture the entire
gallery or feed of images you expectâ€”even if the site only
loads them as the user scrolls. Combine these with the
standard media filtering and domain exclusion for a complete
link & media handling strategy.",
"markdown": "# Lazy Loading - Crawl4AI Documentation
(v0.5.x)\n\n## Handling Lazy-Loaded Images\n\nMany websites
now load images **lazily** as you scroll. If you need to
ensure they appear in your final crawl (and in
`result.media`), consider:\n\n1.â€€**`wait_for_images=True`**
â€“ Wait for images to fully load. \n2.â
€€**`scan_full_page`** â€“ Force the crawler to scroll the
entire page, triggering lazy loads. \n3.â€€**`scroll_delay`**
â€“ Add small delays between scroll steps.\n\n**Note**: If the
site requires multiple â€œLoad Moreâ€ triggers or complex
interactions, see the [Page Interaction docs]
(https://crawl4ai.com/mkdocs/core/page-interaction/).\n\n###
Example: Ensuring Lazy Images Appear\n\nìmport asyncio from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig,
BrowserConfig from crawl4ai.async_configs import CacheMode
async def main(): config = CrawlerRunConfig( #
Force the crawler to wait until images are fully loaded
wait_for_images=True, # Option 1: If you want to
automatically scroll the page to load images
scan_full_page=True, # Tells the crawler to try scrolling the
entire page scroll_delay=0.5, # Delay (seconds)
between scroll steps # Option 2: If the site uses a
'Load More' or JS triggers for images, # you can also
specify js_code or wait_for logic here.
cache_mode=CacheMode.BYPASS, verbose=True )
async with
AsyncWebCrawler(config=BrowserConfig(headless=True)) as
crawler: result = await
crawler.arun(\"https://www.example.com/gallery\",
config=config) if result.success: images
= result.media.get(\"images\", []) print(\"Images
found:\", len(images)) for i, img in
enumerate(images[:5]): print(f\"[Image {i}]
URL: {img['src']}, Score: {img.get('score','N/A')}\")
else: print(\"Error:\", result.error_message) if
__name__ == \"__main__\": asyncio.run(main())`\n
\n**Explanation**:\n\n* **`wait_for_images=True`** \n
The crawler tries to ensure images have finished loading
before finalizing the HTML.\n* **`scan_full_page=True`** \n
Tells the crawler to attempt scrolling from top to bottom.
Each scroll step helps trigger lazy loading.\n*
**`scroll_delay=0.5`** \n Pause half a second between each
scroll step. Helps the site load images before continuing.\n
\n**When to Use**:\n\n* **Lazy-Loading**: If images appear
only when the user scrolls into view, `scan_full_page` +
`scroll_delay` helps the crawler see them.\n* **Heavier
Pages**: If a page is extremely long, be mindful that scanning
the entire page can be slow. Adjust `scroll_delay` or the max
scroll steps as needed.\n\n* * *\n\nYou can still combine
**lazy-load** logic with the usual **exclude\\_external
\\_images**, **exclude\\_domains**, or link filtration:\n
\n`config = CrawlerRunConfig( wait_for_images=True,
121
scan_full_page=True, scroll_delay=0.5, # Filter out
external images if you only want local ones
exclude_external_images=True, # Exclude certain domains
for links exclude_domains=[\"spammycdn.com\"], )`\n\nThis
approach ensures you see **all** images from the main domain
while ignoring external ones, and the crawler physically
scrolls the entire page so that lazy-loading triggers.\n\n* *
*\n\n## Tips & Troubleshooting\n\n1.â€€**Long Pages** \n\\-
Setting `scan_full_page=True` on extremely long or infinite-
scroll pages can be resource-intensive. \n\\- Consider using
[hooks](https://crawl4ai.com/mkdocs/core/page-interaction/) or
specialized logic to load specific sections or â€œLoad Moreâ€
triggers repeatedly.\n\n2.â€€**Mixed Image Behavior** \n\\-
Some sites load images in batches as you scroll. If youâ€™re
missing images, increase your `scroll_delay` or call multiple
partial scrolls in a loop with JS code or hooks.\n\n3.â
€€**Combining with Dynamic Wait** \n\\- If the site has a
placeholder that only changes to a real image after a certain
event, you might do `wait_for=\"css:img.loaded\"` or a custom
JS `wait_for`.\n\n4.â€€**Caching** \n\\- If `cache_mode` is
enabled, repeated crawls might skip some network fetches. If
you suspect caching is missing new images, set
`cache_mode=CacheMode.BYPASS` for fresh fetches.\n\n* * *\n
\nWith **lazy-loading** support, **wait\\_for\\_images**, and
**scan\\_full\\_page** settings, you can capture the entire
gallery or feed of images you expectâ€”even if the site only
loads them as the user scrolls. Combine these with the
standard media filtering and domain exclusion for a complete
link & media handling strategy.",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/advanced/file-
downloading/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/advanced/file-
downloading/",
"loadedTime": "2025-03-05T23:16:56.540Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/advanced/file-
downloading/",
"title": "File Downloading - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
122
"date": "Wed, 05 Mar 2025 23:16:54 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"5e47067c46ff1457e024bd3a4538b53e\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "File Downloading - Crawl4AI Documentation
(v0.5.x)\nDownload Handling in Crawl4AI\nThis guide explains
how to use Crawl4AI to handle file downloads during crawling.
You'll learn how to trigger downloads, specify download
locations, and access downloaded files.\nEnabling Downloads
\nTo enable downloads, set the accept_downloads parameter in
the BrowserConfig object and pass it to the crawler.\nfrom
crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler
async def main(): config =
BrowserConfig(accept_downloads=True) # Enable downloads
globally async with AsyncWebCrawler(config=config) as crawler:
# ... your crawling logic ... asyncio.run(main()) \nSpecifying
Download Location\nSpecify the download directory using the
downloads_path attribute in the BrowserConfig object. If not
provided, Crawl4AI defaults to creating a \"downloads\"
directory inside the .crawl4ai folder in your home directory.
\nfrom crawl4ai.async_configs import BrowserConfig import os
downloads_path = os.path.join(os.getcwd(), \"my_downloads\") #
Custom download path os.makedirs(downloads_path,
exist_ok=True) config = BrowserConfig(accept_downloads=True,
downloads_path=downloads_path) async def main(): async with
AsyncWebCrawler(config=config) as crawler: result = await
crawler.arun(url=\"https://example.com\") # ... \nTriggering
Downloads\nDownloads are typically triggered by user
interactions on a web page, such as clicking a download
button. Use js_code in CrawlerRunConfig to simulate these
actions and wait_for to allow sufficient time for downloads to
start.\nfrom crawl4ai.async_configs import CrawlerRunConfig
config = CrawlerRunConfig( js_code=\"\"\" const downloadLink =
document.querySelector('a[href$=\".exe\"]'); if (downloadLink)
{ downloadLink.click(); } \"\"\", wait_for=5 # Wait 5 seconds
for the download to start ) result = await crawler.arun(url=
\"https://www.python.org/downloads/\", config=config)
\nAccessing Downloaded Files\nThe downloaded_files attribute
of the CrawlResult object contains paths to downloaded files.
\nif result.downloaded_files: print(\"Downloaded files:\") for
file_path in result.downloaded_files: print(f\"-
{file_path}\") file_size = os.path.getsize(file_path) print(f
\"- File size: {file_size} bytes\") else: print(\"No files
downloaded.\") \nExample: Downloading Multiple Files\nfrom
crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import os from pathlib import Path async def
download_multiple_files(url: str, download_path: str): config
= BrowserConfig(accept_downloads=True,
downloads_path=download_path) async with
AsyncWebCrawler(config=config) as crawler: run_config =
CrawlerRunConfig( js_code=\"\"\" const downloadLinks =
123
document.querySelectorAll('a[download]'); for (const link of
downloadLinks) { link.click(); // Delay between clicks await
new Promise(r => setTimeout(r, 2000)); } \"\"\", wait_for=10 #
Wait for all downloads to start ) result = await
crawler.arun(url=url, config=run_config) if
result.downloaded_files: print(\"Downloaded files:\") for file
in result.downloaded_files: print(f\"- {file}\") else:
print(\"No files downloaded.\") # Usage download_path =
os.path.join(Path.home(), \".crawl4ai\", \"downloads\")
os.makedirs(download_path, exist_ok=True)
asyncio.run(download_multiple_files(\"https://www.python.org/d
ownloads/windows/\", download_path)) \nImportant
Considerations\nBrowser Context: Downloads are managed within
the browser context. Ensure js_code correctly targets the
download triggers on the webpage.\nTiming: Use wait_for in
CrawlerRunConfig to manage download timing.\nError Handling:
Handle errors to manage failed downloads or incorrect paths
gracefully.\nSecurity: Scan downloaded files for potential
security threats before use.\nThis revised guide ensures
consistency with the Crawl4AI codebase by using BrowserConfig
and CrawlerRunConfig for all download-related configurations.
Let me know if further adjustments are needed!",
"markdown": "# File Downloading - Crawl4AI Documentation
(v0.5.x)\n\n## Download Handling in Crawl4AI\n\nThis guide
explains how to use Crawl4AI to handle file downloads during
crawling. You'll learn how to trigger downloads, specify
download locations, and access downloaded files.\n\n##
Enabling Downloads\n\nTo enable downloads, set the
àccept_downloads` parameter in the `BrowserConfig` object and
pass it to the crawler.\n\n`from crawl4ai.async_configs import
BrowserConfig, AsyncWebCrawler async def main(): config =
BrowserConfig(accept_downloads=True) # Enable downloads
globally async with AsyncWebCrawler(config=config) as
crawler: # ... your crawling logic ...
asyncio.run(main())`\n\n## Specifying Download Location\n
\nSpecify the download directory using the `downloads_path`
attribute in the `BrowserConfig` object. If not provided,
Crawl4AI defaults to creating a \"downloads\" directory inside
the `.crawl4ai` folder in your home directory.\n\n`from
crawl4ai.async_configs import BrowserConfig import os
downloads_path = os.path.join(os.getcwd(), \"my_downloads\")
# Custom download path os.makedirs(downloads_path,
exist_ok=True) config = BrowserConfig(accept_downloads=True,
downloads_path=downloads_path) async def main(): async
with AsyncWebCrawler(config=config) as crawler: result
= await crawler.arun(url=\"https://example.com\")
# ...`\n\n## Triggering Downloads\n\nDownloads are typically
triggered by user interactions on a web page, such as clicking
a download button. Use `js_code` in `CrawlerRunConfig` to
simulate these actions and `wait_for` to allow sufficient time
for downloads to start.\n\n`from crawl4ai.async_configs import
CrawlerRunConfig config = CrawlerRunConfig( js_code=
\"\"\" const downloadLink =
document.querySelector('a[href$=\".exe\"]'); if
(downloadLink) { downloadLink.click(); }
\"\"\", wait_for=5 # Wait 5 seconds for the download to
start ) result = await crawler.arun(url=
124
\"https://www.python.org/downloads/\", config=config)`\n\n##
Accessing Downloaded Files\n\nThe `downloaded_files` attribute
of the `CrawlResult` object contains paths to downloaded
files.\n\nìf result.downloaded_files: print(\"Downloaded
files:\") for file_path in result.downloaded_files:
print(f\"- {file_path}\") file_size =
os.path.getsize(file_path) print(f\"- File size:
{file_size} bytes\") else: print(\"No files downloaded.
\")`\n\n## Example: Downloading Multiple Files\n\n`from
crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import os from pathlib import Path async def
download_multiple_files(url: str, download_path: str):
config = BrowserConfig(accept_downloads=True,
downloads_path=download_path) async with
AsyncWebCrawler(config=config) as crawler: run_config
= CrawlerRunConfig( js_code=\"\"\"
const downloadLinks =
document.querySelectorAll('a[download]'); for
(const link of downloadLinks)
{ link.click(); //
Delay between clicks await new Promise(r
=> setTimeout(r, 2000)); }
\"\"\", wait_for=10 # Wait for all downloads to
start ) result = await crawler.arun(url=url,
config=run_config) if result.downloaded_files:
print(\"Downloaded files:\") for file in
result.downloaded_files: print(f\"- {file}\")
else: print(\"No files downloaded.\") # Usage
download_path = os.path.join(Path.home(), \".crawl4ai\",
\"downloads\") os.makedirs(download_path, exist_ok=True)
asyncio.run(download_multiple_files(\"https://www.python.org/d
ownloads/windows/\", download_path))`\n\n## Important
Considerations\n\n* **Browser Context:** Downloads are
managed within the browser context. Ensure `js_code` correctly
targets the download triggers on the webpage.\n* **Timing:**
Use `wait_for` in `CrawlerRunConfig` to manage download
timing.\n* **Error Handling:** Handle errors to manage
failed downloads or incorrect paths gracefully.\n*
**Security:** Scan downloaded files for potential security
threats before use.\n\nThis revised guide ensures consistency
with the `Crawl4AI` codebase by using `BrowserConfig` and
`CrawlerRunConfig` for all download-related configurations.
Let me know if further adjustments are needed!",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/advanced/hooks-auth/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/advanced/hooks-
auth/",
"loadedTime": "2025-03-05T23:17:02.161Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
125
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/advanced/hooks-
auth/",
"title": "Hooks & Auth - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:01 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"b1b6b412b0b5f3ad9308f4f05ef40bf2\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Hooks & Auth - Crawl4AI Documentation
(v0.5.x)\nHooks & Auth in AsyncWebCrawler\nCrawl4AIâ€™s hooks
let you customize the crawler at specific points in the
pipeline:\n1. on_browser_created â€“ After browser creation.
\n2. on_page_context_created â€“ After a new context & page
are created.\n3. before_goto â€“ Just before navigating to a
page.\n4. after_goto â€“ Right after navigation completes.\n5.
on_user_agent_updated â€“ Whenever the user agent changes.\n6.
on_execution_started â€“ Once custom JavaScript execution
begins.\n7. before_retrieve_html â€“ Just before the crawler
retrieves final HTML.\n8. before_return_html â€“ Right before
returning the HTML content.\nImportant: Avoid heavy tasks in
on_browser_created since you donâ€™t yet have a page context.
If you need to log in, do so in on_page_context_created.\nnote
\"Important Hook Usage Warning\" Avoid Misusing Hooks: Do not
manipulate page objects in the wrong hook or at the wrong
time, as it can crash the pipeline or produce incorrect
results. A common mistake is attempting to handle
authentication prematurelyâ€”such as creating or closing pages
in on_browser_created. \nUse the Right Hook for Auth: If you
need to log in or set tokens, use on_page_context_created.
This ensures you have a valid page/context to work with,
without disrupting the main crawling flow.\nIdentity-Based
Crawling: For robust auth, consider identity-based crawling
(or passing a session ID) to preserve state. Run your initial
login steps in a separate, well-defined process, then feed
that session to your main crawlâ€”rather than shoehorning
complex authentication into early hooks. Check out Identity-
Based Crawling for more details.\nBe Cautious: Overwriting or
removing elements in the wrong hook can compromise the final
crawl. Keep hooks focused on smaller tasks (like route
filters, custom headers), and let your main logic (crawling,
data extraction) proceed normally.\nBelow is an example
demonstration.\nExample: Using Hooks in AsyncWebCrawler
\nimport asyncio import json from crawl4ai import
126
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from playwright.async_api import Page, BrowserContext async
def main(): print(\"ðŸ”— Hooks Example: Demonstrating
recommended usage\") # 1) Configure the browser browser_config
= BrowserConfig( headless=True, verbose=True ) # 2) Configure
the crawler run crawler_run_config =
CrawlerRunConfig( js_code=\"window.scrollTo(0,
document.body.scrollHeight);\", wait_for=\"body\",
cache_mode=CacheMode.BYPASS ) # 3) Create the crawler instance
crawler = AsyncWebCrawler(config=browser_config) # # Define
Hook Functions # async def on_browser_created(browser,
**kwargs): # Called once the browser instance is created (but
no pages or contexts yet) print(\"[HOOK] on_browser_created -
Browser created successfully!\") # Typically, do minimal setup
here if needed return browser async def
on_page_context_created(page: Page, context: BrowserContext,
**kwargs): # Called right after a new page + context are
created (ideal for auth or route config). print(\"[HOOK]
on_page_context_created - Setting up page & context.\") #
Example 1: Route filtering (e.g., block images) async def
route_filter(route): if route.request.resource_type == \"image
\": print(f\"[HOOK] Blocking image request:
{route.request.url}\") await route.abort() else: await
route.continue_() await context.route(\"**\", route_filter) #
Example 2: (Optional) Simulate a login scenario # (We do NOT
create or close pages here, just do quick steps if needed) #
e.g., await page.goto(\"https://example.com/login\") # e.g.,
await page.fill(\"input[name='username']\", \"testuser\") #
e.g., await page.fill(\"input[name='password']\",
\"password123\") # e.g., await
page.click(\"button[type='submit']\") # e.g., await
page.wait_for_selector(\"#welcome\") # e.g., await
context.add_cookies([...]) # Then continue # Example 3: Adjust
the viewport await page.set_viewport_size({\"width\": 1080,
\"height\": 600}) return page async def before_goto( page:
Page, context: BrowserContext, url: str, **kwargs ): # Called
before navigating to each URL. print(f\"[HOOK] before_goto -
About to navigate: {url}\") # e.g., inject custom headers
await page.set_extra_http_headers({ \"Custom-Header\": \"my-
value\" }) return page async def after_goto( page: Page,
context: BrowserContext, url: str, response, **kwargs ): #
Called after navigation completes. print(f\"[HOOK]
after_goto - Successfully loaded: {url}\") # e.g., wait for a
certain element if we want to verify try: await
page.wait_for_selector('.content', timeout=1000)
print(\"[HOOK] Found .content element!\") except:
print(\"[HOOK] .content not found, continuing anyway.\")
return page async def on_user_agent_updated( page: Page,
context: BrowserContext, user_agent: str, **kwargs ): # Called
whenever the user agent updates. print(f\"[HOOK]
on_user_agent_updated - New user agent: {user_agent}\") return
page async def on_execution_started(page: Page, context:
BrowserContext, **kwargs): # Called after custom JavaScript
execution begins. print(\"[HOOK] on_execution_started - JS
code is running!\") return page async def
before_retrieve_html(page: Page, context: BrowserContext,
**kwargs): # Called before final HTML retrieval.
127
print(\"[HOOK] before_retrieve_html - We can do final actions
\") # Example: Scroll again await
page.evaluate(\"window.scrollTo(0,
document.body.scrollHeight);\") return page async def
before_return_html( page: Page, context: BrowserContext, html:
str, **kwargs ): # Called just before returning the HTML in
the result. print(f\"[HOOK] before_return_html - HTML length:
{len(html)}\") return page # # Attach Hooks #
crawler.crawler_strategy.set_hook(\"on_browser_created\",
on_browser_created)
crawler.crawler_strategy.set_hook( \"on_page_context_created
\", on_page_context_created )
crawler.crawler_strategy.set_hook(\"before_goto\",
before_goto) crawler.crawler_strategy.set_hook(\"after_goto\",
after_goto)
crawler.crawler_strategy.set_hook( \"on_user_agent_updated\",
on_user_agent_updated )
crawler.crawler_strategy.set_hook( \"on_execution_started\",
on_execution_started )
crawler.crawler_strategy.set_hook( \"before_retrieve_html\",
before_retrieve_html )
crawler.crawler_strategy.set_hook( \"before_return_html\",
before_return_html ) await crawler.start() # 4) Run the
crawler on an example page url = \"https://example.com\"
result = await crawler.arun(url, config=crawler_run_config) if
result.success: print(\"\\nCrawled URL:\", result.url)
print(\"HTML length:\", len(result.html)) else: print(\"Error:
\", result.error_message) await crawler.close() if __name__ ==
\"__main__\": asyncio.run(main()) \nHook Lifecycle Summary\n1.
on_browser_created:\n- Browser is up, but no pages or contexts
yet.\n- Light setup onlyâ€”donâ€™t try to open or close pages
here (that belongs in on_page_context_created).\n2.
on_page_context_created:\n- Perfect for advanced auth or route
blocking.\n- You have a page + context ready but havenâ€™t
navigated to the target URL yet.\n3. before_goto:\n- Right
before navigation. Typically used for setting custom headers
or logging the target URL.\n4. after_goto:\n- After page
navigation is done. Good place for verifying content or
waiting on essential elements. \n5. on_user_agent_updated:\n-
Whenever the user agent changes (for stealth or different UA
modes).\n6. on_execution_started:\n- If you set js_code or run
custom scripts, this runs once your JS is about to start.\n7.
before_retrieve_html:\n- Just before the final HTML snapshot
is taken. Often you do a final scroll or lazy-load triggers
here.\n8. before_return_html:\n- The last hook before
returning HTML to the CrawlResult. Good for logging HTML
length or minor modifications.\nWhen to Handle Authentication
\nRecommended: Use on_page_context_created if you need to:
\nNavigate to a login page or fill forms\nSet cookies or
localStorage tokens\nBlock resource routes to avoid ads\nThis
ensures the newly created context is under your control before
arun() navigates to the main URL.\nAdditional Considerations
\nSession Management: If you want multiple arun() calls to
reuse a single session, pass session_id= in your
CrawlerRunConfig. Hooks remain the same. \nPerformance: Hooks
can slow down crawling if they do heavy tasks. Keep them
concise. \nError Handling: If a hook fails, the overall crawl
128
might fail. Catch exceptions or handle them gracefully.
\nConcurrency: If you run arun_many(), each URL triggers these
hooks in parallel. Ensure your hooks are thread/async-safe.
\nConclusion\nHooks provide fine-grained control over:
\nBrowser creation (light tasks only)\nPage and context
creation (auth, route blocking)\nNavigation phases\nFinal HTML
retrieval\nFollow the recommended usage: - Login or advanced
tasks in on_page_context_created\n- Custom headers or logs in
before_goto / after_goto\n- Scrolling or final checks in
before_retrieve_html / before_return_html",
"markdown": "# Hooks & Auth - Crawl4AI Documentation
(v0.5.x)\n\n## Hooks & Auth in AsyncWebCrawler\n\nCrawl4AIâ€™s
**hooks** let you customize the crawler at specific points in
the pipeline:\n\n1.â€€**òn_browser_created`** â€“ After
browser creation. \n2.â€€**òn_page_context_created`** â
€“ After a new context & page are created. \n3.â
€€**`before_goto`** â€“ Just before navigating to a page.
\n4.â€€**àfter_goto`** â€“ Right after navigation completes.
\n5.â€€**òn_user_agent_updated`** â€“ Whenever the user agent
changes. \n6.â€€**òn_execution_started`** â€“ Once custom
JavaScript execution begins. \n7.â
€€**`before_retrieve_html`** â€“ Just before the crawler
retrieves final HTML. \n8.â€€**`before_return_html`** â
€“ Right before returning the HTML content.\n\n**Important**:
Avoid heavy tasks in òn_browser_created` since you donâ€™t
yet have a page context. If you need to _log in_, do so in
**òn_page_context_created`**.\n\n> note \"Important Hook
Usage Warning\" **Avoid Misusing Hooks**: Do not manipulate
page objects in the wrong hook or at the wrong time, as it can
crash the pipeline or produce incorrect results. A common
mistake is attempting to handle authentication prematurelyâ€”
such as creating or closing pages in òn_browser_created`.\n>
\n> **Use the Right Hook for Auth**: If you need to log in or
set tokens, use òn_page_context_created`. This ensures you
have a valid page/context to work with, without disrupting the
main crawling flow.\n> \n> **Identity-Based Crawling**: For
robust auth, consider identity-based crawling (or passing a
session ID) to preserve state. Run your initial login steps in
a separate, well-defined process, then feed that session to
your main crawlâ€”rather than shoehorning complex
authentication into early hooks. Check out [Identity-Based
Crawling](https://crawl4ai.com/mkdocs/advanced/identity-based-
crawling/) for more details.\n> \n> **Be Cautious**:
Overwriting or removing elements in the wrong hook can
compromise the final crawl. Keep hooks focused on smaller
tasks (like route filters, custom headers), and let your main
logic (crawling, data extraction) proceed normally.\n\nBelow
is an example demonstration.\n\n* * *\n\n## Example: Using
Hooks in AsyncWebCrawler\n\nìmport asyncio import json from
crawl4ai import AsyncWebCrawler, BrowserConfig,
CrawlerRunConfig, CacheMode from playwright.async_api import
Page, BrowserContext async def main(): print(\"ðŸ”— Hooks
Example: Demonstrating recommended usage\") # 1)
Configure the browser browser_config =
BrowserConfig( headless=True,
verbose=True ) # 2) Configure the crawler run
crawler_run_config = CrawlerRunConfig( js_code=
129
\"window.scrollTo(0, document.body.scrollHeight);\",
wait_for=\"body\", cache_mode=CacheMode.BYPASS )
# 3) Create the crawler instance crawler =
AsyncWebCrawler(config=browser_config) # # Define
Hook Functions # async def
on_browser_created(browser, **kwargs): # Called once
the browser instance is created (but no pages or contexts yet)
print(\"[HOOK] on_browser_created - Browser created
successfully!\") # Typically, do minimal setup here if
needed return browser async def
on_page_context_created(page: Page, context: BrowserContext,
**kwargs): # Called right after a new page + context
are created (ideal for auth or route config).
print(\"[HOOK] on_page_context_created - Setting up page &
context.\") # Example 1: Route filtering (e.g., block
images) async def route_filter(route): if
route.request.resource_type == \"image\":
print(f\"[HOOK] Blocking image request: {route.request.url}\")
await route.abort() else: await
route.continue_() await context.route(\"**\",
route_filter) # Example 2: (Optional) Simulate a
login scenario # (We do NOT create or close pages
here, just do quick steps if needed) # e.g., await
page.goto(\"https://example.com/login\") # e.g., await
page.fill(\"input[name='username']\", \"testuser\") #
e.g., await page.fill(\"input[name='password']\",
\"password123\") # e.g., await
page.click(\"button[type='submit']\") # e.g., await
page.wait_for_selector(\"#welcome\") # e.g., await
context.add_cookies([...]) # Then continue #
Example 3: Adjust the viewport await
page.set_viewport_size({\"width\": 1080, \"height\": 600})
return page async def before_goto( page: Page,
context: BrowserContext, url: str, **kwargs ): #
Called before navigating to each URL. print(f\"[HOOK]
before_goto - About to navigate: {url}\") # e.g.,
inject custom headers await
page.set_extra_http_headers({ \"Custom-Header\":
\"my-value\" }) return page async def
after_goto( page: Page, context: BrowserContext,
url: str, response, **kwargs ): # Called after
navigation completes. print(f\"[HOOK] after_goto -
Successfully loaded: {url}\") # e.g., wait for a
certain element if we want to verify try:
await page.wait_for_selector('.content', timeout=1000)
print(\"[HOOK] Found .content element!\") except:
print(\"[HOOK] .content not found, continuing anyway.\")
return page async def
on_user_agent_updated( page: Page, context:
BrowserContext, user_agent: str, **kwargs ):
# Called whenever the user agent updates. print(f
\"[HOOK] on_user_agent_updated - New user agent:
{user_agent}\") return page async def
on_execution_started(page: Page, context: BrowserContext,
**kwargs): # Called after custom JavaScript execution
begins. print(\"[HOOK] on_execution_started - JS code
is running!\") return page async def
130
before_retrieve_html(page: Page, context: BrowserContext,
**kwargs): # Called before final HTML retrieval.
print(\"[HOOK] before_retrieve_html - We can do final actions
\") # Example: Scroll again await
page.evaluate(\"window.scrollTo(0,
document.body.scrollHeight);\") return page async
def before_return_html( page: Page, context:
BrowserContext, html: str, **kwargs ): # Called
just before returning the HTML in the result. print(f
\"[HOOK] before_return_html - HTML length: {len(html)}\")
return page # # Attach Hooks #
crawler.crawler_strategy.set_hook(\"on_browser_created\",
on_browser_created)
crawler.crawler_strategy.set_hook( \"on_page_context_c
reated\", on_page_context_created )
crawler.crawler_strategy.set_hook(\"before_goto\",
before_goto)
crawler.crawler_strategy.set_hook(\"after_goto\", after_goto)
crawler.crawler_strategy.set_hook( \"on_user_agent_upd
ated\", on_user_agent_updated )
crawler.crawler_strategy.set_hook( \"on_execution_star
ted\", on_execution_started )
crawler.crawler_strategy.set_hook( \"before_retrieve_h
tml\", before_retrieve_html )
crawler.crawler_strategy.set_hook( \"before_return_htm
l\", before_return_html ) await crawler.start()
# 4) Run the crawler on an example page url =
\"https://example.com\" result = await crawler.arun(url,
config=crawler_run_config) if result.success:
print(\"\\nCrawled URL:\", result.url) print(\"HTML
length:\", len(result.html)) else: print(\"Error:
\", result.error_message) await crawler.close() if
__name__ == \"__main__\": asyncio.run(main())`\n\n* * *\n
\n## Hook Lifecycle Summary\n\n1.â€€**òn_browser_created`**:
\n\\- Browser is up, but **no** pages or contexts yet. \n\\-
Light setup onlyâ€”donâ€™t try to open or close pages here
(that belongs in òn_page_context_created`).\n\n2.â
€€**òn_page_context_created`**: \n\\- Perfect for advanced
**auth** or route blocking. \n\\- You have a **page** +
**context** ready but havenâ€™t navigated to the target URL
yet.\n\n3.â€€**`before_goto`**: \n\\- Right before
navigation. Typically used for setting **custom headers** or
logging the target URL.\n\n4.â€€**àfter_goto`**: \n\\- After
page navigation is done. Good place for verifying content or
waiting on essential elements.\n\n5.â
€€**òn_user_agent_updated`**: \n\\- Whenever the user agent
changes (for stealth or different UA modes).\n\n6.â
€€**òn_execution_started`**: \n\\- If you set `js_code` or
run custom scripts, this runs once your JS is about to start.
\n\n7.â€€**`before_retrieve_html`**: \n\\- Just before the
final HTML snapshot is taken. Often you do a final scroll or
lazy-load triggers here.\n\n8.â€€**`before_return_html`**: \n
\\- The last hook before returning HTML to the `CrawlResult`.
Good for logging HTML length or minor modifications.\n\n* * *
\n\n## When to Handle Authentication\n\n**Recommended**: Use
**òn_page_context_created`** if you need to:\n\n* Navigate
to a login page or fill forms\n* Set cookies or localStorage
131
tokens\n* Block resource routes to avoid ads\n\nThis ensures
the newly created context is under your control **before**
àrun()` navigates to the main URL.\n\n* * *\n\n## Additional
Considerations\n\n* **Session Management**: If you want
multiple àrun()` calls to reuse a single session, pass
`session_id=` in your `CrawlerRunConfig`. Hooks remain the
same.\n* **Performance**: Hooks can slow down crawling if
they do heavy tasks. Keep them concise.\n* **Error
Handling**: If a hook fails, the overall crawl might fail.
Catch exceptions or handle them gracefully.\n*
**Concurrency**: If you run àrun_many()`, each URL triggers
these hooks in parallel. Ensure your hooks are thread/async-
safe.\n\n* * *\n\n## Conclusion\n\nHooks provide **fine-
grained** control over:\n\n* **Browser** creation (light
tasks only)\n* **Page** and **context** creation (auth,
route blocking)\n* **Navigation** phases\n* **Final HTML**
retrieval\n\nFollow the recommended usage: - **Login** or
advanced tasks in òn_page_context_created` \n\\- **Custom
headers** or logs in `before_goto` / àfter_goto` \n\\-
**Scrolling** or final checks in `before_retrieve_html` /
`before_return_html`",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/advanced/proxy-
security/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/advanced/proxy-
security/",
"loadedTime": "2025-03-05T23:17:02.549Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/advanced/proxy-
security/",
"title": "Proxy & Security - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:01 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"3902d02bb675557ccd3cecff674c0313\"",
"content-encoding": "gzip"
}
132
},
"screenshotUrl": null,
"text": "Proxy & Security - Crawl4AI Documentation
(v0.5.x)\nBasic Proxy Setup\nSimple proxy configuration with
BrowserConfig:\nfrom crawl4ai.async_configs import
BrowserConfig # Using proxy URL browser_config =
BrowserConfig(proxy=\"http://proxy.example.com:8080\") async
with AsyncWebCrawler(config=browser_config) as crawler: result
= await crawler.arun(url=\"https://example.com\") # Using
SOCKS proxy browser_config = BrowserConfig(proxy=
\"socks5://proxy.example.com:1080\") async with
AsyncWebCrawler(config=browser_config) as crawler: result =
await crawler.arun(url=\"https://example.com\")
\nAuthenticated Proxy\nUse an authenticated proxy with
BrowserConfig:\nfrom crawl4ai.async_configs import
BrowserConfig proxy_config = { \"server\":
\"http://proxy.example.com:8080\", \"username\": \"user\",
\"password\": \"pass\" } browser_config =
BrowserConfig(proxy_config=proxy_config) async with
AsyncWebCrawler(config=browser_config) as crawler: result =
await crawler.arun(url=\"https://example.com\") \nHere's the
corrected documentation:\nRotating Proxies\nExample using a
proxy rotation service dynamically:\nfrom crawl4ai import
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig async def
get_next_proxy(): # Your proxy rotation logic here return
{\"server\": \"http://next.proxy.com:8080\"} async def main():
browser_config = BrowserConfig() run_config =
CrawlerRunConfig() async with
AsyncWebCrawler(config=browser_config) as crawler: # For each
URL, create a new run config with different proxy for url in
urls: proxy = await get_next_proxy() # Clone the config and
update proxy - this creates a new browser context
current_config = run_config.clone(proxy_config=proxy) result =
await crawler.arun(url=url, config=current_config) if __name__
== \"__main__\": import asyncio asyncio.run(main())",
"markdown": "# Proxy & Security - Crawl4AI Documentation
(v0.5.x)\n\n## Basic Proxy Setup\n\nSimple proxy configuration
with `BrowserConfig`:\n\n`from crawl4ai.async_configs import
BrowserConfig # Using proxy URL browser_config =
BrowserConfig(proxy=\"http://proxy.example.com:8080\") async
with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=\"https://example.com\") #
Using SOCKS proxy browser_config = BrowserConfig(proxy=
\"socks5://proxy.example.com:1080\") async with
AsyncWebCrawler(config=browser_config) as crawler: result
= await crawler.arun(url=\"https://example.com\")`\n\n##
Authenticated Proxy\n\nUse an authenticated proxy with
`BrowserConfig`:\n\n`from crawl4ai.async_configs import
BrowserConfig proxy_config = { \"server\":
\"http://proxy.example.com:8080\", \"username\": \"user\",
\"password\": \"pass\" } browser_config =
BrowserConfig(proxy_config=proxy_config) async with
AsyncWebCrawler(config=browser_config) as crawler: result
= await crawler.arun(url=\"https://example.com\")`\n\nHere's
the corrected documentation:\n\n## Rotating Proxies\n\nExample
using a proxy rotation service dynamically:\n\n`from crawl4ai
import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig async
133
def get_next_proxy(): # Your proxy rotation logic here
return {\"server\": \"http://next.proxy.com:8080\"} async def
main(): browser_config = BrowserConfig() run_config =
CrawlerRunConfig() async with
AsyncWebCrawler(config=browser_config) as crawler: #
For each URL, create a new run config with different proxy
for url in urls: proxy = await get_next_proxy()
# Clone the config and update proxy - this creates a new
browser context current_config =
run_config.clone(proxy_config=proxy) result =
await crawler.arun(url=url, config=current_config) if
__name__ == \"__main__\": import asyncio
asyncio.run(main())`",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/advanced/ssl-
certificate/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/advanced/ssl-
certificate/",
"loadedTime": "2025-03-05T23:17:08.059Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/advanced/ssl-
certificate/",
"title": "SSL Certificate - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:07 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"a0f4b7abae4f390590a6de8f51bed518\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "SSL Certificate - Crawl4AI Documentation
(v0.5.x)\nSSLCertificate Reference\nThe SSLCertificate class
encapsulates an SSL certificateâ€™s data and allows exporting
it in various formats (PEM, DER, JSON, or text). Itâ€™s used
within Crawl4AI whenever you set fetch_ssl_certificate=True in
your CrawlerRunConfig. \n1. Overview\nLocation:
134
crawl4ai/ssl_certificate.py\nclass SSLCertificate: \"\"\"
Represents an SSL certificate with methods to export in
various formats. Main Methods: - from_url(url, timeout=10) -
from_file(file_path) - from_binary(binary_data) -
to_json(filepath=None) - to_pem(filepath=None) -
to_der(filepath=None) ... Common Properties: - issuer -
subject - valid_from - valid_until - fingerprint \"\"\"
\nTypical Use Case\nYou enable certificate fetching in your
crawl by: \nCrawlerRunConfig(fetch_ssl_certificate=True, ...)
\nAfter arun(), if result.ssl_certificate is present, itâ€™s
an instance of SSLCertificate. \nYou can read basic properties
(issuer, subject, validity) or export them in multiple
formats.\n2. Construction & Fetching\n2.1 from_url(url,
timeout=10)\nManually load an SSL certificate from a given URL
(port 443). Typically used internally, but you can call it
directly if you want:\ncert =
SSLCertificate.from_url(\"https://example.com\") if cert:
print(\"Fingerprint:\", cert.fingerprint) \n2.2
from_file(file_path)\nLoad from a file containing certificate
data in ASN.1 or DER. Rarely needed unless you have local cert
files:\ncert = SSLCertificate.from_file(\"/path/to/cert.der\")
\n2.3 from_binary(binary_data)\nInitialize from raw binary.
E.g., if you captured it from a socket or another source:
\ncert = SSLCertificate.from_binary(raw_bytes) \n3. Common
Properties\nAfter obtaining a SSLCertificate instance (e.g.
result.ssl_certificate from a crawl), you can read:\n1. issuer
(dict)\n- E.g. {\"CN\": \"My Root CA\", \"O\": \"...\"} 2.
subject (dict)\n- E.g. {\"CN\": \"example.com\", \"O\":
\"ExampleOrg\"} 3. valid_from (str)\n- NotBefore date/time.
Often in ASN.1/UTC format. 4. valid_until (str)\n- NotAfter
date/time. 5. fingerprint (str)\n- The SHA-256 digest
(lowercase hex).\n- E.g. \"d14d2e...\"\n4. Export Methods
\nOnce you have a SSLCertificate object, you can export or
inspect it:\n4.1 to_json(filepath=None) â†’
Optional[str]\nReturns a JSON string containing the parsed
certificate fields. \nIf filepath is provided, saves it to
disk instead, returning None.\nUsage: \njson_data =
cert.to_json() # returns JSON string
cert.to_json(\"certificate.json\") # writes file, returns None
\n4.2 to_pem(filepath=None) â†’ Optional[str]\nReturns a PEM-
encoded string (common for web servers). \nIf filepath is
provided, saves it to disk instead.\npem_str = cert.to_pem() #
in-memory PEM string cert.to_pem(\"/path/to/cert.pem\") #
saved to file \n4.3 to_der(filepath=None) â†’
Optional[bytes]\nReturns the original DER (binary ASN.1)
bytes. \nIf filepath is specified, writes the bytes there
instead.\nder_bytes = cert.to_der()
cert.to_der(\"certificate.der\") \n4.4 (Optional)
export_as_text()\nIf you see a method like export_as_text(),
it typically returns an OpenSSL-style textual representation.
\nNot always needed, but can help for debugging or manual
inspection.\n5. Example Usage in Crawl4AI\nBelow is a minimal
sample showing how the crawler obtains an SSL cert from a
site, then reads or exports it. The code snippet:\nimport
asyncio import os from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig, CacheMode async def main(): tmp_dir = \"tmp
\" os.makedirs(tmp_dir, exist_ok=True) config =
135
CrawlerRunConfig( fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS ) async with AsyncWebCrawler() as
crawler: result = await crawler.arun(\"https://example.com\",
config=config) if result.success and result.ssl_certificate:
cert = result.ssl_certificate # 1. Basic Info print(\"Issuer
CN:\", cert.issuer.get(\"CN\", \"\")) print(\"Valid until:\",
cert.valid_until) print(\"Fingerprint:\", cert.fingerprint) #
2. Export cert.to_json(os.path.join(tmp_dir,
\"certificate.json\")) cert.to_pem(os.path.join(tmp_dir,
\"certificate.pem\")) cert.to_der(os.path.join(tmp_dir,
\"certificate.der\")) if __name__ == \"__main__\":
asyncio.run(main()) \n6. Notes & Best Practices\n1. Timeout:
SSLCertificate.from_url internally uses a default 10s socket
connect and wraps SSL.\n2. Binary Form: The certificate is
loaded in ASN.1 (DER) form, then re-parsed by OpenSSL.crypto.
\n3. Validation: This does not validate the certificate chain
or trust store. It only fetches and parses.\n4. Integration:
Within Crawl4AI, you typically just set
fetch_ssl_certificate=True in CrawlerRunConfig; the final
resultâ€™s ssl_certificate is automatically built.\n5. Export:
If you need to store or analyze a cert, the to_json and to_pem
are quite universal.\nSummary\nSSLCertificate is a convenience
class for capturing and exporting the TLS certificate from
your crawled site(s). \nCommon usage is in the
CrawlResult.ssl_certificate field, accessible after setting
fetch_ssl_certificate=True. \nOffers quick access to essential
certificate details (issuer, subject, fingerprint) and is easy
to export (PEM, DER, JSON) for further analysis or server
usage.\nUse it whenever you need insight into a siteâ€™s
certificate or require some form of cryptographic or
compliance check.",
"markdown": "# SSL Certificate - Crawl4AI Documentation
(v0.5.x)\n\n## `SSLCertificate` Reference\n\nThe
**`SSLCertificate`** class encapsulates an SSL certificateâ€™s
data and allows exporting it in various formats (PEM, DER,
JSON, or text). Itâ€™s used within **Crawl4AI** whenever you
set **`fetch_ssl_certificate=True`** in your
**`CrawlerRunConfig`**.\n\n## 1\\. Overview\n\n**Location**:
`crawl4ai/ssl_certificate.py`\n\n`class SSLCertificate:
\"\"\" Represents an SSL certificate with methods to
export in various formats. Main Methods: -
from_url(url, timeout=10) - from_file(file_path) -
from_binary(binary_data) - to_json(filepath=None) -
to_pem(filepath=None) - to_der(filepath=None) ...
Common Properties: - issuer - subject -
valid_from - valid_until - fingerprint \"\"\"`\n
\n### Typical Use Case\n\n1. You **enable** certificate
fetching in your crawl by:\n \n
`CrawlerRunConfig(fetch_ssl_certificate=True, ...)`\n \n2.
After àrun()`, if `result.ssl_certificate` is present, itâ€™s
an instance of **`SSLCertificate`**.\n3. You can **read**
basic properties (issuer, subject, validity) or **export**
them in multiple formats.\n\n* * *\n\n## 2\\. Construction &
Fetching\n\n### 2.1 **`from_url(url, timeout=10)`**\n
\nManually load an SSL certificate from a given URL (port
443). Typically used internally, but you can call it directly
if you want:\n\n`cert =
136
SSLCertificate.from_url(\"https://example.com\") if cert:
print(\"Fingerprint:\", cert.fingerprint)`\n\n### 2.2
**`from_file(file_path)`**\n\nLoad from a file containing
certificate data in ASN.1 or DER. Rarely needed unless you
have local cert files:\n\n`cert =
SSLCertificate.from_file(\"/path/to/cert.der\")`\n\n### 2.3
**`from_binary(binary_data)`**\n\nInitialize from raw binary.
E.g., if you captured it from a socket or another source:\n
\n`cert = SSLCertificate.from_binary(raw_bytes)`\n\n* * *\n
\n## 3\\. Common Properties\n\nAfter obtaining a
**`SSLCertificate`** instance (e.g. `result.ssl_certificate`
from a crawl), you can read:\n\n1.â€€**ìssuer`** _(dict)_ \n
\\- E.g. `{\"CN\": \"My Root CA\", \"O\": \"...\"}` 2.â
€€**`subject`** _(dict)_ \n\\- E.g. `{\"CN\": \"example.com
\", \"O\": \"ExampleOrg\"}` 3.â€€**`valid_from`** _(str)_ \n
\\- NotBefore date/time. Often in ASN.1/UTC format. 4.â
€€**`valid_until`** _(str)_ \n\\- NotAfter date/time. 5.â
€€**`fingerprint`** _(str)_ \n\\- The SHA-256 digest
(lowercase hex). \n\\- E.g. `\"d14d2e...\"`\n\n* * *\n\n## 4
\\. Export Methods\n\nOnce you have a **`SSLCertificate`**
object, you can **export** or **inspect** it:\n\n### 4.1
**`to_json(filepath=None)` â†’ Òptional[str]`**\n\n*
Returns a JSON string containing the parsed certificate
fields.\n* If `filepath` is provided, saves it to disk
instead, returning `None`.\n\n**Usage**:\n\n`json_data =
cert.to_json() # returns JSON string
cert.to_json(\"certificate.json\") # writes file, returns
None`\n\n### 4.2 **`to_pem(filepath=None)` â†’
Òptional[str]`**\n\n* Returns a PEM-encoded string (common
for web servers).\n* If `filepath` is provided, saves it to
disk instead.\n\n`pem_str = cert.to_pem() # in-
memory PEM string cert.to_pem(\"/path/to/cert.pem\") #
saved to file`\n\n### 4.3 **`to_der(filepath=None)` â†’
Òptional[bytes]`**\n\n* Returns the original DER (binary
ASN.1) bytes.\n* If `filepath` is specified, writes the
bytes there instead.\n\n`der_bytes = cert.to_der()
cert.to_der(\"certificate.der\")`\n\n### 4.4 (Optional)
**èxport_as_text()`**\n\n* If you see a method like
èxport_as_text()`, it typically returns an OpenSSL-style
textual representation.\n* Not always needed, but can help
for debugging or manual inspection.\n\n* * *\n\n## 5\\.
Example Usage in Crawl4AI\n\nBelow is a minimal sample showing
how the crawler obtains an SSL cert from a site, then reads or
exports it. The code snippet:\n\nìmport asyncio import os
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig,
CacheMode async def main(): tmp_dir = \"tmp\"
os.makedirs(tmp_dir, exist_ok=True) config =
CrawlerRunConfig( fetch_ssl_certificate=True,
cache_mode=CacheMode.BYPASS ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun(\"https://example.com\", config=config)
if result.success and result.ssl_certificate: cert
= result.ssl_certificate # 1. Basic Info
print(\"Issuer CN:\", cert.issuer.get(\"CN\", \"\"))
print(\"Valid until:\", cert.valid_until)
print(\"Fingerprint:\", cert.fingerprint) # 2.
Export cert.to_json(os.path.join(tmp_dir,
137
\"certificate.json\"))
cert.to_pem(os.path.join(tmp_dir, \"certificate.pem\"))
cert.to_der(os.path.join(tmp_dir, \"certificate.der\")) if
__name__ == \"__main__\": asyncio.run(main())`\n\n* * *\n
\n## 6\\. Notes & Best Practices\n\n1.â€€**Timeout**:
`SSLCertificate.from_url` internally uses a default **10s**
socket connect and wraps SSL. \n2.â€€**Binary Form**: The
certificate is loaded in ASN.1 (DER) form, then re-parsed by
ÒpenSSL.crypto`. \n3.â€€**Validation**: This does **not**
validate the certificate chain or trust store. It only fetches
and parses. \n4.â€€**Integration**: Within Crawl4AI, you
typically just set `fetch_ssl_certificate=True` in
`CrawlerRunConfig`; the final resultâ€™s `ssl_certificate` is
automatically built. \n5.â€€**Export**: If you need to store
or analyze a cert, the `to_json` and `to_pem` are quite
universal.\n\n* * *\n\n### Summary\n\n* **`SSLCertificate`**
is a convenience class for capturing and exporting the **TLS
certificate** from your crawled site(s).\n* Common usage is
in the **`CrawlResult.ssl_certificate`** field, accessible
after setting `fetch_ssl_certificate=True`.\n* Offers quick
access to essential certificate details (ìssuer`, `subject`,
`fingerprint`) and is easy to export (PEM, DER, JSON) for
further analysis or server usage.\n\nUse it whenever you need
**insight** into a siteâ€™s certificate or require some form
of cryptographic or compliance check.",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/advanced/session-
management/",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/advanced/session-management/",
"loadedTime": "2025-03-05T23:17:09.168Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl":
"https://docs.crawl4ai.com/advanced/session-management/",
"title": "Session Management - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:08 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
138
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"c090e87b6dfd8587df01e7aa6e377b21\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Session Management - Crawl4AI Documentation
(v0.5.x)\nSession management in Crawl4AI is a powerful feature
that allows you to maintain state across multiple requests,
making it particularly suitable for handling complex multi-
step crawling tasks. It enables you to reuse the same browser
tab (or page object) across sequential actions and crawls,
which is beneficial for:\nPerforming JavaScript actions before
and after crawling.\nExecuting multiple sequential crawls
faster without needing to reopen tabs or allocate memory
repeatedly.\nNote: This feature is designed for sequential
workflows and is not suitable for parallel operations.\nBasic
Session Usage\nUse BrowserConfig and CrawlerRunConfig to
maintain state with a session_id:\nfrom crawl4ai.async_configs
import BrowserConfig, CrawlerRunConfig async with
AsyncWebCrawler() as crawler: session_id = \"my_session\" #
Define configurations config1 = CrawlerRunConfig( url=
\"https://example.com/page1\", session_id=session_id ) config2
= CrawlerRunConfig( url=\"https://example.com/page2\",
session_id=session_id ) # First request result1 = await
crawler.arun(config=config1) # Subsequent request using the
same session result2 = await crawler.arun(config=config2) #
Clean up when done await
crawler.crawler_strategy.kill_session(session_id) \nDynamic
Content with Sessions\nHere's an example of crawling GitHub
commits across multiple pages while preserving session state:
\nfrom crawl4ai.async_configs import CrawlerRunConfig from
crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.cache_context import CacheMode async def
crawl_dynamic_content(): async with AsyncWebCrawler() as
crawler: session_id = \"github_commits_session\" url =
\"https://github.com/microsoft/TypeScript/commits/main\"
all_commits = [] # Define extraction schema schema = { \"name
\": \"Commit Extractor\", \"baseSelector\": \"li.Box-sc-
g0xbh4-0\", \"fields\": [{ \"name\": \"title\", \"selector\":
\"h4.markdown-title\", \"type\": \"text\" }], }
extraction_strategy = JsonCssExtractionStrategy(schema) #
JavaScript and wait configurations js_next_page =
\"\"\"document.querySelector('a[data-testid=\"pagination-next-
button\"]').click();\"\"\" wait_for = \"\"\"() =>
document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0
\"\"\" # Crawl multiple pages for page in range(3): config =
CrawlerRunConfig( url=url, session_id=session_id,
extraction_strategy=extraction_strategy, js_code=js_next_page
if page > 0 else None, wait_for=wait_for if page > 0 else
None, js_only=page > 0, cache_mode=CacheMode.BYPASS ) result =
await crawler.arun(config=config) if result.success: commits =
json.loads(result.extracted_content)
all_commits.extend(commits) print(f\"Page {page + 1}: Found
{len(commits)} commits\") # Clean up session await
crawler.crawler_strategy.kill_session(session_id) return
all_commits \nExample 1: Basic Session-Based Crawling\nA
139
simple example using session-based crawling:\nimport asyncio
from crawl4ai.async_configs import BrowserConfig,
CrawlerRunConfig from crawl4ai.cache_context import CacheMode
async def basic_session_crawl(): async with AsyncWebCrawler()
as crawler: session_id = \"dynamic_content_session\" url =
\"https://example.com/dynamic-content\" for page in range(3):
config = CrawlerRunConfig( url=url, session_id=session_id,
js_code=\"document.querySelector('.load-more-button').click();
\" if page > 0 else None, css_selector=\".content-item\",
cache_mode=CacheMode.BYPASS ) result = await
crawler.arun(config=config) print(f\"Page {page + 1}: Found
{result.extracted_content.count('.content-item')} items\")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(basic_session_crawl()) \nThis example shows: 1.
Reusing the same session_id across multiple requests. 2.
Executing JavaScript to load more content dynamically. 3.
Properly closing the session to free resources.\nAdvanced
Technique 1: Custom Execution Hooks\nWarning: You might feel
confused by the end of the next few examples ðŸ˜…, so make
sure you are comfortable with the order of the parts before
you start this.\nUse custom hooks to handle complex scenarios,
such as waiting for content to load dynamically:\nasync def
advanced_session_crawl_with_hooks(): first_commit = \"\" async
def on_execution_started(page): nonlocal first_commit try:
while True: await page.wait_for_selector(\"li.commit-item h4
\") commit = await page.query_selector(\"li.commit-item h4\")
commit = await commit.evaluate(\"(element) =>
element.textContent\").strip() if commit and commit !=
first_commit: first_commit = commit break await
asyncio.sleep(0.5) except Exception as e: print(f\"Warning:
New content didn't appear: {e}\") async with AsyncWebCrawler()
as crawler: session_id = \"commit_session\" url =
\"https://github.com/example/repo/commits/main\"
crawler.crawler_strategy.set_hook(\"on_execution_started\",
on_execution_started) js_next_page =
\"\"\"document.querySelector('a.pagination-next').click();
\"\"\" for page in range(3): config =
CrawlerRunConfig( url=url, session_id=session_id,
js_code=js_next_page if page > 0 else None, css_selector=
\"li.commit-item\", js_only=page > 0,
cache_mode=CacheMode.BYPASS ) result = await
crawler.arun(config=config) print(f\"Page {page + 1}: Found
{len(result.extracted_content)} commits\") await
crawler.crawler_strategy.kill_session(session_id)
asyncio.run(advanced_session_crawl_with_hooks()) \nThis
technique ensures new content loads before the next action.
\nAdvanced Technique 2: Integrated JavaScript Execution and
Waiting\nCombine JavaScript execution and waiting logic for
concise handling of dynamic content:\nasync def
integrated_js_and_wait_crawl(): async with AsyncWebCrawler()
as crawler: session_id = \"integrated_session\" url =
\"https://github.com/example/repo/commits/main\"
js_next_page_and_wait = \"\"\" (async () => { const
getCurrentCommit = () => document.querySelector('li.commit-
item h4').textContent.trim(); const initialCommit =
getCurrentCommit(); document.querySelector('a.pagination-
next').click(); while (getCurrentCommit() === initialCommit)
140
{ await new Promise(resolve => setTimeout(resolve, 100)); } })
(); \"\"\" for page in range(3): config =
CrawlerRunConfig( url=url, session_id=session_id,
js_code=js_next_page_and_wait if page > 0 else None,
css_selector=\"li.commit-item\", js_only=page > 0,
cache_mode=CacheMode.BYPASS ) result = await
crawler.arun(config=config) print(f\"Page {page + 1}: Found
{len(result.extracted_content)} commits\") await
crawler.crawler_strategy.kill_session(session_id)
asyncio.run(integrated_js_and_wait_crawl()) \nCommon Use Cases
for Sessions\n1. Authentication Flows: Login and interact with
secured pages.\n2. Pagination Handling: Navigate through
multiple pages.\n3. Form Submissions: Fill forms, submit, and
process results.\n4. Multi-step Processes: Complete workflows
that span multiple actions.\n5. Dynamic Content Navigation:
Handle JavaScript-rendered or event-triggered content.",
"markdown": "# Session Management - Crawl4AI Documentation
(v0.5.x)\n\nSession management in Crawl4AI is a powerful
feature that allows you to maintain state across multiple
requests, making it particularly suitable for handling complex
multi-step crawling tasks. It enables you to reuse the same
browser tab (or page object) across sequential actions and
crawls, which is beneficial for:\n\n* **Performing
JavaScript actions before and after crawling.**\n*
**Executing multiple sequential crawls faster** without
needing to reopen tabs or allocate memory repeatedly.\n
\n**Note:** This feature is designed for sequential workflows
and is not suitable for parallel operations.\n\n* * *\n\n####
Basic Session Usage\n\nUse `BrowserConfig` and
`CrawlerRunConfig` to maintain state with a `session_id`:\n
\n`from crawl4ai.async_configs import BrowserConfig,
CrawlerRunConfig async with AsyncWebCrawler() as crawler:
session_id = \"my_session\" # Define configurations
config1 = CrawlerRunConfig( url=
\"https://example.com/page1\", session_id=session_id )
config2 = CrawlerRunConfig( url=
\"https://example.com/page2\", session_id=session_id )
# First request result1 = await
crawler.arun(config=config1) # Subsequent request using
the same session result2 = await
crawler.arun(config=config2) # Clean up when done
await crawler.crawler_strategy.kill_session(session_id)`\n\n*
* *\n\n#### Dynamic Content with Sessions\n\nHere's an example
of crawling GitHub commits across multiple pages while
preserving session state:\n\n`from crawl4ai.async_configs
import CrawlerRunConfig from crawl4ai.extraction_strategy
import JsonCssExtractionStrategy from crawl4ai.cache_context
import CacheMode async def crawl_dynamic_content(): async
with AsyncWebCrawler() as crawler: session_id =
\"github_commits_session\" url =
\"https://github.com/microsoft/TypeScript/commits/main\"
all_commits = [] # Define extraction schema
schema = { \"name\": \"Commit Extractor\",
\"baseSelector\": \"li.Box-sc-g0xbh4-0\", \"fields
\": [{ \"name\": \"title\", \"selector\":
\"h4.markdown-title\", \"type\": \"text
\" }], } extraction_strategy =
141
JsonCssExtractionStrategy(schema) # JavaScript and
wait configurations js_next_page =
\"\"\"document.querySelector('a[data-testid=\"pagination-next-
button\"]').click();\"\"\" wait_for = \"\"\"() =>
document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0
\"\"\" # Crawl multiple pages for page in
range(3): config =
CrawlerRunConfig( url=url,
session_id=session_id,
extraction_strategy=extraction_strategy,
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
cache_mode=CacheMode.BYPASS ) result
= await crawler.arun(config=config) if
result.success: commits =
json.loads(result.extracted_content)
all_commits.extend(commits) print(f\"Page
{page + 1}: Found {len(commits)} commits\") # Clean
up session await
crawler.crawler_strategy.kill_session(session_id)
return all_commits`\n\n* * *\n\n## Example 1: Basic Session-
Based Crawling\n\nA simple example using session-based
crawling:\n\nìmport asyncio from crawl4ai.async_configs
import BrowserConfig, CrawlerRunConfig from
crawl4ai.cache_context import CacheMode async def
basic_session_crawl(): async with AsyncWebCrawler() as
crawler: session_id = \"dynamic_content_session\"
url = \"https://example.com/dynamic-content\" for
page in range(3): config =
CrawlerRunConfig( url=url,
session_id=session_id, js_code=
\"document.querySelector('.load-more-button').click();\" if
page > 0 else None, css_selector=\".content-
item\",
cache_mode=CacheMode.BYPASS ) result
= await crawler.arun(config=config) print(f\"Page
{page + 1}: Found {result.extracted_content.count('.content-
item')} items\") await
crawler.crawler_strategy.kill_session(session_id)
asyncio.run(basic_session_crawl())`\n\nThis example shows: 1.
Reusing the same `session_id` across multiple requests. 2.
Executing JavaScript to load more content dynamically. 3.
Properly closing the session to free resources.\n\n* * *\n\n##
Advanced Technique 1: Custom Execution Hooks\n\n> Warning: You
might feel confused by the end of the next few examples ðŸ˜…,
so make sure you are comfortable with the order of the parts
before you start this.\n\nUse custom hooks to handle complex
scenarios, such as waiting for content to load dynamically:\n
\nàsync def advanced_session_crawl_with_hooks():
first_commit = \"\" async def on_execution_started(page):
nonlocal first_commit try: while True:
await page.wait_for_selector(\"li.commit-item h4\")
commit = await page.query_selector(\"li.commit-item h4\")
commit = await commit.evaluate(\"(element) =>
element.textContent\").strip() if commit and
commit != first_commit: first_commit =
142
commit break await
asyncio.sleep(0.5) except Exception as e:
print(f\"Warning: New content didn't appear: {e}\") async
with AsyncWebCrawler() as crawler: session_id =
\"commit_session\" url =
\"https://github.com/example/repo/commits/main\"
crawler.crawler_strategy.set_hook(\"on_execution_started\",
on_execution_started) js_next_page =
\"\"\"document.querySelector('a.pagination-next').click();
\"\"\" for page in range(3): config =
CrawlerRunConfig( url=url,
session_id=session_id, js_code=js_next_page if
page > 0 else None, css_selector=\"li.commit-
item\", js_only=page > 0,
cache_mode=CacheMode.BYPASS ) result
= await crawler.arun(config=config) print(f\"Page
{page + 1}: Found {len(result.extracted_content)} commits\")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(advanced_session_crawl_with_hooks())`\n\nThis
technique ensures new content loads before the next action.\n
\n* * *\n\n## Advanced Technique 2: Integrated JavaScript
Execution and Waiting\n\nCombine JavaScript execution and
waiting logic for concise handling of dynamic content:\n
\nàsync def integrated_js_and_wait_crawl(): async with
AsyncWebCrawler() as crawler: session_id =
\"integrated_session\" url =
\"https://github.com/example/repo/commits/main\"
js_next_page_and_wait = \"\"\" (async () =>
{ const getCurrentCommit = () =>
document.querySelector('li.commit-item
h4').textContent.trim(); const initialCommit =
getCurrentCommit();
document.querySelector('a.pagination-next').click();
while (getCurrentCommit() === initialCommit)
{ await new Promise(resolve =>
setTimeout(resolve, 100)); } })();
\"\"\" for page in range(3): config =
CrawlerRunConfig( url=url,
session_id=session_id,
js_code=js_next_page_and_wait if page > 0 else None,
css_selector=\"li.commit-item\",
js_only=page > 0,
cache_mode=CacheMode.BYPASS ) result
= await crawler.arun(config=config) print(f\"Page
{page + 1}: Found {len(result.extracted_content)} commits\")
await crawler.crawler_strategy.kill_session(session_id)
asyncio.run(integrated_js_and_wait_crawl())`\n\n* * *\n\n####
Common Use Cases for Sessions\n\n1.â€€**Authentication
Flows**: Login and interact with secured pages.\n\n2.â
€€**Pagination Handling**: Navigate through multiple pages.\n
\n3.â€€**Form Submissions**: Fill forms, submit, and process
results.\n\n4.â€€**Multi-step Processes**: Complete workflows
that span multiple actions.\n\n5.â€€**Dynamic Content
Navigation**: Handle JavaScript-rendered or event-triggered
content.",
"debug": {
"requestHandlerMode": "browser"
143
}
},
{
"url": "https://crawl4ai.com/mkdocs/extraction/no-llm-
strategies/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/extraction/no-
llm-strategies/",
"loadedTime": "2025-03-05T23:17:14.635Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/extraction/no-
llm-strategies/",
"title": "LLM-Free Strategies - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:13 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"b6b32d293a9bf2f263fc59e36a5a9af0\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "LLM-Free Strategies - Crawl4AI Documentation
(v0.5.x)\nOne of Crawl4AIâ€™s most powerful features is
extracting structured JSON from websites without relying on
large language models. By defining a schema with CSS or XPath
selectors, you can extract data instantlyâ€”even from complex
or nested HTML structuresâ€”without the cost, latency, or
environmental impact of an LLM.\nWhy avoid LLM for basic
extractions?\n1. Faster & Cheaper: No API calls or GPU
overhead.\n2. Lower Carbon Footprint: LLM inference can be
energy-intensive. A well-defined schema is practically carbon-
free.\n3. Precise & Repeatable: CSS/XPath selectors do exactly
what you specify. LLM outputs can vary or hallucinate.\n4.
Scales Readily: For thousands of pages, schema-based
extraction runs quickly and in parallel.\nBelow, weâ€™ll
explore how to craft these schemas and use them with
JsonCssExtractionStrategy (or JsonXPathExtractionStrategy if
you prefer XPath). Weâ€™ll also highlight advanced features
like nested fields and base element attributes.\nA schema
defines:\nA base selector that identifies each â€œcontainerâ€
element on the page (e.g., a product row, a blog post card).
\n2. Fields describing which CSS/XPath selectors to use for
144
each piece of data you want to capture (text, attribute, HTML
block, etc.).\n3. Nested or list types for repeated or
hierarchical structures. \nFor example, if you have a list of
products, each one might have a name, price, reviews, and â
€œrelated products.â€ This approach is faster and more
reliable than an LLM for consistent, structured pages.\n2.
Simple Example: Crypto Prices\nLetâ€™s begin with a simple
schema-based extraction using the JsonCssExtractionStrategy.
Below is a snippet that extracts cryptocurrency prices from a
site (similar to the legacy Coinbase example). Notice we donâ
€™t call any LLM:\nimport json import asyncio from crawl4ai
import AsyncWebCrawler, CrawlerRunConfig, CacheMode from
crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def extract_crypto_prices(): # 1. Define a simple
extraction schema schema = { \"name\": \"Crypto Prices\",
\"baseSelector\": \"div.crypto-row\", # Repeated elements
\"fields\": [ { \"name\": \"coin_name\", \"selector\":
\"h2.coin-name\", \"type\": \"text\" }, { \"name\": \"price\",
\"selector\": \"span.coin-price\", \"type\": \"text\" } ] } #
2. Create the extraction strategy extraction_strategy =
JsonCssExtractionStrategy(schema, verbose=True) # 3. Set up
your crawler config (if needed) config = CrawlerRunConfig( #
e.g., pass js_code or wait_for if the page is dynamic #
wait_for=\"css:.crypto-row:nth-child(20)\" cache_mode =
CacheMode.BYPASS, extraction_strategy=extraction_strategy, )
async with AsyncWebCrawler(verbose=True) as crawler: # 4. Run
the crawl and extraction result = await crawler.arun( url=
\"https://example.com/crypto-prices\", config=config ) if not
result.success: print(\"Crawl failed:\", result.error_message)
return # 5. Parse the extracted JSON data =
json.loads(result.extracted_content) print(f\"Extracted
{len(data)} coin entries\") print(json.dumps(data[0], indent=
2) if data else \"No data found\")
asyncio.run(extract_crypto_prices()) \nHighlights:
\nbaseSelector: Tells us where each â€œitemâ€ (crypto row)
is. \nfields: Two fields (coin_name, price) using simple CSS
selectors. \nEach field defines a type (e.g., text, attribute,
html, regex, etc.).\nNo LLM is needed, and the performance is
near-instant for hundreds or thousands of items.\nXPath
Example with raw:// HTML\nBelow is a short example
demonstrating XPath extraction plus the raw:// scheme. Weâ€™ll
pass a dummy HTML directly (no network request) and define the
extraction strategy in CrawlerRunConfig.\nimport json import
asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.extraction_strategy import
JsonXPathExtractionStrategy async def
extract_crypto_prices_xpath(): # 1. Minimal dummy HTML with
some repeating rows dummy_html = \"\"\" <html> <body> <div
class='crypto-row'> <h2 class='coin-name'>Bitcoin</h2> <span
class='coin-price'>$28,000</span> </div> <div class='crypto-
row'> <h2 class='coin-name'>Ethereum</h2> <span class='coin-
price'>$1,800</span> </div> </body> </html> \"\"\" # 2. Define
the JSON schema (XPath version) schema = { \"name\": \"Crypto
Prices via XPath\", \"baseSelector\": \"//div[@class='crypto-
row']\", \"fields\": [ { \"name\": \"coin_name\", \"selector
\": \".//h2[@class='coin-name']\", \"type\": \"text\" },
{ \"name\": \"price\", \"selector\": \".//span[@class='coin-
145
price']\", \"type\": \"text\" } ] } # 3. Place the strategy in
the CrawlerRunConfig config =
CrawlerRunConfig( extraction_strategy=JsonXPathExtractionStrat
egy(schema, verbose=True) ) # 4. Use raw:// scheme to pass
dummy_html directly raw_url = f\"raw://{dummy_html}\" async
with AsyncWebCrawler(verbose=True) as crawler: result = await
crawler.arun( url=raw_url, config=config ) if not
result.success: print(\"Crawl failed:\", result.error_message)
return data = json.loads(result.extracted_content) print(f
\"Extracted {len(data)} coin rows\") if data: print(\"First
item:\", data[0]) asyncio.run(extract_crypto_prices_xpath())
\nKey Points:\n1. JsonXPathExtractionStrategy is used instead
of JsonCssExtractionStrategy.\n2. baseSelector and each fieldâ
€™s \"selector\" use XPath instead of CSS.\n3. raw:// lets us
pass dummy_html with no real network requestâ€”handy for local
testing.\n4. Everything (including the extraction strategy) is
in CrawlerRunConfig. \nThatâ€™s how you keep the config self-
contained, illustrate XPath usage, and demonstrate the raw
scheme for direct HTML inputâ€”all while avoiding the old
approach of passing extraction_strategy directly to arun().
\n3. Advanced Schema & Nested Structures\nReal sites often
have nested or repeated dataâ€”like categories containing
products, which themselves have a list of reviews or features.
For that, we can define nested or list (and even nested_list)
fields.\nSample E-Commerce HTML\nWe have a sample e-commerce
HTML file on GitHub (example):
\nhttps://gist.githubusercontent.com/githubusercontent/2d7b8ba
3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920d
ee5e1fd2/sample_ecommerce.html \nThis snippet includes
categories, products, features, reviews, and related items.
Letâ€™s see how to define a schema that fully captures that
structure without LLM. \nschema = { \"name\": \"E-commerce
Product Catalog\", \"baseSelector\": \"div.category\", # (1)
We can define optional baseFields if we want to extract
attributes # from the category container \"baseFields\":
[ {\"name\": \"data_cat_id\", \"type\": \"attribute\",
\"attribute\": \"data-cat-id\"}, ], \"fields\": [ { \"name\":
\"category_name\", \"selector\": \"h2.category-name\", \"type
\": \"text\" }, { \"name\": \"products\", \"selector\":
\"div.product\", \"type\": \"nested_list\", # repeated sub-
objects \"fields\": [ { \"name\": \"name\", \"selector\":
\"h3.product-name\", \"type\": \"text\" }, { \"name\": \"price
\", \"selector\": \"p.product-price\", \"type\": \"text\" },
{ \"name\": \"details\", \"selector\": \"div.product-details
\", \"type\": \"nested\", # single sub-object \"fields\":
[ { \"name\": \"brand\", \"selector\": \"span.brand\", \"type
\": \"text\" }, { \"name\": \"model\", \"selector\":
\"span.model\", \"type\": \"text\" } ] }, { \"name\":
\"features\", \"selector\": \"ul.product-features li\", \"type
\": \"list\", \"fields\": [ {\"name\": \"feature\", \"type\":
\"text\"} ] }, { \"name\": \"reviews\", \"selector\":
\"div.review\", \"type\": \"nested_list\", \"fields\":
[ { \"name\": \"reviewer\", \"selector\": \"span.reviewer\",
\"type\": \"text\" }, { \"name\": \"rating\", \"selector\":
\"span.rating\", \"type\": \"text\" }, { \"name\": \"comment
\", \"selector\": \"p.review-text\", \"type\": \"text\" } ] },
{ \"name\": \"related_products\", \"selector\": \"ul.related-
146
products li\", \"type\": \"list\", \"fields\": [ { \"name\":
\"name\", \"selector\": \"span.related-name\", \"type\":
\"text\" }, { \"name\": \"price\", \"selector\":
\"span.related-price\", \"type\": \"text\" } ] } ] } ] } \nKey
Takeaways:\nNested vs. List: \ntype: \"nested\" means a single
sub-object (like details). \ntype: \"list\" means multiple
items that are simple dictionaries or single text fields.
\ntype: \"nested_list\" means repeated complex objects (like
products or reviews).\nBase Fields: We can extract attributes
from the container element via \"baseFields\". For instance,
\"data_cat_id\" might be data-cat-id=\"elect123\".
\nTransforms: We can also define a transform if we want to
lower/upper case, strip whitespace, or even run a custom
function.\nimport json import asyncio from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig from
crawl4ai.extraction_strategy import JsonCssExtractionStrategy
ecommerce_schema = { # ... the advanced schema from
above ... } async def extract_ecommerce_data(): strategy =
JsonCssExtractionStrategy(ecommerce_schema, verbose=True)
config = CrawlerRunConfig() async with
AsyncWebCrawler(verbose=True) as crawler: result = await
crawler.arun( url=
\"https://gist.githubusercontent.com/githubusercontent/2d7b8ba
3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920d
ee5e1fd2/sample_ecommerce.html\",
extraction_strategy=strategy, config=config ) if not
result.success: print(\"Crawl failed:\", result.error_message)
return # Parse the JSON output data =
json.loads(result.extracted_content) print(json.dumps(data,
indent=2) if data else \"No data found.\")
asyncio.run(extract_ecommerce_data()) \nIf all goes well, you
get a structured JSON array with each â€œcategory,â€
containing an array of products. Each product includes
details, features, reviews, etc. All of that without an LLM.
\n4. Why â€œNo LLMâ€ Is Often Better\n1. Zero Hallucination:
Schema-based extraction doesnâ€™t guess text. It either finds
it or not.\n2. Guaranteed Structure: The same schema yields
consistent JSON across many pages, so your downstream pipeline
can rely on stable keys.\n3. Speed: LLM-based extraction can
be 10â€“1000x slower for large-scale crawling.\n4. Scalable:
Adding or updating a field is a matter of adjusting the
schema, not re-tuning a model.\nWhen might you consider an
LLM? Possibly if the site is extremely unstructured or you
want AI summarization. But always try a schema approach first
for repeated or consistent data patterns.\n5. Base Element
Attributes & Additional Fields\nItâ€™s easy to extract
attributes (like href, src, or data-xxx) from your base or
nested elements using:\n{ \"name\": \"href\", \"type\":
\"attribute\", \"attribute\": \"href\", \"default\": null }
\nYou can define them in baseFields (extracted from the main
container element) or in each fieldâ€™s sub-lists. This is
especially helpful if you need an itemâ€™s link or ID stored
in the parent <div>.\n6. Putting It All Together: Larger
Example\nConsider a blog site. We have a schema that extracts
the URL from each post card (via baseFields with an
\"attribute\": \"href\"), plus the title, date, summary, and
author:\nschema = { \"name\": \"Blog Posts\", \"baseSelector
147
\": \"a.blog-post-card\", \"baseFields\": [ {\"name\":
\"post_url\", \"type\": \"attribute\", \"attribute\": \"href
\"} ], \"fields\": [ {\"name\": \"title\", \"selector\":
\"h2.post-title\", \"type\": \"text\", \"default\": \"No Title
\"}, {\"name\": \"date\", \"selector\": \"time.post-date\",
\"type\": \"text\", \"default\": \"\"}, {\"name\": \"summary
\", \"selector\": \"p.post-summary\", \"type\": \"text\",
\"default\": \"\"}, {\"name\": \"author\", \"selector\":
\"span.post-author\", \"type\": \"text\", \"default\":
\"\"} ] } \nThen run with JsonCssExtractionStrategy(schema) to
get an array of blog post objects, each with \"post_url\",
\"title\", \"date\", \"summary\", \"author\".\n7. Tips & Best
Practices\n1. Inspect the DOM in Chrome DevTools or Firefoxâ
€™s Inspector to find stable selectors.\n2. Start Simple:
Verify you can extract a single field. Then add complexity
like nested objects or lists.\n3. Test your schema on partial
HTML or a test page before a big crawl.\n4. Combine with JS
Execution if the site loads content dynamically. You can pass
js_code or wait_for in CrawlerRunConfig.\n5. Look at Logs when
verbose=True: if your selectors are off or your schema is
malformed, itâ€™ll often show warnings.\n6. Use baseFields if
you need attributes from the container element (e.g., href,
data-id), especially for the â€œparentâ€ item.\n7.
Performance: For large pages, make sure your selectors are as
narrow as possible.\n8. Schema Generation Utility\nWhile
manually crafting schemas is powerful and precise, Crawl4AI
now offers a convenient utility to automatically generate
extraction schemas using LLM. This is particularly useful
when:\nYou're dealing with a new website structure and want a
quick starting point\nYou need to extract complex nested data
structures\nYou want to avoid the learning curve of CSS/XPath
selector syntax\nUsing the Schema Generator\nThe schema
generator is available as a static method on both
JsonCssExtractionStrategy and JsonXPathExtractionStrategy. You
can choose between OpenAI's GPT-4 or the open-source Ollama
for schema generation:\nfrom crawl4ai.extraction_strategy
import JsonCssExtractionStrategy, JsonXPathExtractionStrategy
from crawl4ai.async_configs import LlmConfig # Sample HTML
with product information html = \"\"\" <div class=\"product-
card\"> <h2 class=\"title\">Gaming Laptop</h2> <div class=
\"price\">$999.99</div> <div class=\"specs\"> <ul> <li>16GB
RAM</li> <li>1TB SSD</li> </ul> </div> </div> \"\"\" # Option
1: Using OpenAI (requires API token) css_schema =
JsonCssExtractionStrategy.generate_schema( html, schema_type=
\"css\", llmConfig = LlmConfig(provider=\"openai/gpt-4o
\",api_token=\"your-openai-token\") ) # Option 2: Using Ollama
(open source, no token needed) xpath_schema =
JsonXPathExtractionStrategy.generate_schema( html,
schema_type=\"xpath\", llmConfig = LlmConfig(provider=
\"ollama/llama3.3\", api_token=None) # Not needed for Ollama )
# Use the generated schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(css_schema) \nLLM
Provider Options\nOpenAI GPT-4 (openai/gpt4o)\nDefault
provider\nRequires an API token\nGenerally provides more
accurate schemas\nSet via environment variable: OPENAI_API_KEY
\nOllama (ollama/llama3.3)\nOpen source alternative\nNo API
token required\nSelf-hosted option\nGood for development and
148
testing\nBenefits of Schema Generation\nOne-Time Cost: While
schema generation uses LLM, it's a one-time cost. The
generated schema can be reused for unlimited extractions
without further LLM calls.\nSmart Pattern Recognition: The LLM
analyzes the HTML structure and identifies common patterns,
often producing more robust selectors than manual attempts.
\nAutomatic Nesting: Complex nested structures are
automatically detected and properly represented in the schema.
\nLearning Tool: The generated schemas serve as excellent
examples for learning how to write your own schemas.\nBest
Practices\nReview Generated Schemas: While the generator is
smart, always review and test the generated schema before
using it in production.\nProvide Representative HTML: The
better your sample HTML represents the overall structure, the
more accurate the generated schema will be.\nConsider Both CSS
and XPath: Try both schema types and choose the one that works
best for your specific case.\nCache Generated Schemas: Since
generation uses LLM, save successful schemas for reuse.\nAPI
Token Security: Never hardcode API tokens. Use environment
variables or secure configuration management.\nChoose Provider
Wisely: \nUse OpenAI for production-quality schemas\nUse
Ollama for development, testing, or when you need a self-
hosted solution\nThat's it for Extracting JSON (No LLM)!
You've seen how schema-based approaches (either CSS or XPath)
can handle everything from simple lists to deeply nested
product catalogsâ€”instantly, with minimal overhead. Enjoy
building robust scrapers that produce consistent, structured
JSON for your data pipelines!\n9. Conclusion\nWith
JsonCssExtractionStrategy (or JsonXPathExtractionStrategy),
you can build powerful, LLM-free pipelines that:\nScrape any
consistent site for structured data. \nSupport nested objects,
repeating lists, or advanced transformations. \nScale to
thousands of pages quickly and reliably.\nNext Steps:\nCombine
your extracted JSON with advanced filtering or summarization
in a second pass if needed. \nFor dynamic pages, combine
strategies with js_code or infinite scroll hooking to ensure
all content is loaded.\nRemember: For repeated, structured
data, you donâ€™t need to pay for or wait on an LLM. A well-
crafted schema plus CSS or XPath gets you the data faster,
cleaner, and cheaperâ€”the real power of Crawl4AI.\nLast
Updated: 2025-01-01\nThatâ€™s it for Extracting JSON (No LLM)!
Youâ€™ve seen how schema-based approaches (either CSS or
XPath) can handle everything from simple lists to deeply
nested product catalogsâ€”instantly, with minimal overhead.
Enjoy building robust scrapers that produce consistent,
structured JSON for your data pipelines!",
"markdown": "# LLM-Free Strategies - Crawl4AI Documentation
(v0.5.x)\n\nOne of Crawl4AIâ€™s **most powerful** features is
extracting **structured JSON** from websites **without**
relying on large language models. By defining a **schema**
with CSS or XPath selectors, you can extract data instantlyâ€”
even from complex or nested HTML structuresâ€”without the
cost, latency, or environmental impact of an LLM.\n\n**Why
avoid LLM for basic extractions?**\n\n1.â€€**Faster &
Cheaper**: No API calls or GPU overhead. \n2.â€€**Lower
Carbon Footprint**: LLM inference can be energy-intensive. A
well-defined schema is practically carbon-free. \n3.â
149
€€**Precise & Repeatable**: CSS/XPath selectors do exactly
what you specify. LLM outputs can vary or hallucinate. \n4.â
€€**Scales Readily**: For thousands of pages, schema-based
extraction runs quickly and in parallel.\n\nBelow, weâ€™ll
explore how to craft these schemas and use them with
**JsonCssExtractionStrategy** (or
**JsonXPathExtractionStrategy** if you prefer XPath). Weâ€™ll
also highlight advanced features like **nested fields** and
**base element attributes**.\n\n* * *\n\nA schema defines:\n
\n1. A **base selector** that identifies each â€œcontainerâ€
element on the page (e.g., a product row, a blog post card).
\n 2.â€€**Fields** describing which CSS/XPath selectors to
use for each piece of data you want to capture (text,
attribute, HTML block, etc.). \n 3.â€€**Nested** or
**list** types for repeated or hierarchical structures.\n\nFor
example, if you have a list of products, each one might have a
name, price, reviews, and â€œrelated products.â€ This
approach is faster and more reliable than an LLM for
consistent, structured pages.\n\n* * *\n\n## 2\\. Simple
Example: Crypto Prices\n\nLetâ€™s begin with a **simple**
schema-based extraction using the `JsonCssExtractionStrategy`.
Below is a snippet that extracts cryptocurrency prices from a
site (similar to the legacy Coinbase example). Notice we
**donâ€™t** call any LLM:\n\nìmport json import asyncio from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import
JsonCssExtractionStrategy async def extract_crypto_prices():
# 1. Define a simple extraction schema schema =
{ \"name\": \"Crypto Prices\", \"baseSelector
\": \"div.crypto-row\", # Repeated elements
\"fields\": [ { \"name\":
\"coin_name\", \"selector\": \"h2.coin-name\",
\"type\": \"text\" },
{ \"name\": \"price\",
\"selector\": \"span.coin-price\", \"type\":
\"text\" } ] } # 2. Create the
extraction strategy extraction_strategy =
JsonCssExtractionStrategy(schema, verbose=True) # 3. Set
up your crawler config (if needed) config =
CrawlerRunConfig( # e.g., pass js_code or wait_for if
the page is dynamic # wait_for=\"css:.crypto-row:nth-
child(20)\" cache_mode = CacheMode.BYPASS,
extraction_strategy=extraction_strategy, ) async with
AsyncWebCrawler(verbose=True) as crawler: # 4. Run the
crawl and extraction result = await
crawler.arun( url=\"https://example.com/crypto-
prices\", config=config ) if not
result.success: print(\"Crawl failed:\",
result.error_message) return # 5. Parse
the extracted JSON data =
json.loads(result.extracted_content) print(f
\"Extracted {len(data)} coin entries\")
print(json.dumps(data[0], indent=2) if data else \"No data
found\") asyncio.run(extract_crypto_prices())`\n
\n**Highlights**:\n\n* **`baseSelector`**: Tells us where
each â€œitemâ€ (crypto row) is.\n* **`fields`**: Two fields
(`coin_name`, `price`) using simple CSS selectors.\n* Each
150
field defines a **`type`** (e.g., `text`, àttribute`, `html`,
`regex`, etc.).\n\nNo LLM is needed, and the performance is
**near-instant** for hundreds or thousands of items.\n\n* * *
\n\n### **XPath Example with `raw://` HTML**\n\nBelow is a
short example demonstrating **XPath** extraction plus the
**`raw://`** scheme. Weâ€™ll pass a **dummy HTML** directly
(no network request) and define the extraction strategy in
`CrawlerRunConfig`.\n\nìmport json import asyncio from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig from
crawl4ai.extraction_strategy import
JsonXPathExtractionStrategy async def
extract_crypto_prices_xpath(): # 1. Minimal dummy HTML
with some repeating rows dummy_html = \"\"\" <html>
<body> <div class='crypto-row'> <h2
class='coin-name'>Bitcoin</h2> <span class='coin-
price'>$28,000</span> </div> <div
class='crypto-row'> <h2 class='coin-name'>
Ethereum</h2> <span class='coin-price'>$1,800</span>
</div> </body> </html> \"\"\" # 2. Define
the JSON schema (XPath version) schema = { \"name
\": \"Crypto Prices via XPath\", \"baseSelector\":
\"//div[@class='crypto-row']\", \"fields\":
[ { \"name\": \"coin_name\",
\"selector\": \".//h2[@class='coin-name']\",
\"type\": \"text\" },
{ \"name\": \"price\",
\"selector\": \".//span[@class='coin-price']\",
\"type\": \"text\" } ] } # 3.
Place the strategy in the CrawlerRunConfig config =
CrawlerRunConfig( extraction_strategy=JsonXPathExtract
ionStrategy(schema, verbose=True) ) # 4. Use raw://
scheme to pass dummy_html directly raw_url = f
\"raw://{dummy_html}\" async with
AsyncWebCrawler(verbose=True) as crawler: result =
await crawler.arun( url=raw_url,
config=config ) if not result.success:
print(\"Crawl failed:\", result.error_message)
return data = json.loads(result.extracted_content)
print(f\"Extracted {len(data)} coin rows\") if data:
print(\"First item:\", data[0])
asyncio.run(extract_crypto_prices_xpath())`\n\n**Key Points**:
\n\n1.â€€**`JsonXPathExtractionStrategy`** is used instead of
`JsonCssExtractionStrategy`. \n2.â€€**`baseSelector`** and
each fieldâ€™s `\"selector\"` use **XPath** instead of CSS.
\n3.â€€**`raw://`** lets us pass `dummy_html` with no real
network requestâ€”handy for local testing. \n4\\. Everything
(including the extraction strategy) is in
**`CrawlerRunConfig`**.\n\nThatâ€™s how you keep the config
self-contained, illustrate **XPath** usage, and demonstrate
the **raw** scheme for direct HTML inputâ€”all while avoiding
the old approach of passing èxtraction_strategy` directly to
àrun()`.\n\n* * *\n\n## 3\\. Advanced Schema & Nested
Structures\n\nReal sites often have **nested** or repeated
dataâ€”like categories containing products, which themselves
have a list of reviews or features. For that, we can define
**nested** or **list** (and even **nested\\_list**) fields.\n
\n### Sample E-Commerce HTML\n\nWe have a **sample e-
151
commerce** HTML file on GitHub (example):\n
\n`https://gist.githubusercontent.com/githubusercontent/2d7b8b
a3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920
dee5e1fd2/sample_ecommerce.html`\n\nThis snippet includes
categories, products, features, reviews, and related items.
Letâ€™s see how to define a schema that fully captures that
structure **without LLM**.\n\n`schema = { \"name\": \"E-
commerce Product Catalog\", \"baseSelector\":
\"div.category\", # (1) We can define optional baseFields
if we want to extract attributes # from the category
container \"baseFields\": [ {\"name\":
\"data_cat_id\", \"type\": \"attribute\", \"attribute\":
\"data-cat-id\"}, ], \"fields\":
[ { \"name\": \"category_name\",
\"selector\": \"h2.category-name\", \"type\":
\"text\" }, { \"name\": \"products
\", \"selector\": \"div.product\",
\"type\": \"nested_list\", # repeated sub-objects
\"fields\": [ { \"name\":
\"name\", \"selector\": \"h3.product-name
\", \"type\": \"text\" },
{ \"name\": \"price\",
\"selector\": \"p.product-price\", \"type
\": \"text\" },
{ \"name\": \"details\",
\"selector\": \"div.product-details\",
\"type\": \"nested\", # single sub-object
\"fields\":
[ { \"name
\": \"brand\", \"selector\":
\"span.brand\", \"type\": \"text
\" },
{ \"name\": \"model\",
\"selector\": \"span.model\",
\"type\": \"text
\" } ]
}, { \"name\": \"features
\", \"selector\": \"ul.product-features li
\", \"type\": \"list\",
\"fields\": [ {\"name\": \"feature\",
\"type\": \"text\"} ] },
{ \"name\": \"reviews\",
\"selector\": \"div.review\", \"type\":
\"nested_list\", \"fields\":
[ { \"name
\": \"reviewer\", \"selector\":
\"span.reviewer\", \"type\":
\"text\" },
{ \"name\": \"rating\",
\"selector\": \"span.rating\",
\"type\": \"text\" },
{ \"name\": \"comment\",
\"selector\": \"p.review-text\",
\"type\": \"text
\" } ]
}, { \"name\":
\"related_products\", \"selector\":
152
\"ul.related-products li\", \"type\":
\"list\", \"fields\":
[ { \"name
\": \"name\", \"selector\":
\"span.related-name\", \"type\":
\"text\" },
{ \"name\": \"price\",
\"selector\": \"span.related-price\",
\"type\": \"text
\" } ]
} ] } ] }`\n\nKey Takeaways:\n\n*
**Nested vs. List**:\n* **`type: \"nested\"`** means a
**single** sub-object (like `details`).\n* **`type: \"list
\"`** means multiple items that are **simple** dictionaries or
single text fields.\n* **`type: \"nested_list\"`** means
repeated **complex** objects (like `products` or `reviews`).
\n* **Base Fields**: We can extract **attributes** from the
container element via `\"baseFields\"`. For instance, `
\"data_cat_id\"` might be `data-cat-id=\"elect123\"`.\n*
**Transforms**: We can also define a `transform` if we want to
lower/upper case, strip whitespace, or even run a custom
function.\n\nìmport json import asyncio from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig from
crawl4ai.extraction_strategy import JsonCssExtractionStrategy
ecommerce_schema = { # ... the advanced schema from
above ... } async def extract_ecommerce_data(): strategy
= JsonCssExtractionStrategy(ecommerce_schema, verbose=True)
config = CrawlerRunConfig() async with
AsyncWebCrawler(verbose=True) as crawler: result =
await crawler.arun( url=
\"https://gist.githubusercontent.com/githubusercontent/2d7b8ba
3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920d
ee5e1fd2/sample_ecommerce.html\",
extraction_strategy=strategy,
config=config ) if not result.success:
print(\"Crawl failed:\", result.error_message)
return # Parse the JSON output data =
json.loads(result.extracted_content)
print(json.dumps(data, indent=2) if data else \"No data found.
\") asyncio.run(extract_ecommerce_data())`\n\nIf all goes
well, you get a **structured** JSON array with each â
€œcategory,â€ containing an array of `products`. Each product
includes `details`, `features`, `reviews`, etc. All of that
**without** an LLM.\n\n* * *\n\n## 4\\. Why â€œNo LLMâ€ Is
Often Better\n\n1.â€€**Zero Hallucination**: Schema-based
extraction doesnâ€™t guess text. It either finds it or not.
\n2.â€€**Guaranteed Structure**: The same schema yields
consistent JSON across many pages, so your downstream pipeline
can rely on stable keys. \n3.â€€**Speed**: LLM-based
extraction can be 10â€“1000x slower for large-scale crawling.
\n4.â€€**Scalable**: Adding or updating a field is a matter of
adjusting the schema, not re-tuning a model.\n\n**When might
you consider an LLM?** Possibly if the site is extremely
unstructured or you want AI summarization. But always try a
schema approach first for repeated or consistent data
patterns.\n\n* * *\n\n## 5\\. Base Element Attributes &
Additional Fields\n\nItâ€™s easy to **extract attributes**
153
(like `href`, `src`, or `data-xxx`) from your base or nested
elements using:\n\n`{ \"name\": \"href\", \"type\":
\"attribute\", \"attribute\": \"href\", \"default\":
null }`\n\nYou can define them in **`baseFields`** (extracted
from the main container element) or in each fieldâ€™s sub-
lists. This is especially helpful if you need an itemâ€™s link
or ID stored in the parent `<div>`.\n\n* * *\n\n## 6\\.
Putting It All Together: Larger Example\n\nConsider a blog
site. We have a schema that extracts the **URL** from each
post card (via `baseFields` with an `\"attribute\": \"href
\"`), plus the title, date, summary, and author:\n\n`schema =
{ \"name\": \"Blog Posts\", \"baseSelector\": \"a.blog-
post-card\", \"baseFields\": [ {\"name\": \"post_url\",
\"type\": \"attribute\", \"attribute\": \"href\"} ],
\"fields\": [ {\"name\": \"title\", \"selector\":
\"h2.post-title\", \"type\": \"text\", \"default\": \"No Title
\"}, {\"name\": \"date\", \"selector\": \"time.post-date
\", \"type\": \"text\", \"default\": \"\"}, {\"name\":
\"summary\", \"selector\": \"p.post-summary\", \"type\":
\"text\", \"default\": \"\"}, {\"name\": \"author\",
\"selector\": \"span.post-author\", \"type\": \"text\",
\"default\": \"\"} ] }`\n\nThen run with
`JsonCssExtractionStrategy(schema)` to get an array of blog
post objects, each with `\"post_url\"`, `\"title\"`, `\"date
\"`, `\"summary\"`, `\"author\"`.\n\n* * *\n\n## 7\\. Tips &
Best Practices\n\n1.â€€**Inspect the DOM** in Chrome DevTools
or Firefoxâ€™s Inspector to find stable selectors. \n2.â
€€**Start Simple**: Verify you can extract a single field.
Then add complexity like nested objects or lists. \n3.â
€€**Test** your schema on partial HTML or a test page before a
big crawl. \n4.â€€**Combine with JS Execution** if the site
loads content dynamically. You can pass `js_code` or
`wait_for` in `CrawlerRunConfig`. \n5.â€€**Look at Logs**
when `verbose=True`: if your selectors are off or your schema
is malformed, itâ€™ll often show warnings. \n6.â€€**Use
baseFields** if you need attributes from the container element
(e.g., `href`, `data-id`), especially for the â€œparentâ€
item. \n7.â€€**Performance**: For large pages, make sure your
selectors are as narrow as possible.\n\n* * *\n\n## 8\\.
Schema Generation Utility\n\nWhile manually crafting schemas
is powerful and precise, Crawl4AI now offers a convenient
utility to **automatically generate** extraction schemas using
LLM. This is particularly useful when:\n\n1. You're dealing
with a new website structure and want a quick starting point
\n2. You need to extract complex nested data structures\n3.
You want to avoid the learning curve of CSS/XPath selector
syntax\n\n### Using the Schema Generator\n\nThe schema
generator is available as a static method on both
`JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`.
You can choose between OpenAI's GPT-4 or the open-source
Ollama for schema generation:\n\n`from
crawl4ai.extraction_strategy import JsonCssExtractionStrategy,
JsonXPathExtractionStrategy from crawl4ai.async_configs import
LlmConfig # Sample HTML with product information html =
\"\"\" <div class=\"product-card\"> <h2 class=\"title\">
Gaming Laptop</h2> <div class=\"price\">$999.99</div>
<div class=\"specs\"> <ul> <li>16GB
154
RAM</li> <li>1TB SSD</li> </ul> </div>
</div> \"\"\" # Option 1: Using OpenAI (requires API token)
css_schema =
JsonCssExtractionStrategy.generate_schema( html,
schema_type=\"css\", llmConfig = LlmConfig(provider=
\"openai/gpt-4o\",api_token=\"your-openai-token\") ) # Option
2: Using Ollama (open source, no token needed) xpath_schema =
JsonXPathExtractionStrategy.generate_schema( html,
schema_type=\"xpath\", llmConfig = LlmConfig(provider=
\"ollama/llama3.3\", api_token=None) # Not needed for
Ollama ) # Use the generated schema for fast, repeated
extractions strategy = JsonCssExtractionStrategy(css_schema)`
\n\n### LLM Provider Options\n\n1. **OpenAI GPT-4
(òpenai/gpt4o`)**\n2. Default provider\n3. Requires an API
token\n4. Generally provides more accurate schemas\n5. Set
via environment variable: ÒPENAI_API_KEY`\n \n6. **Ollama
(òllama/llama3.3`)**\n \n7. Open source alternative\n8.
No API token required\n9. Self-hosted option\n10. Good for
development and testing\n\n### Benefits of Schema Generation\n
\n1. **One-Time Cost**: While schema generation uses LLM,
it's a one-time cost. The generated schema can be reused for
unlimited extractions without further LLM calls.\n2. **Smart
Pattern Recognition**: The LLM analyzes the HTML structure and
identifies common patterns, often producing more robust
selectors than manual attempts.\n3. **Automatic Nesting**:
Complex nested structures are automatically detected and
properly represented in the schema.\n4. **Learning Tool**:
The generated schemas serve as excellent examples for learning
how to write your own schemas.\n\n### Best Practices\n\n1.
**Review Generated Schemas**: While the generator is smart,
always review and test the generated schema before using it in
production.\n2. **Provide Representative HTML**: The better
your sample HTML represents the overall structure, the more
accurate the generated schema will be.\n3. **Consider Both
CSS and XPath**: Try both schema types and choose the one that
works best for your specific case.\n4. **Cache Generated
Schemas**: Since generation uses LLM, save successful schemas
for reuse.\n5. **API Token Security**: Never hardcode API
tokens. Use environment variables or secure configuration
management.\n6. **Choose Provider Wisely**:\n7. Use OpenAI
for production-quality schemas\n8. Use Ollama for
development, testing, or when you need a self-hosted solution
\n\nThat's it for **Extracting JSON (No LLM)**! You've seen
how schema-based approaches (either CSS or XPath) can handle
everything from simple lists to deeply nested product
catalogsâ€”instantly, with minimal overhead. Enjoy building
robust scrapers that produce consistent, structured JSON for
your data pipelines!\n\n* * *\n\n## 9\\. Conclusion\n\nWith
**JsonCssExtractionStrategy** (or
**JsonXPathExtractionStrategy**), you can build powerful,
**LLM-free** pipelines that:\n\n* Scrape any consistent site
for structured data.\n* Support nested objects, repeating
lists, or advanced transformations.\n* Scale to thousands of
pages quickly and reliably.\n\n**Next Steps**:\n\n* Combine
your extracted JSON with advanced filtering or summarization
in a second pass if needed.\n* For dynamic pages, combine
strategies with `js_code` or infinite scroll hooking to ensure
155
all content is loaded.\n\n**Remember**: For repeated,
structured data, you donâ€™t need to pay for or wait on an
LLM. A well-crafted schema plus CSS or XPath gets you the data
faster, cleaner, and cheaperâ€”**the real power** of Crawl4AI.
\n\n**Last Updated**: 2025-01-01\n\n* * *\n\nThatâ€™s it for
**Extracting JSON (No LLM)**! Youâ€™ve seen how schema-based
approaches (either CSS or XPath) can handle everything from
simple lists to deeply nested product catalogsâ€”instantly,
with minimal overhead. Enjoy building robust scrapers that
produce consistent, structured JSON for your data pipelines!",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/advanced/multi-url-
crawling/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/advanced/multi-
url-crawling/",
"loadedTime": "2025-03-05T23:17:19.654Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/advanced/multi-
url-crawling/",
"title": "Multi-URL Crawling - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:18 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"77a0d5542f197fe8137c13cbf49c8b08\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Multi-URL Crawling - Crawl4AI Documentation
(v0.5.x)\nAdvanced Multi-URL Crawling with Dispatchers\nHeads
Up: Crawl4AI supports advanced dispatchers for parallel or
throttled crawling, providing dynamic rate limiting and memory
usage checks. The built-in arun_many() function uses these
dispatchers to handle concurrency efficiently.\n1.
Introduction\nWhen crawling many URLs:\nBasic: Use arun() in a
loop (simple but less efficient)\nBetter: Use arun_many(),
which efficiently handles multiple URLs with proper
156
concurrency control\nBest: Customize dispatcher behavior for
your specific needs (memory management, rate limits,
etc.)\nWhy Dispatchers? \nAdaptive: Memory-based dispatchers
can pause or slow down based on system resources\nRate-
limiting: Built-in rate limiting with exponential backoff for
429/503 responses\nReal-time Monitoring: Live dashboard of
ongoing tasks, memory usage, and performance\nFlexibility:
Choose between memory-adaptive or semaphore-based concurrency
\n2. Core Components\n2.1 Rate Limiter\nclass RateLimiter: def
__init__( # Random delay range between requests base_delay:
Tuple[float, float] = (1.0, 3.0), # Maximum backoff delay
max_delay: float = 60.0, # Retries before giving up
max_retries: int = 3, # Status codes triggering backoff
rate_limit_codes: List[int] = [429, 503] ) \nHereâ€™s the
revised and simplified explanation of the RateLimiter,
focusing on constructor parameters and adhering to your
markdown style and mkDocs guidelines.\nRateLimiter Constructor
Parameters\nThe RateLimiter is a utility that helps manage the
pace of requests to avoid overloading servers or getting
blocked due to rate limits. It operates internally to delay
requests and handle retries but can be configured using its
constructor parameters.\nParameters of the RateLimiter
constructor:\n1. base_delay (Tuple[float, float], default:
(1.0, 3.0))\nThe range for a random delay (in seconds) between
consecutive requests to the same domain.\nA random delay is
chosen between base_delay[0] and base_delay[1] for each
request. \nThis prevents sending requests at a predictable
frequency, reducing the chances of triggering rate limits.
\nExample:\nIf base_delay = (2.0, 5.0), delays could be
randomly chosen as 2.3s, 4.1s, etc.\n2. max_delay (float,
default: 60.0)\nThe maximum allowable delay when rate-limiting
errors occur.\nWhen servers return rate-limit responses (e.g.,
429 or 503), the delay increases exponentially with jitter.
\nThe max_delay ensures the delay doesnâ€™t grow unreasonably
high, capping it at this value.\nExample:\nFor a max_delay =
30.0, even if backoff calculations suggest a delay of 45s, it
will cap at 30s.\n3. max_retries (int, default: 3)\nThe
maximum number of retries for a request if rate-limiting
errors occur.\nAfter encountering a rate-limit response, the
RateLimiter retries the request up to this number of times.
\nIf all retries fail, the request is marked as failed, and
the process continues.\nExample:\nIf max_retries = 3, the
system retries a failed request three times before giving up.
\n4. rate_limit_codes (List[int], default: [429, 503])\nA list
of HTTP status codes that trigger the rate-limiting logic.
\nThese status codes indicate the server is overwhelmed or
actively limiting requests. \nYou can customize this list to
include other codes based on specific server behavior.
\nExample:\nIf rate_limit_codes = [429, 503, 504], the crawler
will back off on these three error codes.\nHow to Use the
RateLimiter:\nHereâ€™s an example of initializing and using a
RateLimiter in your project:\nfrom crawl4ai import RateLimiter
# Create a RateLimiter with custom settings rate_limiter =
RateLimiter( base_delay=(2.0, 4.0), # Random delay between 2-4
seconds max_delay=30.0, # Cap delay at 30 seconds max_retries=
5, # Retry up to 5 times on rate-limiting errors
rate_limit_codes=[429, 503] # Handle these HTTP status codes )
157
# RateLimiter will handle delays and retries internally # No
additional setup is required for its operation \nThe
RateLimiter integrates seamlessly with dispatchers like
MemoryAdaptiveDispatcher and SemaphoreDispatcher, ensuring
requests are paced correctly without user intervention. Its
internal mechanisms manage delays and retries to avoid
overwhelming servers while maximizing efficiency.\n2.2 Crawler
Monitor\nThe CrawlerMonitor provides real-time visibility into
crawling operations:\nfrom crawl4ai import CrawlerMonitor,
DisplayMode monitor = CrawlerMonitor( # Maximum rows in live
display max_visible_rows=15, # DETAILED or AGGREGATED view
display_mode=DisplayMode.DETAILED ) \nDisplay Modes:
\nDETAILED: Shows individual task status, memory usage, and
timing\nAGGREGATED: Displays summary statistics and overall
progress\n3. Available Dispatchers\n3.1
MemoryAdaptiveDispatcher (Default)\nAutomatically manages
concurrency based on system memory usage:\nfrom
crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher
dispatcher =
MemoryAdaptiveDispatcher( memory_threshold_percent=90.0, #
Pause if memory exceeds this check_interval=1.0, # How often
to check memory max_session_permit=10, # Maximum concurrent
tasks rate_limiter=RateLimiter( # Optional rate limiting
base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2 ),
monitor=CrawlerMonitor( # Optional monitoring
max_visible_rows=15, display_mode=DisplayMode.DETAILED ) )
\nConstructor Parameters:\n1. memory_threshold_percent (float,
default: 90.0)\nSpecifies the memory usage threshold (as a
percentage). If system memory usage exceeds this value, the
dispatcher pauses crawling to prevent system overload.\n2.
check_interval (float, default: 1.0)\nThe interval (in
seconds) at which the dispatcher checks system memory usage.
\n3. max_session_permit (int, default: 10)\nThe maximum number
of concurrent crawling tasks allowed. This ensures resource
limits are respected while maintaining concurrency.\n4.
memory_wait_timeout (float, default: 300.0)\nOptional timeout
(in seconds). If memory usage exceeds memory_threshold_percent
for longer than this duration, a MemoryError is raised.\n5.
rate_limiter (RateLimiter, default: None)\nOptional rate-
limiting logic to avoid server-side blocking (e.g., for
handling 429 or 503 errors). See RateLimiter for details.\n6.
monitor (CrawlerMonitor, default: None)\nOptional monitoring
for real-time task tracking and performance insights. See
CrawlerMonitor for details.\n3.2 SemaphoreDispatcher\nProvides
simple concurrency control with a fixed limit:\nfrom
crawl4ai.async_dispatcher import SemaphoreDispatcher
dispatcher = SemaphoreDispatcher( max_session_permit=20, #
Maximum concurrent tasks rate_limiter=RateLimiter( # Optional
rate limiting base_delay=(0.5, 1.0), max_delay=10.0 ),
monitor=CrawlerMonitor( # Optional monitoring
max_visible_rows=15, display_mode=DisplayMode.DETAILED ) )
\nConstructor Parameters:\n1. max_session_permit (int,
default: 20)\nThe maximum number of concurrent crawling tasks
allowed, irrespective of semaphore slots.\n2. rate_limiter
(RateLimiter, default: None)\nOptional rate-limiting logic to
avoid overwhelming servers. See RateLimiter for details.\n3.
monitor (CrawlerMonitor, default: None)\nOptional monitoring
158
for tracking task progress and resource usage. See
CrawlerMonitor for details.\n4. Usage Examples\n4.1 Batch
Processing (Default)\nasync def crawl_batch(): browser_config
= BrowserConfig(headless=True, verbose=False) run_config =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS, stream=False #
Default: get all results at once ) dispatcher =
MemoryAdaptiveDispatcher( memory_threshold_percent=70.0,
check_interval=1.0, max_session_permit=10,
monitor=CrawlerMonitor( display_mode=DisplayMode.DETAILED ) )
async with AsyncWebCrawler(config=browser_config) as crawler:
# Get all results at once results = await
crawler.arun_many( urls=urls, config=run_config,
dispatcher=dispatcher ) # Process all results after completion
for result in results: if result.success: await
process_result(result) else: print(f\"Failed to crawl
{result.url}: {result.error_message}\") \nReview:\n- Purpose:
Executes a batch crawl with all URLs processed together after
crawling is complete.\n- Dispatcher: Uses
MemoryAdaptiveDispatcher to manage concurrency and system
memory.\n- Stream: Disabled (stream=False), so all results are
collected at once for post-processing.\n- Best Use Case: When
you need to analyze results in bulk rather than individually
during the crawl.\n4.2 Streaming Mode\nasync def
crawl_streaming(): browser_config =
BrowserConfig(headless=True, verbose=False) run_config =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS, stream=True #
Enable streaming mode ) dispatcher =
MemoryAdaptiveDispatcher( memory_threshold_percent=70.0,
check_interval=1.0, max_session_permit=10,
monitor=CrawlerMonitor( display_mode=DisplayMode.DETAILED ) )
async with AsyncWebCrawler(config=browser_config) as crawler:
# Process results as they become available async for result in
await crawler.arun_many( urls=urls, config=run_config,
dispatcher=dispatcher ): if result.success: # Process each
result immediately await process_result(result) else: print(f
\"Failed to crawl {result.url}: {result.error_message}\")
\nReview:\n- Purpose: Enables streaming to process results as
soon as theyâ€™re available.\n- Dispatcher: Uses
MemoryAdaptiveDispatcher for concurrency and memory
management.\n- Stream: Enabled (stream=True), allowing real-
time processing during crawling.\n- Best Use Case: When you
need to act on results immediately, such as for real-time
analytics or progressive data storage.\n4.3 Semaphore-based
Crawling\nasync def crawl_with_semaphore(urls): browser_config
= BrowserConfig(headless=True, verbose=False) run_config =
CrawlerRunConfig(cache_mode=CacheMode.BYPASS) dispatcher =
SemaphoreDispatcher( semaphore_count=5,
rate_limiter=RateLimiter( base_delay=(0.5, 1.0), max_delay=
10.0 ), monitor=CrawlerMonitor( max_visible_rows=15,
display_mode=DisplayMode.DETAILED ) ) async with
AsyncWebCrawler(config=browser_config) as crawler: results =
await crawler.arun_many( urls, config=run_config,
dispatcher=dispatcher ) return results \nReview:\n- Purpose:
Uses SemaphoreDispatcher to limit concurrency with a fixed
number of slots.\n- Dispatcher: Configured with a semaphore to
control parallel crawling tasks.\n- Rate Limiter: Prevents
servers from being overwhelmed by pacing requests.\n- Best Use
159
Case: When you want precise control over the number of
concurrent requests, independent of system memory.\n4.4
Robots.txt Consideration\nimport asyncio from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig, CacheMode async def main():
urls = [ \"https://example1.com\", \"https://example2.com\",
\"https://example3.com\" ] config =
CrawlerRunConfig( cache_mode=CacheMode.ENABLED,
check_robots_txt=True, # Will respect robots.txt for each URL
semaphore_count=3 # Max concurrent requests ) async with
AsyncWebCrawler() as crawler: async for result in
crawler.arun_many(urls, config=config): if result.success:
print(f\"Successfully crawled {result.url}\") elif
result.status_code == 403 and \"robots.txt\" in
result.error_message: print(f\"Skipped {result.url} - blocked
by robots.txt\") else: print(f\"Failed to crawl {result.url}:
{result.error_message}\") if __name__ == \"__main__\":
asyncio.run(main()) \nReview:\n- Purpose: Ensures compliance
with robots.txt rules for ethical and legal web crawling.\n-
Configuration: Set check_robots_txt=True to validate each URL
against robots.txt before crawling.\n- Dispatcher: Handles
requests with concurrency limits (semaphore_count=3).\n- Best
Use Case: When crawling websites that strictly enforce
robots.txt policies or for responsible crawling practices.\n5.
Dispatch Results\nEach crawl result includes dispatch
information:\n@dataclass class DispatchResult: task_id: str
memory_usage: float peak_memory: float start_time: datetime
end_time: datetime error_message: str = \"\" \nAccess via
result.dispatch_result:\nfor result in results: if
result.success: dr = result.dispatch_result print(f\"URL:
{result.url}\") print(f\"Memory: {dr.memory_usage:.1f}MB\")
print(f\"Duration: {dr.end_time - dr.start_time}\") \n6.
Summary\n1. Two Dispatcher Types:\nMemoryAdaptiveDispatcher
(default): Dynamic concurrency based on memory
\nSemaphoreDispatcher: Fixed concurrency limit\n2. Optional
Components:\nRateLimiter: Smart request pacing and backoff
\nCrawlerMonitor: Real-time progress visualization\n3. Key
Benefits:\nAutomatic memory management\nBuilt-in rate limiting
\nLive progress monitoring\nFlexible concurrency control
\nChoose the dispatcher that best fits your needs:
\nMemoryAdaptiveDispatcher: For large crawls or limited
resources\nSemaphoreDispatcher: For simple, fixed-concurrency
scenarios",
"markdown": "# Multi-URL Crawling - Crawl4AI Documentation
(v0.5.x)\n\n## Advanced Multi-URL Crawling with Dispatchers\n
\n> **Heads Up**: Crawl4AI supports advanced dispatchers for
**parallel** or **throttled** crawling, providing dynamic rate
limiting and memory usage checks. The built-in àrun_many()`
function uses these dispatchers to handle concurrency
efficiently.\n\n## 1\\. Introduction\n\nWhen crawling many
URLs:\n\n* **Basic**: Use àrun()` in a loop (simple but
less efficient)\n* **Better**: Use àrun_many()`, which
efficiently handles multiple URLs with proper concurrency
control\n* **Best**: Customize dispatcher behavior for your
specific needs (memory management, rate limits, etc.)\n\n**Why
Dispatchers?**\n\n* **Adaptive**: Memory-based dispatchers
can pause or slow down based on system resources\n* **Rate-
limiting**: Built-in rate limiting with exponential backoff
160
for 429/503 responses\n* **Real-time Monitoring**: Live
dashboard of ongoing tasks, memory usage, and performance\n*
**Flexibility**: Choose between memory-adaptive or semaphore-
based concurrency\n\n* * *\n\n## 2\\. Core Components\n\n###
2.1 Rate Limiter\n\n`class RateLimiter: def
__init__( # Random delay range between requests
base_delay: Tuple[float, float] = (1.0, 3.0), #
Maximum backoff delay max_delay: float = 60.0,
# Retries before giving up max_retries: int = 3,
# Status codes triggering backoff rate_limit_codes:
List[int] = [429, 503] )`\n\nHereâ€™s the revised
and simplified explanation of the **RateLimiter**, focusing on
constructor parameters and adhering to your markdown style and
mkDocs guidelines.\n\n#### RateLimiter Constructor Parameters
\n\nThe **RateLimiter** is a utility that helps manage the
pace of requests to avoid overloading servers or getting
blocked due to rate limits. It operates internally to delay
requests and handle retries but can be configured using its
constructor parameters.\n\n**Parameters of the `RateLimiter`
constructor:**\n\n1.â€‚**`base_delay`** (`Tuple[float,
float]`, default: `(1.0, 3.0)`) \nâ€‚â€‚The range for a
random delay (in seconds) between consecutive requests to the
same domain.\n\n* A random delay is chosen between
`base_delay[0]` and `base_delay[1]` for each request.\n*
This prevents sending requests at a predictable frequency,
reducing the chances of triggering rate limits.\n\n**Example:
** \nIf `base_delay = (2.0, 5.0)`, delays could be randomly
chosen as `2.3s`, `4.1s`, etc.\n\n* * *\n\n2.â
€‚**`max_delay`** (`float`, default: `60.0`) \nâ€‚â€‚The
maximum allowable delay when rate-limiting errors occur.\n\n*
When servers return rate-limit responses (e.g., 429 or 503),
the delay increases exponentially with jitter.\n* The
`max_delay` ensures the delay doesnâ€™t grow unreasonably
high, capping it at this value.\n\n**Example:** \nFor a
`max_delay = 30.0`, even if backoff calculations suggest a
delay of `45s`, it will cap at `30s`.\n\n* * *\n\n3.â
€‚**`max_retries`** (ìnt`, default: `3`) \nâ€‚â€‚The maximum
number of retries for a request if rate-limiting errors occur.
\n\n* After encountering a rate-limit response, the
`RateLimiter` retries the request up to this number of times.
\n* If all retries fail, the request is marked as failed,
and the process continues.\n\n**Example:** \nIf `max_retries
= 3`, the system retries a failed request three times before
giving up.\n\n* * *\n\n4.â€‚**`rate_limit_codes`**
(`List[int]`, default: `[429, 503]`) \nâ€‚â€‚A list of HTTP
status codes that trigger the rate-limiting logic.\n\n*
These status codes indicate the server is overwhelmed or
actively limiting requests.\n* You can customize this list
to include other codes based on specific server behavior.\n
\n**Example:** \nIf `rate_limit_codes = [429, 503, 504]`, the
crawler will back off on these three error codes.\n\n* * *\n
\n**How to Use the `RateLimiter`:**\n\nHereâ€™s an example of
initializing and using a `RateLimiter` in your project:\n
\n`from crawl4ai import RateLimiter # Create a RateLimiter
with custom settings rate_limiter =
RateLimiter( base_delay=(2.0, 4.0), # Random delay
between 2-4 seconds max_delay=30.0, # Cap delay at
161
30 seconds max_retries=5, # Retry up to 5 times
on rate-limiting errors rate_limit_codes=[429, 503] #
Handle these HTTP status codes ) # RateLimiter will handle
delays and retries internally # No additional setup is
required for its operation`\n\nThe `RateLimiter` integrates
seamlessly with dispatchers like `MemoryAdaptiveDispatcher`
and `SemaphoreDispatcher`, ensuring requests are paced
correctly without user intervention. Its internal mechanisms
manage delays and retries to avoid overwhelming servers while
maximizing efficiency.\n\n### 2.2 Crawler Monitor\n\nThe
CrawlerMonitor provides real-time visibility into crawling
operations:\n\n`from crawl4ai import CrawlerMonitor,
DisplayMode monitor = CrawlerMonitor( # Maximum rows in
live display max_visible_rows=15, #
DETAILED or AGGREGATED view
display_mode=DisplayMode.DETAILED )`\n\n**Display Modes**:\n
\n1. **DETAILED**: Shows individual task status, memory
usage, and timing\n2. **AGGREGATED**: Displays summary
statistics and overall progress\n\n* * *\n\n## 3\\. Available
Dispatchers\n\n### 3.1 MemoryAdaptiveDispatcher (Default)\n
\nAutomatically manages concurrency based on system memory
usage:\n\n`from crawl4ai.async_dispatcher import
MemoryAdaptiveDispatcher dispatcher =
MemoryAdaptiveDispatcher( memory_threshold_percent=90.0,
# Pause if memory exceeds this check_interval=1.0,
# How often to check memory max_session_permit=10,
# Maximum concurrent tasks
rate_limiter=RateLimiter( # Optional rate limiting
base_delay=(1.0, 2.0), max_delay=30.0,
max_retries=2 ), monitor=CrawlerMonitor( #
Optional monitoring max_visible_rows=15,
display_mode=DisplayMode.DETAILED ) )`\n\n**Constructor
Parameters:**\n\n1.â€‚**`memory_threshold_percent`** (`float`,
default: `90.0`) \nâ€‚â€‚Specifies the memory usage threshold
(as a percentage). If system memory usage exceeds this value,
the dispatcher pauses crawling to prevent system overload.\n
\n2.â€‚**`check_interval`** (`float`, default: `1.0`) \nâ€‚â
€‚The interval (in seconds) at which the dispatcher checks
system memory usage.\n\n3.â€‚**`max_session_permit`** (ìnt`,
default: `10`) \nâ€‚â€‚The maximum number of concurrent
crawling tasks allowed. This ensures resource limits are
respected while maintaining concurrency.\n\n4.â
€‚**`memory_wait_timeout`** (`float`, default: `300.0`) \nâ
€‚â€‚Optional timeout (in seconds). If memory usage exceeds
`memory_threshold_percent` for longer than this duration, a
`MemoryError` is raised.\n\n5.â€‚**`rate_limiter`**
(`RateLimiter`, default: `None`) \nâ€‚â€‚Optional rate-
limiting logic to avoid server-side blocking (e.g., for
handling 429 or 503 errors). See **RateLimiter** for details.
\n\n6.â€‚**`monitor`** (`CrawlerMonitor`, default: `None`)
\nâ€‚â€‚Optional monitoring for real-time task tracking and
performance insights. See **CrawlerMonitor** for details.\n\n*
* *\n\n### 3.2 SemaphoreDispatcher\n\nProvides simple
concurrency control with a fixed limit:\n\n`from
crawl4ai.async_dispatcher import SemaphoreDispatcher
dispatcher = SemaphoreDispatcher( max_session_permit=20,
# Maximum concurrent tasks
162
rate_limiter=RateLimiter( # Optional rate limiting
base_delay=(0.5, 1.0), max_delay=10.0 ),
monitor=CrawlerMonitor( # Optional monitoring
max_visible_rows=15,
display_mode=DisplayMode.DETAILED ) )`\n\n**Constructor
Parameters:**\n\n1.â€‚**`max_session_permit`** (ìnt`,
default: `20`) \nâ€‚â€‚The maximum number of concurrent
crawling tasks allowed, irrespective of semaphore slots.\n
\n2.â€‚**`rate_limiter`** (`RateLimiter`, default: `None`)
\nâ€‚â€‚Optional rate-limiting logic to avoid overwhelming
servers. See **RateLimiter** for details.\n\n3.â
€‚**`monitor`** (`CrawlerMonitor`, default: `None`) \nâ€‚â
€‚Optional monitoring for tracking task progress and resource
usage. See **CrawlerMonitor** for details.\n\n* * *\n\n## 4\\.
Usage Examples\n\n### 4.1 Batch Processing (Default)\n\nàsync
def crawl_batch(): browser_config =
BrowserConfig(headless=True, verbose=False) run_config =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
stream=False # Default: get all results at once )
dispatcher =
MemoryAdaptiveDispatcher( memory_threshold_percent=
70.0, check_interval=1.0, max_session_permit=
10,
monitor=CrawlerMonitor( display_mode=DisplayMode.D
ETAILED ) ) async with
AsyncWebCrawler(config=browser_config) as crawler: #
Get all results at once results = await
crawler.arun_many( urls=urls,
config=run_config, dispatcher=dispatcher )
# Process all results after completion for result in
results: if result.success: await
process_result(result) else:
print(f\"Failed to crawl {result.url}:
{result.error_message}\")`\n\n**Review:** \n\\- **Purpose:**
Executes a batch crawl with all URLs processed together after
crawling is complete. \n\\- **Dispatcher:** Uses
`MemoryAdaptiveDispatcher` to manage concurrency and system
memory. \n\\- **Stream:** Disabled (`stream=False`), so all
results are collected at once for post-processing. \n\\-
**Best Use Case:** When you need to analyze results in bulk
rather than individually during the crawl.\n\n* * *\n\n### 4.2
Streaming Mode\n\nàsync def crawl_streaming():
browser_config = BrowserConfig(headless=True, verbose=False)
run_config =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
stream=True # Enable streaming mode ) dispatcher =
MemoryAdaptiveDispatcher( memory_threshold_percent=
70.0, check_interval=1.0, max_session_permit=
10,
monitor=CrawlerMonitor( display_mode=DisplayMode.D
ETAILED ) ) async with
AsyncWebCrawler(config=browser_config) as crawler: #
Process results as they become available async for
result in await crawler.arun_many( urls=urls,
config=run_config,
dispatcher=dispatcher ): if
result.success: # Process each result
163
immediately await process_result(result)
else: print(f\"Failed to crawl {result.url}:
{result.error_message}\")`\n\n**Review:** \n\\- **Purpose:**
Enables streaming to process results as soon as theyâ€™re
available. \n\\- **Dispatcher:** Uses
`MemoryAdaptiveDispatcher` for concurrency and memory
management. \n\\- **Stream:** Enabled (`stream=True`),
allowing real-time processing during crawling. \n\\- **Best
Use Case:** When you need to act on results immediately, such
as for real-time analytics or progressive data storage.\n\n* *
*\n\n### 4.3 Semaphore-based Crawling\n\nàsync def
crawl_with_semaphore(urls): browser_config =
BrowserConfig(headless=True, verbose=False) run_config =
CrawlerRunConfig(cache_mode=CacheMode.BYPASS) dispatcher
= SemaphoreDispatcher( semaphore_count=5,
rate_limiter=RateLimiter( base_delay=(0.5, 1.0),
max_delay=10.0 ),
monitor=CrawlerMonitor( max_visible_rows=15,
display_mode=DisplayMode.DETAILED ) ) async
with AsyncWebCrawler(config=browser_config) as crawler:
results = await crawler.arun_many( urls,
config=run_config, dispatcher=dispatcher )
return results`\n\n**Review:** \n\\- **Purpose:** Uses
`SemaphoreDispatcher` to limit concurrency with a fixed number
of slots. \n\\- **Dispatcher:** Configured with a semaphore
to control parallel crawling tasks. \n\\- **Rate Limiter:**
Prevents servers from being overwhelmed by pacing requests.
\n\\- **Best Use Case:** When you want precise control over
the number of concurrent requests, independent of system
memory.\n\n* * *\n\n### 4.4 Robots.txt Consideration\n
\nìmport asyncio from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig, CacheMode async def main(): urls =
[ \"https://example1.com\",
\"https://example2.com\", \"https://example3.com
\" ] config =
CrawlerRunConfig( cache_mode=CacheMode.ENABLED,
check_robots_txt=True, # Will respect robots.txt for each URL
semaphore_count=3 # Max concurrent requests )
async with AsyncWebCrawler() as crawler: async for
result in crawler.arun_many(urls, config=config):
if result.success: print(f\"Successfully
crawled {result.url}\") elif result.status_code ==
403 and \"robots.txt\" in result.error_message:
print(f\"Skipped {result.url} - blocked by robots.txt\")
else: print(f\"Failed to crawl {result.url}:
{result.error_message}\") if __name__ == \"__main__\":
asyncio.run(main())`\n\n**Review:** \n\\- **Purpose:**
Ensures compliance with `robots.txt` rules for ethical and
legal web crawling. \n\\- **Configuration:** Set
`check_robots_txt=True` to validate each URL against
`robots.txt` before crawling. \n\\- **Dispatcher:** Handles
requests with concurrency limits (`semaphore_count=3`). \n\\-
**Best Use Case:** When crawling websites that strictly
enforce robots.txt policies or for responsible crawling
practices.\n\n* * *\n\n## 5\\. Dispatch Results\n\nEach crawl
result includes dispatch information:\n\n`@dataclass class
DispatchResult: task_id: str memory_usage: float
164
peak_memory: float start_time: datetime end_time:
datetime error_message: str = \"\"`\n\nAccess via
`result.dispatch_result`:\n\n`for result in results: if
result.success: dr = result.dispatch_result
print(f\"URL: {result.url}\") print(f\"Memory:
{dr.memory_usage:.1f}MB\") print(f\"Duration:
{dr.end_time - dr.start_time}\")`\n\n## 6\\. Summary\n\n1.â
€‚**Two Dispatcher Types**:\n\n* MemoryAdaptiveDispatcher
(default): Dynamic concurrency based on memory\n*
SemaphoreDispatcher: Fixed concurrency limit\n\n2.â
€‚**Optional Components**:\n\n* RateLimiter: Smart request
pacing and backoff\n* CrawlerMonitor: Real-time progress
visualization\n\n3.â€‚**Key Benefits**:\n\n* Automatic
memory management\n* Built-in rate limiting\n* Live
progress monitoring\n* Flexible concurrency control\n
\nChoose the dispatcher that best fits your needs:\n\n*
**MemoryAdaptiveDispatcher**: For large crawls or limited
resources\n* **SemaphoreDispatcher**: For simple, fixed-
concurrency scenarios",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/extraction/clustring-
strategies/",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/extraction/clustring-
strategies/",
"loadedTime": "2025-03-05T23:17:29.446Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl":
"https://docs.crawl4ai.com/extraction/clustring-strategies/",
"title": "Clustering Strategies - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:28 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"4c541eadd267ed92507ddb26f86e3477\"",
"content-encoding": "gzip"
}
},
165
"screenshotUrl": null,
"text": "Clustering Strategies - Crawl4AI Documentation
(v0.5.x)\nCosine Strategy\nThe Cosine Strategy in Crawl4AI
uses similarity-based clustering to identify and extract
relevant content sections from web pages. This strategy is
particularly useful when you need to find and extract content
based on semantic similarity rather than structural patterns.
\nHow It Works\nThe Cosine Strategy: 1. Breaks down page
content into meaningful chunks 2. Converts text into vector
representations 3. Calculates similarity between chunks 4.
Clusters similar content together 5. Ranks and filters content
based on relevance\nBasic Usage\nfrom
crawl4ai.extraction_strategy import CosineStrategy strategy =
CosineStrategy( semantic_filter=\"product reviews\", # Target
content type word_count_threshold=10, # Minimum words per
cluster sim_threshold=0.3 # Similarity threshold ) async with
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://example.com/reviews\",
extraction_strategy=strategy ) content =
result.extracted_content \nConfiguration Options\nCore
Parameters\nCosineStrategy( # Content Filtering
semantic_filter: str = None, # Keywords/topic for content
filtering word_count_threshold: int = 10, # Minimum words per
cluster sim_threshold: float = 0.3, # Similarity threshold
(0.0 to 1.0) # Clustering Parameters max_dist: float = 0.2, #
Maximum distance for clustering linkage_method: str = 'ward',
# Clustering linkage method top_k: int = 3, # Number of top
categories to extract # Model Configuration model_name: str =
'sentence-transformers/all-MiniLM-L6-v2', # Embedding model
verbose: bool = False # Enable logging ) \nParameter Details
\n1. semantic_filter - Sets the target topic or content type -
Use keywords relevant to your desired content - Example:
\"technical specifications\", \"user reviews\", \"pricing
information\"\n2. sim_threshold - Controls how similar content
must be to be grouped together - Higher values (e.g., 0.8)
mean stricter matching - Lower values (e.g., 0.3) allow more
variation \n# Strict matching strategy =
CosineStrategy(sim_threshold=0.8) # Loose matching strategy =
CosineStrategy(sim_threshold=0.3) \n3. word_count_threshold -
Filters out short content blocks - Helps eliminate noise and
irrelevant content \n# Only consider substantial paragraphs
strategy = CosineStrategy(word_count_threshold=50) \n4.
top_k - Number of top content clusters to return - Higher
values return more diverse content \n# Get top 5 most relevant
content clusters strategy = CosineStrategy(top_k=5) \nUse
Cases\n1. Article Content Extraction\nstrategy =
CosineStrategy( semantic_filter=\"main article content\",
word_count_threshold=100, # Longer blocks for articles top_k=1
# Usually want single main content ) result = await
crawler.arun( url=\"https://example.com/blog/post\",
extraction_strategy=strategy ) \n2. Product Review Analysis
\nstrategy = CosineStrategy( semantic_filter=\"customer
reviews and ratings\", word_count_threshold=20, # Reviews can
be shorter top_k=10, # Get multiple reviews sim_threshold=0.4
# Allow variety in review content ) \n3. Technical
Documentation\nstrategy = CosineStrategy( semantic_filter=
\"technical specifications documentation\",
166
word_count_threshold=30, sim_threshold=0.6, # Stricter
matching for technical content max_dist=0.3 # Allow related
technical sections ) \nAdvanced Features\nCustom Clustering
\nstrategy = CosineStrategy( linkage_method='complete', #
Alternative clustering method max_dist=0.4, # Larger clusters
model_name='sentence-transformers/paraphrase-multilingual-
MiniLM-L12-v2' # Multilingual support ) \nContent Filtering
Pipeline\nstrategy = CosineStrategy( semantic_filter=\"pricing
plans features\", word_count_threshold=15, sim_threshold=0.5,
top_k=3 ) async def extract_pricing_features(url: str): async
with AsyncWebCrawler() as crawler: result = await
crawler.arun( url=url, extraction_strategy=strategy ) if
result.success: content = json.loads(result.extracted_content)
return { 'pricing_features': content, 'clusters':
len(content), 'similarity_scores': [item['score'] for item in
content] } \nBest Practices\n1. Adjust Thresholds
Iteratively - Start with default values - Adjust based on
results - Monitor clustering quality\n2. Choose Appropriate
Word Count Thresholds - Higher for articles (100+) - Lower for
reviews/comments (20+) - Medium for product descriptions (50
+)\n3. Optimize Performance \nstrategy =
CosineStrategy( word_count_threshold=10, # Filter early top_k=
5, # Limit results verbose=True # Monitor performance ) \n4.
Handle Different Content Types \n# For mixed content pages
strategy = CosineStrategy( semantic_filter=\"product features
\", sim_threshold=0.4, # More flexible matching max_dist=0.3,
# Larger clusters top_k=3 # Multiple relevant sections )
\nError Handling\ntry: result = await crawler.arun( url=
\"https://example.com\", extraction_strategy=strategy ) if
result.success: content = json.loads(result.extracted_content)
if not content: print(\"No relevant content found\") else:
print(f\"Extraction failed: {result.error_message}\") except
Exception as e: print(f\"Error during extraction: {str(e)}\")
\nThe Cosine Strategy is particularly effective when: -
Content structure is inconsistent - You need semantic
understanding - You want to find similar content blocks -
Structure-based extraction (CSS/XPath) isn't reliable\nIt
works well with other strategies and can be used as a pre-
processing step for LLM-based extraction.",
"markdown": "# Clustering Strategies - Crawl4AI
Documentation (v0.5.x)\n\n## Cosine Strategy\n\nThe Cosine
Strategy in Crawl4AI uses similarity-based clustering to
identify and extract relevant content sections from web pages.
This strategy is particularly useful when you need to find and
extract content based on semantic similarity rather than
structural patterns.\n\n## How It Works\n\nThe Cosine
Strategy: 1. Breaks down page content into meaningful chunks
2. Converts text into vector representations 3. Calculates
similarity between chunks 4. Clusters similar content together
5. Ranks and filters content based on relevance\n\n## Basic
Usage\n\n`from crawl4ai.extraction_strategy import
CosineStrategy strategy =
CosineStrategy( semantic_filter=\"product reviews\", #
Target content type word_count_threshold=10, #
Minimum words per cluster sim_threshold=0.3
# Similarity threshold ) async with AsyncWebCrawler() as
crawler: result = await crawler.arun( url=
167
\"https://example.com/reviews\",
extraction_strategy=strategy ) content =
result.extracted_content`\n\n## Configuration Options\n\n###
Core Parameters\n\n`CosineStrategy( # Content Filtering
semantic_filter: str = None, # Keywords/topic for
content filtering word_count_threshold: int = 10, #
Minimum words per cluster sim_threshold: float = 0.3,
# Similarity threshold (0.0 to 1.0) # Clustering
Parameters max_dist: float = 0.2, # Maximum
distance for clustering linkage_method: str = 'ward',
# Clustering linkage method top_k: int = 3,
# Number of top categories to extract # Model
Configuration model_name: str = 'sentence-
transformers/all-MiniLM-L6-v2', # Embedding model
verbose: bool = False # Enable logging )`\n\n###
Parameter Details\n\n1.â€€**semantic\\_filter** - Sets the
target topic or content type - Use keywords relevant to your
desired content - Example: \"technical specifications\",
\"user reviews\", \"pricing information\"\n\n2.â€€**sim
\\_threshold** - Controls how similar content must be to be
grouped together - Higher values (e.g., 0.8) mean stricter
matching - Lower values (e.g., 0.3) allow more variation\n\n`#
Strict matching strategy = CosineStrategy(sim_threshold=0.8)
# Loose matching strategy = CosineStrategy(sim_threshold=0.3)`
\n\n3.â€€**word\\_count\\_threshold** - Filters out short
content blocks - Helps eliminate noise and irrelevant content
\n\n`# Only consider substantial paragraphs strategy =
CosineStrategy(word_count_threshold=50)`\n\n4.â€€**top\\_k** -
Number of top content clusters to return - Higher values
return more diverse content\n\n`# Get top 5 most relevant
content clusters strategy = CosineStrategy(top_k=5)`\n\n## Use
Cases\n\n### 1\\. Article Content Extraction\n\n`strategy =
CosineStrategy( semantic_filter=\"main article content\",
word_count_threshold=100, # Longer blocks for articles
top_k=1 # Usually want single main content )
result = await crawler.arun( url=
\"https://example.com/blog/post\",
extraction_strategy=strategy )`\n\n### 2\\. Product Review
Analysis\n\n`strategy = CosineStrategy( semantic_filter=
\"customer reviews and ratings\", word_count_threshold=20,
# Reviews can be shorter top_k=10, # Get
multiple reviews sim_threshold=0.4 # Allow variety
in review content )`\n\n### 3\\. Technical Documentation\n
\n`strategy = CosineStrategy( semantic_filter=\"technical
specifications documentation\", word_count_threshold=30,
sim_threshold=0.6, # Stricter matching for technical
content max_dist=0.3 # Allow related technical
sections )`\n\n## Advanced Features\n\n### Custom Clustering\n
\n`strategy = CosineStrategy( linkage_method='complete',
# Alternative clustering method max_dist=0.4,
# Larger clusters model_name='sentence-
transformers/paraphrase-multilingual-MiniLM-L12-v2' #
Multilingual support )`\n\n### Content Filtering Pipeline\n
\n`strategy = CosineStrategy( semantic_filter=\"pricing
plans features\", word_count_threshold=15,
sim_threshold=0.5, top_k=3 ) async def
extract_pricing_features(url: str): async with
168
AsyncWebCrawler() as crawler: result = await
crawler.arun( url=url,
extraction_strategy=strategy ) if
result.success: content =
json.loads(result.extracted_content) return
{ 'pricing_features': content,
'clusters': len(content), 'similarity_scores':
[item['score'] for item in content] }`\n\n## Best
Practices\n\n1.â€€**Adjust Thresholds Iteratively** - Start
with default values - Adjust based on results - Monitor
clustering quality\n\n2.â€€**Choose Appropriate Word Count
Thresholds** - Higher for articles (100+) - Lower for
reviews/comments (20+) - Medium for product descriptions (50
+)\n\n3.â€€**Optimize Performance**\n\n`strategy =
CosineStrategy( word_count_threshold=10, # Filter early
top_k=5, # Limit results verbose=True
# Monitor performance )`\n\n4.â€€**Handle Different Content
Types**\n\n`# For mixed content pages strategy =
CosineStrategy( semantic_filter=\"product features\",
sim_threshold=0.4, # More flexible matching max_dist=
0.3, # Larger clusters top_k=3 #
Multiple relevant sections )`\n\n## Error Handling\n\n`try:
result = await crawler.arun( url=\"https://example.com
\", extraction_strategy=strategy ) if
result.success: content =
json.loads(result.extracted_content) if not content:
print(\"No relevant content found\") else: print(f
\"Extraction failed: {result.error_message}\") except
Exception as e: print(f\"Error during extraction:
{str(e)}\")`\n\nThe Cosine Strategy is particularly effective
when: - Content structure is inconsistent - You need semantic
understanding - You want to find similar content blocks -
Structure-based extraction (CSS/XPath) isn't reliable\n\nIt
works well with other strategies and can be used as a pre-
processing step for LLM-based extraction.",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/extraction/llm-
strategies/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/extraction/llm-
strategies/",
"loadedTime": "2025-03-05T23:17:32.190Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/extraction/llm-
strategies/",
"title": "LLM Strategies - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
169
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:23 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"b86b27dde5be474f1a2a4653b142085e\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "LLM Strategies - Crawl4AI Documentation
(v0.5.x)\nIn some cases, you need to extract complex or
unstructured information from a webpage that a simple
CSS/XPath schema cannot easily parse. Or you want AI-driven
insights, classification, or summarization. For these
scenarios, Crawl4AI provides an LLM-based extraction strategy
that:\nWorks with any large language model supported by
LightLLM (Ollama, OpenAI, Claude, and more). \nAutomatically
splits content into chunks (if desired) to handle token
limits, then combines results. \nLets you define a schema
(like a Pydantic model) or a simpler â€œblockâ€ extraction
approach.\nImportant: LLM-based extraction can be slower and
costlier than schema-based approaches. If your page data is
highly structured, consider using JsonCssExtractionStrategy or
JsonXPathExtractionStrategy first. But if you need AI to
interpret or reorganize content, read on!\n1. Why Use an LLM?
\nComplex Reasoning: If the siteâ€™s data is unstructured,
scattered, or full of natural language context. \nSemantic
Extraction: Summaries, knowledge graphs, or relational data
that require comprehension. \nFlexible: You can pass
instructions to the model to do more advanced transformations
or classification.\n2. Provider-Agnostic via LightLLM
\nCrawl4AI uses a â€œprovider stringâ€ (e.g., \"openai/gpt-4o
\", \"ollama/llama2.0\", \"aws/titan\") to identify your LLM.
Any model that LightLLM supports is fair game. You just
provide:\nprovider: The <provider>/<model_name> identifier
(e.g., \"openai/gpt-4\", \"ollama/llama2\",
\"huggingface/google-flan\", etc.). \napi_token: If needed
(for OpenAI, HuggingFace, etc.); local models or Ollama might
not require it. \napi_base (optional): If your provider has a
custom endpoint. \nThis means you arenâ€™t locked into a
single LLM vendor. Switch or experiment easily.\n3.1 Flow\n1.
Chunking (optional): The HTML or markdown is split into
smaller segments if itâ€™s very long (based on
chunk_token_threshold, overlap, etc.).\n2. Prompt
Construction: For each chunk, the library forms a prompt that
includes your instruction (and possibly schema or examples).
\n3. LLM Inference: Each chunk is sent to the model in
parallel or sequentially (depending on your concurrency).\n4.
Combining: The results from each chunk are merged and parsed
into JSON.\n\"schema\": The model tries to return JSON
170
conforming to your Pydantic-based schema. \n\"block\": The
model returns freeform text, or smaller JSON structures, which
the library collects. \nFor structured data, \"schema\" is
recommended. You provide
schema=YourPydanticModel.model_json_schema().\n4. Key
Parameters\nBelow is an overview of important LLM extraction
parameters. All are typically set inside
LLMExtractionStrategy(...). You then put that strategy in your
CrawlerRunConfig(..., extraction_strategy=...).\n1. provider
(str): e.g., \"openai/gpt-4\", \"ollama/llama2\".\n2.
api_token (str): The API key or token for that model. May not
be needed for local models.\n3. schema (dict): A JSON schema
describing the fields you want. Usually generated by
YourModel.model_json_schema().\n4. extraction_type (str):
\"schema\" or \"block\".\n5. instruction (str): Prompt text
telling the LLM what you want extracted. E.g., â€œExtract
these fields as a JSON array.â€ \n6. chunk_token_threshold
(int): Maximum tokens per chunk. If your content is huge, you
can break it up for the LLM.\n7. overlap_rate (float): Overlap
ratio between adjacent chunks. E.g., 0.1 means 10% of each
chunk is repeated to preserve context continuity.\n8.
apply_chunking (bool): Set True to chunk automatically. If you
want a single pass, set False.\n9. input_format (str):
Determines which crawler result is passed to the LLM. Options
include:\n- \"markdown\": The raw markdown (default).\n-
\"fit_markdown\": The filtered â€œfitâ€ markdown if you used
a content filter.\n- \"html\": The cleaned or raw HTML.\n10.
extra_args (dict): Additional LLM parameters like temperature,
max_tokens, top_p, etc.\n11. show_usage(): A method you can
call to print out usage info (token usage per chunk, total
cost if known). \nExample:\nextraction_strategy =
LLMExtractionStrategy( llmConfig = LlmConfig(provider=
\"openai/gpt-4\", api_token=\"YOUR_OPENAI_KEY\"),
schema=MyModel.model_json_schema(), extraction_type=\"schema
\", instruction=\"Extract a list of items from the text with
'name' and 'price' fields.\", chunk_token_threshold=1200,
overlap_rate=0.1, apply_chunking=True, input_format=\"html\",
extra_args={\"temperature\": 0.1, \"max_tokens\": 1000},
verbose=True ) \n5. Putting It in CrawlerRunConfig\nImportant:
In Crawl4AI, all strategy definitions should go inside the
CrawlerRunConfig, not directly as a param in arun(). Hereâ€™s
a full example:\nimport os import asyncio import json from
pydantic import BaseModel, Field from typing import List from
crawl4ai import AsyncWebCrawler, BrowserConfig,
CrawlerRunConfig, CacheMode, LlmConfig from
crawl4ai.extraction_strategy import LLMExtractionStrategy
class Product(BaseModel): name: str price: str async def
main(): # 1. Define the LLM extraction strategy llm_strategy =
LLMExtractionStrategy( llmConfig = LlmConfig(provider=
\"openai/gpt-4o-mini\",
api_token=os.getenv('OPENAI_API_KEY')),
schema=Product.schema_json(), # Or use model_json_schema()
extraction_type=\"schema\", instruction=\"Extract all product
objects with 'name' and 'price' from the content.\",
chunk_token_threshold=1000, overlap_rate=0.0,
apply_chunking=True, input_format=\"markdown\", # or \"html\",
\"fit_markdown\" extra_args={\"temperature\": 0.0,
171
\"max_tokens\": 800} ) # 2. Build the crawler config
crawl_config =
CrawlerRunConfig( extraction_strategy=llm_strategy,
cache_mode=CacheMode.BYPASS ) # 3. Create a browser config if
needed browser_cfg = BrowserConfig(headless=True) async with
AsyncWebCrawler(config=browser_cfg) as crawler: # 4. Let's say
we want to crawl a single page result = await
crawler.arun( url=\"https://example.com/products\",
config=crawl_config ) if result.success: # 5. The extracted
content is presumably JSON data =
json.loads(result.extracted_content) print(\"Extracted items:
\", data) # 6. Show usage stats llm_strategy.show_usage() #
prints token usage else: print(\"Error:\",
result.error_message) if __name__ == \"__main__\":
asyncio.run(main()) \n6. Chunking Details\n6.1
chunk_token_threshold\nIf your page is large, you might exceed
your LLMâ€™s context window. chunk_token_threshold sets the
approximate max tokens per chunk. The library calculates
wordâ†’token ratio using word_token_rate (often ~0.75 by
default). If chunking is enabled (apply_chunking=True), the
text is split into segments.\n6.2 overlap_rate\nTo keep
context continuous across chunks, we can overlap them. E.g.,
overlap_rate=0.1 means each subsequent chunk includes 10% of
the previous chunkâ€™s text. This is helpful if your needed
info might straddle chunk boundaries.\n6.3 Performance &
Parallelism\nBy chunking, you can potentially process multiple
chunks in parallel (depending on your concurrency settings and
the LLM provider). This reduces total time if the site is huge
or has many sections.\n7. Input Format\nBy default,
LLMExtractionStrategy uses input_format=\"markdown\", meaning
the crawlerâ€™s final markdown is fed to the LLM. You can
change to:\nhtml: The cleaned HTML or raw HTML (depending on
your crawler config) goes into the LLM. \nfit_markdown: If you
used, for instance, PruningContentFilter, the â€œfitâ€
version of the markdown is used. This can drastically reduce
tokens if you trust the filter. \nmarkdown: Standard markdown
output from the crawlerâ€™s markdown_generator.\nThis setting
is crucial: if the LLM instructions rely on HTML tags, pick
\"html\". If you prefer a text-based approach, pick \"markdown
\".\nLLMExtractionStrategy( # ... input_format=\"html\", #
Instead of \"markdown\" or \"fit_markdown\" ) \n8. Token Usage
& Show Usage\nTo keep track of tokens and cost, each chunk is
processed with an LLM call. We record usage in:\nusages
(list): token usage per chunk or call. \ntotal_usage: sum of
all chunk calls. \nshow_usage(): prints a usage report (if the
provider returns usage data).\nllm_strategy =
LLMExtractionStrategy(...) # ... llm_strategy.show_usage() #
e.g. â€œTotal usage: 1241 tokens across 2 chunk callsâ€ \nIf
your model provider doesnâ€™t return usage info, these fields
might be partial or empty.\n9. Example: Building a Knowledge
Graph\nBelow is a snippet combining LLMExtractionStrategy with
a Pydantic schema for a knowledge graph. Notice how we pass an
instruction telling the model what to parse.\nimport os import
json import asyncio from typing import List from pydantic
import BaseModel, Field from crawl4ai import AsyncWebCrawler,
BrowserConfig, CrawlerRunConfig, CacheMode from
crawl4ai.extraction_strategy import LLMExtractionStrategy
172
class Entity(BaseModel): name: str description: str class
Relationship(BaseModel): entity1: Entity entity2: Entity
description: str relation_type: str class
KnowledgeGraph(BaseModel): entities: List[Entity]
relationships: List[Relationship] async def main(): # LLM
extraction strategy llm_strat =
LLMExtractionStrategy( provider=\"openai/gpt-4\",
api_token=os.getenv('OPENAI_API_KEY'),
schema=KnowledgeGraph.schema_json(), extraction_type=\"schema
\", instruction=\"Extract entities and relationships from the
content. Return valid JSON.\", chunk_token_threshold=1400,
apply_chunking=True, input_format=\"html\",
extra_args={\"temperature\": 0.1, \"max_tokens\": 1500} )
crawl_config =
CrawlerRunConfig( extraction_strategy=llm_strat,
cache_mode=CacheMode.BYPASS ) async with
AsyncWebCrawler(config=BrowserConfig(headless=True)) as
crawler: # Example page url =
\"https://www.nbcnews.com/business\" result = await
crawler.arun(url=url, config=crawl_config) if result.success:
with open(\"kb_result.json\", \"w\", encoding=\"utf-8\") as f:
f.write(result.extracted_content) llm_strat.show_usage() else:
print(\"Crawl failed:\", result.error_message) if __name__ ==
\"__main__\": asyncio.run(main()) \nKey Observations:
\nextraction_type=\"schema\" ensures we get JSON fitting our
KnowledgeGraph. \ninput_format=\"html\" means we feed HTML to
the model. \ninstruction guides the model to output a
structured knowledge graph. \n10. Best Practices & Caveats\n1.
Cost & Latency: LLM calls can be slow or expensive. Consider
chunking or smaller coverage if you only need partial data.
\n2. Model Token Limits: If your page + instruction exceed the
context window, chunking is essential.\n3. Instruction
Engineering: Well-crafted instructions can drastically improve
output reliability.\n4. Schema Strictness: \"schema\"
extraction tries to parse the model output as JSON. If the
model returns invalid JSON, partial extraction might happen,
or you might get an error.\n5. Parallel vs. Serial: The
library can process multiple chunks in parallel, but you must
watch out for rate limits on certain providers.\n6. Check
Output: Sometimes, an LLM might omit fields or produce
extraneous text. You may want to post-validate with Pydantic
or do additional cleanup.\n11. Conclusion\nLLM-based
extraction in Crawl4AI is provider-agnostic, letting you
choose from hundreds of models via LightLLM. Itâ€™s perfect
for semantically complex tasks or generating advanced
structures like knowledge graphs. However, itâ€™s slower and
potentially costlier than schema-based approaches. Keep these
tips in mind:\nPut your LLM strategy in CrawlerRunConfig.
\nUse input_format to pick which form (markdown, HTML,
fit_markdown) the LLM sees. \nTweak chunk_token_threshold,
overlap_rate, and apply_chunking to handle large content
efficiently. \nMonitor token usage with show_usage().\nIf your
siteâ€™s data is consistent or repetitive, consider
JsonCssExtractionStrategy first for speed and simplicity. But
if you need an AI-driven approach, LLMExtractionStrategy
offers a flexible, multi-provider solution for extracting
structured JSON from any website.\nNext Steps:\n1. Experiment
173
with Different Providers\n- Try switching the provider (e.g.,
\"ollama/llama2\", \"openai/gpt-4o\", etc.) to see differences
in speed, accuracy, or cost.\n- Pass different extra_args like
temperature, top_p, and max_tokens to fine-tune your results.
\n2. Performance Tuning\n- If pages are large, tweak
chunk_token_threshold, overlap_rate, or apply_chunking to
optimize throughput.\n- Check the usage logs with show_usage()
to keep an eye on token consumption and identify potential
bottlenecks.\n3. Validate Outputs\n- If using extraction_type=
\"schema\", parse the LLMâ€™s JSON with a Pydantic model for a
final validation step.\n- Log or handle any parse errors
gracefully, especially if the model occasionally returns
malformed JSON.\n4. Explore Hooks & Automation\n- Integrate
LLM extraction with hooks for complex pre/post-processing.\n-
Use a multi-step pipeline: crawl, filter, LLM-extract, then
store or index results for further analysis.\nLast Updated:
2025-01-01\nThatâ€™s it for Extracting JSON (LLM)â€”now you
can harness AI to parse, classify, or reorganize data on the
web. Happy crawling!",
"markdown": "# LLM Strategies - Crawl4AI Documentation
(v0.5.x)\n\nIn some cases, you need to extract **complex or
unstructured** information from a webpage that a simple
CSS/XPath schema cannot easily parse. Or you want **AI**\\-
driven insights, classification, or summarization. For these
scenarios, Crawl4AI provides an **LLM-based extraction
strategy** that:\n\n1. Works with **any** large language
model supported by [LightLLM](https://github.com/LightLLM)
(Ollama, OpenAI, Claude, and more).\n2. Automatically splits
content into chunks (if desired) to handle token limits, then
combines results.\n3. Lets you define a **schema** (like a
Pydantic model) or a simpler â€œblockâ€ extraction approach.
\n\n**Important**: LLM-based extraction can be slower and
costlier than schema-based approaches. If your page data is
highly structured, consider using
[`JsonCssExtractionStrategy`]
(https://crawl4ai.com/mkdocs/extraction/no-llm-strategies/) or
[`JsonXPathExtractionStrategy`]
(https://crawl4ai.com/mkdocs/extraction/no-llm-strategies/)
first. But if you need AI to interpret or reorganize content,
read on!\n\n* * *\n\n## 1\\. Why Use an LLM?\n\n* **Complex
Reasoning**: If the siteâ€™s data is unstructured, scattered,
or full of natural language context.\n* **Semantic
Extraction**: Summaries, knowledge graphs, or relational data
that require comprehension.\n* **Flexible**: You can pass
instructions to the model to do more advanced transformations
or classification.\n\n* * *\n\n## 2\\. Provider-Agnostic via
LightLLM\n\nCrawl4AI uses a â€œprovider stringâ€ (e.g., `
\"openai/gpt-4o\"`, `\"ollama/llama2.0\"`, `\"aws/titan\"`) to
identify your LLM.â€€**Any** model that LightLLM supports is
fair game. You just provide:\n\n* **`provider`**: The
`<provider>/<model_name>` identifier (e.g., `\"openai/gpt-4
\"`, `\"ollama/llama2\"`, `\"huggingface/google-flan\"`,
etc.).\n* **àpi_token`**: If needed (for OpenAI,
HuggingFace, etc.); local models or Ollama might not require
it.\n* **àpi_base`** (optional): If your provider has a
custom endpoint.\n\nThis means you **arenâ€™t locked** into a
single LLM vendor. Switch or experiment easily.\n\n* * *\n
174
\n### 3.1 Flow\n\n1.â€€**Chunking** (optional): The HTML or
markdown is split into smaller segments if itâ€™s very long
(based on `chunk_token_threshold`, overlap, etc.). \n2.â
€€**Prompt Construction**: For each chunk, the library forms a
prompt that includes your **ìnstruction`** (and possibly
schema or examples). \n3.â€€**LLM Inference**: Each chunk is
sent to the model in parallel or sequentially (depending on
your concurrency). \n4.â€€**Combining**: The results from
each chunk are merged and parsed into JSON.\n\n* **`\"schema
\"`**: The model tries to return JSON conforming to your
Pydantic-based schema.\n* **`\"block\"`**: The model returns
freeform text, or smaller JSON structures, which the library
collects.\n\nFor structured data, `\"schema\"` is recommended.
You provide `schema=YourPydanticModel.model_json_schema()`.\n
\n* * *\n\n## 4\\. Key Parameters\n\nBelow is an overview of
important LLM extraction parameters. All are typically set
inside `LLMExtractionStrategy(...)`. You then put that
strategy in your `CrawlerRunConfig(...,
extraction_strategy=...)`.\n\n1.â€€**`provider`** (str): e.g.,
`\"openai/gpt-4\"`, `\"ollama/llama2\"`. \n2.â
€€**àpi_token`** (str): The API key or token for that model.
May not be needed for local models. \n3.â€€**`schema`**
(dict): A JSON schema describing the fields you want. Usually
generated by `YourModel.model_json_schema()`. \n4.â
€€**èxtraction_type`** (str): `\"schema\"` or `\"block\"`.
\n5.â€€**ìnstruction`** (str): Prompt text telling the LLM
what you want extracted. E.g., â€œExtract these fields as a
JSON array.â€ \n6.â€€**`chunk_token_threshold`** (int):
Maximum tokens per chunk. If your content is huge, you can
break it up for the LLM. \n7.â€€**òverlap_rate`** (float):
Overlap ratio between adjacent chunks. E.g., `0.1` means 10%
of each chunk is repeated to preserve context continuity.
\n8.â€€**àpply_chunking`** (bool): Set `True` to chunk
automatically. If you want a single pass, set `False`. \n9.â
€€**ìnput_format`** (str): Determines **which** crawler
result is passed to the LLM. Options include: \n\\- `
\"markdown\"`: The raw markdown (default). \n\\- `
\"fit_markdown\"`: The filtered â€œfitâ€ markdown if you used
a content filter. \n\\- `\"html\"`: The cleaned or raw HTML.
\n10.â€€**èxtra_args`** (dict): Additional LLM parameters
like `temperature`, `max_tokens`, `top_p`, etc. \n11.â
€€**`show_usage()`**: A method you can call to print out usage
info (token usage per chunk, total cost if known).\n
\n**Example**:\n\nèxtraction_strategy =
LLMExtractionStrategy( llmConfig = LlmConfig(provider=
\"openai/gpt-4\", api_token=\"YOUR_OPENAI_KEY\"),
schema=MyModel.model_json_schema(), extraction_type=
\"schema\", instruction=\"Extract a list of items from the
text with 'name' and 'price' fields.\",
chunk_token_threshold=1200, overlap_rate=0.1,
apply_chunking=True, input_format=\"html\",
extra_args={\"temperature\": 0.1, \"max_tokens\": 1000},
verbose=True )`\n\n* * *\n\n## 5\\. Putting It in
`CrawlerRunConfig`\n\n**Important**: In Crawl4AI, all strategy
definitions should go inside the `CrawlerRunConfig`, not
directly as a param in àrun()`. Hereâ€™s a full example:\n
\nìmport os import asyncio import json from pydantic import
175
BaseModel, Field from typing import List from crawl4ai import
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode,
LlmConfig from crawl4ai.extraction_strategy import
LLMExtractionStrategy class Product(BaseModel): name: str
price: str async def main(): # 1. Define the LLM
extraction strategy llm_strategy =
LLMExtractionStrategy( llmConfig = LlmConfig(provider=
\"openai/gpt-4o-mini\",
api_token=os.getenv('OPENAI_API_KEY')),
schema=Product.schema_json(), # Or use model_json_schema()
extraction_type=\"schema\", instruction=\"Extract all
product objects with 'name' and 'price' from the content.\",
chunk_token_threshold=1000, overlap_rate=0.0,
apply_chunking=True, input_format=\"markdown\", # or
\"html\", \"fit_markdown\" extra_args={\"temperature
\": 0.0, \"max_tokens\": 800} ) # 2. Build the
crawler config crawl_config =
CrawlerRunConfig( extraction_strategy=llm_strategy,
cache_mode=CacheMode.BYPASS ) # 3. Create a browser
config if needed browser_cfg =
BrowserConfig(headless=True) async with
AsyncWebCrawler(config=browser_cfg) as crawler: # 4.
Let's say we want to crawl a single page result =
await crawler.arun( url=
\"https://example.com/products\",
config=crawl_config ) if result.success:
# 5. The extracted content is presumably JSON data
= json.loads(result.extracted_content)
print(\"Extracted items:\", data) # 6. Show usage
stats llm_strategy.show_usage() # prints token
usage else: print(\"Error:\",
result.error_message) if __name__ == \"__main__\":
asyncio.run(main())`\n\n* * *\n\n## 6\\. Chunking Details\n
\n### 6.1 `chunk_token_threshold`\n\nIf your page is large,
you might exceed your LLMâ€™s context window.â
€€**`chunk_token_threshold`** sets the approximate max tokens
per chunk. The library calculates wordâ†’token ratio using
`word_token_rate` (often ~0.75 by default). If chunking is
enabled (àpply_chunking=True`), the text is split into
segments.\n\n### 6.2 òverlap_rate`\n\nTo keep context
continuous across chunks, we can overlap them. E.g.,
òverlap_rate=0.1` means each subsequent chunk includes 10% of
the previous chunkâ€™s text. This is helpful if your needed
info might straddle chunk boundaries.\n\n### 6.3 Performance &
Parallelism\n\nBy chunking, you can potentially process
multiple chunks in parallel (depending on your concurrency
settings and the LLM provider). This reduces total time if the
site is huge or has many sections.\n\n* * *\n\n## 7\\. Input
Format\n\nBy default, **LLMExtractionStrategy** uses
ìnput_format=\"markdown\"`, meaning the **crawlerâ€™s final
markdown** is fed to the LLM. You can change to:\n\n*
**`html`**: The cleaned HTML or raw HTML (depending on your
crawler config) goes into the LLM.\n* **`fit_markdown`**: If
you used, for instance, `PruningContentFilter`, the â€œfitâ€
version of the markdown is used. This can drastically reduce
tokens if you trust the filter.\n* **`markdown`**: Standard
markdown output from the crawlerâ€™s `markdown_generator`.\n
176
\nThis setting is crucial: if the LLM instructions rely on
HTML tags, pick `\"html\"`. If you prefer a text-based
approach, pick `\"markdown\"`.\n
\n`LLMExtractionStrategy( # ... input_format=\"html\",
# Instead of \"markdown\" or \"fit_markdown\" )`\n\n* * *\n
\n## 8\\. Token Usage & Show Usage\n\nTo keep track of tokens
and cost, each chunk is processed with an LLM call. We record
usage in:\n\n* **ùsages`** (list): token usage per chunk or
call.\n* **`total_usage`**: sum of all chunk calls.\n*
**`show_usage()`**: prints a usage report (if the provider
returns usage data).\n\n`llm_strategy =
LLMExtractionStrategy(...) # ... llm_strategy.show_usage() #
e.g. â€œTotal usage: 1241 tokens across 2 chunk callsâ€ `\n
\nIf your model provider doesnâ€™t return usage info, these
fields might be partial or empty.\n\n* * *\n\n## 9\\. Example:
Building a Knowledge Graph\n\nBelow is a snippet combining
**`LLMExtractionStrategy`** with a Pydantic schema for a
knowledge graph. Notice how we pass an **ìnstruction`**
telling the model what to parse.\n\nìmport os import json
import asyncio from typing import List from pydantic import
BaseModel, Field from crawl4ai import AsyncWebCrawler,
BrowserConfig, CrawlerRunConfig, CacheMode from
crawl4ai.extraction_strategy import LLMExtractionStrategy
class Entity(BaseModel): name: str description: str
class Relationship(BaseModel): entity1: Entity
entity2: Entity description: str relation_type: str
class KnowledgeGraph(BaseModel): entities: List[Entity]
relationships: List[Relationship] async def main(): # LLM
extraction strategy llm_strat =
LLMExtractionStrategy( provider=\"openai/gpt-4\",
api_token=os.getenv('OPENAI_API_KEY'),
schema=KnowledgeGraph.schema_json(), extraction_type=
\"schema\", instruction=\"Extract entities and
relationships from the content. Return valid JSON.\",
chunk_token_threshold=1400, apply_chunking=True,
input_format=\"html\", extra_args={\"temperature\":
0.1, \"max_tokens\": 1500} ) crawl_config =
CrawlerRunConfig( extraction_strategy=llm_strat,
cache_mode=CacheMode.BYPASS ) async with
AsyncWebCrawler(config=BrowserConfig(headless=True)) as
crawler: # Example page url =
\"https://www.nbcnews.com/business\" result = await
crawler.arun(url=url, config=crawl_config) if
result.success: with open(\"kb_result.json\", \"w
\", encoding=\"utf-8\") as f:
f.write(result.extracted_content)
llm_strat.show_usage() else: print(\"Crawl
failed:\", result.error_message) if __name__ == \"__main__\":
asyncio.run(main())`\n\n**Key Observations**:\n\n*
**èxtraction_type=\"schema\"`** ensures we get JSON fitting
our `KnowledgeGraph`.\n* **ìnput_format=\"html\"`** means
we feed HTML to the model.\n* **ìnstruction`** guides the
model to output a structured knowledge graph.\n\n* * *\n\n##
10\\. Best Practices & Caveats\n\n1.â€€**Cost & Latency**: LLM
calls can be slow or expensive. Consider chunking or smaller
coverage if you only need partial data. \n2.â€€**Model Token
Limits**: If your page + instruction exceed the context
177
window, chunking is essential. \n3.â€€**Instruction
Engineering**: Well-crafted instructions can drastically
improve output reliability. \n4.â€€**Schema Strictness**: `
\"schema\"` extraction tries to parse the model output as
JSON. If the model returns invalid JSON, partial extraction
might happen, or you might get an error. \n5.â€€**Parallel
vs. Serial**: The library can process multiple chunks in
parallel, but you must watch out for rate limits on certain
providers. \n6.â€€**Check Output**: Sometimes, an LLM might
omit fields or produce extraneous text. You may want to post-
validate with Pydantic or do additional cleanup.\n\n* * *\n
\n## 11\\. Conclusion\n\n**LLM-based extraction** in Crawl4AI
is **provider-agnostic**, letting you choose from hundreds of
models via LightLLM. Itâ€™s perfect for **semantically
complex** tasks or generating advanced structures like
knowledge graphs. However, itâ€™s **slower** and potentially
costlier than schema-based approaches. Keep these tips in
mind:\n\n* Put your LLM strategy **in `CrawlerRunConfig`**.
\n* Use **ìnput_format`** to pick which form (markdown,
HTML, fit\\_markdown) the LLM sees.\n* Tweak
**`chunk_token_threshold`**, **òverlap_rate`**, and
**àpply_chunking`** to handle large content efficiently.\n*
Monitor token usage with `show_usage()`.\n\nIf your siteâ€™s
data is consistent or repetitive, consider
[`JsonCssExtractionStrategy`]
(https://crawl4ai.com/mkdocs/extraction/no-llm-strategies/)
first for speed and simplicity. But if you need an **AI-
driven** approach, `LLMExtractionStrategy` offers a flexible,
multi-provider solution for extracting structured JSON from
any website.\n\n**Next Steps**:\n\n1.â€€**Experiment with
Different Providers** \n\\- Try switching the `provider`
(e.g., `\"ollama/llama2\"`, `\"openai/gpt-4o\"`, etc.) to see
differences in speed, accuracy, or cost. \n\\- Pass different
èxtra_args` like `temperature`, `top_p`, and `max_tokens` to
fine-tune your results.\n\n2.â€€**Performance Tuning** \n\\-
If pages are large, tweak `chunk_token_threshold`,
òverlap_rate`, or àpply_chunking` to optimize throughput.
\n\\- Check the usage logs with `show_usage()` to keep an eye
on token consumption and identify potential bottlenecks.\n
\n3.â€€**Validate Outputs** \n\\- If using èxtraction_type=
\"schema\"`, parse the LLMâ€™s JSON with a Pydantic model for
a final validation step. \n\\- Log or handle any parse errors
gracefully, especially if the model occasionally returns
malformed JSON.\n\n4.â€€**Explore Hooks & Automation** \n\\-
Integrate LLM extraction with [hooks]
(https://crawl4ai.com/mkdocs/advanced/hooks-auth/) for complex
pre/post-processing. \n\\- Use a multi-step pipeline: crawl,
filter, LLM-extract, then store or index results for further
analysis.\n\n**Last Updated**: 2025-01-01\n\n* * *\n\nThatâ€™s
it for **Extracting JSON (LLM)**â€”now you can harness AI to
parse, classify, or reorganize data on the web. Happy
crawling!",
"debug": {
"requestHandlerMode": "browser"
}
},
{
178
"url": "https://crawl4ai.com/mkdocs/advanced/crawl-
dispatcher/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/advanced/crawl-
dispatcher/",
"loadedTime": "2025-03-05T23:17:33.406Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/advanced/crawl-
dispatcher/",
"title": "Crawl Dispatcher - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:26 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"973f255a4f916259384c408afeea5a7f\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Crawl Dispatcher - Crawl4AI Documentation
(v0.5.x)\nWeâ€™re excited to announce a Crawl Dispatcher
module that can handle thousands of crawling tasks
simultaneously. By efficiently managing system resources
(memory, CPU, network), this dispatcher ensures high-
performance data extraction at scale. It also provides real-
time monitoring of each crawlerâ€™s status, memory usage, and
overall progress.\nStay tunedâ€”this feature is coming soon in
an upcoming release of Crawl4AI! For the latest news, keep an
eye on our changelogs and follow @unclecode on X.\nBelow is a
sample of how the dispatcherâ€™s performance monitor might
look in action:\nWe canâ€™t wait to bring you this
streamlined, scalable approach to multi-URL crawlingâ€”watch
this space for updates!",
"markdown": "# Crawl Dispatcher - Crawl4AI Documentation
(v0.5.x)\n\nWeâ€™re excited to announce a **Crawl Dispatcher**
module that can handle **thousands** of crawling tasks
simultaneously. By efficiently managing system resources
(memory, CPU, network), this dispatcher ensures high-
performance data extraction at scale. It also provides **real-
time monitoring** of each crawlerâ€™s status, memory usage,
and overall progress.\n\nStay tunedâ€”this feature is **coming
soon** in an upcoming release of Crawl4AI! For the latest
news, keep an eye on our changelogs and follow [@unclecode]
179
(https://twitter.com/unclecode) on X.\n\nBelow is a **sample**
of how the dispatcherâ€™s performance monitor might look in
action:\n\n![Crawl Dispatcher Performance Monitor]
(https://crawl4ai.com/mkdocs/assets/images/dispatcher.png)\n
\nWe canâ€™t wait to bring you this streamlined, **scalable**
approach to multi-URL crawlingâ€”**watch this space** for
updates!",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/advanced/identity-based-
crawling/",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/advanced/identity-based-
crawling/",
"loadedTime": "2025-03-05T23:17:35.961Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl":
"https://docs.crawl4ai.com/advanced/identity-based-crawling/",
"title": "Identity Based Crawling - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:35 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"6fa45c754b3c6ded5868499d22181838\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Identity Based Crawling - Crawl4AI Documentation
(v0.5.x)\nPreserve Your Identity with Crawl4AI\nCrawl4AI
empowers you to navigate and interact with the web using your
authentic digital identity, ensuring youâ€™re recognized as a
human and not mistaken for a bot. This tutorial covers:\n1.
Managed Browsers â€“ The recommended approach for persistent
profiles and identity-based crawling.\n2. Magic Mode â€“ A
simplified fallback solution for quick automation without
persistent identity.\n1. Managed Browsers: Your Digital
Identity Solution\nManaged Browsers let developers create and
use persistent browser profiles. These profiles store local
180
storage, cookies, and other session data, letting you browse
as your real selfâ€”complete with logins, preferences, and
cookies.\nKey Benefits\nAuthentic Browsing Experience: Retain
session data and browser fingerprints as though youâ€™re a
normal user. \nEffortless Configuration: Once you log in or
solve CAPTCHAs in your chosen data directory, you can re-run
crawls without repeating those steps. \nEmpowered Data Access:
If you can see the data in your own browser, you can automate
its retrieval with your genuine identity.\nBelow is a partial
update to your Managed Browsers tutorial, specifically the
section about creating a user-data directory using Playwrightâ
€™s Chromium binary rather than a system-wide Chrome/Edge. Weâ
€™ll show how to locate that binary and launch it with a --
user-data-dir argument to set up your profile. You can then
point BrowserConfig.user_data_dir to that folder for
subsequent crawls.\nCreating a User Data Directory (Command-
Line Approach via Playwright)\nIf you installed Crawl4AI
(which installs Playwright under the hood), you already have a
Playwright-managed Chromium on your system. Follow these steps
to launch that Chromium from your command line, specifying a
custom data directory:\n1. Find the Playwright Chromium
binary: - On most systems, installed browsers go under a
~/.cache/ms-playwright/ folder or similar path.\n- To see an
overview of installed browsers, run: \npython -m playwright
install --dry-run \nor \nplaywright install --dry-run
\n(depending on your environment). This shows where Playwright
keeps Chromium. \nFor instance, you might see a path like:
\n~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome
\non Linux, or a corresponding folder on macOS/Windows.\n2.
Launch the Playwright Chromium binary with a custom user-data
directory: \n# Linux example ~/.cache/ms-
playwright/chromium-1234/chrome-linux/chrome \\ --user-data-
dir=/home/<you>/my_chrome_profile \n# macOS example
(Playwrightâ€™s internal binary) ~/Library/Caches/ms-
playwright/chromium-1234/chrome-
mac/Chromium.app/Contents/MacOS/Chromium \\ --user-data-
dir=/Users/<you>/my_chrome_profile \n# Windows example
(PowerShell/cmd) \"C:\\Users\\<you>\\AppData\\Local\\ms-
playwright\\chromium-1234\\chrome-win\\chrome.exe\" ^ --user-
data-dir=\"C:\\Users\\<you>\\my_chrome_profile\" \nReplace the
path with the actual subfolder indicated in your ms-playwright
cache structure.\n- This opens a fresh Chromium with your new
or existing data folder.\n- Log into any sites or configure
your browser the way you want.\n- Close when doneâ€”your
profile data is saved in that folder.\n3. Use that folder in
BrowserConfig.user_data_dir: \nfrom crawl4ai import
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig( headless=True,
use_managed_browser=True, user_data_dir=
\"/home/<you>/my_chrome_profile\", browser_type=\"chromium\" )
\n- Next time you run your code, it reuses that folderâ€”
preserving your session data, cookies, local storage, etc.
\n3. Using Managed Browsers in Crawl4AI\nOnce you have a data
directory with your session data, pass it to BrowserConfig:
\nimport asyncio from crawl4ai import AsyncWebCrawler,
BrowserConfig, CrawlerRunConfig async def main(): # 1)
Reference your persistent data directory browser_config =
181
BrowserConfig( headless=True, # 'True' for automated runs
verbose=True, use_managed_browser=True, # Enables persistent
browser strategy browser_type=\"chromium\", user_data_dir=
\"/path/to/my-chrome-profile\" ) # 2) Standard crawl config
crawl_config = CrawlerRunConfig( wait_for=\"css:.logged-in-
content\" ) async with AsyncWebCrawler(config=browser_config)
as crawler: result = await crawler.arun(url=
\"https://example.com/private\", config=crawl_config) if
result.success: print(\"Successfully accessed private data
with your identity!\") else: print(\"Error:\",
result.error_message) if __name__ == \"__main__\":
asyncio.run(main()) \nWorkflow\n1. Login externally (via CLI
or your normal Chrome with --user-data-dir=...).\n2. Close
that browser.\n3. Use the same folder in user_data_dir= in
Crawl4AI.\n4. Crawl â€“ The site sees your identity as if youâ
€™re the same user who just logged in.\n4. Magic Mode:
Simplified Automation\nIf you donâ€™t need a persistent
profile or identity-based approach, Magic Mode offers a quick
way to simulate human-like browsing without storing long-term
data.\nfrom crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async with AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://example.com\",
config=CrawlerRunConfig( magic=True, # Simplifies a lot of
interaction remove_overlay_elements=True, page_timeout=
60000 ) ) \nMagic Mode:\nSimulates a user-like experience
\nRandomizes user agent & navigator\nRandomizes interactions &
timings \nMasks automation signals \nAttempts pop-up handling
\nBut itâ€™s no substitute for true user-based sessions if you
want a fully legitimate identity-based solution.\n5. Comparing
Managed Browsers vs. Magic Mode\nFeature Managed Browsers
Magic Mode \nSession Persistence\tFull localStorage/cookies
retained in user_data_dir\tNo persistent data (fresh each
run)\t\nGenuine Identity\tReal user profile with full rights &
preferences\tEmulated user-like patterns, but no actual
identity\t\nComplex Sites\tBest for login-gated sites or heavy
config\tSimple tasks, minimal login or config needed\t\nSetup
\tExternal creation of user_data_dir, then use in Crawl4AI
\tSingle-line approach (magic=True)\t\nReliability\tExtremely
consistent (same data across runs)\tGood for smaller tasks,
can be less stable\t\n6. Using the BrowserProfiler Class
\nCrawl4AI provides a dedicated BrowserProfiler class for
managing browser profiles, making it easy to create, list, and
delete profiles for identity-based browsing.\nCreating and
Managing Profiles with BrowserProfiler\nThe BrowserProfiler
class offers a comprehensive API for browser profile
management:\nimport asyncio from crawl4ai import
BrowserProfiler async def manage_profiles(): # Create a
profiler instance profiler = BrowserProfiler() # Create a
profile interactively - opens a browser window profile_path =
await profiler.create_profile( profile_name=\"my-login-profile
\" # Optional: name your profile ) print(f\"Profile saved at:
{profile_path}\") # List all available profiles profiles =
profiler.list_profiles() for profile in profiles: print(f
\"Profile: {profile['name']}\") print(f\" Path:
{profile['path']}\") print(f\" Created:
{profile['created']}\") print(f\" Browser type:
{profile['type']}\") # Get a specific profile path by name
182
specific_profile = profiler.get_profile_path(\"my-login-
profile\") # Delete a profile when no longer needed success =
profiler.delete_profile(\"old-profile-name\")
asyncio.run(manage_profiles()) \nHow profile creation works:
1. A browser window opens for you to interact with 2. You log
in to websites, set preferences, etc. 3. When you're done,
press 'q' in the terminal to close the browser 4. The profile
is saved in the Crawl4AI profiles directory 5. You can use the
returned path with BrowserConfig.user_data_dir\nInteractive
Profile Management\nThe BrowserProfiler also offers an
interactive management console that guides you through profile
creation, listing, and deletion:\nimport asyncio from crawl4ai
import BrowserProfiler, AsyncWebCrawler, BrowserConfig #
Define a function to use a profile for crawling async def
crawl_with_profile(profile_path, url): browser_config =
BrowserConfig( headless=True, use_managed_browser=True,
user_data_dir=profile_path ) async with
AsyncWebCrawler(config=browser_config) as crawler: result =
await crawler.arun(url) return result async def main(): #
Create a profiler instance profiler = BrowserProfiler() #
Launch the interactive profile manager # Passing the crawl
function as a callback adds a \"crawl with profile\" option
await
profiler.interactive_manager(crawl_callback=crawl_with_profile
) asyncio.run(main()) \nLegacy Methods\nFor backward
compatibility, the previous methods on ManagedBrowser are
still available, but they delegate to the new BrowserProfiler
class:\nfrom crawl4ai.browser_manager import ManagedBrowser #
These methods still work but use BrowserProfiler internally
profiles = ManagedBrowser.list_profiles() \nComplete Example
\nSee the full example in
docs/examples/identity_based_browsing.py for a complete
demonstration of creating and using profiles for authenticated
browsing using the new BrowserProfiler class.\n7. Summary
\nCreate your user-data directory either:\nBy launching
Chrome/Chromium externally with --user-data-dir=/some/path
\nOr by using the built-in BrowserProfiler.create_profile()
method\nOr through the interactive interface with
profiler.interactive_manager()\nLog in or configure sites as
needed, then close the browser\nReference that folder in
BrowserConfig(user_data_dir=\"...\") +
use_managed_browser=True\nList and reuse profiles with
BrowserProfiler.list_profiles()\nManage your profiles with the
dedicated BrowserProfiler class\nEnjoy persistent sessions
that reflect your real identity\nIf you only need quick,
ephemeral automation, Magic Mode might suffice\nRecommended:
Always prefer a Managed Browser for robust, identity-based
crawling and simpler interactions with complex sites. Use
Magic Mode for quick tasks or prototypes where persistent data
is unnecessary.\nWith these approaches, you preserve your
authentic browsing environment, ensuring the site sees you
exactly as a normal userâ€”no repeated logins or wasted
time.",
"markdown": "# Identity Based Crawling - Crawl4AI
Documentation (v0.5.x)\n\n## Preserve Your Identity with
Crawl4AI\n\nCrawl4AI empowers you to navigate and interact
with the web using your **authentic digital identity**,
183
ensuring youâ€™re recognized as a human and not mistaken for a
bot. This tutorial covers:\n\n1.â€€**Managed Browsers** â
€“ The recommended approach for persistent profiles and
identity-based crawling. \n2.â€€**Magic Mode** â€“ A
simplified fallback solution for quick automation without
persistent identity.\n\n* * *\n\n## 1\\. Managed Browsers:
Your Digital Identity Solution\n\n**Managed Browsers** let
developers create and use **persistent browser profiles**.
These profiles store local storage, cookies, and other session
data, letting you browse as your **real self**â€”complete with
logins, preferences, and cookies.\n\n### Key Benefits\n\n*
**Authentic Browsing Experience**: Retain session data and
browser fingerprints as though youâ€™re a normal user.\n*
**Effortless Configuration**: Once you log in or solve
CAPTCHAs in your chosen data directory, you can re-run crawls
without repeating those steps.\n* **Empowered Data Access**:
If you can see the data in your own browser, you can automate
its retrieval with your genuine identity.\n\n* * *\n\nBelow is
a **partial update** to your **Managed Browsers** tutorial,
specifically the section about **creating a user-data
directory** using **Playwrightâ€™s Chromium** binary rather
than a system-wide Chrome/Edge. Weâ€™ll show how to **locate**
that binary and launch it with a `--user-data-dir` argument to
set up your profile. You can then point
`BrowserConfig.user_data_dir` to that folder for subsequent
crawls.\n\n* * *\n\n### Creating a User Data Directory
(Command-Line Approach via Playwright)\n\nIf you installed
Crawl4AI (which installs Playwright under the hood), you
already have a Playwright-managed Chromium on your system.
Follow these steps to launch that **Chromium** from your
command line, specifying a **custom** data directory:\n\n1.â
€€**Find** the Playwright Chromium binary: - On most systems,
installed browsers go under a `~/.cache/ms-playwright/` folder
or similar path. \n\\- To see an overview of installed
browsers, run:\n\n`python -m playwright install --dry-run`\n
\nor\n\n`playwright install --dry-run`\n\n(depending on your
environment). This shows where Playwright keeps Chromium.\n\n*
For instance, you might see a path like:\n \n
`~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome`\n
\n on Linux, or a corresponding folder on macOS/Windows.\n
\n2.â€€**Launch** the Playwright Chromium binary with a
**custom** user-data directory:\n\n`# Linux example
~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome
\\ --user-data-dir=/home/<you>/my_chrome_profile`\n\n`#
macOS example (Playwrightâ€™s internal binary)
~/Library/Caches/ms-playwright/chromium-1234/chrome-
mac/Chromium.app/Contents/MacOS/Chromium \\ --user-data-
dir=/Users/<you>/my_chrome_profile`\n\n`# Windows example
(PowerShell/cmd) \"C:\\Users\\<you>\\AppData\\Local\\ms-
playwright\\chromium-1234\\chrome-win\\chrome.exe\" ^ --
user-data-dir=\"C:\\Users\\<you>\\my_chrome_profile\"`\n
\n**Replace** the path with the actual subfolder indicated in
your `ms-playwright` cache structure. \n\\- This **opens** a
fresh Chromium with your new or existing data folder. \n\\-
**Log into** any sites or configure your browser the way you
want. \n\\- **Close** when doneâ€”your profile data is saved
in that folder.\n\n3.â€€**Use** that folder in
184
**`BrowserConfig.user_data_dir`**:\n\n`from crawl4ai import
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig( headless=True,
use_managed_browser=True, user_data_dir=
\"/home/<you>/my_chrome_profile\", browser_type=\"chromium
\" )`\n\n\\- Next time you run your code, it reuses that
folderâ€”**preserving** your session data, cookies, local
storage, etc.\n\n* * *\n\n## 3\\. Using Managed Browsers in
Crawl4AI\n\nOnce you have a data directory with your session
data, pass it to **`BrowserConfig`**:\n\nìmport asyncio from
crawl4ai import AsyncWebCrawler, BrowserConfig,
CrawlerRunConfig async def main(): # 1) Reference your
persistent data directory browser_config =
BrowserConfig( headless=True, # 'True' for
automated runs verbose=True,
use_managed_browser=True, # Enables persistent browser
strategy browser_type=\"chromium\",
user_data_dir=\"/path/to/my-chrome-profile\" ) # 2)
Standard crawl config crawl_config =
CrawlerRunConfig( wait_for=\"css:.logged-in-content
\" ) async with
AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=\"https://example.com/private
\", config=crawl_config) if result.success:
print(\"Successfully accessed private data with your identity!
\") else: print(\"Error:\",
result.error_message) if __name__ == \"__main__\":
asyncio.run(main())`\n\n### Workflow\n\n1.â€€**Login**
externally (via CLI or your normal Chrome with `--user-data-
dir=...`). \n2.â€€**Close** that browser. \n3.â€€**Use** the
same folder in ùser_data_dir=` in Crawl4AI. \n4.â€€**Crawl**
â€“ The site sees your identity as if youâ€™re the same user
who just logged in.\n\n* * *\n\n## 4\\. Magic Mode: Simplified
Automation\n\nIf you **donâ€™t** need a persistent profile or
identity-based approach, **Magic Mode** offers a quick way to
simulate human-like browsing without storing long-term data.\n
\n`from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async with AsyncWebCrawler() as crawler: result = await
crawler.arun( url=\"https://example.com\",
config=CrawlerRunConfig( magic=True, # Simplifies
a lot of interaction remove_overlay_elements=True,
page_timeout=60000 ) )`\n\n**Magic Mode**:\n\n*
Simulates a user-like experience\n* Randomizes user agent &
navigator\n* Randomizes interactions & timings\n* Masks
automation signals\n* Attempts pop-up handling\n\n**But**
itâ€™s no substitute for **true** user-based sessions if you
want a fully legitimate identity-based solution.\n\n* * *\n
\n## 5\\. Comparing Managed Browsers vs. Magic Mode\n\n|
Feature | **Managed Browsers** | **Magic Mode** |\n| --- | ---
| --- |\n| **Session Persistence** | Full localStorage/cookies
retained in user\\_data\\_dir | No persistent data (fresh each
run) |\n| **Genuine Identity** | Real user profile with full
rights & preferences | Emulated user-like patterns, but no
actual identity |\n| **Complex Sites** | Best for login-gated
sites or heavy config | Simple tasks, minimal login or config
needed |\n| **Setup** | External creation of user\\_data
\\_dir, then use in Crawl4AI | Single-line approach
185
(`magic=True`) |\n| **Reliability** | Extremely consistent
(same data across runs) | Good for smaller tasks, can be less
stable |\n\n* * *\n\n## 6\\. Using the BrowserProfiler Class\n
\nCrawl4AI provides a dedicated `BrowserProfiler` class for
managing browser profiles, making it easy to create, list, and
delete profiles for identity-based browsing.\n\n### Creating
and Managing Profiles with BrowserProfiler\n\nThe
`BrowserProfiler` class offers a comprehensive API for browser
profile management:\n\nìmport asyncio from crawl4ai import
BrowserProfiler async def manage_profiles(): # Create a
profiler instance profiler = BrowserProfiler() #
Create a profile interactively - opens a browser window
profile_path = await
profiler.create_profile( profile_name=\"my-login-
profile\" # Optional: name your profile ) print(f
\"Profile saved at: {profile_path}\") # List all
available profiles profiles = profiler.list_profiles()
for profile in profiles: print(f\"Profile:
{profile['name']}\") print(f\" Path:
{profile['path']}\") print(f\" Created:
{profile['created']}\") print(f\" Browser type:
{profile['type']}\") # Get a specific profile path by
name specific_profile = profiler.get_profile_path(\"my-
login-profile\") # Delete a profile when no longer needed
success = profiler.delete_profile(\"old-profile-name\")
asyncio.run(manage_profiles())`\n\n**How profile creation
works:** 1. A browser window opens for you to interact with 2.
You log in to websites, set preferences, etc. 3. When you're
done, press 'q' in the terminal to close the browser 4. The
profile is saved in the Crawl4AI profiles directory 5. You can
use the returned path with `BrowserConfig.user_data_dir`\n
\n### Interactive Profile Management\n\nThe `BrowserProfiler`
also offers an interactive management console that guides you
through profile creation, listing, and deletion:\n\nìmport
asyncio from crawl4ai import BrowserProfiler, AsyncWebCrawler,
BrowserConfig # Define a function to use a profile for
crawling async def crawl_with_profile(profile_path, url):
browser_config = BrowserConfig( headless=True,
use_managed_browser=True,
user_data_dir=profile_path ) async with
AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url) return result async
def main(): # Create a profiler instance profiler =
BrowserProfiler() # Launch the interactive profile
manager # Passing the crawl function as a callback adds a
\"crawl with profile\" option await
profiler.interactive_manager(crawl_callback=crawl_with_profile
) asyncio.run(main())`\n\n### Legacy Methods\n\nFor backward
compatibility, the previous methods on `ManagedBrowser` are
still available, but they delegate to the new
`BrowserProfiler` class:\n\n`from crawl4ai.browser_manager
import ManagedBrowser # These methods still work but use
BrowserProfiler internally profiles =
ManagedBrowser.list_profiles()`\n\n### Complete Example\n\nSee
the full example in `docs/examples/identity_based_browsing.py`
for a complete demonstration of creating and using profiles
for authenticated browsing using the new `BrowserProfiler`
186
class.\n\n* * *\n\n## 7\\. Summary\n\n* **Create** your
user-data directory either:\n* By launching Chrome/Chromium
externally with `--user-data-dir=/some/path`\n* Or by using
the built-in `BrowserProfiler.create_profile()` method\n* Or
through the interactive interface with
`profiler.interactive_manager()`\n* **Log in** or configure
sites as needed, then close the browser\n* **Reference**
that folder in `BrowserConfig(user_data_dir=\"...\")` +
ùse_managed_browser=True`\n* **List and reuse** profiles
with `BrowserProfiler.list_profiles()`\n* **Manage** your
profiles with the dedicated `BrowserProfiler` class\n* Enjoy
**persistent** sessions that reflect your real identity\n*
If you only need quick, ephemeral automation, **Magic Mode**
might suffice\n\n**Recommended**: Always prefer a **Managed
Browser** for robust, identity-based crawling and simpler
interactions with complex sites. Use **Magic Mode** for quick
tasks or prototypes where persistent data is unnecessary.\n
\nWith these approaches, you preserve your **authentic**
browsing environment, ensuring the site sees you exactly as a
normal userâ€”no repeated logins or wasted time.",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/extraction/chunking/",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/extraction/chunking/",
"loadedTime": "2025-03-05T23:17:41.239Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl":
"https://docs.crawl4ai.com/extraction/chunking/",
"title": "Chunking - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:39 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"c45d969ef20a687fb22bfa5d46bf2edc\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Chunking - Crawl4AI Documentation
187
(v0.5.x)\nChunking Strategies\nChunking strategies are
critical for dividing large texts into manageable parts,
enabling effective content processing and extraction. These
strategies are foundational in cosine similarity-based
extraction techniques, which allow users to retrieve only the
most relevant chunks of content for a given query.
Additionally, they facilitate direct integration into RAG
(Retrieval-Augmented Generation) systems for structured and
scalable workflows.\nWhy Use Chunking?\n1. Cosine Similarity
and Query Relevance: Prepares chunks for semantic similarity
analysis. 2. RAG System Integration: Seamlessly processes and
stores chunks for retrieval. 3. Structured Processing: Allows
for diverse segmentation methods, such as sentence-based,
topic-based, or windowed approaches.\nMethods of Chunking\n1.
Regex-Based Chunking\nSplits text based on regular expression
patterns, useful for coarse segmentation.\nCode Example:
\nclass RegexChunking: def __init__(self, patterns=None):
self.patterns = patterns or [r'\\n\\n'] # Default pattern for
paragraphs def chunk(self, text): paragraphs = [text] for
pattern in self.patterns: paragraphs = [seg for p in
paragraphs for seg in re.split(pattern, p)] return paragraphs
# Example Usage text = \"\"\"This is the first paragraph. This
is the second paragraph.\"\"\" chunker = RegexChunking()
print(chunker.chunk(text)) \n2. Sentence-Based Chunking
\nDivides text into sentences using NLP tools, ideal for
extracting meaningful statements.\nCode Example: \nfrom
nltk.tokenize import sent_tokenize class NlpSentenceChunking:
def chunk(self, text): sentences = sent_tokenize(text) return
[sentence.strip() for sentence in sentences] # Example Usage
text = \"This is sentence one. This is sentence two.\" chunker
= NlpSentenceChunking() print(chunker.chunk(text)) \n3. Topic-
Based Segmentation\nUses algorithms like TextTiling to create
topic-coherent chunks.\nCode Example: \nfrom nltk.tokenize
import TextTilingTokenizer class TopicSegmentationChunking:
def __init__(self): self.tokenizer = TextTilingTokenizer() def
chunk(self, text): return self.tokenizer.tokenize(text) #
Example Usage text = \"\"\"This is an introduction. This is a
detailed discussion on the topic.\"\"\" chunker =
TopicSegmentationChunking() print(chunker.chunk(text)) \n4.
Fixed-Length Word Chunking\nSegments text into chunks of a
fixed word count.\nCode Example: \nclass
FixedLengthWordChunking: def __init__(self, chunk_size=100):
self.chunk_size = chunk_size def chunk(self, text): words =
text.split() return [' '.join(words[i:i + self.chunk_size])
for i in range(0, len(words), self.chunk_size)] # Example
Usage text = \"This is a long text with many words to be
chunked into fixed sizes.\" chunker =
FixedLengthWordChunking(chunk_size=5)
print(chunker.chunk(text)) \n5. Sliding Window Chunking
\nGenerates overlapping chunks for better contextual
coherence.\nCode Example: \nclass SlidingWindowChunking: def
__init__(self, window_size=100, step=50): self.window_size =
window_size self.step = step def chunk(self, text): words =
text.split() chunks = [] for i in range(0, len(words) -
self.window_size + 1, self.step): chunks.append('
'.join(words[i:i + self.window_size])) return chunks # Example
Usage text = \"This is a long text to demonstrate sliding
188
window chunking.\" chunker =
SlidingWindowChunking(window_size=5, step=2)
print(chunker.chunk(text)) \nCombining Chunking with Cosine
Similarity\nTo enhance the relevance of extracted content,
chunking strategies can be paired with cosine similarity
techniques. Hereâ€™s an example workflow:\nCode Example:
\nfrom sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity class
CosineSimilarityExtractor: def __init__(self, query):
self.query = query self.vectorizer = TfidfVectorizer() def
find_relevant_chunks(self, chunks): vectors =
self.vectorizer.fit_transform([self.query] + chunks)
similarities = cosine_similarity(vectors[0:1],
vectors[1:]).flatten() return [(chunks[i], similarities[i])
for i in range(len(chunks))] # Example Workflow text =
\"\"\"This is a sample document. It has multiple sentences. We
are testing chunking and similarity.\"\"\" chunker =
SlidingWindowChunking(window_size=5, step=3) chunks =
chunker.chunk(text) query = \"testing chunking\" extractor =
CosineSimilarityExtractor(query) relevant_chunks =
extractor.find_relevant_chunks(chunks)
print(relevant_chunks)",
"markdown": "# Chunking - Crawl4AI Documentation (v0.5.x)\n
\n## Chunking Strategies\n\nChunking strategies are critical
for dividing large texts into manageable parts, enabling
effective content processing and extraction. These strategies
are foundational in cosine similarity-based extraction
techniques, which allow users to retrieve only the most
relevant chunks of content for a given query. Additionally,
they facilitate direct integration into RAG (Retrieval-
Augmented Generation) systems for structured and scalable
workflows.\n\n### Why Use Chunking?\n\n1.â€€**Cosine
Similarity and Query Relevance**: Prepares chunks for semantic
similarity analysis. 2.â€€**RAG System Integration**:
Seamlessly processes and stores chunks for retrieval. 3.â
€€**Structured Processing**: Allows for diverse segmentation
methods, such as sentence-based, topic-based, or windowed
approaches.\n\n### Methods of Chunking\n\n#### 1\\. Regex-
Based Chunking\n\nSplits text based on regular expression
patterns, useful for coarse segmentation.\n\n**Code Example**:
\n\n`class RegexChunking: def __init__(self,
patterns=None): self.patterns = patterns or [r'\\n
\\n'] # Default pattern for paragraphs def chunk(self,
text): paragraphs = [text] for pattern in
self.patterns: paragraphs = [seg for p in
paragraphs for seg in re.split(pattern, p)] return
paragraphs # Example Usage text = \"\"\"This is the first
paragraph. This is the second paragraph.\"\"\" chunker =
RegexChunking() print(chunker.chunk(text))`\n\n#### 2\\.
Sentence-Based Chunking\n\nDivides text into sentences using
NLP tools, ideal for extracting meaningful statements.\n
\n**Code Example**:\n\n`from nltk.tokenize import
sent_tokenize class NlpSentenceChunking: def chunk(self,
text): sentences = sent_tokenize(text) return
[sentence.strip() for sentence in sentences] # Example Usage
text = \"This is sentence one. This is sentence two.\" chunker
= NlpSentenceChunking() print(chunker.chunk(text))`\n\n#### 3
189
\\. Topic-Based Segmentation\n\nUses algorithms like
TextTiling to create topic-coherent chunks.\n\n**Code
Example**:\n\n`from nltk.tokenize import TextTilingTokenizer
class TopicSegmentationChunking: def __init__(self):
self.tokenizer = TextTilingTokenizer() def chunk(self,
text): return self.tokenizer.tokenize(text) # Example
Usage text = \"\"\"This is an introduction. This is a detailed
discussion on the topic.\"\"\" chunker =
TopicSegmentationChunking() print(chunker.chunk(text))`\n
\n#### 4\\. Fixed-Length Word Chunking\n\nSegments text into
chunks of a fixed word count.\n\n**Code Example**:\n\n`class
FixedLengthWordChunking: def __init__(self, chunk_size=
100): self.chunk_size = chunk_size def
chunk(self, text): words = text.split() return
[' '.join(words[i:i + self.chunk_size]) for i in range(0,
len(words), self.chunk_size)] # Example Usage text = \"This
is a long text with many words to be chunked into fixed sizes.
\" chunker = FixedLengthWordChunking(chunk_size=5)
print(chunker.chunk(text))`\n\n#### 5\\. Sliding Window
Chunking\n\nGenerates overlapping chunks for better contextual
coherence.\n\n**Code Example**:\n\n`class
SlidingWindowChunking: def __init__(self, window_size=100,
step=50): self.window_size = window_size
self.step = step def chunk(self, text): words =
text.split() chunks = [] for i in range(0,
len(words) - self.window_size + 1, self.step):
chunks.append(' '.join(words[i:i + self.window_size]))
return chunks # Example Usage text = \"This is a long text to
demonstrate sliding window chunking.\" chunker =
SlidingWindowChunking(window_size=5, step=2)
print(chunker.chunk(text))`\n\n### Combining Chunking with
Cosine Similarity\n\nTo enhance the relevance of extracted
content, chunking strategies can be paired with cosine
similarity techniques. Hereâ€™s an example workflow:\n\n**Code
Example**:\n\n`from sklearn.feature_extraction.text import
TfidfVectorizer from sklearn.metrics.pairwise import
cosine_similarity class CosineSimilarityExtractor: def
__init__(self, query): self.query = query
self.vectorizer = TfidfVectorizer() def
find_relevant_chunks(self, chunks): vectors =
self.vectorizer.fit_transform([self.query] + chunks)
similarities = cosine_similarity(vectors[0:1],
vectors[1:]).flatten() return [(chunks[i],
similarities[i]) for i in range(len(chunks))] # Example
Workflow text = \"\"\"This is a sample document. It has
multiple sentences. We are testing chunking and similarity.
\"\"\" chunker = SlidingWindowChunking(window_size=5, step=3)
chunks = chunker.chunk(text) query = \"testing chunking\"
extractor = CosineSimilarityExtractor(query) relevant_chunks =
extractor.find_relevant_chunks(chunks)
print(relevant_chunks)`",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/api/async-webcrawler/",
190
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/api/async-
webcrawler/",
"loadedTime": "2025-03-05T23:17:42.540Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/api/async-
webcrawler/",
"title": "AsyncWebCrawler - Crawl4AI Documentation
(v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:40 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"68d525978a1b3900841fcaa8bb49ddf4\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "AsyncWebCrawler - Crawl4AI Documentation
(v0.5.x)\nThe AsyncWebCrawler is the core class for
asynchronous web crawling in Crawl4AI. You typically create it
once, optionally customize it with a BrowserConfig (e.g.,
headless, user agent), then run multiple arun() calls with
different CrawlerRunConfig objects.\nRecommended usage:\n1.
Create a BrowserConfig for global browser settings. \n2.
Instantiate AsyncWebCrawler(config=browser_config). \n3. Use
the crawler in an async context manager (async with) or manage
start/close manually. \n4. Call arun(url,
config=crawler_run_config) for each page you want.\n1.
Constructor Overview\nclass AsyncWebCrawler: def
__init__( self, crawler_strategy:
Optional[AsyncCrawlerStrategy] = None, config:
Optional[BrowserConfig] = None, always_bypass_cache: bool =
False, # deprecated always_by_pass_cache: Optional[bool] =
None, # also deprecated base_directory: str = ...,
thread_safe: bool = False, **kwargs, ): \"\"\" Create an
AsyncWebCrawler instance. Args: crawler_strategy: (Advanced)
Provide a custom crawler strategy if needed. config: A
BrowserConfig object specifying how the browser is set up.
always_bypass_cache: (Deprecated) Use
CrawlerRunConfig.cache_mode instead. base_directory: Folder
for storing caches/logs (if relevant). thread_safe: If True,
attempts some concurrency safeguards. Usually False. **kwargs:
Additional legacy or debugging parameters. \"\"\" ) ###
191
Typical Initialization ```python from crawl4ai import
AsyncWebCrawler, BrowserConfig browser_cfg =
BrowserConfig( browser_type=\"chromium\", headless=True,
verbose=True ) crawler = AsyncWebCrawler(config=browser_cfg)
\nNotes:\nLegacy parameters like always_bypass_cache remain
for backward compatibility, but prefer to set caching in
CrawlerRunConfig.\n2. Lifecycle: Start/Close or Context
Manager\n2.1 Context Manager (Recommended)\nasync with
AsyncWebCrawler(config=browser_cfg) as crawler: result = await
crawler.arun(\"https://example.com\") # The crawler
automatically starts/closes resources \nWhen the async with
block ends, the crawler cleans up (closes the browser, etc.).
\n2.2 Manual Start & Close\ncrawler =
AsyncWebCrawler(config=browser_cfg) await crawler.start()
result1 = await crawler.arun(\"https://example.com\") result2
= await crawler.arun(\"https://another.com\") await
crawler.close() \nUse this style if you have a long-running
application or need full control of the crawlerâ€™s lifecycle.
\n3. Primary Method: arun()\nasync def arun( self, url: str,
config: Optional[CrawlerRunConfig] = None, # Legacy parameters
for backward compatibility... ) -> CrawlResult: ... \n3.1 New
Approach\nYou pass a CrawlerRunConfig object that sets up
everything about a crawlâ€”content filtering, caching, session
reuse, JS code, screenshots, etc.\nimport asyncio from
crawl4ai import CrawlerRunConfig, CacheMode run_cfg =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS, css_selector=
\"main.article\", word_count_threshold=10, screenshot=True )
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(\"https://example.com/news\",
config=run_cfg) print(\"Crawled HTML length:\",
len(result.cleaned_html)) if result.screenshot:
print(\"Screenshot base64 length:\", len(result.screenshot))
\n3.2 Legacy Parameters Still Accepted\nFor backward
compatibility, arun() can still accept direct arguments like
css_selector=..., word_count_threshold=..., etc., but we
strongly advise migrating them into a CrawlerRunConfig.\n4.
Batch Processing: arun_many()\nasync def arun_many( self,
urls: List[str], config: Optional[CrawlerRunConfig] = None, #
Legacy parameters maintained for backwards
compatibility... ) -> List[CrawlResult]: \"\"\" Process
multiple URLs with intelligent rate limiting and resource
monitoring. \"\"\" \n4.1 Resource-Aware Crawling\nThe
arun_many() method now uses an intelligent dispatcher that:
\nMonitors system memory usage\nImplements adaptive rate
limiting\nProvides detailed progress monitoring\nManages
concurrent crawls efficiently\n4.2 Example Usage\nCheck page
Multi-url Crawling for a detailed example of how to use
arun_many().\n### 4.3 Key Features 1. **Rate Limiting** -
Automatic delay between requests - Exponential backoff on rate
limit detection - Domain-specific rate limiting - Configurable
retry strategy 2. **Resource Monitoring** - Memory usage
tracking - Adaptive concurrency based on system load -
Automatic pausing when resources are constrained 3. **Progress
Monitoring** - Detailed or aggregated progress display - Real-
time status updates - Memory usage statistics 4. **Error
Handling** - Graceful handling of rate limits - Automatic
retries with backoff - Detailed error reporting --- ## 5.
192
`CrawlResult` Output Each àrun()` returns a **`CrawlResult`**
containing: - ùrl`: Final URL (if redirected). - `html`:
Original HTML. - `cleaned_html`: Sanitized HTML. -
`markdown_v2`: Deprecated. Instead just use regular
`markdown` - èxtracted_content`: If an extraction strategy
was used (JSON for CSS/LLM strategies). - `screenshot`, `pdf`:
If screenshots/PDF requested. - `media`, `links`: Information
about discovered images/links. - `success`, èrror_message`:
Status info. For details, see [CrawlResult doc](./crawl-
result.md). --- ## 6. Quick Example Below is an example
hooking it all together: ```python import asyncio from
crawl4ai import AsyncWebCrawler, BrowserConfig,
CrawlerRunConfig, CacheMode from crawl4ai.extraction_strategy
import JsonCssExtractionStrategy import json async def main():
# 1. Browser config browser_cfg = BrowserConfig( browser_type=
\"firefox\", headless=False, verbose=True ) # 2. Run config
schema = { \"name\": \"Articles\", \"baseSelector\":
\"article.post\", \"fields\": [ { \"name\": \"title\",
\"selector\": \"h2\", \"type\": \"text\" }, { \"name\": \"url
\", \"selector\": \"a\", \"type\": \"attribute\", \"attribute
\": \"href\" } ] } run_cfg =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
word_count_threshold=15, remove_overlay_elements=True,
wait_for=\"css:.post\" # Wait for posts to appear ) async with
AsyncWebCrawler(config=browser_cfg) as crawler: result = await
crawler.arun( url=\"https://example.com/blog\",
config=run_cfg ) if result.success: print(\"Cleaned HTML
length:\", len(result.cleaned_html)) if
result.extracted_content: articles =
json.loads(result.extracted_content) print(\"Extracted
articles:\", articles[:2]) else: print(\"Error:\",
result.error_message) asyncio.run(main()) \nExplanation:\nWe
define a BrowserConfig with Firefox, no headless, and
verbose=True. \nWe define a CrawlerRunConfig that bypasses
cache, uses a CSS extraction schema, has a
word_count_threshold=15, etc. \nWe pass them to
AsyncWebCrawler(config=...) and arun(url=..., config=...).\n7.
Best Practices & Migration Notes\n1. Use BrowserConfig for
global settings about the browserâ€™s environment. 2. Use
CrawlerRunConfig for per-crawl logic (caching, content
filtering, extraction strategies, wait conditions). 3. Avoid
legacy parameters like css_selector or word_count_threshold
directly in arun(). Instead:\nrun_cfg =
CrawlerRunConfig(css_selector=\".main-content\",
word_count_threshold=20) result = await crawler.arun(url=\"...
\", config=run_cfg) \n4. Context Manager usage is simplest
unless you want a persistent crawler across many calls.\n8.
Summary\nAsyncWebCrawler is your entry point to asynchronous
crawling:\nConstructor accepts BrowserConfig (or defaults).
\narun(url, config=CrawlerRunConfig) is the main method for
single-page crawls. \narun_many(urls, config=CrawlerRunConfig)
handles concurrency across multiple URLs. \nFor advanced
lifecycle control, use start() and close() explicitly.
\nMigration: \nIf you used AsyncWebCrawler(browser_type=
\"chromium\", css_selector=\"...\"), move browser settings to
BrowserConfig(...) and content/crawl logic to
193
CrawlerRunConfig(...).\nThis modular approach ensures your
code is clean, scalable, and easy to maintain. For any
advanced or rarely used parameters, see the BrowserConfig
docs.",
"markdown": "# AsyncWebCrawler - Crawl4AI Documentation
(v0.5.x)\n\nThe **ÀsyncWebCrawler`** is the core class for
asynchronous web crawling in Crawl4AI.â€€You typically create
it **once**, optionally customize it with a
**`BrowserConfig`** (e.g., headless, user agent), then **run**
multiple **àrun()`** calls with different
**`CrawlerRunConfig`** objects.\n\n**Recommended usage**:\n
\n1.â€€**Create** a `BrowserConfig` for global browser
settings.â€€\n\n2.â€€**Instantiate**
ÀsyncWebCrawler(config=browser_config)`.â€€\n\n3.â€€**Use**
the crawler in an async context manager (àsync with`) or
manage start/close manually.â€€\n\n4.â€€**Call** àrun(url,
config=crawler_run_config)` for each page you want.\n\n* * *\n
\n## 1.â€€Constructor Overview\n\n`class AsyncWebCrawler:
def __init__( self, crawler_strategy:
Optional[AsyncCrawlerStrategy] = None, config:
Optional[BrowserConfig] = None, always_bypass_cache:
bool = False, # deprecated
always_by_pass_cache: Optional[bool] = None, # also deprecated
base_directory: str = ..., thread_safe: bool = False,
**kwargs, ): \"\"\" Create an
AsyncWebCrawler instance. Args:
crawler_strategy: (Advanced) Provide a custom
crawler strategy if needed. config:
A BrowserConfig object specifying how the browser is set up.
always_bypass_cache: (Deprecated) Use
CrawlerRunConfig.cache_mode instead.
base_directory: Folder for storing
caches/logs (if relevant). thread_safe:
If True, attempts some concurrency safeguards.â€€Usually
False. **kwargs: Additional
legacy or debugging parameters. \"\"\" ) ###
Typical Initialization ```python from crawl4ai import
AsyncWebCrawler, BrowserConfig browser_cfg =
BrowserConfig( browser_type=\"chromium\",
headless=True, verbose=True ) crawler =
AsyncWebCrawler(config=browser_cfg)`\n\n**Notes**:\n\n*
**Legacy** parameters like àlways_bypass_cache` remain for
backward compatibility, but prefer to set **caching** in
`CrawlerRunConfig`.\n\n* * *\n\n## 2.â€€Lifecycle: Start/Close
or Context Manager\n\n### 2.1 Context Manager (Recommended)\n
\nàsync with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(\"https://example.com\") # The
crawler automatically starts/closes resources`\n\nWhen the
àsync with` block ends, the crawler cleans up (closes the
browser, etc.).\n\n### 2.2 Manual Start & Close\n\n`crawler =
AsyncWebCrawler(config=browser_cfg) await crawler.start()
result1 = await crawler.arun(\"https://example.com\") result2
= await crawler.arun(\"https://another.com\") await
crawler.close()`\n\nUse this style if you have a **long-
running** application or need full control of the crawlerâ€™s
lifecycle.\n\n* * *\n\n## 3.â€€Primary Method: àrun()`\n
\nàsync def arun( self, url: str, config:
194
Optional[CrawlerRunConfig] = None, # Legacy parameters for
backward compatibility... ) -> CrawlResult: ...`\n\n###
3.1 New Approach\n\nYou pass a `CrawlerRunConfig` object that
sets up everything about a crawlâ€”content filtering, caching,
session reuse, JS code, screenshots, etc.\n\nìmport asyncio
from crawl4ai import CrawlerRunConfig, CacheMode run_cfg =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
css_selector=\"main.article\", word_count_threshold=10,
screenshot=True ) async with
AsyncWebCrawler(config=browser_cfg) as crawler: result =
await crawler.arun(\"https://example.com/news\",
config=run_cfg) print(\"Crawled HTML length:\",
len(result.cleaned_html)) if result.screenshot:
print(\"Screenshot base64 length:\", len(result.screenshot))`
\n\n### 3.2 Legacy Parameters Still Accepted\n\nFor
**backward** compatibility, àrun()` can still accept direct
arguments like `css_selector=...`, `word_count_threshold=...`,
etc., but we strongly advise migrating them into a
**`CrawlerRunConfig`**.\n\n* * *\n\n## 4.â€€Batch Processing:
àrun_many()`\n\nàsync def arun_many( self, urls:
List[str], config: Optional[CrawlerRunConfig] = None,
# Legacy parameters maintained for backwards
compatibility... ) -> List[CrawlResult]: \"\"\"
Process multiple URLs with intelligent rate limiting and
resource monitoring. \"\"\"`\n\n### 4.1 Resource-Aware
Crawling\n\nThe àrun_many()` method now uses an intelligent
dispatcher that:\n\n* Monitors system memory usage\n*
Implements adaptive rate limiting\n* Provides detailed
progress monitoring\n* Manages concurrent crawls efficiently
\n\n### 4.2 Example Usage\n\nCheck page [Multi-url Crawling]
(https://crawl4ai.com/mkdocs/advanced/multi-url-crawling/) for
a detailed example of how to use àrun_many()`.\n\n``### 4.3
Key Features 1.â€€**Rate Limiting** - Automatic delay
between requests - Exponential backoff on rate limit
detection - Domain-specific rate limiting - Configurable
retry strategy 2.â€€**Resource Monitoring** - Memory
usage tracking - Adaptive concurrency based on system
load - Automatic pausing when resources are constrained
3.â€€**Progress Monitoring** - Detailed or aggregated
progress display - Real-time status updates - Memory
usage statistics 4.â€€**Error Handling** - Graceful
handling of rate limits - Automatic retries with
backoff - Detailed error reporting --- ## 5.â
€€`CrawlResult` Output Each àrun()` returns a
**`CrawlResult`** containing: - ùrl`: Final URL (if
redirected). - `html`: Original HTML. - `cleaned_html`:
Sanitized HTML. - `markdown_v2`: Deprecated. Instead just use
regular `markdown` - èxtracted_content`: If an extraction
strategy was used (JSON for CSS/LLM strategies). -
`screenshot`, `pdf`: If screenshots/PDF requested. - `media`,
`links`: Information about discovered images/links. -
`success`, èrror_message`: Status info. For details, see
[CrawlResult doc](./crawl-result.md). --- ## 6.â€€Quick
Example Below is an example hooking it all together:
```python import asyncio from crawl4ai import AsyncWebCrawler,
BrowserConfig, CrawlerRunConfig, CacheMode from
crawl4ai.extraction_strategy import JsonCssExtractionStrategy
195
import json async def main(): # 1.â€€Browser config
browser_cfg = BrowserConfig( browser_type=\"firefox\",
headless=False, verbose=True ) # 2.â€€Run
config schema = { \"name\": \"Articles\",
\"baseSelector\": \"article.post\", \"fields\":
[ { \"name\": \"title\",
\"selector\": \"h2\", \"type\": \"text
\" }, { \"name\":
\"url\", \"selector\": \"a\",
\"type\": \"attribute\", \"attribute\":
\"href\" } ] } run_cfg =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
word_count_threshold=15, remove_overlay_elements=True,
wait_for=\"css:.post\" # Wait for posts to appear )
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun( url=
\"https://example.com/blog\",
config=run_cfg ) if result.success:
print(\"Cleaned HTML length:\", len(result.cleaned_html))
if result.extracted_content: articles =
json.loads(result.extracted_content)
print(\"Extracted articles:\", articles[:2]) else:
print(\"Error:\", result.error_message) asyncio.run(main())``
\n\n**Explanation**:\n\n* We define a **`BrowserConfig`**
with Firefox, no headless, and `verbose=True`.â€€\n* We
define a **`CrawlerRunConfig`** that **bypasses cache**, uses
a **CSS** extraction schema, has a `word_count_threshold=15`,
etc.â€€\n* We pass them to ÀsyncWebCrawler(config=...)` and
àrun(url=..., config=...)`.\n\n* * *\n\n## 7.â€€Best
Practices & Migration Notes\n\n1.â€€**Use** `BrowserConfig`
for **global** settings about the browserâ€™s environment.â€€
2.â€€**Use** `CrawlerRunConfig` for **per-crawl** logic
(caching, content filtering, extraction strategies, wait
conditions).â€€ 3.â€€**Avoid** legacy parameters like
`css_selector` or `word_count_threshold` directly in
àrun()`.â€€Instead:\n\n`run_cfg =
CrawlerRunConfig(css_selector=\".main-content\",
word_count_threshold=20) result = await crawler.arun(url=\"...
\", config=run_cfg)`\n\n4.â€€**Context Manager** usage is
simplest unless you want a persistent crawler across many
calls.\n\n* * *\n\n## 8.â€€Summary\n\n**AsyncWebCrawler** is
your entry point to asynchronous crawling:\n\n*
**Constructor** accepts **`BrowserConfig`** (or defaults).â
€€\n* **àrun(url, config=CrawlerRunConfig)`** is the main
method for single-page crawls.â€€\n* **àrun_many(urls,
config=CrawlerRunConfig)`** handles concurrency across
multiple URLs.â€€\n* For advanced lifecycle control, use
`start()` and `close()` explicitly.â€€\n\n**Migration**:\n\n*
If you used ÀsyncWebCrawler(browser_type=\"chromium\",
css_selector=\"...\")`, move browser settings to
`BrowserConfig(...)` and content/crawl logic to
`CrawlerRunConfig(...)`.\n\nThis modular approach ensures your
code is **clean**, **scalable**, and **easy to maintain**.â
€€For any advanced or rarely used parameters, see the
[BrowserConfig docs]
(https://crawl4ai.com/mkdocs/api/parameters/).",
196
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/api/arun/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/api/arun/",
"loadedTime": "2025-03-05T23:17:42.638Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/api/arun/",
"title": "arun() - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:40 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"8457a185249ac8b83e23b6c3096726d4\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "arun() - Crawl4AI Documentation (v0.5.x)\narun()
Parameter Guide (New Approach)\nIn Crawl4AIâ€™s latest
configuration model, nearly all parameters that once went
directly to arun() are now part of CrawlerRunConfig. When
calling arun(), you provide:\nawait crawler.arun( url=
\"https://example.com\", config=my_run_config ) \nBelow is an
organized look at the parameters that can go inside
CrawlerRunConfig, divided by their functional areas. For
Browser settings (e.g., headless, browser_type), see
BrowserConfig.\n1. Core Usage\nfrom crawl4ai import
AsyncWebCrawler, CrawlerRunConfig, CacheMode async def main():
run_config = CrawlerRunConfig( verbose=True, # Detailed
logging cache_mode=CacheMode.ENABLED, # Use normal read/write
cache check_robots_txt=True, # Respect robots.txt rules # ...
other parameters ) async with AsyncWebCrawler() as crawler:
result = await crawler.arun( url=\"https://example.com\",
config=run_config ) # Check if blocked by robots.txt if not
result.success and result.status_code == 403: print(f\"Error:
{result.error_message}\") \nKey Fields: - verbose=True logs
each crawl step. - cache_mode decides how to read/write the
local crawl cache.\n2. Cache Control\ncache_mode (default:
CacheMode.ENABLED)\nUse a built-in enum from CacheMode:
\nENABLED: Normal cachingâ€”reads if available, writes if
197
missing.\nDISABLED: No cachingâ€”always refetch pages.
\nREAD_ONLY: Reads from cache only; no new writes.
\nWRITE_ONLY: Writes to cache but doesnâ€™t read existing
data.\nBYPASS: Skips reading cache for this crawl (though it
might still write if set up that way).\nrun_config =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS ) \nAdditional
flags:\nbypass_cache=True acts like CacheMode.BYPASS.
\ndisable_cache=True acts like CacheMode.DISABLED.
\nno_cache_read=True acts like CacheMode.WRITE_ONLY.
\nno_cache_write=True acts like CacheMode.READ_ONLY.\n3.
Content Processing & Selection\n3.1 Text Processing
\nrun_config = CrawlerRunConfig( word_count_threshold=10, #
Ignore text blocks <10 words only_text=False, # If True, tries
to remove non-text elements keep_data_attributes=False # Keep
or discard data-* attributes ) \n3.2 Content Selection
\nrun_config = CrawlerRunConfig( css_selector=\".main-content
\", # Focus on .main-content region only excluded_tags=[\"form
\", \"nav\"], # Remove entire tag blocks remove_forms=True, #
Specifically strip <form> elements
remove_overlay_elements=True, # Attempt to remove
modals/popups ) \n3.3 Link Handling\nrun_config =
CrawlerRunConfig( exclude_external_links=True, # Remove
external links from final content
exclude_social_media_links=True, # Remove links to known
social sites exclude_domains=[\"ads.example.com\"], # Exclude
links to these domains
exclude_social_media_domains=[\"facebook.com\",\"twitter.com
\"], # Extend the default list ) \n3.4 Media Filtering
\nrun_config = CrawlerRunConfig( exclude_external_images=True
# Strip images from other domains ) \n4. Page Navigation &
Timing\n4.1 Basic Browser Flow\nrun_config =
CrawlerRunConfig( wait_for=\"css:.dynamic-content\", # Wait
for .dynamic-content delay_before_return_html=2.0, # Wait 2s
before capturing final HTML page_timeout=60000, # Navigation &
script timeout (ms) ) \nKey Fields:\nwait_for: \n
\"css:selector\" or \n\"js:() => boolean\"\ne.g. js:() =>
document.querySelectorAll('.item').length > 10.\nmean_delay &
max_range: define random delays for arun_many() calls.
\nsemaphore_count: concurrency limit when crawling multiple
URLs.\n4.2 JavaScript Execution\nrun_config =
CrawlerRunConfig( js_code=[ \"window.scrollTo(0,
document.body.scrollHeight);\",
\"document.querySelector('.load-more')?.click();\" ],
js_only=False ) \njs_code can be a single string or a list of
strings. \njs_only=True means â€œIâ€™m continuing in the same
session with new JS steps, no new full navigation.â€ \n4.3
Anti-Bot\nrun_config = CrawlerRunConfig( magic=True,
simulate_user=True, override_navigator=True ) \n- magic=True
tries multiple stealth features. - simulate_user=True mimics
mouse movements or random delays. - override_navigator=True
fakes some navigator properties (like user agent checks). \n5.
Session Management\nsession_id: \nrun_config =
CrawlerRunConfig( session_id=\"my_session123\" ) \nIf re-used
in subsequent arun() calls, the same tab/page context is
continued (helpful for multi-step tasks or stateful browsing).
\nrun_config = CrawlerRunConfig( screenshot=True, # Grab a
screenshot as base64 screenshot_wait_for=1.0, # Wait 1s before
198
capturing pdf=True, # Also produce a PDF
image_description_min_word_threshold=5, # If analyzing alt
text image_score_threshold=3, # Filter out low-score images )
\nWhere they appear: - result.screenshot â†’ Base64 screenshot
string. - result.pdf â†’ Byte array with PDF data. \nFor
advanced data extraction (CSS/LLM-based), set
extraction_strategy:\nrun_config =
CrawlerRunConfig( extraction_strategy=my_css_or_llm_strategy )
\nThe extracted data will appear in result.extracted_content.
\n8. Comprehensive Example\nBelow is a snippet combining many
parameters:\nimport asyncio from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig, CacheMode from
crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def main(): # Example schema schema = { \"name\":
\"Articles\", \"baseSelector\": \"article.post\", \"fields\":
[ {\"name\": \"title\", \"selector\": \"h2\", \"type\": \"text
\"}, {\"name\": \"link\", \"selector\": \"a\", \"type\":
\"attribute\", \"attribute\": \"href\"} ] } run_config =
CrawlerRunConfig( # Core verbose=True,
cache_mode=CacheMode.ENABLED, check_robots_txt=True, # Respect
robots.txt rules # Content word_count_threshold=10,
css_selector=\"main.content\", excluded_tags=[\"nav\",
\"footer\"], exclude_external_links=True, # Page & JS js_code=
\"document.querySelector('.show-more')?.click();\", wait_for=
\"css:.loaded-block\", page_timeout=30000, # Extraction
extraction_strategy=JsonCssExtractionStrategy(schema), #
Session session_id=\"persistent_session\", # Media
screenshot=True, pdf=True, # Anti-bot simulate_user=True,
magic=True, ) async with AsyncWebCrawler() as crawler: result
= await crawler.arun(\"https://example.com/posts\",
config=run_config) if result.success: print(\"HTML length:\",
len(result.cleaned_html)) print(\"Extraction JSON:\",
result.extracted_content) if result.screenshot:
print(\"Screenshot length:\", len(result.screenshot)) if
result.pdf: print(\"PDF bytes length:\", len(result.pdf))
else: print(\"Error:\", result.error_message) if __name__ ==
\"__main__\": asyncio.run(main()) \nWhat we covered:\n1.
Crawling the main content region, ignoring external links. 2.
Running JavaScript to click â€œ.show-moreâ€ . 3. Waiting for â
€œ.loaded-blockâ€ to appear. 4. Generating a screenshot & PDF
of the final page. 5. Extracting repeated â€œarticle.postâ€
elements with a CSS-based extraction strategy.\n9. Best
Practices\n1. Use BrowserConfig for global browser settings
(headless, user agent). 2. Use CrawlerRunConfig to handle the
specific crawl needs: content filtering, caching, JS,
screenshot, extraction, etc. 3. Keep your parameters
consistent in run configsâ€”especially if youâ€™re part of a
large codebase with multiple crawls. 4. Limit large
concurrency (semaphore_count) if the site or your system canâ
€™t handle it. 5. For dynamic pages, set js_code or
scan_full_page so you load all content.\n10. Conclusion\nAll
parameters that used to be direct arguments to arun() now
belong in CrawlerRunConfig. This approach:\nMakes code clearer
and more maintainable. \nMinimizes confusion about which
arguments affect global vs. per-crawl behavior. \nAllows you
to create reusable config objects for different pages or
tasks.\nFor a full reference, check out the CrawlerRunConfig
199
Docs. \nHappy crawling with your structured, flexible config
approach!",
"markdown": "# arun() - Crawl4AI Documentation (v0.5.x)\n
\n## àrun()` Parameter Guide (New Approach)\n\nIn Crawl4AIâ
€™s **latest** configuration model, nearly all parameters that
once went directly to àrun()` are now part of
**`CrawlerRunConfig`**.â€€When calling àrun()`, you provide:
\n\nàwait crawler.arun( url=\"https://example.com\",
config=my_run_config )`\n\nBelow is an organized look at the
parameters that can go inside `CrawlerRunConfig`, divided by
their functional areas.â€€For **Browser** settings (e.g.,
`headless`, `browser_type`), see [BrowserConfig]
(https://crawl4ai.com/mkdocs/api/parameters/).\n\n* * *\n\n##
1.â€€Core Usage\n\n`from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig, CacheMode async def main(): run_config
= CrawlerRunConfig( verbose=True, #
Detailed logging cache_mode=CacheMode.ENABLED, # Use
normal read/write cache check_robots_txt=True, #
Respect robots.txt rules # ...â€€other
parameters ) async with AsyncWebCrawler() as crawler:
result = await crawler.arun( url=
\"https://example.com\",
config=run_config ) # Check if blocked by
robots.txt if not result.success and
result.status_code == 403: print(f\"Error:
{result.error_message}\")`\n\n**Key Fields**: - `verbose=True`
logs each crawl step.â€€ - `cache_mode` decides how to
read/write the local crawl cache.\n\n* * *\n\n## 2.â€€Cache
Control\n\n**`cache_mode`** (default: `CacheMode.ENABLED`)
\nUse a built-in enum from `CacheMode`:\n\n* ÈNABLED`:
Normal cachingâ€”reads if available, writes if missing.\n*
`DISABLED`: No cachingâ€”always refetch pages.\n*
`READ_ONLY`: Reads from cache only; no new writes.\n*
`WRITE_ONLY`: Writes to cache but doesnâ€™t read existing
data.\n* `BYPASS`: Skips reading cache for this crawl
(though it might still write if set up that way).\n
\n`run_config =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS )`\n
\n**Additional flags**:\n\n* `bypass_cache=True` acts like
`CacheMode.BYPASS`.\n* `disable_cache=True` acts like
`CacheMode.DISABLED`.\n* `no_cache_read=True` acts like
`CacheMode.WRITE_ONLY`.\n* `no_cache_write=True` acts like
`CacheMode.READ_ONLY`.\n\n* * *\n\n## 3.â€€Content Processing
& Selection\n\n### 3.1 Text Processing\n\n`run_config =
CrawlerRunConfig( word_count_threshold=10, # Ignore text
blocks <10 words only_text=False, # If True,
tries to remove non-text elements
keep_data_attributes=False # Keep or discard data-*
attributes )`\n\n### 3.2 Content Selection\n\n`run_config =
CrawlerRunConfig( css_selector=\".main-content\", # Focus
on .main-content region only excluded_tags=[\"form\",
\"nav\"], # Remove entire tag blocks remove_forms=True,
# Specifically strip <form> elements
remove_overlay_elements=True, # Attempt to remove
modals/popups )`\n\n### 3.3 Link Handling\n\n`run_config =
CrawlerRunConfig( exclude_external_links=True, #
Remove external links from final content
200
exclude_social_media_links=True, # Remove links to known
social sites exclude_domains=[\"ads.example.com\"], #
Exclude links to these domains
exclude_social_media_domains=[\"facebook.com\",\"twitter.com
\"], # Extend the default list )`\n\n### 3.4 Media Filtering\n
\n`run_config =
CrawlerRunConfig( exclude_external_images=True # Strip
images from other domains )`\n\n* * *\n\n## 4.â€€Page
Navigation & Timing\n\n### 4.1 Basic Browser Flow\n
\n`run_config = CrawlerRunConfig( wait_for=\"css:.dynamic-
content\", # Wait for .dynamic-content
delay_before_return_html=2.0, # Wait 2s before capturing
final HTML page_timeout=60000, # Navigation &
script timeout (ms) )`\n\n**Key Fields**:\n\n* `wait_for`:
\n* `\"css:selector\"` or\n* `\"js:() => boolean\"` \n
e.g.â€€`js:() => document.querySelectorAll('.item').length >
10`.\n \n* `mean_delay` & `max_range`: define random
delays for àrun_many()` calls.â€€\n \n*
`semaphore_count`: concurrency limit when crawling multiple
URLs.\n\n### 4.2 JavaScript Execution\n\n`run_config =
CrawlerRunConfig( js_code=[ \"window.scrollTo(0,
document.body.scrollHeight);\",
\"document.querySelector('.load-more')?.click();\" ],
js_only=False )`\n\n* `js_code` can be a single string or a
list of strings.â€€\n* `js_only=True` means â€œIâ€™m
continuing in the same session with new JS steps, no new full
navigation.â€ \n\n### 4.3 Anti-Bot\n\n`run_config =
CrawlerRunConfig( magic=True, simulate_user=True,
override_navigator=True )`\n\n\\- `magic=True` tries multiple
stealth features.â€€ - `simulate_user=True` mimics mouse
movements or random delays.â€€ - òverride_navigator=True`
fakes some navigator properties (like user agent checks).\n\n*
* *\n\n## 5.â€€Session Management\n\n**`session_id`**:\n
\n`run_config = CrawlerRunConfig( session_id=
\"my_session123\" )`\n\nIf re-used in subsequent àrun()`
calls, the same tab/page context is continued (helpful for
multi-step tasks or stateful browsing).\n\n* * *\n
\n`run_config = CrawlerRunConfig( screenshot=True,
# Grab a screenshot as base64 screenshot_wait_for=1.0,
# Wait 1s before capturing pdf=True, #
Also produce a PDF image_description_min_word_threshold=5,
# If analyzing alt text image_score_threshold=3,
# Filter out low-score images )`\n\n**Where they appear**: -
`result.screenshot` â†’ Base64 screenshot string. -
`result.pdf` â†’ Byte array with PDF data.\n\n* * *\n\n**For
advanced data extraction** (CSS/LLM-based), set
èxtraction_strategy`:\n\n`run_config =
CrawlerRunConfig( extraction_strategy=my_css_or_llm_strate
gy )`\n\nThe extracted data will appear in
`result.extracted_content`.\n\n* * *\n\n## 8.â€€Comprehensive
Example\n\nBelow is a snippet combining many parameters:\n
\nìmport asyncio from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig, CacheMode from crawl4ai.extraction_strategy
import JsonCssExtractionStrategy async def main(): #
Example schema schema = { \"name\": \"Articles\",
\"baseSelector\": \"article.post\", \"fields\":
[ {\"name\": \"title\", \"selector\": \"h2\",
201
\"type\": \"text\"}, {\"name\": \"link\",
\"selector\": \"a\", \"type\": \"attribute\", \"attribute\":
\"href\"} ] } run_config =
CrawlerRunConfig( # Core verbose=True,
cache_mode=CacheMode.ENABLED, check_robots_txt=True,
# Respect robots.txt rules # Content
word_count_threshold=10, css_selector=\"main.content
\", excluded_tags=[\"nav\", \"footer\"],
exclude_external_links=True, # Page & JS
js_code=\"document.querySelector('.show-more')?.click();\",
wait_for=\"css:.loaded-block\", page_timeout=30000,
# Extraction
extraction_strategy=JsonCssExtractionStrategy(schema),
# Session session_id=\"persistent_session\",
# Media screenshot=True, pdf=True, #
Anti-bot simulate_user=True, magic=True, )
async with AsyncWebCrawler() as crawler: result =
await crawler.arun(\"https://example.com/posts\",
config=run_config) if result.success:
print(\"HTML length:\", len(result.cleaned_html))
print(\"Extraction JSON:\", result.extracted_content)
if result.screenshot: print(\"Screenshot
length:\", len(result.screenshot)) if result.pdf:
print(\"PDF bytes length:\", len(result.pdf)) else:
print(\"Error:\", result.error_message) if __name__ ==
\"__main__\": asyncio.run(main())`\n\n**What we covered**:
\n\n1.â€€**Crawling** the main content region, ignoring
external links.â€€ 2.â€€Running **JavaScript** to click â
€œ.show-moreâ€ .â€€ 3.â€€**Waiting** for â€œ.loaded-blockâ€
to appear.â€€ 4.â€€Generating a **screenshot** & **PDF** of
the final page.â€€ 5.â€€Extracting repeated â€œarticle.postâ€
elements with a **CSS-based** extraction strategy.\n\n* * *\n
\n## 9.â€€Best Practices\n\n1.â€€**Use `BrowserConfig` for
global browser** settings (headless, user agent).â€€ 2.â
€€**Use `CrawlerRunConfig`** to handle the **specific** crawl
needs: content filtering, caching, JS, screenshot, extraction,
etc.â€€ 3.â€€Keep your **parameters consistent** in run
configsâ€”especially if youâ€™re part of a large codebase with
multiple crawls.â€€ 4.â€€**Limit** large concurrency
(`semaphore_count`) if the site or your system canâ€™t handle
it.â€€ 5.â€€For dynamic pages, set `js_code` or
`scan_full_page` so you load all content.\n\n* * *\n\n## 10.â
€€Conclusion\n\nAll parameters that used to be direct
arguments to àrun()` now belong in **`CrawlerRunConfig`**.â
€€This approach:\n\n* Makes code **clearer** and **more
maintainable**.â€€\n* Minimizes confusion about which
arguments affect global vs.â€€per-crawl behavior.â€€\n*
Allows you to create **reusable** config objects for different
pages or tasks.\n\nFor a **full** reference, check out the
[CrawlerRunConfig Docs]
(https://crawl4ai.com/mkdocs/api/parameters/).â€€\n\nHappy
crawling with your **structured, flexible** config approach!",
"debug": {
"requestHandlerMode": "browser"
}
},
{
202
"url": "https://crawl4ai.com/mkdocs/api/arun_many/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/api/arun_many/",
"loadedTime": "2025-03-05T23:17:44.067Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl":
"https://docs.crawl4ai.com/api/arun_many/",
"title": "arun_many() - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:42 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"25c4ef11151c26a52822ad1f9201b3b2\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "arun_many() - Crawl4AI Documentation
(v0.5.x)\narun_many(...) Reference\nNote: This function is
very similar to arun() but focused on concurrent or batch
crawling. If youâ€™re unfamiliar with arun() usage, please
read that doc first, then review this for differences.
\nFunction Signature\nasync def arun_many( urls:
Union[List[str], List[Any]], config:
Optional[CrawlerRunConfig] = None, dispatcher:
Optional[BaseDispatcher] = None, ... ) ->
Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
\"\"\" Crawl multiple URLs concurrently or in batches. :param
urls: A list of URLs (or tasks) to crawl. :param config:
(Optional) A default `CrawlerRunConfig` applying to each
crawl. :param dispatcher: (Optional) A concurrency controller
(e.g. MemoryAdaptiveDispatcher). ... :return: Either a list of
`CrawlResult` objects, or an async generator if streaming is
enabled. \"\"\" \nDifferences from arun()\n1. Multiple URLs:
\nInstead of crawling a single URL, you pass a list of them
(strings or tasks). \nThe function returns either a list of
CrawlResult or an async generator if streaming is enabled.\n2.
Concurrency & Dispatchers: \ndispatcher param allows advanced
concurrency control. \nIf omitted, a default dispatcher (like
MemoryAdaptiveDispatcher) is used internally. \nDispatchers
handle concurrency, rate limiting, and memory-based adaptive
throttling (see Multi-URL Crawling).\n3. Streaming Support:
\nEnable streaming by setting stream=True in your
CrawlerRunConfig.\nWhen streaming, use async for to process
203
results as they become available.\nIdeal for processing large
numbers of URLs without waiting for all to complete.\n4.
Parallel Execution**: \narun_many() can run multiple requests
concurrently under the hood. \nEach CrawlResult might also
include a dispatch_result with concurrency details (like
memory usage, start/end times).\nBasic Example (Batch Mode)\n#
Minimal usage: The default dispatcher will be used results =
await crawler.arun_many( urls=[\"https://site1.com\",
\"https://site2.com\"], config=CrawlerRunConfig(stream=False)
# Default behavior ) for res in results: if res.success:
print(res.url, \"crawled OK!\") else: print(\"Failed:\",
res.url, \"-\", res.error_message) \nStreaming Example\nconfig
= CrawlerRunConfig( stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS ) # Process results as they
complete async for result in await
crawler.arun_many( urls=[\"https://site1.com\",
\"https://site2.com\", \"https://site3.com\"],
config=config ): if result.success: print(f\"Just completed:
{result.url}\") # Process each result immediately
process_result(result) \nWith a Custom Dispatcher\ndispatcher
= MemoryAdaptiveDispatcher( memory_threshold_percent=70.0,
max_session_permit=10 ) results = await
crawler.arun_many( urls=[\"https://site1.com\",
\"https://site2.com\", \"https://site3.com\"],
config=my_run_config, dispatcher=dispatcher ) \nKey Points: -
Each URL is processed by the same or separate sessions,
depending on the dispatcherâ€™s strategy. - dispatch_result in
each CrawlResult (if using concurrency) can hold memory and
timing info. - If you need to handle authentication or session
IDs, pass them in each individual task or within your run
config.\nReturn Value\nEither a list of CrawlResult objects,
or an async generator if streaming is enabled. You can iterate
to check result.success or read each itemâ€™s
extracted_content, markdown, or dispatch_result.\nDispatcher
Reference\nMemoryAdaptiveDispatcher: Dynamically manages
concurrency based on system memory usage.
\nSemaphoreDispatcher: Fixed concurrency limit, simpler but
less adaptive. \nFor advanced usage or custom settings, see
Multi-URL Crawling with Dispatchers.\nCommon Pitfalls\n1.
Large Lists: If you pass thousands of URLs, be mindful of
memory or rate-limits. A dispatcher can help. \n2. Session
Reuse: If you need specialized logins or persistent contexts,
ensure your dispatcher or tasks handle sessions accordingly.
\n3. Error Handling: Each CrawlResult might fail for different
reasonsâ€”always check result.success or the error_message
before proceeding.\nConclusion\nUse arun_many() when you want
to crawl multiple URLs simultaneously or in controlled
parallel tasks. If you need advanced concurrency features
(like memory-based adaptive throttling or complex rate-
limiting), provide a dispatcher. Each result is a standard
CrawlResult, possibly augmented with concurrency stats
(dispatch_result) for deeper inspection. For more details on
concurrency logic and dispatchers, see the Advanced Multi-URL
Crawling docs.",
"markdown": "# arun\\_many() - Crawl4AI Documentation
(v0.5.x)\n\n## àrun_many(...)` Reference\n\n> **Note**: This
function is very similar to [àrun()`]
204
(https://crawl4ai.com/mkdocs/api/arun/) but focused on
**concurrent** or **batch** crawling.â€€If youâ€™re unfamiliar
with àrun()` usage, please read that doc first, then review
this for differences.\n\n## Function Signature\n\n`àsync def
arun_many( urls: Union[List[str], List[Any]], config:
Optional[CrawlerRunConfig] = None, dispatcher:
Optional[BaseDispatcher] = None, ... ) ->
Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
\"\"\" Crawl multiple URLs concurrently or in
batches. :param urls: A list of URLs (or tasks) to
crawl. :param config: (Optional) A default
`CrawlerRunConfig` applying to each crawl. :param
dispatcher: (Optional) A concurrency controller (e.g.â
€€MemoryAdaptiveDispatcher). ... :return: Either a
list of `CrawlResult` objects, or an async generator if
streaming is enabled. \"\"\"``\n\n## Differences from
àrun()`\n\n1.â€€**Multiple URLs**:\n\n* Instead of crawling
a single URL, you pass a list of them (strings or tasks).â
€€\n* The function returns either a **list** of
`CrawlResult` or an **async generator** if streaming is
enabled.\n\n2.â€€**Concurrency & Dispatchers**:\n\n*
**`dispatcher`** param allows advanced concurrency control.â
€€\n* If omitted, a default dispatcher (like
`MemoryAdaptiveDispatcher`) is used internally.â€€\n*
Dispatchers handle concurrency, rate limiting, and memory-
based adaptive throttling (see [Multi-URL Crawling]
(https://crawl4ai.com/mkdocs/advanced/multi-url-crawling/)).\n
\n3.â€€**Streaming Support**:\n\n* Enable streaming by
setting `stream=True` in your `CrawlerRunConfig`.\n* When
streaming, use àsync for` to process results as they become
available.\n* Ideal for processing large numbers of URLs
without waiting for all to complete.\n\n4.â€€**Parallel**
Execution\\*\\*:\n\n* àrun_many()` can run multiple
requests concurrently under the hood.â€€\n* Each
`CrawlResult` might also include a **`dispatch_result`** with
concurrency details (like memory usage, start/end times).\n
\n### Basic Example (Batch Mode)\n\n`# Minimal usage: The
default dispatcher will be used results = await
crawler.arun_many( urls=[\"https://site1.com\",
\"https://site2.com\"],
config=CrawlerRunConfig(stream=False) # Default behavior )
for res in results: if res.success: print(res.url,
\"crawled OK!\") else: print(\"Failed:\", res.url,
\"-\", res.error_message)`\n\n### Streaming Example\n\n`config
= CrawlerRunConfig( stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS ) # Process results as they
complete async for result in await
crawler.arun_many( urls=[\"https://site1.com\",
\"https://site2.com\", \"https://site3.com\"],
config=config ): if result.success: print(f\"Just
completed: {result.url}\") # Process each result
immediately process_result(result)`\n\n### With a
Custom Dispatcher\n\n`dispatcher =
MemoryAdaptiveDispatcher( memory_threshold_percent=70.0,
max_session_permit=10 ) results = await
crawler.arun_many( urls=[\"https://site1.com\",
\"https://site2.com\", \"https://site3.com\"],
205
config=my_run_config, dispatcher=dispatcher )`\n\n**Key
Points**: - Each URL is processed by the same or separate
sessions, depending on the dispatcherâ€™s strategy. -
`dispatch_result` in each `CrawlResult` (if using concurrency)
can hold memory and timing info.â€€ - If you need to handle
authentication or session IDs, pass them in each individual
task or within your run config.\n\n### Return Value\n\nEither
a **list** of [`CrawlResult`]
(https://crawl4ai.com/mkdocs/api/crawl-result/) objects, or an
**async generator** if streaming is enabled.â€€You can iterate
to check `result.success` or read each itemâ€™s
èxtracted_content`, `markdown`, or `dispatch_result`.\n\n* *
*\n\n## Dispatcher Reference\n\n*
**`MemoryAdaptiveDispatcher`**: Dynamically manages
concurrency based on system memory usage.â€€\n*
**`SemaphoreDispatcher`**: Fixed concurrency limit, simpler
but less adaptive.â€€\n\nFor advanced usage or custom
settings, see [Multi-URL Crawling with Dispatchers]
(https://crawl4ai.com/mkdocs/advanced/multi-url-crawling/).\n
\n* * *\n\n## Common Pitfalls\n\n1.â€€**Large Lists**: If you
pass thousands of URLs, be mindful of memory or rate-limits.â
€€A dispatcher can help.â€€\n\n2.â€€**Session Reuse**: If you
need specialized logins or persistent contexts, ensure your
dispatcher or tasks handle sessions accordingly.â€€\n\n3.â
€€**Error Handling**: Each `CrawlResult` might fail for
different reasonsâ€”always check `result.success` or the
èrror_message` before proceeding.\n\n* * *\n\n## Conclusion\n
\nUse àrun_many()` when you want to **crawl multiple URLs**
simultaneously or in controlled parallel tasks.â€€If you need
advanced concurrency features (like memory-based adaptive
throttling or complex rate-limiting), provide a
**dispatcher**.â€€Each result is a standard `CrawlResult`,
possibly augmented with concurrency stats (`dispatch_result`)
for deeper inspection.â€€For more details on concurrency logic
and dispatchers, see the [Advanced Multi-URL Crawling]
(https://crawl4ai.com/mkdocs/advanced/multi-url-crawling/)
docs.",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/api/parameters/",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/api/parameters/",
"loadedTime": "2025-03-05T23:17:49.945Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl":
"https://docs.crawl4ai.com/api/parameters/",
"title": "Browser, Crawler & LLM Config - Crawl4AI
Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
206
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:47 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"dbe1e33ad68171faa3d9c5ad72ab3fd5\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Browser, Crawler & LLM Config\n1. BrowserConfig â
€“ Controlling the Browser\nBrowserConfig focuses on how the
browser is launched and behaves. This includes headless mode,
proxies, user agents, and other environment tweaks.\nfrom
crawl4ai import AsyncWebCrawler, BrowserConfig browser_cfg =
BrowserConfig( browser_type=\"chromium\", headless=True,
viewport_width=1280, viewport_height=720, proxy=
\"http://user:pass@proxy:8080\", user_agent=\"Mozilla/5.0
(X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0
Safari/537.36\", ) \n1.1 Parameter Highlights\nParameter Type
/ Default What It Does \nbrowser_type\t\"chromium\", \"firefox
\", \"webkit\"\n(default: \"chromium\")\tWhich browser engine
to use. \"chromium\" is typical for many sites, \"firefox\" or
\"webkit\" for specialized tests.\t\nheadless\tbool (default:
True)\tHeadless means no visible UI. False is handy for
debugging.\t\nviewport_width\tint (default: 1080)\tInitial
page width (in px). Useful for testing responsive layouts.\t
\nviewport_height\tint (default: 600)\tInitial page height (in
px).\t\nproxy\tstr (default: None)\tSingle-proxy URL if you
want all traffic to go through it, e.g.
\"http://user:pass@proxy:8080\".\t\nproxy_config\tdict
(default: None)\tFor advanced or multi-proxy needs, specify
details like {\"server\": \"...\", \"username\": \"...
\", ...}.\t\nuse_persistent_context\tbool (default: False)\tIf
True, uses a persistent browser context (keep cookies,
sessions across runs). Also sets use_managed_browser=True.\t
\nuser_data_dir\tstr or None (default: None)\tDirectory to
store user data (profiles, cookies). Must be set if you want
permanent sessions.\t\nignore_https_errors\tbool (default:
True)\tIf True, continues despite invalid certificates (common
in dev/staging).\t\njava_script_enabled\tbool (default:
True)\tDisable if you want no JS overhead, or if only static
content is needed.\t\ncookies\tlist (default: [])\tPre-set
cookies, each a dict like {\"name\": \"session\", \"value\":
\"...\", \"url\": \"...\"}.\t\nheaders\tdict (default:
{})\tExtra HTTP headers for every request, e.g. {\"Accept-
Language\": \"en-US\"}.\t\nuser_agent\tstr (default: Chrome-
based UA)\tYour custom or random user agent. user_agent_mode=
\"random\" can shuffle it.\t\nlight_mode\tbool (default:
False)\tDisables some background features for performance
207
gains.\t\ntext_mode\tbool (default: False)\tIf True, tries to
disable images/other heavy content for speed.\t
\nuse_managed_browser\tbool (default: False)\tFor advanced â
€œmanagedâ€ interactions (debugging, CDP usage). Typically
set automatically if persistent context is on.\t\nextra_args
\tlist (default: [])\tAdditional flags for the underlying
browser process, e.g. [\"--disable-extensions\"].\t\nTips: -
Set headless=False to visually debug how pages load or how
interactions proceed.\n- If you need authentication storage or
repeated sessions, consider use_persistent_context=True and
specify user_data_dir.\n- For large pages, you might need a
bigger viewport_width and viewport_height to handle dynamic
content.\n2. CrawlerRunConfig â€“ Controlling Each Crawl
\nWhile BrowserConfig sets up the environment,
CrawlerRunConfig details how each crawl operation should
behave: caching, content filtering, link or domain blocking,
timeouts, JavaScript code, etc.\nfrom crawl4ai import
AsyncWebCrawler, CrawlerRunConfig run_cfg =
CrawlerRunConfig( wait_for=\"css:.main-content\",
word_count_threshold=15, excluded_tags=[\"nav\", \"footer\"],
exclude_external_links=True, stream=True, # Enable streaming
for arun_many() ) \n2.1 Parameter Highlights\nWe group them by
category. \nA) Content Processing\nParameter Type / Default
What It Does \nword_count_threshold\tint (default: ~
200)\tSkips text blocks below X words. Helps ignore trivial
sections.\t\nextraction_strategy\tExtractionStrategy (default:
None)\tIf set, extracts structured data (CSS-based, LLM-based,
etc.).\t\nmarkdown_generator\tMarkdownGenerationStrategy
(None)\tIf you want specialized markdown output (citations,
filtering, chunking, etc.).\t\ncss_selector\tstr
(None)\tRetains only the part of the page matching this
selector.\t\nexcluded_tags\tlist (None)\tRemoves entire tags
(e.g. [\"script\", \"style\"]).\t\nexcluded_selector\tstr
(None)\tLike css_selector but to exclude. E.g.
\"#ads, .tracker\".\t\nonly_text\tbool (False)\tIf True, tries
to extract text-only content.\t\nprettiify\tbool (False)\tIf
True, beautifies final HTML (slower, purely cosmetic).\t
\nkeep_data_attributes\tbool (False)\tIf True, preserve data-*
attributes in cleaned HTML.\t\nremove_forms\tbool (False)\tIf
True, remove all <form> elements.\t\nB) Caching & Session
\nParameter Type / Default What It Does \ncache_mode
\tCacheMode or None\tControls how caching is handled (ENABLED,
BYPASS, DISABLED, etc.). If None, typically defaults to
ENABLED.\t\nsession_id\tstr or None\tAssign a unique ID to
reuse a single browser session across multiple arun() calls.\t
\nbypass_cache\tbool (False)\tIf True, acts like
CacheMode.BYPASS.\t\ndisable_cache\tbool (False)\tIf True,
acts like CacheMode.DISABLED.\t\nno_cache_read\tbool
(False)\tIf True, acts like CacheMode.WRITE_ONLY (writes cache
but never reads).\t\nno_cache_write\tbool (False)\tIf True,
acts like CacheMode.READ_ONLY (reads cache but never writes).
\t\nUse these for controlling whether you read or write from a
local content cache. Handy for large batch crawls or repeated
site visits.\nC) Page Navigation & Timing\nParameter Type /
Default What It Does \nwait_until\tstr
(domcontentloaded)\tCondition for navigation to â€œcompleteâ
€ . Often \"networkidle\" or \"domcontentloaded\".\t
208
\npage_timeout\tint (60000 ms)\tTimeout for page navigation or
JS steps. Increase for slow sites.\t\nwait_for\tstr or None
\tWait for a CSS (\"css:selector\") or JS (\"js:() => bool\")
condition before content extraction.\t\nwait_for_images\tbool
(False)\tWait for images to load before finishing. Slows down
if you only want text.\t\ndelay_before_return_html\tfloat
(0.1)\tAdditional pause (seconds) before final HTML is
captured. Good for last-second updates.\t\ncheck_robots_txt
\tbool (False)\tWhether to check and respect robots.txt rules
before crawling. If True, caches robots.txt for efficiency.\t
\nmean_delay and max_range\tfloat (0.1, 0.3)\tIf you call
arun_many(), these define random delay intervals between
crawls, helping avoid detection or rate limits.\t
\nsemaphore_count\tint (5)\tMax concurrency for arun_many().
Increase if you have resources for parallel crawls.\t\nD) Page
Interaction\nParameter Type / Default What It Does \njs_code
\tstr or list[str] (None)\tJavaScript to run after load. E.g.
\"document.querySelector('button')?.click();\".\t\njs_only
\tbool (False)\tIf True, indicates weâ€™re reusing an existing
session and only applying JS. No full reload.\t
\nignore_body_visibility\tbool (True)\tSkip checking if <body>
is visible. Usually best to keep True.\t\nscan_full_page\tbool
(False)\tIf True, auto-scroll the page to load dynamic content
(infinite scroll).\t\nscroll_delay\tfloat (0.2)\tDelay between
scroll steps if scan_full_page=True.\t\nprocess_iframes\tbool
(False)\tInlines iframe content for single-page extraction.\t
\nremove_overlay_elements\tbool (False)\tRemoves potential
modals/popups blocking the main content.\t\nsimulate_user
\tbool (False)\tSimulate user interactions (mouse movements)
to avoid bot detection.\t\noverride_navigator\tbool
(False)\tOverride navigator properties in JS for stealth.\t
\nmagic\tbool (False)\tAutomatic handling of popups/consent
banners. Experimental.\t\nadjust_viewport_to_content\tbool
(False)\tResizes viewport to match page content height.\t\nIf
your page is a single-page app with repeated JS updates, set
js_only=True in subsequent calls, plus a session_id for
reusing the same tab.\nE) Media Handling\nParameter Type /
Default What It Does \nscreenshot\tbool (False)\tCapture a
screenshot (base64) in result.screenshot.\t
\nscreenshot_wait_for\tfloat or None\tExtra wait time before
the screenshot.\t\nscreenshot_height_threshold\tint (~
20000)\tIf the page is taller than this, alternate screenshot
strategies are used.\t\npdf\tbool (False)\tIf True, returns a
PDF in result.pdf.\t\nimage_description_min_word_threshold
\tint (~50)\tMinimum words for an imageâ€™s alt text or
description to be considered valid.\t\nimage_score_threshold
\tint (~3)\tFilter out low-scoring images. The crawler scores
images by relevance (size, context, etc.).\t
\nexclude_external_images\tbool (False)\tExclude images from
other domains.\t\nF) Link/Domain Handling\nParameter Type /
Default What It Does \nexclude_social_media_domains\tlist
(e.g. Facebook/Twitter)\tA default list can be extended. Any
link to these domains is removed from final output.\t
\nexclude_external_links\tbool (False)\tRemoves all links
pointing outside the current domain.\t
\nexclude_social_media_links\tbool (False)\tStrips links
specifically to social sites (like Facebook or Twitter).\t
209
\nexclude_domains\tlist ([])\tProvide a custom list of domains
to exclude (like [\"ads.com\", \"trackers.io\"]).\t\nUse these
for link-level content filtering (often to keep crawls â
€œinternalâ€ or to remove spammy domains).\nG) Debug &
Logging\nParameter Type / Default What It Does \nverbose\tbool
(True)\tPrints logs detailing each step of crawling,
interactions, or errors.\t\nlog_console\tbool (False)\tLogs
the pageâ€™s JavaScript console output if you want deeper JS
debugging.\t\n2.2 Helper Methods\nBoth BrowserConfig and
CrawlerRunConfig provide a clone() method to create modified
copies:\n# Create a base configuration base_config =
CrawlerRunConfig( cache_mode=CacheMode.ENABLED,
word_count_threshold=200 ) # Create variations using clone()
stream_config = base_config.clone(stream=True) no_cache_config
= base_config.clone( cache_mode=CacheMode.BYPASS,
stream=True ) \nThe clone() method is particularly useful when
you need slightly different configurations for different use
cases, without modifying the original config.\n2.3 Example
Usage\nimport asyncio from crawl4ai import AsyncWebCrawler,
BrowserConfig, CrawlerRunConfig, CacheMode async def main(): #
Configure the browser browser_cfg =
BrowserConfig( headless=False, viewport_width=1280,
viewport_height=720, proxy=\"http://user:pass@myproxy:8080\",
text_mode=True ) # Configure the run run_cfg =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS, session_id=
\"my_session\", css_selector=\"main.article\",
excluded_tags=[\"script\", \"style\"],
exclude_external_links=True, wait_for=\"css:.article-loaded\",
screenshot=True, stream=True ) async with
AsyncWebCrawler(config=browser_cfg) as crawler: result = await
crawler.arun( url=\"https://example.com/news\",
config=run_cfg ) if result.success: print(\"Final cleaned_html
length:\", len(result.cleaned_html)) if result.screenshot:
print(\"Screenshot captured (base64, length):\",
len(result.screenshot)) else: print(\"Crawl failed:\",
result.error_message) if __name__ == \"__main__\":
asyncio.run(main()) ## 2.4 Compliance & Ethics | **Parameter**
| **Type / Default** | **What It Does** |
|-----------------------|-------------------------|-----------
--------------------------------------------------------------
---------------------------------------------| |
**`check_robots_txt`**| `bool` (False) | When True, checks and
respects robots.txt rules before crawling. Uses efficient
caching with SQLite backend. | | **ùser_agent`** | `str`
(None) | User agent string to identify your crawler. Used for
robots.txt checking when enabled. | ```python run_config =
CrawlerRunConfig( check_robots_txt=True, # Enable robots.txt
compliance user_agent=\"MyBot/1.0\" # Identify your crawler )
\n3. LlmConfig - Setting up LLM providers\nLlmConfig is useful
to pass LLM provider config to strategies and functions that
rely on LLMs to do extraction, filtering, schema generation
etc. Currently it can be used in the
following -\nLLMExtractionStrategy\nLLMContentFilter
\nJsonCssExtractionStrategy.generate_schema
\nJsonXPathExtractionStrategy.generate_schema\n3.1 Parameters
\nParameter Type / Default What It Does \nprovider\t
\"ollama/llama3\",\"groq/llama3-70b-8192\",
210
\"groq/llama3-8b-8192\", \"openai/gpt-4o-mini\" ,
\"openai/gpt-4o\",\"openai/o1-mini\",\"openai/o1-preview\",
\"openai/o3-mini\",\"openai/o3-mini-high\",
\"anthropic/claude-3-haiku-20240307\",\"anthropic/claude-3-
opus-20240229\",\"anthropic/claude-3-sonnet-20240229\",
\"anthropic/claude-3-5-sonnet-20240620\",\"gemini/gemini-pro
\",\"gemini/gemini-1.5-pro\",\"gemini/gemini-2.0-flash\",
\"gemini/gemini-2.0-flash-exp\",\"gemini/gemini-2.0-flash-
lite-preview-02-05\",\"deepseek/deepseek-chat\"\n(default:
\"openai/gpt-4o-mini\")\tWhich LLM provoder to use.\t
\napi_token\t1.Optional. When not provided explicitly,
api_token will be read from environment variables based on
provider. For example: If a gemini model is passed as provider
then,\"GEMINI_API_KEY\" will be read from environment
variables \n2. API token of LLM provider \neg: api_token =
\"gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv\"
\n3. Environment variable - use with prefix \"env:\"
\neg:api_token = \"env: GROQ_API_KEY\"\tAPI token to use for
the given provider\t\nbase_url\tOptional. Custom API endpoint
\tIf your provider has a custom endpoint\t\n3.2 Example Usage
\nllmConfig = LlmConfig(provider=\"openai/gpt-4o-mini\",
api_token=os.getenv(\"OPENAI_API_KEY\")) \n4. Putting It All
Together\nUse BrowserConfig for global browser settings:
engine, headless, proxy, user agent. \nUse CrawlerRunConfig
for each crawlâ€™s context: how to filter content, handle
caching, wait for dynamic elements, or run JS. \nPass both
configs to AsyncWebCrawler (the BrowserConfig) and then to
arun() (the CrawlerRunConfig). \nUse LlmConfig for LLM
provider configurations that can be used across all
extraction, filtering, schema generation tasks. Can be used
in - LLMExtractionStrategy, LLMContentFilter,
JsonCssExtractionStrategy.generate_schema &
JsonXPathExtractionStrategy.generate_schema\n# Create a
modified copy with the clone() method stream_cfg =
run_cfg.clone( stream=True, cache_mode=CacheMode.BYPASS )",
"markdown": "# Browser, Crawler & LLM Config\n\n## 1.â
€€**BrowserConfig** â€“ Controlling the Browser\n
\n`BrowserConfig` focuses on **how** the browser is launched
and behaves. This includes headless mode, proxies, user
agents, and other environment tweaks.\n\n`from crawl4ai import
AsyncWebCrawler, BrowserConfig browser_cfg =
BrowserConfig( browser_type=\"chromium\",
headless=True, viewport_width=1280, viewport_height=
720, proxy=\"http://user:pass@proxy:8080\",
user_agent=\"Mozilla/5.0 (X11; Linux x86_64)
AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36\", )`\n\n##
1.1 Parameter Highlights\n\n| **Parameter** | **Type /
Default** | **What It Does** |\n| --- | --- | --- |\n|
**`browser_type`** | `\"chromium\"`, `\"firefox\"`, `\"webkit
\"` <br>_(default: `\"chromium\"`)_ | Which browser engine to
use. `\"chromium\"` is typical for many sites, `\"firefox\"`
or `\"webkit\"` for specialized tests. |\n| **`headless`** |
`bool` (default: `True`) | Headless means no visible UI.
`False` is handy for debugging. |\n| **`viewport_width`** |
ìnt` (default: `1080`) | Initial page width (in px). Useful
for testing responsive layouts. |\n| **`viewport_height`** |
ìnt` (default: `600`) | Initial page height (in px). |\n|
211
**`proxy`** | `str` (default: `None`) | Single-proxy URL if
you want all traffic to go through it, e.g. `
\"http://user:pass@proxy:8080\"`. |\n| **`proxy_config`** |
`dict` (default: `None`) | For advanced or multi-proxy needs,
specify details like `{\"server\": \"...\", \"username\":
\"...\", ...}`. |\n| **ùse_persistent_context`** | `bool`
(default: `False`) | If `True`, uses a **persistent** browser
context (keep cookies, sessions across runs). Also sets
ùse_managed_browser=True`. |\n| **ùser_data_dir`** | `str or
None` (default: `None`) | Directory to store user data
(profiles, cookies). Must be set if you want permanent
sessions. |\n| **ìgnore_https_errors`** | `bool` (default:
`True`) | If `True`, continues despite invalid certificates
(common in dev/staging). |\n| **`java_script_enabled`** |
`bool` (default: `True`) | Disable if you want no JS overhead,
or if only static content is needed. |\n| **`cookies`** |
`list` (default: `[]`) | Pre-set cookies, each a dict like
`{\"name\": \"session\", \"value\": \"...\", \"url\": \"...
\"}`. |\n| **`headers`** | `dict` (default: `{}`) | Extra HTTP
headers for every request, e.g. `{\"Accept-Language\": \"en-US
\"}`. |\n| **ùser_agent`** | `str` (default: Chrome-based UA)
| Your custom or random user agent. ùser_agent_mode=\"random
\"` can shuffle it. |\n| **`light_mode`** | `bool` (default:
`False`) | Disables some background features for performance
gains. |\n| **`text_mode`** | `bool` (default: `False`) | If
`True`, tries to disable images/other heavy content for speed.
|\n| **ùse_managed_browser`** | `bool` (default: `False`) |
For advanced â€œmanagedâ€ interactions (debugging, CDP
usage). Typically set automatically if persistent context is
on. |\n| **èxtra_args`** | `list` (default: `[]`) |
Additional flags for the underlying browser process, e.g.
`[\"--disable-extensions\"]`. |\n\n**Tips**: - Set
`headless=False` to visually **debug** how pages load or how
interactions proceed. \n\\- If you need **authentication**
storage or repeated sessions, consider
ùse_persistent_context=True` and specify ùser_data_dir`. \n
\\- For large pages, you might need a bigger `viewport_width`
and `viewport_height` to handle dynamic content.\n\n* * *\n
\n## 2.â€€**CrawlerRunConfig** â€“ Controlling Each Crawl\n
\nWhile `BrowserConfig` sets up the **environment**,
`CrawlerRunConfig` details **how** each **crawl operation**
should behave: caching, content filtering, link or domain
blocking, timeouts, JavaScript code, etc.\n\n`from crawl4ai
import AsyncWebCrawler, CrawlerRunConfig run_cfg =
CrawlerRunConfig( wait_for=\"css:.main-content\",
word_count_threshold=15, excluded_tags=[\"nav\", \"footer
\"], exclude_external_links=True, stream=True, #
Enable streaming for arun_many() )`\n\n## 2.1 Parameter
Highlights\n\nWe group them by category.\n\n### A) **Content
Processing**\n\n| **Parameter** | **Type / Default** | **What
It Does** |\n| --- | --- | --- |\n| **`word_count_threshold`**
| ìnt` (default: ~200) | Skips text blocks below X words.
Helps ignore trivial sections. |\n| **èxtraction_strategy`**
| ÈxtractionStrategy` (default: None) | If set, extracts
structured data (CSS-based, LLM-based, etc.). |\n|
**`markdown_generator`** | `MarkdownGenerationStrategy` (None)
| If you want specialized markdown output (citations,
212
filtering, chunking, etc.). |\n| **`css_selector`** | `str`
(None) | Retains only the part of the page matching this
selector. |\n| **èxcluded_tags`** | `list` (None) | Removes
entire tags (e.g. `[\"script\", \"style\"]`). |\n|
**èxcluded_selector`** | `str` (None) | Like `css_selector`
but to exclude. E.g. `\"#ads, .tracker\"`. |\n|
**ònly_text`** | `bool` (False) | If `True`, tries to extract
text-only content. |\n| **`prettiify`** | `bool` (False) | If
`True`, beautifies final HTML (slower, purely cosmetic). |\n|
**`keep_data_attributes`** | `bool` (False) | If `True`,
preserve `data-*` attributes in cleaned HTML. |\n|
**`remove_forms`** | `bool` (False) | If `True`, remove all
`<form>` elements. |\n\n* * *\n\n### B) **Caching & Session**
\n\n| **Parameter** | **Type / Default** | **What It Does** |
\n| --- | --- | --- |\n| **`cache_mode`** | `CacheMode or
None` | Controls how caching is handled (ÈNABLED`, `BYPASS`,
`DISABLED`, etc.). If `None`, typically defaults to ÈNABLED`.
|\n| **`session_id`** | `str or None` | Assign a unique ID to
reuse a single browser session across multiple àrun()` calls.
|\n| **`bypass_cache`** | `bool` (False) | If `True`, acts
like `CacheMode.BYPASS`. |\n| **`disable_cache`** | `bool`
(False) | If `True`, acts like `CacheMode.DISABLED`. |\n|
**`no_cache_read`** | `bool` (False) | If `True`, acts like
`CacheMode.WRITE_ONLY` (writes cache but never reads). |\n|
**`no_cache_write`** | `bool` (False) | If `True`, acts like
`CacheMode.READ_ONLY` (reads cache but never writes). |\n\nUse
these for controlling whether you read or write from a local
content cache. Handy for large batch crawls or repeated site
visits.\n\n* * *\n\n### C) **Page Navigation & Timing**\n\n|
**Parameter** | **Type / Default** | **What It Does** |\n| ---
| --- | --- |\n| **`wait_until`** | `str` (domcontentloaded) |
Condition for navigation to â€œcompleteâ€ . Often `
\"networkidle\"` or `\"domcontentloaded\"`. |\n|
**`page_timeout`** | ìnt` (60000 ms) | Timeout for page
navigation or JS steps. Increase for slow sites. |\n|
**`wait_for`** | `str or None` | Wait for a CSS (`
\"css:selector\"`) or JS (`\"js:() => bool\"`) condition
before content extraction. |\n| **`wait_for_images`** | `bool`
(False) | Wait for images to load before finishing. Slows down
if you only want text. |\n| **`delay_before_return_html`** |
`float` (0.1) | Additional pause (seconds) before final HTML
is captured. Good for last-second updates. |\n|
**`check_robots_txt`** | `bool` (False) | Whether to check and
respect robots.txt rules before crawling. If True, caches
robots.txt for efficiency. |\n| **`mean_delay`** and
**`max_range`** | `float` (0.1, 0.3) | If you call
àrun_many()`, these define random delay intervals between
crawls, helping avoid detection or rate limits. |\n|
**`semaphore_count`** | ìnt` (5) | Max concurrency for
àrun_many()`. Increase if you have resources for parallel
crawls. |\n\n* * *\n\n### D) **Page Interaction**\n\n|
**Parameter** | **Type / Default** | **What It Does** |\n| ---
| --- | --- |\n| **`js_code`** | `str or list[str]` (None) |
JavaScript to run after load. E.g. `
\"document.querySelector('button')?.click();\"`. |\n|
**`js_only`** | `bool` (False) | If `True`, indicates weâ€™re
reusing an existing session and only applying JS. No full
213
reload. |\n| **ìgnore_body_visibility`** | `bool` (True) |
Skip checking if `<body>` is visible. Usually best to keep
`True`. |\n| **`scan_full_page`** | `bool` (False) | If
`True`, auto-scroll the page to load dynamic content (infinite
scroll). |\n| **`scroll_delay`** | `float` (0.2) | Delay
between scroll steps if `scan_full_page=True`. |\n|
**`process_iframes`** | `bool` (False) | Inlines iframe
content for single-page extraction. |\n|
**`remove_overlay_elements`** | `bool` (False) | Removes
potential modals/popups blocking the main content. |\n|
**`simulate_user`** | `bool` (False) | Simulate user
interactions (mouse movements) to avoid bot detection. |\n|
**òverride_navigator`** | `bool` (False) | Override
`navigator` properties in JS for stealth. |\n| **`magic`** |
`bool` (False) | Automatic handling of popups/consent banners.
Experimental. |\n| **àdjust_viewport_to_content`** | `bool`
(False) | Resizes viewport to match page content height. |\n
\nIf your page is a single-page app with repeated JS updates,
set `js_only=True` in subsequent calls, plus a `session_id`
for reusing the same tab.\n\n* * *\n\n### E) **Media
Handling**\n\n| **Parameter** | **Type / Default** | **What It
Does** |\n| --- | --- | --- |\n| **`screenshot`** | `bool`
(False) | Capture a screenshot (base64) in
`result.screenshot`. |\n| **`screenshot_wait_for`** | `float
or None` | Extra wait time before the screenshot. |\n|
**`screenshot_height_threshold`** | ìnt` (~20000) | If the
page is taller than this, alternate screenshot strategies are
used. |\n| **`pdf`** | `bool` (False) | If `True`, returns a
PDF in `result.pdf`. |\n|
**ìmage_description_min_word_threshold`** | ìnt` (~50) |
Minimum words for an imageâ€™s alt text or description to be
considered valid. |\n| **ìmage_score_threshold`** | ìnt` (~
3) | Filter out low-scoring images. The crawler scores images
by relevance (size, context, etc.). |\n|
**èxclude_external_images`** | `bool` (False) | Exclude
images from other domains. |\n\n* * *\n\n### F) **Link/Domain
Handling**\n\n| **Parameter** | **Type / Default** | **What It
Does** |\n| --- | --- | --- |\n|
**èxclude_social_media_domains`** | `list` (e.g.
Facebook/Twitter) | A default list can be extended. Any link
to these domains is removed from final output. |\n|
**èxclude_external_links`** | `bool` (False) | Removes all
links pointing outside the current domain. |\n|
**èxclude_social_media_links`** | `bool` (False) | Strips
links specifically to social sites (like Facebook or Twitter).
|\n| **èxclude_domains`** | `list` (\\[\\]) | Provide a
custom list of domains to exclude (like `[\"ads.com\",
\"trackers.io\"]`). |\n\nUse these for link-level content
filtering (often to keep crawls â€œinternalâ€ or to remove
spammy domains).\n\n* * *\n\n### G) **Debug & Logging**\n\n|
**Parameter** | **Type / Default** | **What It Does** |\n| ---
| --- | --- |\n| **`verbose`** | `bool` (True) | Prints logs
detailing each step of crawling, interactions, or errors. |\n|
**`log_console`** | `bool` (False) | Logs the pageâ€™s
JavaScript console output if you want deeper JS debugging. |\n
\n* * *\n\n## 2.2 Helper Methods\n\nBoth `BrowserConfig` and
`CrawlerRunConfig` provide a `clone()` method to create
214
modified copies:\n\n`# Create a base configuration base_config
= CrawlerRunConfig( cache_mode=CacheMode.ENABLED,
word_count_threshold=200 ) # Create variations using clone()
stream_config = base_config.clone(stream=True) no_cache_config
= base_config.clone( cache_mode=CacheMode.BYPASS,
stream=True )`\n\nThe `clone()` method is particularly useful
when you need slightly different configurations for different
use cases, without modifying the original config.\n\n## 2.3
Example Usage\n\n`ìmport asyncio from crawl4ai import
AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main(): # Configure the browser browser_cfg
= BrowserConfig( headless=False,
viewport_width=1280, viewport_height=720,
proxy=\"http://user:pass@myproxy:8080\",
text_mode=True ) # Configure the run run_cfg =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
session_id=\"my_session\", css_selector=\"main.article
\", excluded_tags=[\"script\", \"style\"],
exclude_external_links=True, wait_for=\"css:.article-
loaded\", screenshot=True, stream=True )
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun( url=
\"https://example.com/news\",
config=run_cfg ) if result.success:
print(\"Final cleaned_html length:\",
len(result.cleaned_html)) if result.screenshot:
print(\"Screenshot captured (base64, length):\",
len(result.screenshot)) else:
print(\"Crawl failed:\", result.error_message) if __name__ ==
\"__main__\": asyncio.run(main()) ## 2.4 Compliance &
Ethics | **Parameter** | **Type / Default** |
**What It Does**
|
|-----------------------|-------------------------|-----------
--------------------------------------------------------------
---------------------------------------------| |
**`check_robots_txt`**| `bool` (False) | When True,
checks and respects robots.txt rules before crawling. Uses
efficient caching with SQLite backend. | |
**ùser_agent`** | `str` (None) | User agent
string to identify your crawler. Used for robots.txt checking
when enabled. | ```python
run_config = CrawlerRunConfig( check_robots_txt=True, #
Enable robots.txt compliance user_agent=\"MyBot/1.0\" #
Identify your crawler )``\n\n## 3\\. **LlmConfig** - Setting
up LLM providers\n\nLlmConfig is useful to pass LLM provider
config to strategies and functions that rely on LLMs to do
extraction, filtering, schema generation etc. Currently it can
be used in the following -\n\n1. LLMExtractionStrategy\n2.
LLMContentFilter\n3. JsonCssExtractionStrategy.generate
\\_schema\n4. JsonXPathExtractionStrategy.generate\\_schema\n
\n## 3.1 Parameters\n\n| **Parameter** | **Type / Default** |
**What It Does** |\n| --- | --- | --- |\n| **`provider`** | `
\"ollama/llama3\",\"groq/llama3-70b-8192\",
\"groq/llama3-8b-8192\", \"openai/gpt-4o-mini\" ,
\"openai/gpt-4o\",\"openai/o1-mini\",\"openai/o1-preview\",
\"openai/o3-mini\",\"openai/o3-mini-high\",
215
\"anthropic/claude-3-haiku-20240307\",\"anthropic/claude-3-
opus-20240229\",\"anthropic/claude-3-sonnet-20240229\",
\"anthropic/claude-3-5-sonnet-20240620\",\"gemini/gemini-pro
\",\"gemini/gemini-1.5-pro\",\"gemini/gemini-2.0-flash\",
\"gemini/gemini-2.0-flash-exp\",\"gemini/gemini-2.0-flash-
lite-preview-02-05\",\"deepseek/deepseek-chat\"`
<br>_(default: `\"openai/gpt-4o-mini\"`)_ | Which LLM provoder
to use. |\n| **àpi_token`** | 1.Optional. When not provided
explicitly, api\\_token will be read from environment
variables based on provider. For example: If a gemini model is
passed as provider then,`\"GEMINI_API_KEY\"` will be read from
environment variables <br>2\\. API token of LLM provider
<br>eg: àpi_token = \"gsk_
1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv\"` <br>3
\\. Environment variable - use with prefix \"env:\" <br>eg:
àpi_token = \"env: GROQ_API_KEY\"` | API token to use for the
given provider |\n| **`base_url`** | Optional. Custom API
endpoint | If your provider has a custom endpoint |\n\n## 3.2
Example Usage\n\n`llmConfig = LlmConfig(provider=
\"openai/gpt-4o-mini\", api_token=os.getenv(\"OPENAI_API_KEY
\"))`\n\n## 4\\. Putting It All Together\n\n* **Use**
`BrowserConfig` for **global** browser settings: engine,
headless, proxy, user agent.\n* **Use** `CrawlerRunConfig`
for each crawlâ€™s **context**: how to filter content, handle
caching, wait for dynamic elements, or run JS.\n* **Pass**
both configs to ÀsyncWebCrawler` (the `BrowserConfig`) and
then to àrun()` (the `CrawlerRunConfig`).\n* **Use**
`LlmConfig` for LLM provider configurations that can be used
across all extraction, filtering, schema generation tasks. Can
be used in - `LLMExtractionStrategy`, `LLMContentFilter`,
`JsonCssExtractionStrategy.generate_schema` &
`JsonXPathExtractionStrategy.generate_schema`\n\n`# Create a
modified copy with the clone() method stream_cfg =
run_cfg.clone( stream=True,
cache_mode=CacheMode.BYPASS )`",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/api/crawl-result/",
"crawl": {
"loadedUrl": "https://crawl4ai.com/mkdocs/api/crawl-
result/",
"loadedTime": "2025-03-05T23:17:50.954Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl": "https://docs.crawl4ai.com/api/crawl-
result/",
"title": "CrawlResult - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
216
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:48 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"c2fa3473e84551c3b62ad7e1d7a00560\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "CrawlResult - Crawl4AI Documentation
(v0.5.x)\nCrawlResult Reference\nThe CrawlResult class
encapsulates everything returned after a single crawl
operation. It provides the raw or processed content, details
on links and media, plus optional metadata (like screenshots,
PDFs, or extracted JSON).\nLocation:
crawl4ai/crawler/models.py (for reference)\nclass
CrawlResult(BaseModel): url: str html: str success: bool
cleaned_html: Optional[str] = None media: Dict[str,
List[Dict]] = {} links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None screenshot:
Optional[str] = None pdf : Optional[bytes] = None markdown:
Optional[Union[str, MarkdownGenerationResult]] = None
extracted_content: Optional[str] = None metadata:
Optional[dict] = None error_message: Optional[str] = None
session_id: Optional[str] = None response_headers:
Optional[dict] = None status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None
dispatch_result: Optional[DispatchResult] = None ... \nBelow
is a field-by-field explanation and possible usage patterns.
\n1. Basic Crawl Info\n1.1 url (str)\nWhat: The final crawled
URL (after any redirects).\nUsage: \nprint(result.url) # e.g.,
\"https://example.com/\" \n1.2 success (bool)\nWhat: True if
the crawl pipeline ended without major errors; False
otherwise.\nUsage: \nif not result.success: print(f\"Crawl
failed: {result.error_message}\") \n1.3 status_code
(Optional[int])\nWhat: The pageâ€™s HTTP status code (e.g.,
200, 404).\nUsage: \nif result.status_code == 404:
print(\"Page not found!\") \n1.4 error_message
(Optional[str])\nWhat: If success=False, a textual description
of the failure.\nUsage: \nif not result.success:
print(\"Error:\", result.error_message) \n1.5 session_id
(Optional[str])\nWhat: The ID used for reusing a browser
context across multiple calls.\nUsage: \n# If you used
session_id=\"login_session\" in CrawlerRunConfig, see it here:
print(\"Session:\", result.session_id) \nWhat: Final HTTP
response headers.\nUsage: \nif result.response_headers:
print(\"Server:\", result.response_headers.get(\"Server\",
\"Unknown\")) \n1.7 ssl_certificate
(Optional[SSLCertificate])\nWhat: If
fetch_ssl_certificate=True in your CrawlerRunConfig,
result.ssl_certificate contains a SSLCertificate object
describing the siteâ€™s certificate. You can export the cert
217
in multiple formats (PEM/DER/JSON) or access its properties
like issuer, subject, valid_from, valid_until, etc. Usage:
\nif result.ssl_certificate: print(\"Issuer:\",
result.ssl_certificate.issuer) \n2. Raw / Cleaned Content\n2.1
html (str)\nWhat: The original unmodified HTML from the final
page load.\nUsage: \n# Possibly large print(len(result.html))
\n2.2 cleaned_html (Optional[str])\nWhat: A sanitized HTML
versionâ€”scripts, styles, or excluded tags are removed based
on your CrawlerRunConfig.\nUsage:
\nprint(result.cleaned_html[:500]) # Show a snippet \n2.3
fit_html (Optional[str])\nWhat: If a content filter or
heuristic (e.g., Pruning/BM25) modifies the HTML, the â€œfitâ
€ or post-filter version.\nWhen: This is only present if your
markdown_generator or content_filter produces it.\nUsage: \nif
result.markdown.fit_html: print(\"High-value HTML content:\",
result.markdown.fit_html[:300]) \n3. Markdown Fields\n3.1 The
Markdown Generation Approach\nCrawl4AI can convert
HTMLâ†’Markdown, optionally including:\nRaw markdown \nLinks
as citations (with a references section) \nFit markdown if a
content filter is used (like Pruning or
BM25)\nMarkdownGenerationResult includes: - raw_markdown
(str): The full HTMLâ†’Markdown conversion.\n-
markdown_with_citations (str): Same markdown, but with link
references as academic-style citations.\n- references_markdown
(str): The reference list or footnotes at the end.\n-
fit_markdown (Optional[str]): If content filtering
(Pruning/BM25) was applied, the filtered â€œfitâ€ text.\n-
fit_html (Optional[str]): The HTML that led to fit_markdown.
\nUsage: \nif result.markdown: md_res = result.markdown
print(\"Raw MD:\", md_res.raw_markdown[:300])
print(\"Citations MD:\", md_res.markdown_with_citations[:300])
print(\"References:\", md_res.references_markdown) if
md_res.fit_markdown: print(\"Pruned text:\",
md_res.fit_markdown[:300]) \n3.2 markdown (Optional[Union[str,
MarkdownGenerationResult]])\nWhat: Holds the
MarkdownGenerationResult.\nUsage:
\nprint(result.markdown.raw_markdown[:200])
print(result.markdown.fit_markdown)
print(result.markdown.fit_html) \nImportant: â€œFitâ€ content
(in fit_markdown/fit_html) exists in result.markdown, only if
you used a filter (like PruningContentFilter or
BM25ContentFilter) within a MarkdownGenerationStrategy. \n4.1
media (Dict[str, List[Dict]])\nWhat: Contains info about
discovered images, videos, or audio. Typically keys: \"images
\", \"videos\", \"audios\".\nCommon Fields in each item:\nsrc
(str): Media URL \nalt or title (str): Descriptive text
\nscore (float): Relevance score if the crawlerâ€™s heuristic
found it â€œimportantâ€ \ndesc or description
(Optional[str]): Additional context extracted from surrounding
text \nUsage: \nimages = result.media.get(\"images\", []) for
img in images: if img.get(\"score\", 0) > 5: print(\"High-
value image:\", img[\"src\"]) \n4.2 links (Dict[str,
List[Dict]])\nWhat: Holds internal and external link data.
Usually two keys: \"internal\" and \"external\".\nCommon
Fields:\nhref (str): The link target \ntext (str): Link text
\ntitle (str): Title attribute \ncontext (str): Surrounding
text snippet \ndomain (str): If external, the domain\nUsage:
218
\nfor link in result.links[\"internal\"]: print(f\"Internal
link to {link['href']} with text {link['text']}\") \n5.
Additional Fields\n5.1 extracted_content
(Optional[str])\nWhat: If you used extraction_strategy (CSS,
LLM, etc.), the structured output (JSON).\nUsage: \nif
result.extracted_content: data =
json.loads(result.extracted_content) print(data) \n5.2
downloaded_files (Optional[List[str]])\nWhat: If
accept_downloads=True in your BrowserConfig + downloads_path,
lists local file paths for downloaded items.\nUsage: \nif
result.downloaded_files: for file_path in
result.downloaded_files: print(\"Downloaded:\", file_path)
\n5.3 screenshot (Optional[str])\nWhat: Base64-encoded
screenshot if screenshot=True in CrawlerRunConfig.\nUsage:
\nimport base64 if result.screenshot: with open(\"page.png\",
\"wb\") as f: f.write(base64.b64decode(result.screenshot))
\n5.4 pdf (Optional[bytes])\nWhat: Raw PDF bytes if pdf=True
in CrawlerRunConfig.\nUsage: \nif result.pdf: with
open(\"page.pdf\", \"wb\") as f: f.write(result.pdf) \n5.5
metadata (Optional[dict])\nWhat: Page-level metadata if
discovered (title, description, OG data, etc.).\nUsage: \nif
result.metadata: print(\"Title:\", result.metadata.get(\"title
\")) print(\"Author:\", result.metadata.get(\"author\")) \n6.
dispatch_result (optional)\nA DispatchResult object providing
additional concurrency and resource usage information when
crawling URLs in parallel (e.g., via arun_many() with custom
dispatchers). It contains:\ntask_id: A unique identifier for
the parallel task.\nmemory_usage (float): The memory (in MB)
used at the time of completion.\npeak_memory (float): The peak
memory usage (in MB) recorded during the taskâ€™s execution.
\nstart_time / end_time (datetime): Time range for this
crawling task.\nerror_message (str): Any dispatcher- or
concurrency-related error encountered.\n# Example usage: for
result in results: if result.success and
result.dispatch_result: dr = result.dispatch_result print(f
\"URL: {result.url}, Task ID: {dr.task_id}\") print(f\"Memory:
{dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)\")
print(f\"Duration: {dr.end_time - dr.start_time}\") \nNote:
This field is typically populated when using arun_many(...)
alongside a dispatcher (e.g., MemoryAdaptiveDispatcher or
SemaphoreDispatcher). If no concurrency or dispatcher is used,
dispatch_result may remain None. \n7. Example: Accessing
Everything\nasync def handle_result(result: CrawlResult): if
not result.success: print(\"Crawl error:\",
result.error_message) return # Basic info print(\"Crawled URL:
\", result.url) print(\"Status code:\", result.status_code) #
HTML print(\"Original HTML size:\", len(result.html))
print(\"Cleaned HTML size:\", len(result.cleaned_html or
\"\")) # Markdown output if result.markdown: print(\"Raw
Markdown:\", result.markdown.raw_markdown[:300])
print(\"Citations Markdown:\",
result.markdown.markdown_with_citations[:300]) if
result.markdown.fit_markdown: print(\"Fit Markdown:\",
result.markdown.fit_markdown[:200]) # Media & Links if
\"images\" in result.media: print(\"Image count:\",
len(result.media[\"images\"])) if \"internal\" in
result.links: print(\"Internal link count:\",
219
len(result.links[\"internal\"])) # Extraction strategy result
if result.extracted_content: print(\"Structured data:\",
result.extracted_content) # Screenshot/PDF if
result.screenshot: print(\"Screenshot length:\",
len(result.screenshot)) if result.pdf: print(\"PDF bytes
length:\", len(result.pdf)) \n8. Key Points & Future\n1.
Deprecated legacy properties of CrawlResult\n- markdown_v2 -
Deprecated in v0.5. Just use markdown. It holds the
MarkdownGenerationResult now! - fit_markdown and fit_html -
Deprecated in v0.5. They can now be accessed via
MarkdownGenerationResult in result.markdown. eg:
result.markdown.fit_markdown and result.markdown.fit_html\n2.
Fit Content\n- fit_markdown and fit_html appear in
MarkdownGenerationResult, only if you used a content filter
(like PruningContentFilter or BM25ContentFilter) inside your
MarkdownGenerationStrategy or set them directly.\n- If no
filter is used, they remain None.\n3. References & Citations
\n- If you enable link citations in your
DefaultMarkdownGenerator (options={\"citations\": True}), youâ
€™ll see markdown_with_citations plus a references_markdown
block. This helps large language models or academic-like
referencing.\n4. Links & Media\n- links[\"internal\"] and
links[\"external\"] group discovered anchors by domain.\n-
media[\"images\"] / [\"videos\"] / [\"audios\"] store
extracted media elements with optional scoring or context.\n5.
Error Cases\n- If success=False, check error_message (e.g.,
timeouts, invalid URLs).\n- status_code might be None if we
failed before an HTTP response.\nUse CrawlResult to glean all
final outputs and feed them into your data pipelines, AI
models, or archives. With the synergy of a properly configured
BrowserConfig and CrawlerRunConfig, the crawler can produce
robust, structured results here in CrawlResult.",
"markdown": "# CrawlResult - Crawl4AI Documentation
(v0.5.x)\n\n## `CrawlResult` Reference\n\nThe
**`CrawlResult`** class encapsulates everything returned after
a single crawl operation. It provides the **raw or processed
content**, details on links and media, plus optional metadata
(like screenshots, PDFs, or extracted JSON).\n\n**Location**:
`crawl4ai/crawler/models.py` (for reference)\n\n`class
CrawlResult(BaseModel): url: str html: str
success: bool cleaned_html: Optional[str] = None
media: Dict[str, List[Dict]] = {} links: Dict[str,
List[Dict]] = {} downloaded_files: Optional[List[str]] =
None screenshot: Optional[str] = None pdf :
Optional[bytes] = None markdown: Optional[Union[str,
MarkdownGenerationResult]] = None extracted_content:
Optional[str] = None metadata: Optional[dict] = None
error_message: Optional[str] = None session_id:
Optional[str] = None response_headers: Optional[dict] =
None status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None
dispatch_result: Optional[DispatchResult] = None ...`\n
\nBelow is a **field-by-field** explanation and possible usage
patterns.\n\n* * *\n\n## 1\\. Basic Crawl Info\n\n### 1.1
**ùrl`** _(str)_\n\n**What**: The final crawled URL (after
any redirects). \n**Usage**:\n\n`print(result.url) # e.g.,
\"https://example.com/\"`\n\n### 1.2 **`success`** _(bool)_\n
220
\n**What**: `True` if the crawl pipeline ended without major
errors; `False` otherwise. \n**Usage**:\n\nìf not
result.success: print(f\"Crawl failed:
{result.error_message}\")`\n\n### 1.3 **`status_code`**
_(Optional\\[int\\])_\n\n**What**: The pageâ€™s HTTP status
code (e.g., 200, 404). \n**Usage**:\n\nìf result.status_code
== 404: print(\"Page not found!\")`\n\n### 1.4
**èrror_message`** _(Optional\\[str\\])_\n\n**What**: If
`success=False`, a textual description of the failure.
\n**Usage**:\n\nìf not result.success: print(\"Error:\",
result.error_message)`\n\n### 1.5 **`session_id`** _(Optional
\\[str\\])_\n\n**What**: The ID used for reusing a browser
context across multiple calls. \n**Usage**:\n\n`# If you used
session_id=\"login_session\" in CrawlerRunConfig, see it here:
print(\"Session:\", result.session_id)`\n\n**What**: Final
HTTP response headers. \n**Usage**:\n\nìf
result.response_headers: print(\"Server:\",
result.response_headers.get(\"Server\", \"Unknown\"))`\n\n###
1.7 **`ssl_certificate`** _(Optional\\[SSLCertificate\\])_\n
\n**What**: If `fetch_ssl_certificate=True` in your
CrawlerRunConfig, **`result.ssl_certificate`** contains a
[**`SSLCertificate`**]
(https://crawl4ai.com/mkdocs/advanced/ssl-certificate/) object
describing the siteâ€™s certificate. You can export the cert
in multiple formats (PEM/DER/JSON) or access its properties
like ìssuer`, `subject`, `valid_from`, `valid_until`, etc.
**Usage**:\n\nìf result.ssl_certificate: print(\"Issuer:
\", result.ssl_certificate.issuer)`\n\n* * *\n\n## 2\\. Raw /
Cleaned Content\n\n### 2.1 **`html`** _(str)_\n\n**What**: The
**original** unmodified HTML from the final page load.
\n**Usage**:\n\n`# Possibly large print(len(result.html))`\n
\n### 2.2 **`cleaned_html`** _(Optional\\[str\\])_\n
\n**What**: A sanitized HTML versionâ€”scripts, styles, or
excluded tags are removed based on your `CrawlerRunConfig`.
\n**Usage**:\n\n`print(result.cleaned_html[:500]) # Show a
snippet`\n\n### 2.3 **`fit_html`** _(Optional\\[str\\])_\n
\n**What**: If a **content filter** or heuristic (e.g.,
Pruning/BM25) modifies the HTML, the â€œfitâ€ or post-filter
version. \n**When**: This is **only** present if your
`markdown_generator` or `content_filter` produces it.
\n**Usage**:\n\nìf result.markdown.fit_html:
print(\"High-value HTML content:\",
result.markdown.fit_html[:300])`\n\n* * *\n\n## 3\\. Markdown
Fields\n\n### 3.1 The Markdown Generation Approach\n\nCrawl4AI
can convert HTMLâ†’Markdown, optionally including:\n\n*
**Raw** markdown\n* **Links as citations** (with a
references section)\n* **Fit** markdown if a **content
filter** is used (like Pruning or BM25)\n
\n**`MarkdownGenerationResult`** includes: -
**`raw_markdown`** _(str)_: The full HTMLâ†’Markdown
conversion. \n\\- **`markdown_with_citations`** _(str)_: Same
markdown, but with link references as academic-style
citations. \n\\- **`references_markdown`** _(str)_: The
reference list or footnotes at the end. \n\\-
**`fit_markdown`** _(Optional\\[str\\])_: If content filtering
(Pruning/BM25) was applied, the filtered â€œfitâ€ text. \n
\\- **`fit_html`** _(Optional\\[str\\])_: The HTML that led to
221
`fit_markdown`.\n\n**Usage**:\n\nìf result.markdown:
md_res = result.markdown print(\"Raw MD:\",
md_res.raw_markdown[:300]) print(\"Citations MD:\",
md_res.markdown_with_citations[:300]) print(\"References:
\", md_res.references_markdown) if md_res.fit_markdown:
print(\"Pruned text:\", md_res.fit_markdown[:300])`\n\n### 3.2
**`markdown`** _(Optional\\[Union\\[str,
MarkdownGenerationResult\\]\\])_\n\n**What**: Holds the
`MarkdownGenerationResult`. \n**Usage**:\n
\n`print(result.markdown.raw_markdown[:200])
print(result.markdown.fit_markdown)
print(result.markdown.fit_html)`\n\n**Important**: â€œFitâ€
content (in `fit_markdown`/`fit_html`) exists in
result.markdown, only if you used a **filter** (like
**PruningContentFilter** or **BM25ContentFilter**) within a
`MarkdownGenerationStrategy`.\n\n* * *\n\n### 4.1 **`media`**
_(Dict\\[str, List\\[Dict\\]\\])_\n\n**What**: Contains info
about discovered images, videos, or audio. Typically keys: `
\"images\"`, `\"videos\"`, `\"audios\"`. \n**Common Fields**
in each item:\n\n* `src` _(str)_: Media URL\n* àlt` or
`title` _(str)_: Descriptive text\n* `score` _(float)_:
Relevance score if the crawlerâ€™s heuristic found it â
€œimportantâ€ \n* `desc` or `description` _(Optional\\[str
\\])_: Additional context extracted from surrounding text\n
\n**Usage**:\n\nìmages = result.media.get(\"images\", []) for
img in images: if img.get(\"score\", 0) > 5:
print(\"High-value image:\", img[\"src\"])`\n\n### 4.2
**`links`** _(Dict\\[str, List\\[Dict\\]\\])_\n\n**What**:
Holds internal and external link data. Usually two keys: `
\"internal\"` and `\"external\"`. \n**Common Fields**:\n\n*
`href` _(str)_: The link target\n* `text` _(str)_: Link text
\n* `title` _(str)_: Title attribute\n* `context` _(str)_:
Surrounding text snippet\n* `domain` _(str)_: If external,
the domain\n\n**Usage**:\n\n`for link in
result.links[\"internal\"]: print(f\"Internal link to
{link['href']} with text {link['text']}\")`\n\n* * *\n\n## 5
\\. Additional Fields\n\n### 5.1 **èxtracted_content`**
_(Optional\\[str\\])_\n\n**What**: If you used
**èxtraction_strategy`** (CSS, LLM, etc.), the structured
output (JSON). \n**Usage**:\n\nìf result.extracted_content:
data = json.loads(result.extracted_content) print(data)`\n
\n### 5.2 **`downloaded_files`** _(Optional\\[List\\[str
\\]\\])_\n\n**What**: If àccept_downloads=True` in your
`BrowserConfig` + `downloads_path`, lists local file paths for
downloaded items. \n**Usage**:\n\nìf
result.downloaded_files: for file_path in
result.downloaded_files: print(\"Downloaded:\",
file_path)`\n\n### 5.3 **`screenshot`** _(Optional\\[str\\])_
\n\n**What**: Base64-encoded screenshot if `screenshot=True`
in `CrawlerRunConfig`. \n**Usage**:\n\nìmport base64 if
result.screenshot: with open(\"page.png\", \"wb\") as f:
f.write(base64.b64decode(result.screenshot))`\n\n### 5.4
**`pdf`** _(Optional\\[bytes\\])_\n\n**What**: Raw PDF bytes
if `pdf=True` in `CrawlerRunConfig`. \n**Usage**:\n\nìf
result.pdf: with open(\"page.pdf\", \"wb\") as f:
f.write(result.pdf)`\n\n### 5.5 **`metadata`** _(Optional
\\[dict\\])_\n\n**What**: Page-level metadata if discovered
222
(title, description, OG data, etc.). \n**Usage**:\n\nìf
result.metadata: print(\"Title:\",
result.metadata.get(\"title\")) print(\"Author:\",
result.metadata.get(\"author\"))`\n\n* * *\n\n## 6\\.
`dispatch_result` (optional)\n\nA `DispatchResult` object
providing additional concurrency and resource usage
information when crawling URLs in parallel (e.g., via
àrun_many()` with custom dispatchers). It contains:\n\n*
**`task_id`**: A unique identifier for the parallel task.\n*
**`memory_usage`** (float): The memory (in MB) used at the
time of completion.\n* **`peak_memory`** (float): The peak
memory usage (in MB) recorded during the taskâ€™s execution.
\n* **`start_time`** / **ènd_time`** (datetime): Time range
for this crawling task.\n* **èrror_message`** (str): Any
dispatcher- or concurrency-related error encountered.\n\n`#
Example usage: for result in results: if result.success
and result.dispatch_result: dr =
result.dispatch_result print(f\"URL: {result.url},
Task ID: {dr.task_id}\") print(f\"Memory:
{dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)\")
print(f\"Duration: {dr.end_time - dr.start_time}\")`\n\n>
**Note**: This field is typically populated when using
àrun_many(...)` alongside a **dispatcher** (e.g.,
`MemoryAdaptiveDispatcher` or `SemaphoreDispatcher`). If no
concurrency or dispatcher is used, `dispatch_result` may
remain `None`.\n\n* * *\n\n## 7\\. Example: Accessing
Everything\n\nàsync def handle_result(result: CrawlResult):
if not result.success: print(\"Crawl error:\",
result.error_message) return # Basic info
print(\"Crawled URL:\", result.url) print(\"Status code:
\", result.status_code) # HTML print(\"Original HTML
size:\", len(result.html)) print(\"Cleaned HTML size:\",
len(result.cleaned_html or \"\")) # Markdown output
if result.markdown: print(\"Raw Markdown:\",
result.markdown.raw_markdown[:300]) print(\"Citations
Markdown:\", result.markdown.markdown_with_citations[:300])
if result.markdown.fit_markdown: print(\"Fit
Markdown:\", result.markdown.fit_markdown[:200]) # Media
& Links if \"images\" in result.media:
print(\"Image count:\", len(result.media[\"images\"])) if
\"internal\" in result.links: print(\"Internal link
count:\", len(result.links[\"internal\"])) # Extraction
strategy result if result.extracted_content:
print(\"Structured data:\", result.extracted_content) #
Screenshot/PDF if result.screenshot:
print(\"Screenshot length:\", len(result.screenshot)) if
result.pdf: print(\"PDF bytes length:\",
len(result.pdf))`\n\n* * *\n\n## 8\\. Key Points & Future\n
\n1.â€€**Deprecated legacy properties of CrawlResult** \n\\-
`markdown_v2` - Deprecated in v0.5. Just use `markdown`. It
holds the `MarkdownGenerationResult` now! - `fit_markdown` and
`fit_html` - Deprecated in v0.5. They can now be accessed via
`MarkdownGenerationResult` in `result.markdown`. eg:
`result.markdown.fit_markdown` and `result.markdown.fit_html`
\n\n2.â€€**Fit Content** \n\\- **`fit_markdown`** and
**`fit_html`** appear in MarkdownGenerationResult, only if you
used a content filter (like **PruningContentFilter** or
223
**BM25ContentFilter**) inside your
**MarkdownGenerationStrategy** or set them directly. \n\\- If
no filter is used, they remain `None`.\n\n3.â€€**References &
Citations** \n\\- If you enable link citations in your
`DefaultMarkdownGenerator` (òptions={\"citations\": True}`),
youâ€™ll see `markdown_with_citations` plus a
**`references_markdown`** block. This helps large language
models or academic-like referencing.\n\n4.â€€**Links & Media**
\n\\- `links[\"internal\"]` and `links[\"external\"]` group
discovered anchors by domain. \n\\- `media[\"images\"]` /
`[\"videos\"]` / `[\"audios\"]` store extracted media elements
with optional scoring or context.\n\n5.â€€**Error Cases** \n
\\- If `success=False`, check èrror_message` (e.g., timeouts,
invalid URLs). \n\\- `status_code` might be `None` if we
failed before an HTTP response.\n\nUse **`CrawlResult`** to
glean all final outputs and feed them into your data
pipelines, AI models, or archives. With the synergy of a
properly configured **BrowserConfig** and
**CrawlerRunConfig**, the crawler can produce robust,
structured results here in **`CrawlResult`**.",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/api/strategies/",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/api/strategies/",
"loadedTime": "2025-03-05T23:17:51.950Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/",
"depth": 1,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl":
"https://docs.crawl4ai.com/api/strategies/",
"title": "Strategies - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:48 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"d577136b9833af34b3b5c6b8b2b189ae\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Strategies - Crawl4AI Documentation (v0.5.x)\nThis
224
documentation covers the API reference for extraction and
chunking strategies in Crawl4AI.\nAll extraction strategies
inherit from the base ExtractionStrategy class and implement
two key methods: - extract(url: str, html: str) ->
List[Dict[str, Any]] - run(url: str, sections: List[str]) ->
List[Dict[str, Any]]\nUsed for extracting structured data
using Language Models.\nLLMExtractionStrategy( # Required
Parameters provider: str = DEFAULT_PROVIDER, # LLM provider
(e.g., \"ollama/llama2\") api_token: Optional[str] = None, #
API token # Extraction Configuration instruction: str = None,
# Custom extraction instruction schema: Dict = None, #
Pydantic model schema for structured data extraction_type: str
= \"block\", # \"block\" or \"schema\" # Chunking Parameters
chunk_token_threshold: int = 4000, # Maximum tokens per chunk
overlap_rate: float = 0.1, # Overlap between chunks
word_token_rate: float = 0.75, # Word to token conversion rate
apply_chunking: bool = True, # Enable/disable chunking # API
Configuration base_url: str = None, # Base URL for API
extra_args: Dict = {}, # Additional provider arguments
verbose: bool = False # Enable verbose logging )
\nCosineStrategy\nUsed for content similarity-based extraction
and clustering.\nCosineStrategy( # Content Filtering
semantic_filter: str = None, # Topic/keyword filter
word_count_threshold: int = 10, # Minimum words per cluster
sim_threshold: float = 0.3, # Similarity threshold #
Clustering Parameters max_dist: float = 0.2, # Maximum cluster
distance linkage_method: str = 'ward', # Clustering method
top_k: int = 3, # Top clusters to return # Model Configuration
model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', #
Embedding model verbose: bool = False # Enable verbose
logging ) \nUsed for CSS selector-based structured data
extraction.\nJsonCssExtractionStrategy( schema: Dict[str,
Any], # Extraction schema verbose: bool = False # Enable
verbose logging ) # Schema Structure schema = { \"name\": str,
# Schema name \"baseSelector\": str, # Base CSS selector
\"fields\": [ # List of fields to extract { \"name\": str, #
Field name \"selector\": str, # CSS selector \"type\": str, #
Field type: \"text\", \"attribute\", \"html\", \"regex\"
\"attribute\": str, # For type=\"attribute\" \"pattern\": str,
# For type=\"regex\" \"transform\": str, # Optional:
\"lowercase\", \"uppercase\", \"strip\" \"default\": Any #
Default value if extraction fails } ] } \nChunking Strategies
\nAll chunking strategies inherit from ChunkingStrategy and
implement the chunk(text: str) -> list method.\nRegexChunking
\nSplits text based on regex patterns.
\nRegexChunking( patterns: List[str] = None # Regex patterns
for splitting # Default: [r'\\n\\n'] ) \nSlidingWindowChunking
\nCreates overlapping chunks with a sliding window approach.
\nSlidingWindowChunking( window_size: int = 100, # Window size
in words step: int = 50 # Step size between windows )
\nOverlappingWindowChunking\nCreates chunks with specified
overlap.\nOverlappingWindowChunking( window_size: int = 1000,
# Chunk size in words overlap: int = 100 # Overlap size in
words ) \nUsage Examples\nfrom pydantic import BaseModel from
crawl4ai.extraction_strategy import LLMExtractionStrategy from
crawl4ai.async_configs import LlmConfig # Define schema class
Article(BaseModel): title: str content: str author: str #
225
Create strategy strategy = LLMExtractionStrategy( llmConfig =
LlmConfig(provider=\"ollama/llama2\"),
schema=Article.schema(), instruction=\"Extract article details
\" ) # Use with crawler result = await crawler.arun( url=
\"https://example.com/article\",
extraction_strategy=strategy ) # Access extracted data data =
json.loads(result.extracted_content) \nfrom
crawl4ai.extraction_strategy import JsonCssExtractionStrategy
# Define schema schema = { \"name\": \"Product List\",
\"baseSelector\": \".product-card\", \"fields\": [ { \"name\":
\"title\", \"selector\": \"h2.title\", \"type\": \"text\" },
{ \"name\": \"price\", \"selector\": \".price\", \"type\":
\"text\", \"transform\": \"strip\" }, { \"name\": \"image\",
\"selector\": \"img\", \"type\": \"attribute\", \"attribute\":
\"src\" } ] } # Create and use strategy strategy =
JsonCssExtractionStrategy(schema) result = await
crawler.arun( url=\"https://example.com/products\",
extraction_strategy=strategy ) \nContent Chunking\nfrom
crawl4ai.chunking_strategy import OverlappingWindowChunking
from crawl4ai.async_configs import LlmConfig # Create chunking
strategy chunker = OverlappingWindowChunking( window_size=500,
# 500 words per chunk overlap=50 # 50 words overlap ) # Use
with extraction strategy strategy =
LLMExtractionStrategy( llmConfig = LlmConfig(provider=
\"ollama/llama2\"), chunking_strategy=chunker ) result = await
crawler.arun( url=\"https://example.com/long-article\",
extraction_strategy=strategy ) \nBest Practices\n1. Choose the
Right Strategy - Use LLMExtractionStrategy for complex,
unstructured content - Use JsonCssExtractionStrategy for well-
structured HTML - Use CosineStrategy for content similarity
and clustering\n2. Optimize Chunking \n# For long documents
strategy = LLMExtractionStrategy( chunk_token_threshold=2000,
# Smaller chunks overlap_rate=0.1 # 10% overlap ) \n3. Handle
Errors \ntry: result = await crawler.arun( url=
\"https://example.com\", extraction_strategy=strategy ) if
result.success: content = json.loads(result.extracted_content)
except Exception as e: print(f\"Extraction failed: {e}\") \n4.
Monitor Performance \nstrategy = CosineStrategy( verbose=True,
# Enable logging word_count_threshold=20, # Filter short
content top_k=5 # Limit results )",
"markdown": "# Strategies - Crawl4AI Documentation
(v0.5.x)\n\nThis documentation covers the API reference for
extraction and chunking strategies in Crawl4AI.\n\nAll
extraction strategies inherit from the base
ÈxtractionStrategy` class and implement two key methods: -
èxtract(url: str, html: str) -> List[Dict[str, Any]]` -
`run(url: str, sections: List[str]) -> List[Dict[str, Any]]`\n
\nUsed for extracting structured data using Language Models.\n
\n`LLMExtractionStrategy( # Required Parameters
provider: str = DEFAULT_PROVIDER, # LLM provider (e.g.,
\"ollama/llama2\") api_token: Optional[str] = None, #
API token # Extraction Configuration instruction: str
= None, # Custom extraction instruction
schema: Dict = None, # Pydantic model schema
for structured data extraction_type: str = \"block\",
# \"block\" or \"schema\" # Chunking Parameters
chunk_token_threshold: int = 4000, # Maximum tokens per
226
chunk overlap_rate: float = 0.1, # Overlap
between chunks word_token_rate: float = 0.75, # Word
to token conversion rate apply_chunking: bool = True,
# Enable/disable chunking # API Configuration
base_url: str = None, # Base URL for API
extra_args: Dict = {}, # Additional provider
arguments verbose: bool = False # Enable
verbose logging )`\n\n### CosineStrategy\n\nUsed for content
similarity-based extraction and clustering.\n
\n`CosineStrategy( # Content Filtering
semantic_filter: str = None, # Topic/keyword filter
word_count_threshold: int = 10, # Minimum words per
cluster sim_threshold: float = 0.3, # Similarity
threshold # Clustering Parameters max_dist: float =
0.2, # Maximum cluster distance
linkage_method: str = 'ward', # Clustering method
top_k: int = 3, # Top clusters to return
# Model Configuration model_name: str = 'sentence-
transformers/all-MiniLM-L6-v2', # Embedding model
verbose: bool = False # Enable verbose logging )`
\n\nUsed for CSS selector-based structured data extraction.\n
\n`JsonCssExtractionStrategy( schema: Dict[str, Any], #
Extraction schema verbose: bool = False # Enable
verbose logging ) # Schema Structure schema = { \"name\":
str, # Schema name \"baseSelector\": str,
# Base CSS selector \"fields\": [ # List of
fields to extract { \"name\": str, #
Field name \"selector\": str, # CSS selector
\"type\": str, # Field type: \"text\", \"attribute\",
\"html\", \"regex\" \"attribute\": str, # For
type=\"attribute\" \"pattern\": str, # For type=
\"regex\" \"transform\": str, # Optional:
\"lowercase\", \"uppercase\", \"strip\" \"default
\": Any # Default value if extraction
fails } ] }`\n\n## Chunking Strategies\n\nAll
chunking strategies inherit from `ChunkingStrategy` and
implement the `chunk(text: str) -> list` method.\n\n###
RegexChunking\n\nSplits text based on regex patterns.\n
\n`RegexChunking( patterns: List[str] = None # Regex
patterns for splitting #
Default: [r'\\n\\n'] )`\n\n### SlidingWindowChunking\n
\nCreates overlapping chunks with a sliding window approach.\n
\n`SlidingWindowChunking( window_size: int = 100, #
Window size in words step: int = 50 # Step
size between windows )`\n\n### OverlappingWindowChunking\n
\nCreates chunks with specified overlap.\n
\nÒverlappingWindowChunking( window_size: int = 1000, #
Chunk size in words overlap: int = 100 # Overlap
size in words )`\n\n## Usage Examples\n\n`from pydantic import
BaseModel from crawl4ai.extraction_strategy import
LLMExtractionStrategy from crawl4ai.async_configs import
LlmConfig # Define schema class Article(BaseModel):
title: str content: str author: str # Create strategy
strategy = LLMExtractionStrategy( llmConfig =
LlmConfig(provider=\"ollama/llama2\"),
schema=Article.schema(), instruction=\"Extract article
details\" ) # Use with crawler result = await
227
crawler.arun( url=\"https://example.com/article\",
extraction_strategy=strategy ) # Access extracted data data =
json.loads(result.extracted_content)`\n\n`from
crawl4ai.extraction_strategy import JsonCssExtractionStrategy
# Define schema schema = { \"name\": \"Product List\",
\"baseSelector\": \".product-card\", \"fields\":
[ { \"name\": \"title\",
\"selector\": \"h2.title\", \"type\": \"text
\" }, { \"name\": \"price\",
\"selector\": \".price\", \"type\": \"text\",
\"transform\": \"strip\" },
{ \"name\": \"image\", \"selector\":
\"img\", \"type\": \"attribute\",
\"attribute\": \"src\" } ] } # Create and use
strategy strategy = JsonCssExtractionStrategy(schema) result =
await crawler.arun( url=\"https://example.com/products\",
extraction_strategy=strategy )`\n\n### Content Chunking\n
\n`from crawl4ai.chunking_strategy import
OverlappingWindowChunking from crawl4ai.async_configs import
LlmConfig # Create chunking strategy chunker =
OverlappingWindowChunking( window_size=500, # 500 words
per chunk overlap=50 # 50 words overlap ) # Use
with extraction strategy strategy =
LLMExtractionStrategy( llmConfig = LlmConfig(provider=
\"ollama/llama2\"), chunking_strategy=chunker ) result =
await crawler.arun( url=\"https://example.com/long-article
\", extraction_strategy=strategy )`\n\n## Best Practices\n
\n1.â€€**Choose the Right Strategy** - Use
`LLMExtractionStrategy` for complex, unstructured content -
Use `JsonCssExtractionStrategy` for well-structured HTML - Use
`CosineStrategy` for content similarity and clustering\n\n2.â
€€**Optimize Chunking**\n\n`# For long documents strategy =
LLMExtractionStrategy( chunk_token_threshold=2000, #
Smaller chunks overlap_rate=0.1 # 10% overlap )`
\n\n3.â€€**Handle Errors**\n\n`try: result = await
crawler.arun( url=\"https://example.com\",
extraction_strategy=strategy ) if result.success:
content = json.loads(result.extracted_content) except
Exception as e: print(f\"Extraction failed: {e}\")`\n\n4.â
€€**Monitor Performance**\n\n`strategy =
CosineStrategy( verbose=True, # Enable logging
word_count_threshold=20, # Filter short content top_k=5
# Limit results )`",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/blog/releases/0.5.0/",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/blog/releases/0.5.0/",
"loadedTime": "2025-03-05T23:17:52.836Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/blog/",
"depth": 2,
"httpStatusCode": 200
},
228
"metadata": {
"canonicalUrl":
"https://docs.crawl4ai.com/blog/releases/0.5.0/",
"title": "Crawl4AI v0.5.0 Release Notes - Crawl4AI
Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:50 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"67d22bd41766474ad73803fb3ab3e322\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Crawl4AI v0.5.0 Release Notes - Crawl4AI
Documentation (v0.5.x)\nRelease Theme: Power, Flexibility, and
Scalability\nCrawl4AI v0.5.0 is a major release focused on
significantly enhancing the library's power, flexibility, and
scalability. Key improvements include a new deep crawling
system, a memory-adaptive dispatcher for handling large-scale
crawls, multiple crawling strategies (including a fast HTTP-
only crawler), Docker deployment options, and a powerful
command-line interface (CLI). This release also includes
numerous bug fixes, performance optimizations, and
documentation updates.\nImportant Note: This release contains
several breaking changes. Please review the \"Breaking Changes
\" section carefully and update your code accordingly.\nKey
Features\n1. Deep Crawling\nCrawl4AI now supports deep
crawling, allowing you to explore websites beyond the initial
URLs. This is controlled by the deep_crawl_strategy parameter
in CrawlerRunConfig. Several strategies are available:
\nBFSDeepCrawlStrategy (Breadth-First Search): Explores the
website level by level. (Default)\nDFSDeepCrawlStrategy
(Depth-First Search): Explores each branch as deeply as
possible before backtracking.\nBestFirstCrawlingStrategy: Uses
a scoring function to prioritize which URLs to crawl next.
\nimport time from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig, BFSDeepCrawlStrategy from
crawl4ai.content_scraping_strategy import
LXMLWebScrapingStrategy from crawl4ai.deep_crawling import
DomainFilter, ContentTypeFilter, FilterChain,
URLPatternFilter, KeywordRelevanceScorer,
BestFirstCrawlingStrategy import asyncio # Create a filter
chain to filter urls based on patterns, domains and content
type filter_chain =
FilterChain( [ DomainFilter( allowed_domains=[\"docs.crawl4ai.
com\"], blocked_domains=[\"old.docs.crawl4ai.com\"], ),
URLPatternFilter(patterns=[\"*core*\", \"*advanced*\"],),
229
ContentTypeFilter(allowed_types=[\"text/html\"]), ] ) # Create
a keyword scorer that prioritises the pages with certain
keywords first keyword_scorer =
KeywordRelevanceScorer( keywords=[\"crawl\", \"example\",
\"async\", \"configuration\"], weight=0.7 ) # Set up the
configuration deep_crawl_config =
CrawlerRunConfig( deep_crawl_strategy=BestFirstCrawlingStrateg
y( max_depth=2, include_external=False,
filter_chain=filter_chain, url_scorer=keyword_scorer, ),
scraping_strategy=LXMLWebScrapingStrategy(), stream=True,
verbose=True, ) async def main(): async with AsyncWebCrawler()
as crawler: start_time = time.perf_counter() results = []
async for result in await crawler.arun(url=
\"https://docs.crawl4ai.com\", config=deep_crawl_config):
print(f\"Crawled: {result.url} (Depth:
{result.metadata['depth']}), score:
{result.metadata['score']:.2f}\") results.append(result)
duration = time.perf_counter() - start_time print(f\"\\nâœ…
Crawled {len(results)} high-value pages in {duration:.2f}
seconds\") asyncio.run(main()) \nBreaking Change: The
max_depth parameter is now part of CrawlerRunConfig and
controls the depth of the crawl, not the number of concurrent
crawls. The arun() and arun_many() methods are now decorated
to handle deep crawling strategies. Imports for deep crawling
strategies have changed. See the Deep Crawling documentation
for more details.\n2. Memory-Adaptive Dispatcher\nThe new
MemoryAdaptiveDispatcher dynamically adjusts concurrency based
on available system memory and includes built-in rate
limiting. This prevents out-of-memory errors and avoids
overwhelming target websites.\nfrom crawl4ai import
AsyncWebCrawler, CrawlerRunConfig, MemoryAdaptiveDispatcher
import asyncio # Configure the dispatcher (optional, defaults
are used if not provided) dispatcher =
MemoryAdaptiveDispatcher( memory_threshold_percent=80.0, #
Pause if memory usage exceeds 80% check_interval=0.5, # Check
memory every 0.5 seconds ) async def batch_mode(): async with
AsyncWebCrawler() as crawler: results = await
crawler.arun_many( urls=[\"https://docs.crawl4ai.com\",
\"https://github.com/unclecode/crawl4ai\"],
config=CrawlerRunConfig(stream=False), # Batch mode
dispatcher=dispatcher, ) for result in results: print(f
\"Crawled: {result.url} with status code:
{result.status_code}\") async def stream_mode(): async with
AsyncWebCrawler() as crawler: # OR, for streaming: async for
result in await
crawler.arun_many( urls=[\"https://docs.crawl4ai.com\",
\"https://github.com/unclecode/crawl4ai\"],
config=CrawlerRunConfig(stream=True),
dispatcher=dispatcher, ): print(f\"Crawled: {result.url} with
status code: {result.status_code}\") print(\"Dispatcher in
batch mode:\") asyncio.run(batch_mode()) print(\"-\" * 50)
print(\"Dispatcher in stream mode:\")
asyncio.run(stream_mode()) \nBreaking Change:
AsyncWebCrawler.arun_many() now uses MemoryAdaptiveDispatcher
by default. Existing code that relied on unbounded concurrency
may require adjustments.\n3. Multiple Crawling Strategies
(Playwright and HTTP)\nCrawl4AI now offers two crawling
230
strategies:\nAsyncPlaywrightCrawlerStrategy (Default): Uses
Playwright for browser-based crawling, supporting JavaScript
rendering and complex interactions.\nAsyncHTTPCrawlerStrategy:
A lightweight, fast, and memory-efficient HTTP-only crawler.
Ideal for simple scraping tasks where browser rendering is
unnecessary.\nfrom crawl4ai import AsyncWebCrawler,
CrawlerRunConfig, HTTPCrawlerConfig from
crawl4ai.async_crawler_strategy import
AsyncHTTPCrawlerStrategy import asyncio # Use the HTTP crawler
strategy http_crawler_config = HTTPCrawlerConfig( method=\"GET
\", headers={\"User-Agent\": \"MyCustomBot/1.0\"},
follow_redirects=True, verify_ssl=True ) async def main():
async with
AsyncWebCrawler(crawler_strategy=AsyncHTTPCrawlerStrategy(brow
ser_config =http_crawler_config)) as crawler: result = await
crawler.arun(\"https://example.com\") print(f\"Status code:
{result.status_code}\") print(f\"Content length:
{len(result.html)}\") asyncio.run(main()) \n4. Docker
Deployment\nCrawl4AI can now be easily deployed as a Docker
container, providing a consistent and isolated environment.
The Docker image includes a FastAPI server with both streaming
and non-streaming endpoints.\n# Build the image (from the
project root) docker build -t crawl4ai . # Run the container
docker run -d -p 8000:8000 --name crawl4ai crawl4ai \nAPI
Endpoints:\n/crawl (POST): Non-streaming crawl.\n/crawl/stream
(POST): Streaming crawl (NDJSON).\n/health (GET): Health
check.\n/schema (GET): Returns configuration schemas.
\n/md/{url} (GET): Returns markdown content of the URL.
\n/llm/{url} (GET): Returns LLM extracted content.\n/token
(POST): Get JWT token\nBreaking Changes:\nDocker deployment
now requires a .llm.env file for API keys.\nDocker deployment
now requires Redis and a new config.yml structure.\nServer
startup now uses supervisord instead of direct process
management.\nDocker server now requires authentication by
default (JWT tokens).\nSee the Docker deployment documentation
for detailed instructions.\n5. Command-Line Interface (CLI)\nA
new CLI (crwl) provides convenient access to Crawl4AI's
functionality from the terminal.\n# Basic crawl crwl
https://example.com # Get markdown output crwl
https://example.com -o markdown # Use a configuration file
crwl https://example.com -B browser.yml -C crawler.yml # Use
LLM-based extraction crwl https://example.com -e extract.yml -
s schema.json # Ask a question about the crawled content crwl
https://example.com -q \"What is the main topic?\" # See usage
examples crwl --example \nSee the CLI documentation for more
details.\n6. LXML Scraping Mode\nAdded LXMLWebScrapingStrategy
for faster HTML parsing using the lxml library. This can
significantly improve scraping performance, especially for
large or complex pages. Set
scraping_strategy=LXMLWebScrapingStrategy() in your
CrawlerRunConfig.\nBreaking Change: The ScrapingMode enum has
been replaced with a strategy pattern. Use WebScrapingStrategy
(default) or LXMLWebScrapingStrategy.\n7. Proxy Rotation
\nAdded ProxyRotationStrategy abstract base class with
RoundRobinProxyStrategy concrete implementation.\nimport re
from crawl4ai import ( AsyncWebCrawler, BrowserConfig,
CrawlerRunConfig, CacheMode, RoundRobinProxyStrategy, ) import
231
asyncio from crawl4ai.configs import ProxyConfig async def
main(): # Load proxies and create rotation strategy proxies =
ProxyConfig.from_env() #eg: export PROXIES=
\"ip1:port1:username1:password1,ip2:port2:username2:password2
\" if not proxies: print(\"No proxies found in environment.
Set PROXIES env variable!\") return proxy_strategy =
RoundRobinProxyStrategy(proxies) # Create configs
browser_config = BrowserConfig(headless=True, verbose=False)
run_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
proxy_rotation_strategy=proxy_strategy ) async with
AsyncWebCrawler(config=browser_config) as crawler: urls =
[\"https://httpbin.org/ip\"] * (len(proxies) * 2) # Test each
proxy twice print(\"\\nðŸ“ˆ Initializing crawler with proxy
rotation...\") async with
AsyncWebCrawler(config=browser_config) as crawler:
print(\"\\nðŸš€ Starting batch crawl with proxy rotation...\")
results = await crawler.arun_many( urls=urls,
config=run_config ) for result in results: if result.success:
ip_match = re.search(r'(?:[0-9]{1,3}\\.){3}[0-9]{1,3}',
result.html) current_proxy = run_config.proxy_config if
run_config.proxy_config else None if current_proxy and
ip_match: print(f\"URL {result.url}\") print(f\"Proxy
{current_proxy.server} -> Response IP: {ip_match.group(0)}\")
verified = ip_match.group(0) == current_proxy.ip if verified:
print(f\"âœ… Proxy working! IP matches: {current_proxy.ip}\")
else: print(\"â Œ Proxy failed or IP mismatch!\")
print(\"---\") asyncio.run(main()) \nOther Changes and
Improvements\nAdded: LLMContentFilter for intelligent markdown
generation. This new filter uses an LLM to create more focused
and relevant markdown output.\nfrom crawl4ai import
AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter
from crawl4ai.async_configs import LlmConfig import asyncio
llm_config = LlmConfig(provider=\"gemini/gemini-1.5-pro\",
api_token=\"env:GEMINI_API_KEY\") markdown_generator =
DefaultMarkdownGenerator( content_filter=LLMContentFilter(llmC
onfig=llm_config, instruction=\"Extract key concepts and
summaries\") ) config =
CrawlerRunConfig(markdown_generator=markdown_generator) async
def main(): async with AsyncWebCrawler() as crawler: result =
await crawler.arun(\"https://docs.crawl4ai.com\",
config=config) print(result.markdown.fit_markdown)
asyncio.run(main()) \nAdded: URL redirection tracking. The
crawler now automatically follows HTTP redirects (301, 302,
307, 308) and records the final URL in the redirected_url
field of the CrawlResult object. No code changes are required
to enable this; it's automatic.\nAdded: LLM-powered schema
generation utility. A new generate_schema method has been
added to JsonCssExtractionStrategy and
JsonXPathExtractionStrategy. This greatly simplifies creating
extraction schemas.\nfrom crawl4ai.extraction_strategy import
JsonCssExtractionStrategy from crawl4ai.async_configs import
LlmConfig llm_config = LlmConfig(provider=\"gemini/gemini-1.5-
pro\", api_token=\"env:GEMINI_API_KEY\") schema =
JsonCssExtractionStrategy.generate_schema( html=\"<div
class='product'><h2>Product Name</h2><span
class='price'>$99</span></div>\", llmConfig = llm_config,
232
query=\"Extract product name and price\" ) print(schema)
\nExpected Output (may vary slightly due to LLM) \n{ \"name\":
\"ProductExtractor\", \"baseSelector\": \"div.product\",
\"fields\": [ {\"name\": \"name\", \"selector\": \"h2\",
\"type\": \"text\"}, {\"name\": \"price\", \"selector\":
\".price\", \"type\": \"text\"} ] } \nAdded: robots.txt
compliance support. The crawler can now respect robots.txt
rules. Enable this by setting check_robots_txt=True in
CrawlerRunConfig.\nconfig =
CrawlerRunConfig(check_robots_txt=True) \nAdded: PDF
processing capabilities. Crawl4AI can now extract text,
images, and metadata from PDF files (both local and remote).
This uses a new PDFCrawlerStrategy and
PDFContentScrapingStrategy.\nfrom crawl4ai import
AsyncWebCrawler, CrawlerRunConfig from crawl4ai.processors.pdf
import PDFCrawlerStrategy, PDFContentScrapingStrategy import
asyncio async def main(): async with
AsyncWebCrawler(crawler_strategy=PDFCrawlerStrategy()) as
crawler: result = await
crawler.arun( \"https://arxiv.org/pdf/2310.06825.pdf\",
config=CrawlerRunConfig( scraping_strategy=PDFContentScrapingS
trategy() ) ) print(result.markdown) # Access extracted text
print(result.metadata) # Access PDF metadata (title, author,
etc.) asyncio.run(main()) \nAdded: Support for frozenset
serialization. Improves configuration serialization,
especially for sets of allowed/blocked domains. No code
changes required.\nAdded: New LlmConfig parameter. This new
parameter can be passed for extraction, filtering, and schema
generation tasks. It simplifies passing provider strings, API
tokens, and base URLs across all sections where LLM
configuration is necessary. It also enables reuse and allows
for quick experimentation between different LLM
configurations.\nfrom crawl4ai.async_configs import LlmConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig #
Example of using LlmConfig with LLMExtractionStrategy
llm_config = LlmConfig(provider=\"openai/gpt-4o\", api_token=
\"YOUR_API_KEY\") strategy =
LLMExtractionStrategy(llmConfig=llm_config, schema=...) #
Example usage within a crawler async with AsyncWebCrawler() as
crawler: result = await crawler.arun( url=
\"https://example.com\",
config=CrawlerRunConfig(extraction_strategy=strategy) )
\nBreaking Change: Removed old parameters like provider,
api_token, base_url, and api_base from LLMExtractionStrategy
and LLMContentFilter. Users should migrate to using the
LlmConfig object. \nChanged: Improved browser context
management and added shared data support. (Breaking Change:
BrowserContext API updated). Browser contexts are now managed
more efficiently, reducing resource usage. A new shared_data
dictionary is available in the BrowserContext to allow passing
data between different stages of the crawling process.
Breaking Change: The BrowserContext API has changed, and the
old get_context method is deprecated.\nChanged: Renamed
final_url to redirected_url in CrawledURL. This improves
consistency and clarity. Update any code referencing the old
field name.\nChanged: Improved type hints and removed unused
233
files. This is an internal improvement and should not require
code changes.\nChanged: Reorganized deep crawling
functionality into dedicated module. (Breaking Change: Import
paths for DeepCrawlStrategy and related classes have changed).
This improves code organization. Update imports to use the new
crawl4ai.deep_crawling module.\nChanged: Improved HTML
handling and cleanup codebase. (Breaking Change: Removed
ssl_certificate.json file). This removes an unused file. If
you were relying on this file for custom certificate
validation, you'll need to implement an alternative approach.
\nChanged: Enhanced serialization and config handling.
(Breaking Change: FastFilterChain has been replaced with
FilterChain). This change simplifies config and improves the
serialization.\nAdded: Modified the license to Apache 2.0 with
a required attribution clause. See the LICENSE file for
details. All users must now clearly attribute the Crawl4AI
project when using, distributing, or creating derivative
works.\nFixed: Prevent memory leaks by ensuring proper closure
of Playwright pages. No code changes required.\nFixed: Make
model fields optional with default values (Breaking Change:
Code relying on all fields being present may need adjustment).
Fields in data models (like CrawledURL) are now optional, with
default values (usually None). Update code to handle potential
None values.\nFixed: Adjust memory threshold and fix
dispatcher initialization. This is an internal bug fix; no
code changes are required.\nFixed: Ensure proper exit after
running doctor command. No code changes are required.\nFixed:
JsonCss selector and crawler improvements.\nFixed: Not working
long page screenshot (#403)\nDocumentation: Updated
documentation URLs to the new domain.\nDocumentation: Added
SERP API project example.\nDocumentation: Added clarifying
comments for CSS selector behavior.\nDocumentation: Add Code
of Conduct for the project (#410)\nBreaking Changes Summary
\nDispatcher: The MemoryAdaptiveDispatcher is now the default
for arun_many(), changing concurrency behavior. The return
type of arun_many depends on the stream parameter.\nDeep
Crawling: max_depth is now part of CrawlerRunConfig and
controls crawl depth. Import paths for deep crawling
strategies have changed.\nBrowser Context: The BrowserContext
API has been updated.\nModels: Many fields in data models are
now optional, with default values.\nScraping Mode:
ScrapingMode enum replaced by strategy pattern
(WebScrapingStrategy, LXMLWebScrapingStrategy).\nContent
Filter: Removed content_filter parameter from
CrawlerRunConfig. Use extraction strategies or markdown
generators with filters instead.\nRemoved: Synchronous
WebCrawler, CLI, and docs management functionality.\nDocker:
Significant changes to Docker deployment, including new
requirements and configuration.\nFile Removed: Removed
ssl_certificate.json file which might affect existing
certificate validations\nRenamed: final_url to redirected_url
for consistency\nConfig: FastFilterChain has been replaced
with FilterChain\nDeep-Crawl: DeepCrawlStrategy.arun now
returns Union[CrawlResultT, List[CrawlResultT],
AsyncGenerator[CrawlResultT, None]]\nProxy: Removed
synchronous WebCrawler support and related rate limiting
configurations\nMigration Guide\nUpdate Imports: Adjust
234
imports for DeepCrawlStrategy, BreadthFirstSearchStrategy, and
related classes due to the new deep_crawling module structure.
\nCrawlerRunConfig: Move max_depth to CrawlerRunConfig. If
using content_filter, migrate to an extraction strategy or a
markdown generator with a filter.\narun_many(): Adapt code to
the new MemoryAdaptiveDispatcher behavior and the return type.
\nBrowserContext: Update code using the BrowserContext API.
\nModels: Handle potential None values for optional fields in
data models.\nScraping: Replace ScrapingMode enum with
WebScrapingStrategy or LXMLWebScrapingStrategy.\nDocker:
Review the updated Docker documentation and adjust your
deployment accordingly.\nCLI: Migrate to the new crwl command
and update any scripts using the old CLI.\nProxy:: Removed
synchronous WebCrawler support and related rate limiting
configurations.\nConfig:: Replace FastFilterChain to
FilterChain",
"markdown": "# Crawl4AI v0.5.0 Release Notes - Crawl4AI
Documentation (v0.5.x)\n\n**Release Theme: Power, Flexibility,
and Scalability**\n\nCrawl4AI v0.5.0 is a major release
focused on significantly enhancing the library's power,
flexibility, and scalability. Key improvements include a new
**deep crawling** system, a **memory-adaptive dispatcher** for
handling large-scale crawls, **multiple crawling strategies**
(including a fast HTTP-only crawler), **Docker** deployment
options, and a powerful **command-line interface (CLI)**. This
release also includes numerous bug fixes, performance
optimizations, and documentation updates.\n\n**Important Note:
** This release contains several **breaking changes**. Please
review the \"Breaking Changes\" section carefully and update
your code accordingly.\n\n## Key Features\n\n### 1\\. Deep
Crawling\n\nCrawl4AI now supports deep crawling, allowing you
to explore websites beyond the initial URLs. This is
controlled by the `deep_crawl_strategy` parameter in
`CrawlerRunConfig`. Several strategies are available:\n\n*
**`BFSDeepCrawlStrategy` (Breadth-First Search):** Explores
the website level by level. (Default)\n*
**`DFSDeepCrawlStrategy` (Depth-First Search):** Explores each
branch as deeply as possible before backtracking.\n*
**`BestFirstCrawlingStrategy`:** Uses a scoring function to
prioritize which URLs to crawl next.\n\nìmport time from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig,
BFSDeepCrawlStrategy from crawl4ai.content_scraping_strategy
import LXMLWebScrapingStrategy from crawl4ai.deep_crawling
import DomainFilter, ContentTypeFilter, FilterChain,
URLPatternFilter, KeywordRelevanceScorer,
BestFirstCrawlingStrategy import asyncio # Create a filter
chain to filter urls based on patterns, domains and content
type filter_chain =
FilterChain( [ DomainFilter( allowed_d
omains=[\"docs.crawl4ai.com\"],
blocked_domains=[\"old.docs.crawl4ai.com\"], ),
URLPatternFilter(patterns=[\"*core*\", \"*advanced*\"],),
ContentTypeFilter(allowed_types=[\"text/html\"]), ] ) #
Create a keyword scorer that prioritises the pages with
certain keywords first keyword_scorer =
KeywordRelevanceScorer( keywords=[\"crawl\", \"example\",
\"async\", \"configuration\"], weight=0.7 ) # Set up the
235
configuration deep_crawl_config =
CrawlerRunConfig( deep_crawl_strategy=BestFirstCrawlingStr
ategy( max_depth=2, include_external=False,
filter_chain=filter_chain,
url_scorer=keyword_scorer, ),
scraping_strategy=LXMLWebScrapingStrategy(), stream=True,
verbose=True, ) async def main(): async with
AsyncWebCrawler() as crawler: start_time =
time.perf_counter() results = [] async for
result in await crawler.arun(url=\"https://docs.crawl4ai.com
\", config=deep_crawl_config): print(f\"Crawled:
{result.url} (Depth: {result.metadata['depth']}), score:
{result.metadata['score']:.2f}\")
results.append(result) duration =
time.perf_counter() - start_time print(f\"\\nâœ…
Crawled {len(results)} high-value pages in {duration:.2f}
seconds\") asyncio.run(main())`\n\n**Breaking Change:** The
`max_depth` parameter is now part of `CrawlerRunConfig` and
controls the _depth_ of the crawl, not the number of
concurrent crawls. The àrun()` and àrun_many()` methods are
now decorated to handle deep crawling strategies. Imports for
deep crawling strategies have changed. See the [Deep Crawling
documentation](https://crawl4ai.com/mkdocs/core/deep-
crawling/) for more details.\n\n### 2\\. Memory-Adaptive
Dispatcher\n\nThe new `MemoryAdaptiveDispatcher` dynamically
adjusts concurrency based on available system memory and
includes built-in rate limiting. This prevents out-of-memory
errors and avoids overwhelming target websites.\n\n`from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig,
MemoryAdaptiveDispatcher import asyncio # Configure the
dispatcher (optional, defaults are used if not provided)
dispatcher =
MemoryAdaptiveDispatcher( memory_threshold_percent=80.0,
# Pause if memory usage exceeds 80% check_interval=0.5, #
Check memory every 0.5 seconds ) async def batch_mode():
async with AsyncWebCrawler() as crawler: results =
await
crawler.arun_many( urls=[\"https://docs.crawl4ai.c
om\", \"https://github.com/unclecode/crawl4ai\"],
config=CrawlerRunConfig(stream=False), # Batch mode
dispatcher=dispatcher, ) for result in
results: print(f\"Crawled: {result.url} with
status code: {result.status_code}\") async def stream_mode():
async with AsyncWebCrawler() as crawler: # OR, for
streaming: async for result in await
crawler.arun_many( urls=[\"https://docs.crawl4ai.c
om\", \"https://github.com/unclecode/crawl4ai\"],
config=CrawlerRunConfig(stream=True),
dispatcher=dispatcher, ): print(f
\"Crawled: {result.url} with status code:
{result.status_code}\") print(\"Dispatcher in batch mode:\")
asyncio.run(batch_mode()) print(\"-\" * 50) print(\"Dispatcher
in stream mode:\") asyncio.run(stream_mode())`\n\n**Breaking
Change:** ÀsyncWebCrawler.arun_many()` now uses
`MemoryAdaptiveDispatcher` by default. Existing code that
relied on unbounded concurrency may require adjustments.\n
\n### 3\\. Multiple Crawling Strategies (Playwright and
236
HTTP)\n\nCrawl4AI now offers two crawling strategies:\n\n*
**ÀsyncPlaywrightCrawlerStrategy` (Default):** Uses
Playwright for browser-based crawling, supporting JavaScript
rendering and complex interactions.\n*
**ÀsyncHTTPCrawlerStrategy`:** A lightweight, fast, and
memory-efficient HTTP-only crawler. Ideal for simple scraping
tasks where browser rendering is unnecessary.\n\n`from
crawl4ai import AsyncWebCrawler, CrawlerRunConfig,
HTTPCrawlerConfig from crawl4ai.async_crawler_strategy import
AsyncHTTPCrawlerStrategy import asyncio # Use the HTTP
crawler strategy http_crawler_config =
HTTPCrawlerConfig( method=\"GET\",
headers={\"User-Agent\": \"MyCustomBot/1.0\"},
follow_redirects=True, verify_ssl=True ) async def
main(): async with
AsyncWebCrawler(crawler_strategy=AsyncHTTPCrawlerStrategy(brow
ser_config =http_crawler_config)) as crawler: result =
await crawler.arun(\"https://example.com\") print(f
\"Status code: {result.status_code}\") print(f
\"Content length: {len(result.html)}\") asyncio.run(main())`
\n\n### 4\\. Docker Deployment\n\nCrawl4AI can now be easily
deployed as a Docker container, providing a consistent and
isolated environment. The Docker image includes a FastAPI
server with both streaming and non-streaming endpoints.\n\n`#
Build the image (from the project root) docker build -t
crawl4ai . # Run the container docker run -d -p 8000:8000 --
name crawl4ai crawl4ai`\n\n**API Endpoints:**\n\n* `/crawl`
(POST): Non-streaming crawl.\n* `/crawl/stream` (POST):
Streaming crawl (NDJSON).\n* `/health` (GET): Health check.
\n* `/schema` (GET): Returns configuration schemas.\n*
`/md/{url}` (GET): Returns markdown content of the URL.\n*
`/llm/{url}` (GET): Returns LLM extracted content.\n*
`/token` (POST): Get JWT token\n\n**Breaking Changes:**\n\n*
Docker deployment now requires a `.llm.env` file for API keys.
\n* Docker deployment now requires Redis and a new
`config.yml` structure.\n* Server startup now uses
`supervisord` instead of direct process management.\n*
Docker server now requires authentication by default (JWT
tokens).\n\nSee the [Docker deployment documentation]
(https://crawl4ai.com/mkdocs/core/docker-deployment/) for
detailed instructions.\n\n### 5\\. Command-Line Interface
(CLI)\n\nA new CLI (`crwl`) provides convenient access to
Crawl4AI's functionality from the terminal.\n\n`# Basic crawl
crwl https://example.com # Get markdown output crwl
https://example.com -o markdown # Use a configuration file
crwl https://example.com -B browser.yml -C crawler.yml # Use
LLM-based extraction crwl https://example.com -e extract.yml -
s schema.json # Ask a question about the crawled content crwl
https://example.com -q \"What is the main topic?\" # See
usage examples crwl --example`\n\nSee the [CLI documentation]
(https://crawl4ai.com/mkdocs/blog/releases/docs/md_v2/core/cli
.md) for more details.\n\n### 6\\. LXML Scraping Mode\n\nAdded
`LXMLWebScrapingStrategy` for faster HTML parsing using the
`lxml` library. This can significantly improve scraping
performance, especially for large or complex pages. Set
`scraping_strategy=LXMLWebScrapingStrategy()` in your
`CrawlerRunConfig`.\n\n**Breaking Change:** The `ScrapingMode`
237
enum has been replaced with a strategy pattern. Use
`WebScrapingStrategy` (default) or `LXMLWebScrapingStrategy`.
\n\n### 7\\. Proxy Rotation\n\nAdded `ProxyRotationStrategy`
abstract base class with `RoundRobinProxyStrategy` concrete
implementation.\n\nìmport re from crawl4ai import
( AsyncWebCrawler, BrowserConfig,
CrawlerRunConfig, CacheMode,
RoundRobinProxyStrategy, ) import asyncio from
crawl4ai.configs import ProxyConfig async def main(): #
Load proxies and create rotation strategy proxies =
ProxyConfig.from_env() #eg: export PROXIES=
\"ip1:port1:username1:password1,ip2:port2:username2:password2
\" if not proxies: print(\"No proxies found in
environment. Set PROXIES env variable!\") return
proxy_strategy = RoundRobinProxyStrategy(proxies) #
Create configs browser_config =
BrowserConfig(headless=True, verbose=False) run_config =
CrawlerRunConfig( cache_mode=CacheMode.BYPASS,
proxy_rotation_strategy=proxy_strategy ) async with
AsyncWebCrawler(config=browser_config) as crawler:
urls = [\"https://httpbin.org/ip\"] * (len(proxies) * 2) #
Test each proxy twice print(\"\\nðŸ“ˆ Initializing
crawler with proxy rotation...\") async with
AsyncWebCrawler(config=browser_config) as crawler:
print(\"\\nðŸš€ Starting batch crawl with proxy rotation...\")
results = await crawler.arun_many( urls=urls,
config=run_config ) for result in
results: if result.success:
ip_match = re.search(r'(?:[0-9]{1,3}\\.){3}[0-9]{1,3}',
result.html) current_proxy =
run_config.proxy_config if run_config.proxy_config else None
if current_proxy and ip_match: print(f
\"URL {result.url}\") print(f\"Proxy
{current_proxy.server} -> Response IP: {ip_match.group(0)}\")
verified = ip_match.group(0) == current_proxy.ip
if verified: print(f\"âœ… Proxy
working! IP matches: {current_proxy.ip}\")
else: print(\"â Œ Proxy failed or
IP mismatch!\") print(\"---\")
asyncio.run(main())`\n\n## Other Changes and Improvements\n\n*
**Added: `LLMContentFilter` for intelligent markdown
generation.** This new filter uses an LLM to create more
focused and relevant markdown output.\n\n`from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter
from crawl4ai.async_configs import LlmConfig import asyncio
llm_config = LlmConfig(provider=\"gemini/gemini-1.5-pro\",
api_token=\"env:GEMINI_API_KEY\") markdown_generator =
DefaultMarkdownGenerator( content_filter=LLMContentFilter(
llmConfig=llm_config, instruction=\"Extract key concepts and
summaries\") ) config =
CrawlerRunConfig(markdown_generator=markdown_generator) async
def main(): async with AsyncWebCrawler() as crawler:
result = await crawler.arun(\"https://docs.crawl4ai.com\",
config=config) print(result.markdown.fit_markdown)
asyncio.run(main())`\n\n* **Added: URL redirection tracking.
** The crawler now automatically follows HTTP redirects (301,
238
302, 307, 308) and records the final URL in the
`redirected_url` field of the `CrawlResult` object. No code
changes are required to enable this; it's automatic.\n \n*
**Added: LLM-powered schema generation utility.** A new
`generate_schema` method has been added to
`JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`.
This greatly simplifies creating extraction schemas.\n \n
\n`from crawl4ai.extraction_strategy import
JsonCssExtractionStrategy from crawl4ai.async_configs import
LlmConfig llm_config = LlmConfig(provider=
\"gemini/gemini-1.5-pro\", api_token=\"env:GEMINI_API_KEY\")
schema = JsonCssExtractionStrategy.generate_schema( html=
\"<div class='product'><h2>Product Name</h2><span
class='price'>$99</span></div>\", llmConfig = llm_config,
query=\"Extract product name and price\" ) print(schema)`\n
\nExpected Output (may vary slightly due to LLM)\n
\n`{ \"name\": \"ProductExtractor\", \"baseSelector\":
\"div.product\", \"fields\": [ {\"name\": \"name\",
\"selector\": \"h2\", \"type\": \"text\"}, {\"name\":
\"price\", \"selector\": \".price\", \"type\": \"text
\"} ] }`\n\n* **Added: robots.txt compliance support.**
The crawler can now respect `robots.txt` rules. Enable this by
setting `check_robots_txt=True` in `CrawlerRunConfig`.\n
\n`config = CrawlerRunConfig(check_robots_txt=True)`\n\n*
**Added: PDF processing capabilities.** Crawl4AI can now
extract text, images, and metadata from PDF files (both local
and remote). This uses a new `PDFCrawlerStrategy` and
`PDFContentScrapingStrategy`.\n\n`from crawl4ai import
AsyncWebCrawler, CrawlerRunConfig from crawl4ai.processors.pdf
import PDFCrawlerStrategy, PDFContentScrapingStrategy import
asyncio async def main(): async with
AsyncWebCrawler(crawler_strategy=PDFCrawlerStrategy()) as
crawler: result = await
crawler.arun( \"https://arxiv.org/pdf/2310.06825.p
df\",
config=CrawlerRunConfig( scraping_strategy=PDF
ContentScrapingStrategy() ) )
print(result.markdown) # Access extracted text
print(result.metadata) # Access PDF metadata (title, author,
etc.) asyncio.run(main())`\n\n* **Added: Support for
frozenset serialization.** Improves configuration
serialization, especially for sets of allowed/blocked domains.
No code changes required.\n \n* **Added: New `LlmConfig`
parameter.** This new parameter can be passed for extraction,
filtering, and schema generation tasks. It simplifies passing
provider strings, API tokens, and base URLs across all
sections where LLM configuration is necessary. It also enables
reuse and allows for quick experimentation between different
LLM configurations.\n \n\n`from crawl4ai.async_configs
import LlmConfig from crawl4ai.extraction_strategy import
LLMExtractionStrategy from crawl4ai import AsyncWebCrawler,
CrawlerRunConfig # Example of using LlmConfig with
LLMExtractionStrategy llm_config = LlmConfig(provider=
\"openai/gpt-4o\", api_token=\"YOUR_API_KEY\") strategy =
LLMExtractionStrategy(llmConfig=llm_config, schema=...) #
Example usage within a crawler async with AsyncWebCrawler() as
crawler: result = await crawler.arun( url=
239
\"https://example.com\",
config=CrawlerRunConfig(extraction_strategy=strategy) )`\n
\n**Breaking Change:** Removed old parameters like `provider`,
àpi_token`, `base_url`, and àpi_base` from
`LLMExtractionStrategy` and `LLMContentFilter`. Users should
migrate to using the `LlmConfig` object.\n\n* **Changed:
Improved browser context management and added shared data
support. (Breaking Change:** `BrowserContext` API updated).
Browser contexts are now managed more efficiently, reducing
resource usage. A new `shared_data` dictionary is available in
the `BrowserContext` to allow passing data between different
stages of the crawling process. **Breaking Change:** The
`BrowserContext` API has changed, and the old `get_context`
method is deprecated.\n \n* **Changed:** Renamed
`final_url` to `redirected_url` in `CrawledURL`. This improves
consistency and clarity. Update any code referencing the old
field name.\n \n* **Changed:** Improved type hints and
removed unused files. This is an internal improvement and
should not require code changes.\n \n* **Changed:**
Reorganized deep crawling functionality into dedicated module.
(**Breaking Change:** Import paths for `DeepCrawlStrategy` and
related classes have changed). This improves code
organization. Update imports to use the new
`crawl4ai.deep_crawling` module.\n \n* **Changed:**
Improved HTML handling and cleanup codebase. (**Breaking
Change:** Removed `ssl_certificate.json` file). This removes
an unused file. If you were relying on this file for custom
certificate validation, you'll need to implement an
alternative approach.\n \n* **Changed:** Enhanced
serialization and config handling. (**Breaking Change:**
`FastFilterChain` has been replaced with `FilterChain`). This
change simplifies config and improves the serialization.\n
\n* **Added:** Modified the license to Apache 2.0 _with a
required attribution clause_. See the `LICENSE` file for
details. All users must now clearly attribute the Crawl4AI
project when using, distributing, or creating derivative
works.\n \n* **Fixed:** Prevent memory leaks by ensuring
proper closure of Playwright pages. No code changes required.
\n \n* **Fixed:** Make model fields optional with default
values (**Breaking Change:** Code relying on all fields being
present may need adjustment). Fields in data models (like
`CrawledURL`) are now optional, with default values (usually
`None`). Update code to handle potential `None` values.\n
\n* **Fixed:** Adjust memory threshold and fix dispatcher
initialization. This is an internal bug fix; no code changes
are required.\n \n* **Fixed:** Ensure proper exit after
running doctor command. No code changes are required.\n \n*
**Fixed:** JsonCss selector and crawler improvements.\n*
**Fixed:** Not working long page screenshot (#403)\n*
**Documentation:** Updated documentation URLs to the new
domain.\n* **Documentation:** Added SERP API project
example.\n* **Documentation:** Added clarifying comments for
CSS selector behavior.\n* **Documentation:** Add Code of
Conduct for the project (#410)\n\n## Breaking Changes Summary
\n\n* **Dispatcher:** The `MemoryAdaptiveDispatcher` is now
the default for àrun_many()`, changing concurrency behavior.
The return type of àrun_many` depends on the `stream`
240
parameter.\n* **Deep Crawling:** `max_depth` is now part of
`CrawlerRunConfig` and controls crawl depth. Import paths for
deep crawling strategies have changed.\n* **Browser Context:
** The `BrowserContext` API has been updated.\n* **Models:**
Many fields in data models are now optional, with default
values.\n* **Scraping Mode:** `ScrapingMode` enum replaced
by strategy pattern (`WebScrapingStrategy`,
`LXMLWebScrapingStrategy`).\n* **Content Filter:** Removed
`content_filter` parameter from `CrawlerRunConfig`. Use
extraction strategies or markdown generators with filters
instead.\n* **Removed:** Synchronous `WebCrawler`, CLI, and
docs management functionality.\n* **Docker:** Significant
changes to Docker deployment, including new requirements and
configuration.\n* **File Removed**: Removed ssl
\\_certificate.json file which might affect existing
certificate validations\n* **Renamed**: final\\_url to
redirected\\_url for consistency\n* **Config**:
FastFilterChain has been replaced with FilterChain\n*
**Deep-Crawl**: DeepCrawlStrategy.arun now returns Union
\\[CrawlResultT, List\\[CrawlResultT\\], AsyncGenerator
\\[CrawlResultT, None\\]\\]\n* **Proxy**: Removed
synchronous WebCrawler support and related rate limiting
configurations\n\n## Migration Guide\n\n1. **Update Imports:
** Adjust imports for `DeepCrawlStrategy`,
`BreadthFirstSearchStrategy`, and related classes due to the
new `deep_crawling` module structure.\n2.
**`CrawlerRunConfig`:** Move `max_depth` to
`CrawlerRunConfig`. If using `content_filter`, migrate to an
extraction strategy or a markdown generator with a filter.\n3.
**àrun_many()`:** Adapt code to the new
`MemoryAdaptiveDispatcher` behavior and the return type.\n4.
**`BrowserContext`:** Update code using the `BrowserContext`
API.\n5. **Models:** Handle potential `None` values for
optional fields in data models.\n6. **Scraping:** Replace
`ScrapingMode` enum with `WebScrapingStrategy` or
`LXMLWebScrapingStrategy`.\n7. **Docker:** Review the updated
Docker documentation and adjust your deployment accordingly.
\n8. **CLI:** Migrate to the new `crwl` command and update
any scripts using the old CLI.\n9. **Proxy:**: Removed
synchronous WebCrawler support and related rate limiting
configurations.\n10. **Config:**: Replace FastFilterChain to
FilterChain",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/blog/releases/0.4.1/",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/blog/releases/0.4.1/",
"loadedTime": "2025-03-05T23:18:00.081Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/blog/",
"depth": 2,
"httpStatusCode": 200
},
"metadata": {
241
"canonicalUrl":
"https://docs.crawl4ai.com/blog/releases/0.4.1/",
"title": "Release Summary for Version 0.4.1 (December 8,
2024): Major Efficiency Boosts with New Features! - Crawl4AI
Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:58 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"d8fa28cdd65af45b418f085358a027b7\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Release Summary for Version 0.4.1 (December 8,
2024): Major Efficiency Boosts with New Features!\nThis post
was generated with the help of ChatGPT, take everything with a
grain of salt. ðŸ§‚\nHi everyone,\nI just finished putting
together version 0.4.1 of Crawl4AI, and there are a few
changes in here that I think youâ€™ll find really helpful. Iâ
€™ll explain whatâ€™s new, why it matters, and exactly how you
can use these features (with the code to back it up). Letâ€™s
get into it.\nHandling Lazy Loading Better (Images
Included)\nOne thing that always bugged me with crawlers is
how often they miss lazy-loaded content, especially images. In
this version, I made sure Crawl4AI waits for all images to
load before moving forward. This is useful because many modern
websites only load images when theyâ€™re in the viewport or
after some JavaScript executes.\nHereâ€™s how to enable it:
\nawait crawler.crawl( url=\"https://example.com\",
wait_for_images=True # Add this argument to ensure images are
fully loaded ) \nWhat this does is: 1. Waits for the page to
reach a \"network idle\" state. 2. Ensures all images on the
page have been completely loaded.\nThis single change handles
the majority of lazy-loading cases youâ€™re likely to
encounter.\nText-Only Mode (Fast, Lightweight
Crawling)\nSometimes, you donâ€™t need to download images or
process JavaScript at all. For example, if youâ€™re crawling
to extract text data, you can enable text-only mode to speed
things up. By disabling images, JavaScript, and other heavy
resources, this mode makes crawling 3-4 times faster in most
cases.\nHereâ€™s how to turn it on:\ncrawler =
AsyncPlaywrightCrawlerStrategy( text_mode=True # Set this to
True to enable text-only crawling ) \nWhen text_mode=True, the
crawler automatically: - Disables GPU processing. - Blocks
image and JavaScript resources. - Reduces the viewport size to
800x600 (you can override this with viewport_width and
viewport_height).\nIf you need to crawl thousands of pages
242
where you only care about text, this mode will save you a ton
of time and resources.\nAdjusting the Viewport Dynamically
\nAnother useful addition is the ability to dynamically adjust
the viewport size to match the content on the page. This is
particularly helpful when youâ€™re working with responsive
layouts or want to ensure all parts of the page load properly.
\nHereâ€™s how it works: 1. The crawler calculates the pageâ
€™s width and height after it loads. 2. It adjusts the
viewport to fit the content dimensions. 3. (Optional) It uses
Chrome DevTools Protocol (CDP) to simulate zooming out so
everything fits in the viewport.\nTo enable this, use:\nawait
crawler.crawl( url=\"https://example.com\",
adjust_viewport_to_content=True # Dynamically adjusts the
viewport ) \nThis approach makes sure the entire page gets
loaded into the viewport, especially for layouts that load
content based on visibility.\nSimulating Full-Page Scrolling
\nSome websites load data dynamically as you scroll down the
page. To handle these cases, I added support for full-page
scanning. It simulates scrolling to the bottom of the page,
checking for new content, and capturing it all.\nHereâ€™s an
example:\nawait crawler.crawl( url=\"https://example.com\",
scan_full_page=True, # Enables scrolling scroll_delay=0.2 #
Waits 200ms between scrolls (optional) ) \nWhat happens here:
1. The crawler scrolls down in increments, waiting for content
to load after each scroll. 2. It stops when no new content
appears (i.e., dynamic elements stop loading). 3. It scrolls
back to the top before finishing (if necessary).\nIf youâ€™ve
ever had to deal with infinite scroll pages, this is going to
save you a lot of headaches.\nReusing Browser Sessions (Save
Time on Setup)\nBy default, every time you crawl a page, a new
browser context (or tab) is created. Thatâ€™s fine for small
crawls, but if youâ€™re working on a large dataset, itâ€™s
more efficient to reuse the same session.\nI added a method
called create_session for this:\nsession_id = await
crawler.create_session() # Use the same session for multiple
crawls await crawler.crawl( url=\"https://example.com/page1\",
session_id=session_id # Reuse the session ) await
crawler.crawl( url=\"https://example.com/page2\",
session_id=session_id ) \nThis avoids creating a new tab for
every page, speeding up the crawl and reducing memory usage.
\nOther Updates\nHere are a few smaller updates Iâ€™ve made: -
Light Mode: Use light_mode=True to disable background
processes, extensions, and other unnecessary features, making
the browser more efficient. - Logging: Improved logs to make
debugging easier. - Defaults: Added sensible defaults for
things like delay_before_return_html (now set to 0.1 seconds).
\nHow to Get the Update\nYou can install or upgrade to version
0.4.1 like this:\npip install crawl4ai --upgrade \nAs always,
Iâ€™d love to hear your thoughts. If thereâ€™s something you
think could be improved or if you have suggestions for future
versions, let me know!\nEnjoy the new features, and happy
crawling! ðŸ•·ï¸ ",
"markdown": "# Release Summary for Version 0.4.1 (December
8, 2024): Major Efficiency Boosts with New Features!\n\n_This
post was generated with the help of ChatGPT, take everything
with a grain of salt. ðŸ§‚_\n\nHi everyone,\n\nI just finished
putting together version 0.4.1 of Crawl4AI, and there are a
243
few changes in here that I think youâ€™ll find really helpful.
Iâ€™ll explain whatâ€™s new, why it matters, and exactly how
you can use these features (with the code to back it up). Letâ
€™s get into it.\n\n* * *\n\n### Handling Lazy Loading Better
(Images Included)\n\nOne thing that always bugged me with
crawlers is how often they miss lazy-loaded content,
especially images. In this version, I made sure Crawl4AI
**waits for all images to load** before moving forward. This
is useful because many modern websites only load images when
theyâ€™re in the viewport or after some JavaScript executes.\n
\nHereâ€™s how to enable it:\n\nàwait crawler.crawl( url=
\"https://example.com\", wait_for_images=True # Add this
argument to ensure images are fully loaded )`\n\nWhat this
does is: 1. Waits for the page to reach a \"network idle\"
state. 2. Ensures all images on the page have been completely
loaded.\n\nThis single change handles the majority of lazy-
loading cases youâ€™re likely to encounter.\n\n* * *\n\n###
Text-Only Mode (Fast, Lightweight Crawling)\n\nSometimes, you
donâ€™t need to download images or process JavaScript at all.
For example, if youâ€™re crawling to extract text data, you
can enable **text-only mode** to speed things up. By disabling
images, JavaScript, and other heavy resources, this mode makes
crawling **3-4 times faster** in most cases.\n\nHereâ€™s how
to turn it on:\n\n`crawler =
AsyncPlaywrightCrawlerStrategy( text_mode=True # Set this
to True to enable text-only crawling )`\n\nWhen
`text_mode=True`, the crawler automatically: - Disables GPU
processing. - Blocks image and JavaScript resources. - Reduces
the viewport size to 800x600 (you can override this with
`viewport_width` and `viewport_height`).\n\nIf you need to
crawl thousands of pages where you only care about text, this
mode will save you a ton of time and resources.\n\n* * *\n
\n### Adjusting the Viewport Dynamically\n\nAnother useful
addition is the ability to **dynamically adjust the viewport
size** to match the content on the page. This is particularly
helpful when youâ€™re working with responsive layouts or want
to ensure all parts of the page load properly.\n\nHereâ€™s how
it works: 1. The crawler calculates the pageâ€™s width and
height after it loads. 2. It adjusts the viewport to fit the
content dimensions. 3. (Optional) It uses Chrome DevTools
Protocol (CDP) to simulate zooming out so everything fits in
the viewport.\n\nTo enable this, use:\n\nàwait
crawler.crawl( url=\"https://example.com\",
adjust_viewport_to_content=True # Dynamically adjusts the
viewport )`\n\nThis approach makes sure the entire page gets
loaded into the viewport, especially for layouts that load
content based on visibility.\n\n* * *\n\n### Simulating Full-
Page Scrolling\n\nSome websites load data dynamically as you
scroll down the page. To handle these cases, I added support
for **full-page scanning**. It simulates scrolling to the
bottom of the page, checking for new content, and capturing it
all.\n\nHereâ€™s an example:\n\nàwait crawler.crawl( url=
\"https://example.com\", scan_full_page=True, # Enables
scrolling scroll_delay=0.2 # Waits 200ms between
scrolls (optional) )`\n\nWhat happens here: 1. The crawler
scrolls down in increments, waiting for content to load after
each scroll. 2. It stops when no new content appears (i.e.,
244
dynamic elements stop loading). 3. It scrolls back to the top
before finishing (if necessary).\n\nIf youâ€™ve ever had to
deal with infinite scroll pages, this is going to save you a
lot of headaches.\n\n* * *\n\n### Reusing Browser Sessions
(Save Time on Setup)\n\nBy default, every time you crawl a
page, a new browser context (or tab) is created. Thatâ€™s fine
for small crawls, but if youâ€™re working on a large dataset,
itâ€™s more efficient to reuse the same session.\n\nI added a
method called `create_session` for this:\n\n`session_id =
await crawler.create_session() # Use the same session for
multiple crawls await crawler.crawl( url=
\"https://example.com/page1\", session_id=session_id #
Reuse the session ) await crawler.crawl( url=
\"https://example.com/page2\", session_id=session_id )`\n
\nThis avoids creating a new tab for every page, speeding up
the crawl and reducing memory usage.\n\n* * *\n\n### Other
Updates\n\nHere are a few smaller updates Iâ€™ve made: -
**Light Mode**: Use `light_mode=True` to disable background
processes, extensions, and other unnecessary features, making
the browser more efficient. - **Logging**: Improved logs to
make debugging easier. - **Defaults**: Added sensible defaults
for things like `delay_before_return_html` (now set to 0.1
seconds).\n\n* * *\n\n### How to Get the Update\n\nYou can
install or upgrade to version `0.4.1` like this:\n\n`pip
install crawl4ai --upgrade`\n\nAs always, Iâ€™d love to hear
your thoughts. If thereâ€™s something you think could be
improved or if you have suggestions for future versions, let
me know!\n\nEnjoy the new features, and happy crawling! ðŸ•·
ï¸ \n\n* * *",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/blog/releases/0.4.2/",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/blog/releases/0.4.2/",
"loadedTime": "2025-03-05T23:18:00.539Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/blog/",
"depth": 2,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl":
"https://docs.crawl4ai.com/blog/releases/0.4.2/",
"title": "0.4.2 - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:58 GMT",
"content-type": "text/html; charset=utf-8",
245
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"387e15e4bc4b8e65a196410efbca3407\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "0.4.2 - Crawl4AI Documentation (v0.5.x)\nðŸš€
Crawl4AI 0.4.2 Update: Smarter Crawling Just Got Easier (Dec
12, 2024)\nHey Developers,\nIâ€™m excited to share Crawl4AI
0.4.2â€”a major upgrade that makes crawling smarter, faster,
and a whole lot more intuitive. Iâ€™ve packed in a bunch of
new features to simplify your workflows and improve your
experience. Letâ€™s cut to the chase!\nðŸ”§ Configurable
Browser and Crawler Behavior\nYouâ€™ve asked for better
control over how browsers and crawlers are configured, and now
youâ€™ve got it. With the new BrowserConfig and
CrawlerRunConfig objects, you can set up your browser and
crawling behavior exactly how you want. No more cluttering
arun with a dozen argumentsâ€”just pass in your configs and
go.\nExample: \nfrom crawl4ai import BrowserConfig,
CrawlerRunConfig, AsyncWebCrawler browser_config =
BrowserConfig(headless=True, viewport_width=1920,
viewport_height=1080) crawler_config =
CrawlerRunConfig(cache_mode=\"BYPASS\") async with
AsyncWebCrawler(config=browser_config) as crawler: result =
await crawler.arun(url=\"https://example.com\",
config=crawler_config) print(result.markdown[:500]) \nThis
setup is a game-changer for scalability, keeping your code
clean and flexible as we add more parameters in the future.
\nRemember: If you like to use the old way, you can still pass
arguments directly to arun as before, no worries!\nðŸ”
Streamlined Session Management\nHereâ€™s the big one: You can
now pass local storage and cookies directly. Whether itâ€™s
setting values programmatically or importing a saved JSON
state, managing sessions has never been easier. This is a
must-have for authenticated crawlsâ€”just export your storage
state once and reuse it effortlessly across runs.\nExample: 1.
Open a browser, log in manually, and export the storage state.
2. Import the JSON file for seamless authenticated crawling:
\nresult = await crawler.arun( url=
\"https://example.com/protected\", storage_state=
\"my_storage_state.json\" ) \nðŸ”¢ Handling Large Pages:
Supercharged Screenshots and PDF Conversion\nTwo big upgrades
here:\nBlazing-fast long-page screenshots: Turn extremely long
web pages into clean, high-quality screenshotsâ€”without
breaking a sweat. Itâ€™s optimized to handle large content
without lag.\nFull-page PDF exports: Now, you can also convert
any page into a PDF with all the details intact. Perfect for
archiving or sharing complex layouts.\nðŸ”§ Other Cool Stuff
\nAnti-bot enhancements: Magic mode now handles overlays, user
simulation, and anti-detection features like a pro.
\nJavaScript execution: Execute custom JS snippets to handle
dynamic content. No more wrestling with endless page
interactions.\nðŸ“Š Performance Boosts and Dev-friendly
Updates\nFaster rendering and viewport adjustments for better
246
performance.\nImproved cookie and local storage handling for
seamless authentication.\nBetter debugging with detailed logs
and actionable error messages.\nðŸ” Use Cases Youâ€™ll Love
\n1. Authenticated Crawls: Login once, export your storage
state, and reuse it across multiple requests without the
headache. 2. Long-page Screenshots: Perfect for blogs, e-
commerce pages, or any endless-scroll website. 3. PDF Export:
Create professional-looking page PDFs in seconds.\nLetâ€™s Get
Crawling\nCrawl4AI 0.4.2 is ready for you to download and try.
Iâ€™m always looking for ways to improve, so donâ€™t hold
backâ€”share your thoughts and feedback.\nHappy Crawling! ðŸš
€",
"markdown": "# 0.4.2 - Crawl4AI Documentation (v0.5.x)\n\n##
ðŸš€ Crawl4AI 0.4.2 Update: Smarter Crawling Just Got Easier
(Dec 12, 2024)\n\n### Hey Developers,\n\nIâ€™m excited to
share Crawl4AI 0.4.2â€”a major upgrade that makes crawling
smarter, faster, and a whole lot more intuitive. Iâ€™ve packed
in a bunch of new features to simplify your workflows and
improve your experience. Letâ€™s cut to the chase!\n\n* * *\n
\n### ðŸ”§ **Configurable Browser and Crawler Behavior**\n
\nYouâ€™ve asked for better control over how browsers and
crawlers are configured, and now youâ€™ve got it. With the new
`BrowserConfig` and `CrawlerRunConfig` objects, you can set up
your browser and crawling behavior exactly how you want. No
more cluttering àrun` with a dozen argumentsâ€”just pass in
your configs and go.\n\n**Example:**\n\n`from crawl4ai import
BrowserConfig, CrawlerRunConfig, AsyncWebCrawler
browser_config = BrowserConfig(headless=True, viewport_width=
1920, viewport_height=1080) crawler_config =
CrawlerRunConfig(cache_mode=\"BYPASS\") async with
AsyncWebCrawler(config=browser_config) as crawler: result
= await crawler.arun(url=\"https://example.com\",
config=crawler_config) print(result.markdown[:500])`\n
\nThis setup is a game-changer for scalability, keeping your
code clean and flexible as we add more parameters in the
future.\n\nRemember: If you like to use the old way, you can
still pass arguments directly to àrun` as before, no worries!
\n\n* * *\n\n### ðŸ” **Streamlined Session Management**\n
\nHereâ€™s the big one: You can now pass local storage and
cookies directly. Whether itâ€™s setting values
programmatically or importing a saved JSON state, managing
sessions has never been easier. This is a must-have for
authenticated crawlsâ€”just export your storage state once and
reuse it effortlessly across runs.\n\n**Example:** 1. Open a
browser, log in manually, and export the storage state. 2.
Import the JSON file for seamless authenticated crawling:\n
\n`result = await crawler.arun( url=
\"https://example.com/protected\", storage_state=
\"my_storage_state.json\" )`\n\n* * *\n\n### ðŸ”¢ **Handling
Large Pages: Supercharged Screenshots and PDF Conversion**\n
\nTwo big upgrades here:\n\n* **Blazing-fast long-page
screenshots**: Turn extremely long web pages into clean, high-
quality screenshotsâ€”without breaking a sweat. Itâ€™s
optimized to handle large content without lag.\n \n*
**Full-page PDF exports**: Now, you can also convert any page
into a PDF with all the details intact. Perfect for archiving
or sharing complex layouts.\n \n\n* * *\n\n### ðŸ”§ **Other
247
Cool Stuff**\n\n* **Anti-bot enhancements**: Magic mode now
handles overlays, user simulation, and anti-detection features
like a pro.\n* **JavaScript execution**: Execute custom JS
snippets to handle dynamic content. No more wrestling with
endless page interactions.\n\n* * *\n\n### ðŸ“Š **Performance
Boosts and Dev-friendly Updates**\n\n* Faster rendering and
viewport adjustments for better performance.\n* Improved
cookie and local storage handling for seamless authentication.
\n* Better debugging with detailed logs and actionable error
messages.\n\n* * *\n\n### ðŸ” **Use Cases Youâ€™ll Love**\n
\n1.â€€**Authenticated Crawls**: Login once, export your
storage state, and reuse it across multiple requests without
the headache. 2.â€€**Long-page Screenshots**: Perfect for
blogs, e-commerce pages, or any endless-scroll website. 3.â
€€**PDF Export**: Create professional-looking page PDFs in
seconds.\n\n* * *\n\n### Letâ€™s Get Crawling\n\nCrawl4AI
0.4.2 is ready for you to download and try. Iâ€™m always
looking for ways to improve, so donâ€™t hold backâ€”share your
thoughts and feedback.\n\nHappy Crawling! ðŸš€",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url":
"https://crawl4ai.com/mkdocs/deploy/docker/README.md",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/deploy/docker/README.md",
"loadedTime": "2025-03-05T23:18:00.637Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/blog/",
"depth": 2,
"httpStatusCode": 404
},
"metadata": {
"canonicalUrl":
"https://crawl4ai.com/mkdocs/deploy/docker/README.md",
"title": "404 - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:58 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"92018cfd47d48a1bd7e35c31ef1330bc\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "404 - Crawl4AI Documentation (v0.5.x)\nThe page you
248
requested could not be found.",
"markdown": "# 404 - Crawl4AI Documentation (v0.5.x)\n\nThe
page you requested could not be found.\n\n[]
(https://docs.crawl4ai.com/)",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url": "https://crawl4ai.com/mkdocs/blog/releases/0.4.0/",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/blog/releases/0.4.0/",
"loadedTime": "2025-03-05T23:18:01.038Z",
"referrerUrl": "https://crawl4ai.com/mkdocs/blog/",
"depth": 2,
"httpStatusCode": 200
},
"metadata": {
"canonicalUrl":
"https://docs.crawl4ai.com/blog/releases/0.4.0/",
"title": "Release Summary for Version 0.4.0 (December 1,
2024) - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:17:59 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"6b4c7614d2e7e758bfdac94d6e84a3b3\"",
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "Release Summary for Version 0.4.0 (December 1,
2024)\nOverview\nThe 0.4.0 release introduces significant
improvements to content filtering, multi-threaded environment
handling, user-agent generation, and test coverage. Key
highlights include the introduction of the
PruningContentFilter, designed to automatically identify and
extract the most valuable parts of an HTML document, as well
as enhancements to the BM25ContentFilter to extend its
versatility and effectiveness.\nMajor Features and
Enhancements\n1. PruningContentFilter\nIntroduced a new
unsupervised content filtering strategy that scores and prunes
less relevant nodes in an HTML document based on metrics like
text and link density.\nFocuses on retaining the most valuable
parts of the content, making it highly effective for
extracting relevant information from complex web pages.\nFully
documented with updated README and expanded user guides.\n2.
249
User-Agent Generator\nAdded a user-agent generator utility
that resolves compatibility issues and supports customizable
user-agent strings.\nBy default, the generator randomizes user
agents for each request, adding diversity, but users can
customize it for tailored scenarios.\n3. Enhanced Thread
Safety\nImproved handling of multi-threaded environments by
adding better thread locks for parallel processing, ensuring
consistency and stability when running multiple threads.\n4.
Extended Content Filtering Strategies\nUsers now have access
to both the PruningContentFilter for unsupervised extraction
and the BM25ContentFilter for supervised filtering based on
user queries.\nEnhanced BM25ContentFilter with improved
capabilities to process page titles, meta tags, and
descriptions, allowing for more effective classification and
clustering of text chunks.\n5. Documentation Updates\nUpdated
examples and tutorials to promote the use of the
PruningContentFilter alongside the BM25ContentFilter,
providing clear instructions for selecting the appropriate
filter for each use case.\n6. Unit Test Enhancements\nAdded
unit tests for PruningContentFilter to ensure accuracy and
reliability.\nEnhanced BM25ContentFilter tests to cover
additional edge cases and performance metrics, particularly
for malformed HTML inputs.\nRevised Change Logs for Version
0.4.0\nPruningContentFilter (Dec 01, 2024)\nIntroduced the
PruningContentFilter to optimize content extraction by pruning
less relevant HTML nodes.\nAffected Files:
\ncrawl4ai/content_filter_strategy.py: Added a scoring-based
pruning algorithm.\nREADME.md: Updated to include
PruningContentFilter usage.
\ndocs/md_v2/basic/content_filtering.md: Expanded user
documentation, detailing the use and benefits of
PruningContentFilter.\nUnit Tests for PruningContentFilter
(Dec 01, 2024)\nAdded comprehensive unit tests for
PruningContentFilter to ensure correctness and efficiency.
\nAffected Files:\ntests/async/test_content_filter_prune.py:
Created tests covering different pruning scenarios to ensure
stability and correctness.\nEnhanced BM25ContentFilter Tests
(Dec 01, 2024)\nExpanded tests to cover additional extraction
scenarios and performance metrics, improving robustness.
\nAffected Files:\ntests/async/test_content_filter_bm25.py:
Added tests for edge cases, including malformed HTML inputs.
\nDocumentation and Example Updates (Dec 01, 2024)\nRevised
examples to illustrate the use of PruningContentFilter
alongside existing content filtering methods.\nAffected Files:
\ndocs/examples/quickstart_async.py: Enhanced example clarity
and usability for new users.\nExperimental Features\nThe
PruningContentFilter is still under experimental development,
and we continue to gather feedback for further refinements.
\nConclusion\nThis release significantly enhances the content
extraction capabilities of Crawl4ai with the introduction of
the PruningContentFilter, improved supervised filtering with
BM25ContentFilter, and robust multi-threaded handling.
Additionally, the user-agent generator provides much-needed
versatility, resolving compatibility issues faced by many
users.\nUsers are encouraged to experiment with the new
content filtering methods to determine which best suits their
needs.",
250
"markdown": "# Release Summary for Version 0.4.0 (December
1, 2024)\n\n## Overview\n\nThe 0.4.0 release introduces
significant improvements to content filtering, multi-threaded
environment handling, user-agent generation, and test
coverage. Key highlights include the introduction of the
PruningContentFilter, designed to automatically identify and
extract the most valuable parts of an HTML document, as well
as enhancements to the BM25ContentFilter to extend its
versatility and effectiveness.\n\n## Major Features and
Enhancements\n\n### 1\\. PruningContentFilter\n\n*
Introduced a new unsupervised content filtering strategy that
scores and prunes less relevant nodes in an HTML document
based on metrics like text and link density.\n* Focuses on
retaining the most valuable parts of the content, making it
highly effective for extracting relevant information from
complex web pages.\n* Fully documented with updated README
and expanded user guides.\n\n### 2\\. User-Agent Generator\n
\n* Added a user-agent generator utility that resolves
compatibility issues and supports customizable user-agent
strings.\n* By default, the generator randomizes user agents
for each request, adding diversity, but users can customize it
for tailored scenarios.\n\n### 3\\. Enhanced Thread Safety\n
\n* Improved handling of multi-threaded environments by
adding better thread locks for parallel processing, ensuring
consistency and stability when running multiple threads.\n
\n### 4\\. Extended Content Filtering Strategies\n\n* Users
now have access to both the PruningContentFilter for
unsupervised extraction and the BM25ContentFilter for
supervised filtering based on user queries.\n* Enhanced
BM25ContentFilter with improved capabilities to process page
titles, meta tags, and descriptions, allowing for more
effective classification and clustering of text chunks.\n\n###
5\\. Documentation Updates\n\n* Updated examples and
tutorials to promote the use of the PruningContentFilter
alongside the BM25ContentFilter, providing clear instructions
for selecting the appropriate filter for each use case.\n\n###
6\\. Unit Test Enhancements\n\n* Added unit tests for
PruningContentFilter to ensure accuracy and reliability.\n*
Enhanced BM25ContentFilter tests to cover additional edge
cases and performance metrics, particularly for malformed HTML
inputs.\n\n## Revised Change Logs for Version 0.4.0\n\n###
PruningContentFilter (Dec 01, 2024)\n\n* Introduced the
PruningContentFilter to optimize content extraction by pruning
less relevant HTML nodes.\n* **Affected Files:**\n *
**crawl4ai/content\\_filter\\_strategy.py**: Added a scoring-
based pruning algorithm.\n * **README.md**: Updated to
include PruningContentFilter usage.\n * **docs/md
\\_v2/basic/content\\_filtering.md**: Expanded user
documentation, detailing the use and benefits of
PruningContentFilter.\n\n### Unit Tests for
PruningContentFilter (Dec 01, 2024)\n\n* Added comprehensive
unit tests for PruningContentFilter to ensure correctness and
efficiency.\n* **Affected Files:**\n *
**tests/async/test\\_content\\_filter\\_prune.py**: Created
tests covering different pruning scenarios to ensure stability
and correctness.\n\n### Enhanced BM25ContentFilter Tests (Dec
01, 2024)\n\n* Expanded tests to cover additional extraction
251
scenarios and performance metrics, improving robustness.\n*
**Affected Files:**\n * **tests/async/test\\_content
\\_filter\\_bm25.py**: Added tests for edge cases, including
malformed HTML inputs.\n\n### Documentation and Example
Updates (Dec 01, 2024)\n\n* Revised examples to illustrate
the use of PruningContentFilter alongside existing content
filtering methods.\n* **Affected Files:**\n *
**docs/examples/quickstart\\_async.py**: Enhanced example
clarity and usability for new users.\n\n## Experimental
Features\n\n* The PruningContentFilter is still under
experimental development, and we continue to gather feedback
for further refinements.\n\n## Conclusion\n\nThis release
significantly enhances the content extraction capabilities of
Crawl4ai with the introduction of the PruningContentFilter,
improved supervised filtering with BM25ContentFilter, and
robust multi-threaded handling. Additionally, the user-agent
generator provides much-needed versatility, resolving
compatibility issues faced by many users.\n\nUsers are
encouraged to experiment with the new content filtering
methods to determine which best suits their needs.",
"debug": {
"requestHandlerMode": "browser"
}
},
{
"url":
"https://crawl4ai.com/mkdocs/blog/releases/docs/md_v2/core/cli
.md",
"crawl": {
"loadedUrl":
"https://crawl4ai.com/mkdocs/blog/releases/docs/md_v2/core/cli
.md",
"loadedTime": "2025-03-05T23:18:01.440Z",
"referrerUrl":
"https://crawl4ai.com/mkdocs/blog/releases/0.5.0/",
"depth": 3,
"httpStatusCode": 404
},
"metadata": {
"canonicalUrl":
"https://crawl4ai.com/mkdocs/blog/releases/docs/md_v2/core/cli
.md",
"title": "404 - Crawl4AI Documentation (v0.5.x)",
"description": "ðŸš€ðŸ¤– Crawl4AI, Open-source LLM-
Friendly Web Crawler & Scraper",
"author": null,
"keywords": null,
"languageCode": "en",
"jsonLd": null,
"headers": {
"server": "nginx/1.24.0 (Ubuntu)",
"date": "Wed, 05 Mar 2025 23:18:00 GMT",
"content-type": "text/html; charset=utf-8",
"transfer-encoding": "chunked",
"connection": "keep-alive",
"last-modified": "Tue, 04 Mar 2025 10:30:17 GMT",
"etag": "W/\"92018cfd47d48a1bd7e35c31ef1330bc\"",
252
"content-encoding": "gzip"
}
},
"screenshotUrl": null,
"text": "404 - Crawl4AI Documentation (v0.5.x)\nThe page you
requested could not be found.",
"markdown": "# 404 - Crawl4AI Documentation (v0.5.x)\n\nThe
page you requested could not be found.\n\n[]
(https://docs.crawl4ai.com/)",
"debug": {
"requestHandlerMode": "browser"
}
}]

253

Detailed Explanation: IR Vs Web Search Vs Web
No ratings yet
Detailed Explanation: IR Vs Web Search Vs Web
15 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
RajSingh WIexp4
No ratings yet
RajSingh WIexp4
7 pages
Web Scraping CheatSheet Guide
No ratings yet
Web Scraping CheatSheet Guide
10 pages
I) Web Crawling: Yash Pahlani D17B 49
No ratings yet
I) Web Crawling: Yash Pahlani D17B 49
7 pages
Ir 5
No ratings yet
Ir 5
18 pages
A Simple Python Web Crawler...
100% (1)
A Simple Python Web Crawler...
5 pages
Python Web Crawler
No ratings yet
Python Web Crawler
15 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Unit I
No ratings yet
Unit I
12 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
WI Sem8
No ratings yet
WI Sem8
56 pages
Icrawler
No ratings yet
Icrawler
35 pages
WebScraping Lessons 1
100% (1)
WebScraping Lessons 1
3 pages
Research Paper
No ratings yet
Research Paper
5 pages
Python - Scalable Web Scraping and API Workflows - Pluralsight
No ratings yet
Python - Scalable Web Scraping and API Workflows - Pluralsight
4 pages
Sithfal-Task2 Explation Matter
No ratings yet
Sithfal-Task2 Explation Matter
6 pages
Ms. Poonam Sinai Kenkre
No ratings yet
Ms. Poonam Sinai Kenkre
43 pages
Python Web Crawler Guide
No ratings yet
Python Web Crawler Guide
10 pages
Week 4
No ratings yet
Week 4
38 pages
Cse3024 WM Module-2 Smsatapathy
No ratings yet
Cse3024 WM Module-2 Smsatapathy
106 pages
Programming Assignment Unit 07 - CS 3308 - Information Retrieval - University of The People
No ratings yet
Programming Assignment Unit 07 - CS 3308 - Information Retrieval - University of The People
4 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
Web Scraping Using Python (Step by Step Tutorial) - Pythonista Planet
No ratings yet
Web Scraping Using Python (Step by Step Tutorial) - Pythonista Planet
11 pages
Web Scraping & API Guide
No ratings yet
Web Scraping & API Guide
24 pages
The A-Z of Web Scraping in 2020 (A How-To Guide)
No ratings yet
The A-Z of Web Scraping in 2020 (A How-To Guide)
18 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Web Scraping for Developers
No ratings yet
Web Scraping for Developers
8 pages
20 Crawl
No ratings yet
20 Crawl
46 pages
Web Crawling: Christopher Olston and Marc Najork
No ratings yet
Web Crawling: Christopher Olston and Marc Najork
49 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
Basic Scraping Techniques
No ratings yet
Basic Scraping Techniques
7 pages
UNIT3
No ratings yet
UNIT3
7 pages
Web Scraping Cheat Sheet 2.0
No ratings yet
Web Scraping Cheat Sheet 2.0
3 pages
DAP Module4 1
No ratings yet
DAP Module4 1
110 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
Docs Scrapy Org en Latest
No ratings yet
Docs Scrapy Org en Latest
354 pages
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
No ratings yet
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
14 pages
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
03 Web Scraping
No ratings yet
03 Web Scraping
41 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
20 - BeautifulSoup Library For Web Scraping
No ratings yet
20 - BeautifulSoup Library For Web Scraping
12 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
A Progressive Understanding Web Agent For Web Crawler Generation
No ratings yet
A Progressive Understanding Web Agent For Web Crawler Generation
18 pages
Introduction To Web Scraping in RPA With Python
No ratings yet
Introduction To Web Scraping in RPA With Python
10 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
Web Crawling and Search Engine Basics
No ratings yet
Web Crawling and Search Engine Basics
40 pages
Web Scraping Handbook
100% (1)
Web Scraping Handbook
115 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Python Web Scraping Guide
No ratings yet
Python Web Scraping Guide
7 pages
Web Crawling and Social Media Mining: Module No. 5
No ratings yet
Web Crawling and Social Media Mining: Module No. 5
77 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Scrapegraphai Docs
No ratings yet
Scrapegraphai Docs
314 pages
Python Selenium Web Scraping Guide
No ratings yet
Python Selenium Web Scraping Guide
14 pages
Anti-Scraping Tactics & Solutions
No ratings yet
Anti-Scraping Tactics & Solutions
5 pages
Web Crawler Toolkit for Developers
No ratings yet
Web Crawler Toolkit for Developers
6 pages
Seminar Report: Submitted By: Aanchal Garg CSE
No ratings yet
Seminar Report: Submitted By: Aanchal Garg CSE
22 pages
Llama Parse Docs
No ratings yet
Llama Parse Docs
632 pages
Llamaparsedocs
No ratings yet
Llamaparsedocs
668 pages
Recorded API Calls
No ratings yet
Recorded API Calls
1 page
Huggingface - Open Deep Research - Free The AI Agents
No ratings yet
Huggingface - Open Deep Research - Free The AI Agents
6 pages
GEOAI
No ratings yet
GEOAI
223 pages
Evolution API Docs
No ratings yet
Evolution API Docs
972 pages
IP Camera User Guide
No ratings yet
IP Camera User Guide
32 pages
Modulation Worksheet
No ratings yet
Modulation Worksheet
13 pages
Evangelizing With Humor: © Sophia Institute For Teachers
No ratings yet
Evangelizing With Humor: © Sophia Institute For Teachers
8 pages
Fundamentals of Plating
100% (3)
Fundamentals of Plating
22 pages
Roberts - Synesthesia
100% (3)
Roberts - Synesthesia
279 pages
Prescription J d8z c2jjm56sUcPPQmHsPxI216ZuVp2dT3jXJWSBHrFMo69w6N3I92ftygTLEX
No ratings yet
Prescription J d8z c2jjm56sUcPPQmHsPxI216ZuVp2dT3jXJWSBHrFMo69w6N3I92ftygTLEX
2 pages
Vegan Kitchen's HRM Strategic Plan
No ratings yet
Vegan Kitchen's HRM Strategic Plan
19 pages
Contract Lifecycle Management in SAP
No ratings yet
Contract Lifecycle Management in SAP
4 pages
Simple Gantt Chart Template - TemplateLab
No ratings yet
Simple Gantt Chart Template - TemplateLab
1 page
Techniques of The Observer - Jonathan Crary PDF
No ratings yet
Techniques of The Observer - Jonathan Crary PDF
15 pages
E.E. Cummings Complete Poems PDF
100% (1)
E.E. Cummings Complete Poems PDF
1,135 pages
Explain The Diagram Shown Below If How PESTELE Effect Each Factor in Considering or Monitoring The Marketing Invironmentbof Each Organization
No ratings yet
Explain The Diagram Shown Below If How PESTELE Effect Each Factor in Considering or Monitoring The Marketing Invironmentbof Each Organization
2 pages
Xrdocs Io CNBNG Tutorials Inception Server Deployment Guide
No ratings yet
Xrdocs Io CNBNG Tutorials Inception Server Deployment Guide
8 pages
Transportation Mgt. Lessons For Finals
No ratings yet
Transportation Mgt. Lessons For Finals
3 pages
JEE Main Physics, Math, Chemistry Topic Analysis
No ratings yet
JEE Main Physics, Math, Chemistry Topic Analysis
2 pages
Appeal Letter
No ratings yet
Appeal Letter
1 page
Filled NC Audit Report Template 45001 Ohsms Team B
No ratings yet
Filled NC Audit Report Template 45001 Ohsms Team B
5 pages
The Finite Volume Method For The 2D Euler Equations
No ratings yet
The Finite Volume Method For The 2D Euler Equations
6 pages
Electrolysis and Conductivity Basics
No ratings yet
Electrolysis and Conductivity Basics
3 pages
Eviden Brochure 2024
No ratings yet
Eviden Brochure 2024
6 pages
v1.1 EMA Sniper
No ratings yet
v1.1 EMA Sniper
2 pages
Concept Paper
No ratings yet
Concept Paper
4 pages
Sandvik DD2710
No ratings yet
Sandvik DD2710
4 pages
Questionnaire For Charitable Organizations
No ratings yet
Questionnaire For Charitable Organizations
5 pages
Book of Extinction
100% (11)
Book of Extinction
255 pages
Kobra 2 Max How To Set Up Kobra 2 Max Printer Parameters and Import
No ratings yet
Kobra 2 Max How To Set Up Kobra 2 Max Printer Parameters and Import
6 pages
Ippd Sy 2023 24 2
100% (6)
Ippd Sy 2023 24 2
3 pages
Revit 2014 Set B
No ratings yet
Revit 2014 Set B
5 pages
AITS 2425 PT I JEEM Sol
No ratings yet
AITS 2425 PT I JEEM Sol
19 pages
How To Write A Lot A Practical Guide To Productive Academic Writing 1st Edition Paul J. Silvia Full
100% (1)
How To Write A Lot A Practical Guide To Productive Academic Writing 1st Edition Paul J. Silvia Full
110 pages

Crawl4ai Docs

Uploaded by

Crawl4ai Docs

Uploaded by

[{

You might also like