[go: up one dir, main page]

0% found this document useful (0 votes)
35 views4 pages

Debugging Spiders - Scrapy 2.13.0 Documentation

This document outlines various techniques for debugging Scrapy spiders, including using the parse command, Scrapy shell, and logging. It provides examples of how to inspect responses and check the behavior of spider methods. Additionally, it includes instructions for debugging with Visual Studio Code by configuring a launch.json file.

Uploaded by

twirlylust
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views4 pages

Debugging Spiders - Scrapy 2.13.0 Documentation

This document outlines various techniques for debugging Scrapy spiders, including using the parse command, Scrapy shell, and logging. It provides examples of how to inspect responses and check the behavior of spider methods. Additionally, it includes instructions for debugging with Visual Studio Code by configuring a launch.json file.

Uploaded by

twirlylust
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

 / Debugging Spiders

Debugging Spiders

This document explains the most common techniques for debugging spiders. Consider the
following Scrapy spider below:

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
name = "myspider"
start_urls = (
"http://example.com/page1",
"http://example.com/page2",
)

def parse(self, response):


# <processing code not shown>
# collect `item_urls`
for item_url in item_urls:
yield scrapy.Request(item_url, self.parse_item)

def parse_item(self, response):


# <processing code not shown>
item = MyItem()
# populate `item` fields
# and extract item_details_url
yield scrapy.Request(
item_details_url, self.parse_details, cb_kwargs={"item": item}
)

def parse_details(self, response, item):


# populate more `item` fields
return item

Basically this is a simple spider which parses two pages of items (the start_urls). Items also
have a details page with additional information, so we use the cb_kwargs functionality of
Request to pass a partially populated item.

Parse Command

The most basic way of checking the output of your spider is to use the parse command. It
allows to check the behaviour of different parts of the spider at the method level. It has the
advantage of being flexible and simple to use, but does not allow debugging code insidelatest
a 
method.
In order to see the item scraped from a specific url:

$ scrapy parse --spider=myspider -c parse_item -d 2 <item_url>


[ ... scrapy log lines crawling example.com spider ... ]

>>> STATUS DEPTH LEVEL 2 <<<


# Scraped Items ------------------------------------------------------------
[{'url': <item_url>}]

# Requests -----------------------------------------------------------------
[]

Using the --verbose or -v option we can see the status at each depth level:

$ scrapy parse --spider=myspider -c parse_item -d 2 -v <item_url>


[ ... scrapy log lines crawling example.com spider ... ]

>>> DEPTH LEVEL: 1 <<<


# Scraped Items ------------------------------------------------------------
[]

# Requests -----------------------------------------------------------------
[<GET item_details_url>]

>>> DEPTH LEVEL: 2 <<<


# Scraped Items ------------------------------------------------------------
[{'url': <item_url>}]

# Requests -----------------------------------------------------------------
[]

Checking items scraped from a single start_url, can also be easily achieved using:

$ scrapy parse --spider=myspider -d 3 'http://example.com/page1'

Scrapy Shell

While the parse command is very useful for checking behaviour of a spider, it is of little help
to check what happens inside a callback, besides showing the response received and the
output. How to debug the situation when parse_details sometimes receives no item?

Fortunately, the shell is your bread and butter in this case (see Invoking the shell from
spiders to inspect responses):

latest
from scrapy.shell import inspect_response

def parse_details(self, response, item=None):


if item:
# populate more `item` fields
return item
else:
inspect_response(response, self)

See also: Invoking the shell from spiders to inspect responses.


Open in browser

Sometimes you just want to see how a certain response looks in a browser, you can use the
open_in_browser() function for that:

scrapy.utils.response.open_in_browser(response: TextResponse, _openfunc: Callable[[str],


Any] = <function open>)→ Any [source]

Open response in a local web browser, adjusting the base tag for external links to work,
e.g. so that images and styles are displayed.

For example:

from scrapy.utils.response import open_in_browser

def parse_details(self, response):


if "item name" not in response.body:
open_in_browser(response)

Logging

Logging is another useful option for getting information about your spider run. Although not
as convenient, it comes with the advantage that the logs will be available in all future runs
should they be necessary again:

def parse_details(self, response, item=None):


if item:
# populate more `item` fields
return item
else:
self.logger.warning("No item received for %s", response.url)

latest
For more information, check the Logging section.
Visual Studio Code

To debug spiders with Visual Studio Code you can use the following launch.json :

{
"version": "0.1.0",
"configurations": [
{
"name": "Python: Launch Scrapy Spider",
"type": "python",
"request": "launch",
"module": "scrapy",
"args": [
"runspider",
"${file}"
],
"console": "integratedTerminal"
}
]
}

Also, make sure you enable “User Uncaught Exceptions”, to catch exceptions in your Scrapy
spider.

latest

You might also like