In this chapter, we're going to work on a project together. We are going to write a simple scraper that finds and saves images from a web page. We'll focus on three parts:
- A simple HTTP webserver in Python
- A script that scrapes a given URL
- A GUI application that scrapes a given URL
So, if you haven't already done so, this would be the perfect time to start a console and position yourself in a folder called ch12 in the root of your project for this book. Within that folder, we'll create two Python modules (scrape.py and guiscrape.py) and a folder (simple_server). Within simple_server, we'll write our HTML page: index.html. Images will be stored in simple_server/img.
The structure in ch12 should look like this:
$ tree -A
.
├── guiscrape.py
├── scrape.py
└── simple_server
├── img
│ ├── owl-alcohol.png
│ ├── owl-book.png
│ ├── owl-books.png
│ ├── owl-ebook.jpg
│ └── owl-rose.jpeg
├── index.html
└── serve.sh
If you're using either Linux or macOS, you can do what I do and put the code to start the HTTP server in a serve.sh file. On Windows, you'll probably want to use a batch file.
The HTML page we're going to scrape has the following structure:
# simple_server/index.html
<!DOCTYPE html>
<html lang="en">
<head><title>Cool Owls!</title></head>
<body>
<h1>Welcome to my owl gallery</h1>
<div>
<img src="img/owl-alcohol.png" height="128" />
<img src="img/owl-book.png" height="128" />
<img src="img/owl-books.png" height="128" />
<img src="img/owl-ebook.jpg" height="128" />
<img src="img/owl-rose.jpeg" height="128" />
</div>
<p>Do you like my owls?</p>
</body>
</html>
It's an extremely simple page, so let's just note that we have five images, three of which are PNGs and two of which are JPGs (note that even though they are both JPGs, one ends with .jpg and the other with .jpeg, which are both valid extensions for this format).
So, Python gives you a very simple HTTP server for free that you can start with the following command (in the simple_server folder):
$ python -m http.server 8000
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
127.0.0.1 - - [06/May/2018 16:54:30] "GET / HTTP/1.1" 200 -
...
The last line is the log you get when you access http://localhost:8000, where our beautiful page will be served. Alternatively, you can put that command in a file called serve.sh, and just run that with this command (make sure it's executable):
$ ./serve.sh
It will have the same effect. If you have the code for this book, your page should look something like this:

Feel free to use any other set of images, as long as you use at least one PNG and one JPG, and that in the src tag you use relative paths, not absolute ones. I got these lovely owls from https://openclipart.org/.