[go: up one dir, main page]

0% found this document useful (0 votes)
21 views17 pages

CST8276 MongoScraping Spring2025

This document outlines Lab Assignment 8 for CST8276 at Algonquin College, focusing on using MongoDB with Python. Students are required to scrape data from a website using BeautifulSoup, store it in a Python object, and then insert it into a MongoDB database. The assignment includes detailed steps for setting up the Python environment, writing the necessary code, and documenting the process with screenshots.

Uploaded by

vy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views17 pages

CST8276 MongoScraping Spring2025

This document outlines Lab Assignment 8 for CST8276 at Algonquin College, focusing on using MongoDB with Python. Students are required to scrape data from a website using BeautifulSoup, store it in a Python object, and then insert it into a MongoDB database. The assignment includes detailed steps for setting up the Python environment, writing the necessary code, and documenting the process with screenshots.

Uploaded by

vy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Algonquin College

of Applied Arts and Technology

CST8276 – Advanced Database Topics: Lab Assignment 8

Student Name: _____LE VY PHAM_____________


Student ID: ______041104662____________
Student email:
_________pham0187@algoquinlive.com_________

Hand-in:
1. The lab assignment will be graded out of a maximum 2 points.
2. This template should be used to submit your lab assignment. That means you should
save a copy of these instructions and then add your screenshots and other notes to that
copy. Then submit the updated version to Brightspace. This will thoroughly document
your work and make it much easier to review before the final lab exam.
3. Make sure you have enough screenshots to completely document that you have completed all
the steps.

Introduction
This assignment consists of researching how to use MongoDB with Python and building two small
python applications that:
1. Collect data from a Web site by scraping the HTML using the python-specific
‘BeautifulSoup’ add-on;
2. Stores the resulting data in a python object; and
3. Stores the data into your MongoDB database.
You must have completed the MongoDB lab from when you were in CST2355 (Lab 4 - Week #9)
before attempting this assignment. (You need to have a working MongoDB server to store the
results.) I have posted that lab alongside this lab – just in case!)

 2025 Algonquin College of Applied Arts and Technology Page 1 / 17


Algonquin College
of Applied Arts and Technology

Step A – Install Python


1. Install python. Ideally, the latest stable version (3.12.4 or later)
a. Python downloads for windows are at:
https://www.python.org/downloads/windows/
Here are the important sub-steps:

(i) Select 3.12.4; then

ii) Select the Windows Installer (64 bit); Run the installer – install to the default location.
Make sure you include pylauncher and all “easier to use” options. You will get a
confirmation when complete. Here’s mine.

 2025 Algonquin College of Applied Arts and Technology Page 2 / 17


Algonquin College
of Applied Arts and Technology

You can find the online documentation for python at: https://docs.python.org/3.12/
a. Python setup and usage documentation is at:
https://docs.python.org/3.12/using/index.html
b. Python cmdline usage documentation is at:
https://docs.python.org/3.12/using/cmdline.html

 2025 Algonquin College of Applied Arts and Technology Page 3 / 17


Algonquin College
of Applied Arts and Technology

Step B – Prepare the Python Environment and Create a Python Program

Please note that these steps are based on the excellent tutorial which is posted online at:
https://medium.com/swlh/web-scraping-with-python-using-beautifulsoup-and-mongodb-
6f15f6b04d68
Please note that “python3” in the tutorial should be replaced with “python” to match the 3.12.4
download version.

Here are the condensed steps to get the environment ready and build the program (based on the
above-mentioned tutorial).

2. Setup the environment and “run the command “ .\env\Scripts\activate.bat

3. Then a simple test to see if python is working….

Followed by the cmdline python command (activate is a python program – with ‘.py’ extension)
and automatically executes
4. And then install beautiful soup:

 2025 Algonquin College of Applied Arts and Technology Page 4 / 17


Algonquin College
of Applied Arts and Technology

5. Create a text file named “index1.py” to contain our initial python code (indentation is
important!)

import requests
from bs4 import BeautifulSoup

# make a request to the site and get it as a string


markup = requests.get(f'http://quotes.toscrape.com/').text

# pass the string to a BeatifulSoup object


soup = BeautifulSoup(markup, 'html.parser')

#this will hold all the quotes


quotes = []

# now we can select elements


for item in soup.select('.quote'):
quote = {}
quote['text'] = item.select_one('.text').get_text()
quote['author'] = item.select_one('.author').get_text()

# get the tags element


tags = item.select_one('.tags')

# get each tag text from the tags element


quote['tags'] = [tag.get_text() for tag in tags.select('.tag')]
quotes.append(quote)

print(quotes)

6. Try the program – run it in your environment and provide the screenshot showing the cmdline
invocation (‘python index1.py’), and the output (the quotes….).
PASTE YOUR SCREENSHOT BELOW:

 2025 Algonquin College of Applied Arts and Technology Page 5 / 17


Algonquin College
of Applied Arts and Technology

7. Now, install the ‘pymongo’ add-on: pip3 install pymongo pymongo[srv]

 2025 Algonquin College of Applied Arts and Technology Page 6 / 17


Algonquin College
of Applied Arts and Technology

8. Modify the “index1.py” program to create our final “index.py” program to contain our python
code that navigates to all the pages of the website to gather all the data. Note: your local
mongoDB services should be at the URL mongodb://localhost:27017

import requests
from bs4 import BeautifulSoup
import pymongo
def scrape_quotes():
more_links = True
page = 1
quotes = []
while(more_links):
markup = requests.get(f'http://quotes.toscrape.com/page/{page}').text
soup = BeautifulSoup(markup, 'html.parser')
for item in soup.select('.quote'):
quote = {}
quote['text'] = item.select_one('.text').get_text()
quote['author'] = item.select_one('.author').get_text()
tags = item.select_one('.tags')
quote['tags'] = [tag.get_text() for tag in tags.select('.tag')]
quotes.append(quote)
next_link = soup.select_one('.next > a')
print(f'scraped page {page}')
if(next_link):
page += 1
else:
more_links = False
return quotes
quotes = scrape_quotes()
client = pymongo.MongoClient('mongodb://localhost:27017')
db = client.db.quotes
try:
db.insert_many(quotes)
print(f'inserted {len(quotes)} articles')
except:
print('an error occurred quotes were not stored to db')

 2025 Algonquin College of Applied Arts and Technology Page 7 / 17


Algonquin College
of Applied Arts and Technology

9. Try the program – run it in your environment and provide the screenshot showing the cmdline
invocation (‘python index.py’), and the output.
PASTE YOUR SCREENSHOT BELOW:

10. Open MongoCompass and provide a screenshot showing (some of) your inserted quotes:
Here’s mine::

 2025 Algonquin College of Applied Arts and Technology Page 8 / 17


Algonquin College
of Applied Arts and Technology

PASTE YOUR SCREENSHOT BELOW:

 2025 Algonquin College of Applied Arts and Technology Page 9 / 17


Algonquin College
of Applied Arts and Technology

 2025 Algonquin College of Applied Arts and Technology Page 10 / 17


Algonquin College
of Applied Arts and Technology

Step C – Using Python to Load a Table of Current Weather Data into


MongoDB
11. Now, we will insert some different data from a different web site. In order to do this, we will
need to update our python program to gather data from a different site.
a. The original site uses the data structure as shown in the “view source” shot below.

b. This structure is searched for as a “.quote” and then the nested data elements are
gathered individually.
c. Now we will try another type of parser in Beautiful Soup – the ‘lxml’.
12. Get ready by installing support for selecting items based on lxml. Run the command “pip
install beautifulsoup4 lxml”

13. We will use the weather data published by the Government of Canada at:
https://www.weather.gc.ca/canada_e.html
Which has a table on the webpage that looks like this:

 2025 Algonquin College of Applied Arts and Technology Page 11 / 17


Algonquin College
of Applied Arts and Technology

14. You should begin by naming your new python program “weather.py”. We will need to
gather the column names for the table, and the data in each row.
15. There is a fantastic set of videos online showing how to scrape a table: Note – you can
follow the video to create a python data item containing the headers of the table, and another
data structure to hold the rows of the table. (In the video, they use the pandas add-on to save
the table as a CSV file.)
a. Part 1: how to get the headers: https://www.youtube.com/watch?v=T1qv3ksMDq4
b. Part 2: how to get the data: https://www.youtube.com/watch?v=aGCyqj8nPKw

If you follow the videos you just need to add some code to create a data structure to hold the
data before inserting it into MongoDB, instead of building the dataframe that pandas uses.
Once you have the data structure ready, you store it the same way that our first example
website stored the collection of quotes.
Note:
The data structure in MongoDB needs to have the following form:
a) an array of data objects corresponding to each row
b) the column headers used as the field names within each row object for the
associate data; and
c) an extra field “last_modified”, containing a timestamp corresponding to the time
the script was executed. This requires a bit of care but PyMongo will do a
conversion automatically when you do the insert() or insert_many(). As per the
online docs at: https://pymongo.readthedocs.io/en/stable/examples/datetimes.html

Python uses a dedicated data type, datetime.datetime , to represent dates


and times. MongoDB stores datetime values in coordinated universal time
(UTC), a global time standard local to London, England.

Always use datetime.datetime.now(tz=datetime.timezone.utc)(), which


explicitly returns the current time in UTC, instead
of datetime.datetime.now(), with no arguments, which returns
the current local time.

 2025 Algonquin College of Applied Arts and Technology Page 12 / 17


Algonquin College
of Applied Arts and Technology

16. Provide a screenshot showing the source code for your program – run it in your
environment and provide the screenshot showing the cmdline invocation (‘python
indexWeather.py’), and the output.

PASTE YOUR SCREENSHOT BELOW:

 2025 Algonquin College of Applied Arts and Technology Page 13 / 17


Algonquin College
of Applied Arts and Technology

 2025 Algonquin College of Applied Arts and Technology Page 14 / 17


Algonquin College
of Applied Arts and Technology

17. Open MongoCompass and provide a screenshot showing (some of) your inserted weather
data: PASTE YOUR SCREENSHOT BELOW:

 2025 Algonquin College of Applied Arts and Technology Page 15 / 17


Algonquin College
of Applied Arts and Technology

 2025 Algonquin College of Applied Arts and Technology Page 16 / 17


Algonquin College
of Applied Arts and Technology

Make sure you have included all of your screen shots (Steps 6, 9, 10, 16, 17 to document your
work and submit the file in Brightspace along with your index1.py, and weather.py -- and you’re
done!

 2025 Algonquin College of Applied Arts and Technology Page 17 / 17

You might also like