0% found this document useful (0 votes)

14 views31 pages

FDSWeb Scraping

FDSWebscraping

Uploaded by

Jay Kelkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views31 pages

FDSWeb Scraping

FDSWebscraping

Uploaded by

Jay Kelkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Financial Data Science (FIN42110)

Dr. Richard McGee

Web Scraping

1
Introduction
Definitions

Web Scraping
Using tools to gather data you can see on a webpage.
A wide range of web scraping
techniques and tools exist. These can
be as simple as copy/paste and
increase in complexity to automation
tools, HTML parsing, APIs and
programming.

2
Definitions

HTTP
HyperText Transfer Protocol

• HTTP is the foundation of data

communication for the World Wide
Web, where hypertext documents
include hyperlinks to other resources
that the user can easily access, for
example by a mouse click or by
tapping the screen in a web browser.

The protocol defines aspects of authentication, requests, status

codes, persistent connections, client/server request/response.
etc.

3
Definitions

HTML
HyperText Markup Language

• HyperText Markup Language

(HTML) is the set of markup
symbols or codes inserted into a file
intended for display on the Internet.
The markup tells web browsers how
to display a web page’s words and
images.

Each individual piece markup code (which would fall between

"<" and ">" characters) is referred to as an element, though
many people also refer to it as a tag.

4
Definitions

XML
Extensible Markup Language

• Extensible Markup Language (XML)

is a markup language and file format
for storing, transmitting, and
reconstructing arbitrary data.

It defines a set of rules for encoding documents in a format that

is both human-readable and machine-readable.
XML is about encoding data, HTML is about display.

5
XML Example

• We will check out the Books.xml example file on

Brightspace.
• With a new data set it can be useful to use an XML viewer
to view the hierarchy:
• https://www.xmlgrid.net/

6
XML parsing

from lxml.etree import fromstring

with open('Books.xml', 'r') as file:

xml = file.read()

root = fromstring(xml)

for books in root.xpath("/catalog/book"):

print(books.xpath("title")[0].text)

7
Definitions

JSON
JavaScript Object Notation

• JSON, is a lightweight computer

data interchange format. It is a
text-based, human-readable format
for representing simple data
structures and associative arrays
(called objects) in serialization and
serves as an alternative to XML.

8
Definitions

API
Application Programming Interface

• An application programming
interface (API) is a connection
between computers or between
computer programs. It is a type of
software interface, offering a service
to other pieces of software.

9
Definitions

SOAP
Simple Object Access Protocol

• SOAP is a commonly used set of

commands and objects used to
implement an API.

10
Definitions

• Parsing
• The act of analyzing the strings and symbols to reveal
only the data you need.
• Crawling
• Moving across or through a website in an attempt to
gather data from more than one URL or page

11
HTML Structure: div

<html>
<head>
<style>
.myDiv {
border: 5px outset red;
background-color: lightblue;
text-align: center;
}
</style>
</head>
<body>

<div class="myDiv">
<h2>This is a heading in a div element</h2>
<p>This is some text in a div element.</p>
</div>

</body>
</html>

• division/section/used as a container for HTML elements.

• https://www.w3schools.com/Tags/tag_div.asp

12
HTML Structure: table/tr/td

• one <table> and one or more <tr>, <th>, and <td> elements
• https://www.w3schools.com/Tags/tag_table.asp

13
Robots.txt

• Instructs web robots (typically search engine robots) how

to crawl pages on the website.
• Example https://www.buzzfeed.com/robots.txt
• Accessing at too high a frequency will get you blocked!

14
Useful Python Packages

• pip install beautifulsoup4

• pip install requests
• pip install html5lib
• pip install yfinance
• pip install mplfinance
• pip install twython
• pip install selenium
• install chrome browser
• and chrome driver matching browser version

15
Crypto Punks
Example Project: Crypto Punk Pricing

https://www.larvalabs.com/cryptopunks
Step 1: Specify what you are looking for. In this case:

• a database of 10,000 crypto punks

• their key features
• their trade history and prices
• looking to explain prices with features.

16
Example Project: Crypto Punk Pricing

Examine the web page source:

17
Example Project: Crypto Punk Pricing

Step 2: Design your database structure

• for this project I will use a simple SQLite DB

• https://sqlitebrowser.org/dl/
• I will create two tables - a punk attribute table and a trade
table.

18
Example Project: Crypto Punk Pricing

Step 3: Examine the web site structure (view page source in

browser)

• CyptoPunk is nicely structured with one page per punk

numbered 1-10,000
• e.g. punk 1 is at
https://www.larvalabs.com/cryptopunks/details/1

19
Example Project: Crypto Punk Pricing

• Example: Print trade dates and amounts for one punk.

import requests
from bs4 import BeautifulSoup

# Crypto Punk
#~~~~~~~~~~~~
BaseStr = "https://www.larvalabs.com/cryptopunks/details/"
PunkNo = '1'
page = requests.get(BaseStr + PunkNo)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table', attrs={'class':'table'})
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
if cols:
cols = [ele.text.strip() for ele in cols]
print(cols[4] + ' : ' + cols[3])

20
Yahoo finance API
Web Scraping from yahoo finance

import yfinance as yf
import mplfinance as mpf
import numpy as np

ticker_name = 'NFLX'
yticker = yf.Ticker(ticker_name)
nflx = yticker.history(period="1y") # max, 1y, 3mo
....
# Compute log returns
nflx['Return'] = np.log(nflx['Close']/nflx['Close'].shift(1))

https://pypi.org/project/yfinance/
https://pypi.org/project/mplfinance/

21
MPL output example

22
More Scraping
Web Scraping Example: House of Representatives

https://www.house.gov/representatives
23
Web Scraping Example

def main():
from bs4 import BeautifulSoup
import requests

url = "https://www.house.gov/representatives"
text = requests.get(url).text
soup = BeautifulSoup(text, "html5lib")

all_urls = [a['href']
for a in soup('a')
if a.has_attr('href')]
print(len(all_urls))

Example from Data Science from Sratch, Joel Grus

24
Web Scraping Example

import re

# Must start with http:// or https://

# Must end with .house.gov or .house.gov/
regex = r"^https?://.*\.house\.gov/?$"

# Let's write some tests!

assert re.match(regex, "http://joel.house.gov")

# And now apply

good_urls = [url for url in all_urls if re.match(regex, url)]
print(len(good_urls))
good_urls = list(set(good_urls))

Example from Data Science from Sratch, Joel Grus.

For regex see, e.g.: https://www.w3schools.com/python/python_regex.asp
25
Web Scraping Example

from bs4 import BeautifulSoup

import requests

def paragraph_mentions(text: str, keyword: str) -> bool:

"""
Returns True if a <p> inside the text mentions {keyword}
"""
soup = BeautifulSoup(text, 'html5lib')
paragraphs = [p.get_text() for p in soup('p')]

return any(keyword.lower() in paragraph.lower()

for paragraph in paragraphs)

Example from Data Science from Sratch, Joel Grus

26
Web Scraping Example

import random
from typing import Dict, Set

good_urls = random.sample(good_urls, 5)
print(f"after sampling, left with {good_urls}")

press_releases: Dict[str, Set[str]] = {}

for house_url in good_urls:
html = requests.get(house_url).text
soup = BeautifulSoup(html, 'html5lib')
pr_links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}
print(f"{house_url}: {pr_links}")
press_releases[house_url] = pr_links

for house_url, pr_links in press_releases.items():

for pr_link in pr_links:
url = f"{house_url}/{pr_link}"
text = requests.get(url).text

if paragraph_mentions(text, 'data'):
print(f"{house_url}")
break # done with this house_url

Example from Data Science from Sratch, Joel Grus 27

Data Aggregation by Web Scraping Using Python
No ratings yet
Data Aggregation by Web Scraping Using Python
48 pages
Internal Audit May 2021
No ratings yet
Internal Audit May 2021
27 pages
Paul A. Watters - Cybercrime and Cybersecurity-Routledge - Taylor & Francis Group (2024)
100% (1)
Paul A. Watters - Cybercrime and Cybersecurity-Routledge - Taylor & Francis Group (2024)
183 pages
Web Scraping With Python Tutorials From A To Z
100% (2)
Web Scraping With Python Tutorials From A To Z
35 pages
Wlan Ac, Fit AP, Fat AP, Cloud AP v200r022c00 Upgrade Guide
No ratings yet
Wlan Ac, Fit AP, Fat AP, Cloud AP v200r022c00 Upgrade Guide
133 pages
GDS Russell 3000 As of May 2023
No ratings yet
GDS Russell 3000 As of May 2023
65 pages
SCDA - SAP Integration Suite (C - CPI - 15) Practice Questions
100% (5)
SCDA - SAP Integration Suite (C - CPI - 15) Practice Questions
15 pages
Christos Chen
No ratings yet
Christos Chen
42 pages
How To Scrape Websites With Python and BeautifulSoup PDF
100% (2)
How To Scrape Websites With Python and BeautifulSoup PDF
10 pages
L2_Data Acquisition
No ratings yet
L2_Data Acquisition
48 pages
2412.10476v2
No ratings yet
2412.10476v2
48 pages
I
No ratings yet
I
54 pages
Python Toolbox 100 Scripts for Developers Enhance Your Development Skills with Ready-to-Use Python Scripts (Sari, Serhan) (Z-Library)
No ratings yet
Python Toolbox 100 Scripts for Developers Enhance Your Development Skills with Ready-to-Use Python Scripts (Sari, Serhan) (Z-Library)
193 pages
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
No ratings yet
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
193 pages
Android Development Beginner Guide
No ratings yet
Android Development Beginner Guide
17 pages
Web Scraping Handbook
No ratings yet
Web Scraping Handbook
115 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
DAP_4_module
No ratings yet
DAP_4_module
45 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
S12 Web Scraping
No ratings yet
S12 Web Scraping
13 pages
web-publication
No ratings yet
web-publication
87 pages
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
No ratings yet
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
14 pages
Base Data For Project2
No ratings yet
Base Data For Project2
100 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Industrial Training Presentation: Prepared By: Guided by
No ratings yet
Industrial Training Presentation: Prepared By: Guided by
27 pages
1747399713103-1747037056197-webscraping
No ratings yet
1747399713103-1747037056197-webscraping
12 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
1.1 Web Scraping
No ratings yet
1.1 Web Scraping
34 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Module5 Q&A
No ratings yet
Module5 Q&A
6 pages
Scraping
100% (1)
Scraping
25 pages
Web Crawling - python
No ratings yet
Web Crawling - python
34 pages
Industrial Training Presentation: Prepared By: Guided by
No ratings yet
Industrial Training Presentation: Prepared By: Guided by
26 pages
Web Scraping Job Portals: Ashutosh Kumar, Kinshuk Chauhan, Jaspreet Kaur Grewal
No ratings yet
Web Scraping Job Portals: Ashutosh Kumar, Kinshuk Chauhan, Jaspreet Kaur Grewal
13 pages
Practical Web Scraping for Economists 1744341390
No ratings yet
Practical Web Scraping for Economists 1744341390
33 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
UNIT-III.-16830445992870
No ratings yet
UNIT-III.-16830445992870
32 pages
Jel - Lee - Miguel - Wolfram Excel
No ratings yet
Jel - Lee - Miguel - Wolfram Excel
60 pages
19-5E8 Tushara Priya
No ratings yet
19-5E8 Tushara Priya
23 pages
WebScraping Lessons 1
100% (1)
WebScraping Lessons 1
3 pages
Utilizing_Python_for_Web_Scraping_and_Incremental_Data_Extraction
No ratings yet
Utilizing_Python_for_Web_Scraping_and_Incremental_Data_Extraction
6 pages
Course Notes - Web Scraping and API Fundamentals in Python
No ratings yet
Course Notes - Web Scraping and API Fundamentals in Python
10 pages
Web Scraping by Using R
No ratings yet
Web Scraping by Using R
3 pages
Internet Tools and Web Technology
No ratings yet
Internet Tools and Web Technology
91 pages
CAIE-IGCSE-ICT Chapter 8,9 and 10 - Theory
No ratings yet
CAIE-IGCSE-ICT Chapter 8,9 and 10 - Theory
10 pages
It005 Lab2
No ratings yet
It005 Lab2
12 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Notes - IOT
No ratings yet
Notes - IOT
64 pages
PDF Document 2
No ratings yet
PDF Document 2
24 pages
24
No ratings yet
24
2 pages
Engineering-A Review Web Data Scrapping
No ratings yet
Engineering-A Review Web Data Scrapping
4 pages
gc_2024_12_30
No ratings yet
gc_2024_12_30
11 pages
6 Results and Discussions
No ratings yet
6 Results and Discussions
5 pages
Notes for Web Scraping - BeautifulSoup-3903
No ratings yet
Notes for Web Scraping - BeautifulSoup-3903
6 pages
scraping
No ratings yet
scraping
6 pages
The Gillette Company Financials
No ratings yet
The Gillette Company Financials
34 pages
Api and data structure
No ratings yet
Api and data structure
3 pages
Hades
No ratings yet
Hades
36 pages
IOT Communication API
No ratings yet
IOT Communication API
6 pages
RESTful API Design OCTO Quick Reference Card 2.2
No ratings yet
RESTful API Design OCTO Quick Reference Card 2.2
4 pages
Oracle ICS Dump Questions
No ratings yet
Oracle ICS Dump Questions
5 pages
Couch DB Doccumentation
No ratings yet
Couch DB Doccumentation
21 pages
Lab 10 Blocking Threats Using Custom Applications
No ratings yet
Lab 10 Blocking Threats Using Custom Applications
29 pages
Lab 07_NodeMCU as Web Server
No ratings yet
Lab 07_NodeMCU as Web Server
36 pages
Application Layer 1
No ratings yet
Application Layer 1
30 pages
Project_I
No ratings yet
Project_I
15 pages
Callstatic API Para Pumpfun
No ratings yet
Callstatic API Para Pumpfun
29 pages
Web Scraping and HTML Basics
No ratings yet
Web Scraping and HTML Basics
4 pages
Input File
No ratings yet
Input File
24 pages
FullStackWebDevelopment Python
No ratings yet
FullStackWebDevelopment Python
9 pages
E-Commerce Review Scrapper: Python Mini Project On
No ratings yet
E-Commerce Review Scrapper: Python Mini Project On
15 pages
web_scrapping_final[1]
No ratings yet
web_scrapping_final[1]
7 pages
Arindam Manna, Financial Analytics
No ratings yet
Arindam Manna, Financial Analytics
9 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Web Scarpping
No ratings yet
Web Scarpping
4 pages
Documentation Flutter
No ratings yet
Documentation Flutter
12 pages
20 - 3 - A Study
No ratings yet
20 - 3 - A Study
5 pages
gc ٢٠٢٤ ١١ ١٤
No ratings yet
gc ٢٠٢٤ ١١ ١٤
10 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
Download
No ratings yet
Download
4 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Web Data Scraping
No ratings yet
Web Data Scraping
5 pages
GDP Categorized Emissions
No ratings yet
GDP Categorized Emissions
12 pages
Config COPEC
No ratings yet
Config COPEC
10 pages
Energy Consumption Growth Rates
No ratings yet
Energy Consumption Growth Rates
9 pages
Bandwidth and Applications Report Teams - MO
No ratings yet
Bandwidth and Applications Report Teams - MO
7 pages
Devops
No ratings yet
Devops
5 pages
Bussiness Logic Error Cheatsheet
No ratings yet
Bussiness Logic Error Cheatsheet
5 pages
Brian
No ratings yet
Brian
2 pages
Labelary ZPL Web Serviceu
No ratings yet
Labelary ZPL Web Serviceu
1 page
Learning DHTMLX Suite UI
From Everand
Learning DHTMLX Suite UI
Eli Geske
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet