FDSWeb Scraping
FDSWeb Scraping
Web Scraping
1
Introduction
Definitions
Web Scraping
Using tools to gather data you can see on a webpage.
A wide range of web scraping
techniques and tools exist. These can
be as simple as copy/paste and
increase in complexity to automation
tools, HTML parsing, APIs and
programming.
2
Definitions
HTTP
HyperText Transfer Protocol
3
Definitions
HTML
HyperText Markup Language
4
Definitions
XML
Extensible Markup Language
5
XML Example
6
XML parsing
root = fromstring(xml)
7
Definitions
JSON
JavaScript Object Notation
8
Definitions
API
Application Programming Interface
• An application programming
interface (API) is a connection
between computers or between
computer programs. It is a type of
software interface, offering a service
to other pieces of software.
9
Definitions
SOAP
Simple Object Access Protocol
10
Definitions
• Parsing
• The act of analyzing the strings and symbols to reveal
only the data you need.
• Crawling
• Moving across or through a website in an attempt to
gather data from more than one URL or page
11
HTML Structure: div
<html>
<head>
<style>
.myDiv {
border: 5px outset red;
background-color: lightblue;
text-align: center;
}
</style>
</head>
<body>
<div class="myDiv">
<h2>This is a heading in a div element</h2>
<p>This is some text in a div element.</p>
</div>
</body>
</html>
12
HTML Structure: table/tr/td
<table>
<tr>
<td>Cell A</td>
<td>Cell B</td>
</tr>
<tr>
<td>Cell C</td>
<td>Cell D</td>
</tr>
</table>
• one <table> and one or more <tr>, <th>, and <td> elements
• https://www.w3schools.com/Tags/tag_table.asp
13
Robots.txt
14
Useful Python Packages
15
Crypto Punks
Example Project: Crypto Punk Pricing
https://www.larvalabs.com/cryptopunks
Step 1: Specify what you are looking for. In this case:
16
Example Project: Crypto Punk Pricing
17
Example Project: Crypto Punk Pricing
18
Example Project: Crypto Punk Pricing
19
Example Project: Crypto Punk Pricing
import requests
from bs4 import BeautifulSoup
# Crypto Punk
#~~~~~~~~~~~~
BaseStr = "https://www.larvalabs.com/cryptopunks/details/"
PunkNo = '1'
page = requests.get(BaseStr + PunkNo)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table', attrs={'class':'table'})
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
if cols:
cols = [ele.text.strip() for ele in cols]
print(cols[4] + ' : ' + cols[3])
20
Yahoo finance API
Web Scraping from yahoo finance
import yfinance as yf
import mplfinance as mpf
import numpy as np
ticker_name = 'NFLX'
yticker = yf.Ticker(ticker_name)
nflx = yticker.history(period="1y") # max, 1y, 3mo
....
# Compute log returns
nflx['Return'] = np.log(nflx['Close']/nflx['Close'].shift(1))
https://pypi.org/project/yfinance/
https://pypi.org/project/mplfinance/
21
MPL output example
22
More Scraping
Web Scraping Example: House of Representatives
https://www.house.gov/representatives
23
Web Scraping Example
def main():
from bs4 import BeautifulSoup
import requests
url = "https://www.house.gov/representatives"
text = requests.get(url).text
soup = BeautifulSoup(text, "html5lib")
all_urls = [a['href']
for a in soup('a')
if a.has_attr('href')]
print(len(all_urls))
24
Web Scraping Example
import re
26
Web Scraping Example
import random
from typing import Dict, Set
good_urls = random.sample(good_urls, 5)
print(f"after sampling, left with {good_urls}")
if paragraph_mentions(text, 'data'):
print(f"{house_url}")
break # done with this house_url