[go: up one dir, main page]

0% found this document useful (0 votes)
14 views31 pages

FDSWeb Scraping

FDSWebscraping

Uploaded by

Jay Kelkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views31 pages

FDSWeb Scraping

FDSWebscraping

Uploaded by

Jay Kelkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Financial Data Science (FIN42110)

Dr. Richard McGee

Web Scraping

1
Introduction
Definitions

Web Scraping
Using tools to gather data you can see on a webpage.
A wide range of web scraping
techniques and tools exist. These can
be as simple as copy/paste and
increase in complexity to automation
tools, HTML parsing, APIs and
programming.

2
Definitions

HTTP
HyperText Transfer Protocol

• HTTP is the foundation of data


communication for the World Wide
Web, where hypertext documents
include hyperlinks to other resources
that the user can easily access, for
example by a mouse click or by
tapping the screen in a web browser.

The protocol defines aspects of authentication, requests, status


codes, persistent connections, client/server request/response.
etc.

3
Definitions

HTML
HyperText Markup Language

• HyperText Markup Language


(HTML) is the set of markup
symbols or codes inserted into a file
intended for display on the Internet.
The markup tells web browsers how
to display a web page’s words and
images.

Each individual piece markup code (which would fall between


"<" and ">" characters) is referred to as an element, though
many people also refer to it as a tag.

4
Definitions

XML
Extensible Markup Language

• Extensible Markup Language (XML)


is a markup language and file format
for storing, transmitting, and
reconstructing arbitrary data.

It defines a set of rules for encoding documents in a format that


is both human-readable and machine-readable.
XML is about encoding data, HTML is about display.

5
XML Example

• We will check out the Books.xml example file on


Brightspace.
• With a new data set it can be useful to use an XML viewer
to view the hierarchy:
• https://www.xmlgrid.net/

6
XML parsing

from lxml.etree import fromstring

with open('Books.xml', 'r') as file:


xml = file.read()

root = fromstring(xml)

for books in root.xpath("/catalog/book"):


print(books.xpath("title")[0].text)

7
Definitions

JSON
JavaScript Object Notation

• JSON, is a lightweight computer


data interchange format. It is a
text-based, human-readable format
for representing simple data
structures and associative arrays
(called objects) in serialization and
serves as an alternative to XML.

8
Definitions

API
Application Programming Interface

• An application programming
interface (API) is a connection
between computers or between
computer programs. It is a type of
software interface, offering a service
to other pieces of software.

9
Definitions

SOAP
Simple Object Access Protocol

• SOAP is a commonly used set of


commands and objects used to
implement an API.

10
Definitions

• Parsing
• The act of analyzing the strings and symbols to reveal
only the data you need.
• Crawling
• Moving across or through a website in an attempt to
gather data from more than one URL or page

11
HTML Structure: div

<html>
<head>
<style>
.myDiv {
border: 5px outset red;
background-color: lightblue;
text-align: center;
}
</style>
</head>
<body>

<div class="myDiv">
<h2>This is a heading in a div element</h2>
<p>This is some text in a div element.</p>
</div>

</body>
</html>

• division/section/used as a container for HTML elements.


• https://www.w3schools.com/Tags/tag_div.asp

12
HTML Structure: table/tr/td

<table>
<tr>
<td>Cell A</td>
<td>Cell B</td>
</tr>
<tr>
<td>Cell C</td>
<td>Cell D</td>
</tr>
</table>

• one <table> and one or more <tr>, <th>, and <td> elements
• https://www.w3schools.com/Tags/tag_table.asp

13
Robots.txt

• Instructs web robots (typically search engine robots) how


to crawl pages on the website.
• Example https://www.buzzfeed.com/robots.txt
• Accessing at too high a frequency will get you blocked!

14
Useful Python Packages

• pip install beautifulsoup4


• pip install requests
• pip install html5lib
• pip install yfinance
• pip install mplfinance
• pip install twython
• pip install selenium
• install chrome browser
• and chrome driver matching browser version

15
Crypto Punks
Example Project: Crypto Punk Pricing

https://www.larvalabs.com/cryptopunks
Step 1: Specify what you are looking for. In this case:

• a database of 10,000 crypto punks


• their key features
• their trade history and prices
• looking to explain prices with features.

16
Example Project: Crypto Punk Pricing

Examine the web page source:

17
Example Project: Crypto Punk Pricing

Step 2: Design your database structure

• for this project I will use a simple SQLite DB


• https://sqlitebrowser.org/dl/
• I will create two tables - a punk attribute table and a trade
table.

18
Example Project: Crypto Punk Pricing

Step 3: Examine the web site structure (view page source in


browser)

• CyptoPunk is nicely structured with one page per punk


numbered 1-10,000
• e.g. punk 1 is at
https://www.larvalabs.com/cryptopunks/details/1

19
Example Project: Crypto Punk Pricing

• Example: Print trade dates and amounts for one punk.

import requests
from bs4 import BeautifulSoup

# Crypto Punk
#~~~~~~~~~~~~
BaseStr = "https://www.larvalabs.com/cryptopunks/details/"
PunkNo = '1'
page = requests.get(BaseStr + PunkNo)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table', attrs={'class':'table'})
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
if cols:
cols = [ele.text.strip() for ele in cols]
print(cols[4] + ' : ' + cols[3])

20
Yahoo finance API
Web Scraping from yahoo finance

import yfinance as yf
import mplfinance as mpf
import numpy as np

ticker_name = 'NFLX'
yticker = yf.Ticker(ticker_name)
nflx = yticker.history(period="1y") # max, 1y, 3mo
....
# Compute log returns
nflx['Return'] = np.log(nflx['Close']/nflx['Close'].shift(1))

https://pypi.org/project/yfinance/
https://pypi.org/project/mplfinance/

21
MPL output example

22
More Scraping
Web Scraping Example: House of Representatives

https://www.house.gov/representatives
23
Web Scraping Example

def main():
from bs4 import BeautifulSoup
import requests

url = "https://www.house.gov/representatives"
text = requests.get(url).text
soup = BeautifulSoup(text, "html5lib")

all_urls = [a['href']
for a in soup('a')
if a.has_attr('href')]
print(len(all_urls))

Example from Data Science from Sratch, Joel Grus

24
Web Scraping Example

import re

# Must start with http:// or https://


# Must end with .house.gov or .house.gov/
regex = r"^https?://.*\.house\.gov/?$"

# Let's write some tests!


assert re.match(regex, "http://joel.house.gov")

# And now apply


good_urls = [url for url in all_urls if re.match(regex, url)]
print(len(good_urls))
good_urls = list(set(good_urls))

Example from Data Science from Sratch, Joel Grus.


For regex see, e.g.: https://www.w3schools.com/python/python_regex.asp
25
Web Scraping Example

from bs4 import BeautifulSoup


import requests

def paragraph_mentions(text: str, keyword: str) -> bool:


"""
Returns True if a <p> inside the text mentions {keyword}
"""
soup = BeautifulSoup(text, 'html5lib')
paragraphs = [p.get_text() for p in soup('p')]

return any(keyword.lower() in paragraph.lower()


for paragraph in paragraphs)

Example from Data Science from Sratch, Joel Grus

26
Web Scraping Example

import random
from typing import Dict, Set

good_urls = random.sample(good_urls, 5)
print(f"after sampling, left with {good_urls}")

press_releases: Dict[str, Set[str]] = {}


for house_url in good_urls:
html = requests.get(house_url).text
soup = BeautifulSoup(html, 'html5lib')
pr_links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}
print(f"{house_url}: {pr_links}")
press_releases[house_url] = pr_links

for house_url, pr_links in press_releases.items():


for pr_link in pr_links:
url = f"{house_url}/{pr_link}"
text = requests.get(url).text

if paragraph_mentions(text, 'data'):
print(f"{house_url}")
break # done with this house_url

Example from Data Science from Sratch, Joel Grus 27

You might also like