[go: up one dir, main page]

0% found this document useful (0 votes)
37 views22 pages

Chapter1 PDF

This document discusses various techniques for importing data from the web in Python, including downloading files, making HTTP requests, and scraping web data. It covers using the urllib and requests packages to download files and make GET requests. BeautifulSoup is introduced as a tool for parsing HTML and extracting structured data. Examples are provided to demonstrate downloading a file from a URL, making GET requests with urllib and requests, exploring the BeautifulSoup object model, and extracting links and text from an HTML document.

Uploaded by

vrhdzv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views22 pages

Chapter1 PDF

This document discusses various techniques for importing data from the web in Python, including downloading files, making HTTP requests, and scraping web data. It covers using the urllib and requests packages to download files and make GET requests. BeautifulSoup is introduced as a tool for parsing HTML and extracting structured data. Examples are provided to demonstrate downloading a file from a URL, making GET requests with urllib and requests, exploring the BeautifulSoup object model, and extracting links and text from an HTML document.

Uploaded by

vrhdzv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Importing flat files

from the web


I N T E R M E D I AT E I M P O R T I N G D ATA I N P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
You’re already great at importing!
Flat les such as .txt and .csv

Pickled les, Excel spreadsheets, and many others!

Data from relational databases

You can do all these locally

What if your data is online?

INTERMEDIATE IMPORTING DATA IN PYTHON


Can you import web data?

You can: go to URL and click to download les

BUT: not reproducible, not scalable

INTERMEDIATE IMPORTING DATA IN PYTHON


You’ll learn how to…
Import and locally save datasets from the web

Load datasets into pandas DataFrames

Make HTTP requests (GET requests)

Scrape web data such as HTML

Parse HTML into useful data (BeautifulSoup)

Use the urllib and requests packages

INTERMEDIATE IMPORTING DATA IN PYTHON


The urllib package
Provides interface for fetching data across the web

urlopen() - accepts URLs instead of le names

INTERMEDIATE IMPORTING DATA IN PYTHON


How to automate file download in Python
from urllib.request import urlretrieve
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
winequality-white.csv'
urlretrieve(url, 'winequality-white.csv')

('winequality-white.csv', <http.client.HTTPMessage at 0x103cf1128>)

INTERMEDIATE IMPORTING DATA IN PYTHON


Let's practice!
I N T E R M E D I AT E I M P O R T I N G D ATA I N P Y T H O N
HTTP requests to
import files from the
web
I N T E R M E D I AT E I M P O R T I N G D ATA I N P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
URL
Uniform/Universal Resource Locator

References to web resources

Focus: web addresses

Ingredients:
Protocol identi er - h p:

Resource name - datacamp.com

These specify web addresses uniquely

INTERMEDIATE IMPORTING DATA IN PYTHON


HTTP
HyperText Transfer Protocol

Foundation of data communication for the web

HTTPS - more secure form of HTTP

Going to a website = sending HTTP request


GET request

urlretrieve() performs a GET request

HTML - HyperText Markup Language

INTERMEDIATE IMPORTING DATA IN PYTHON


GET requests using urllib
from urllib.request import urlopen, Request
url = "https://www.wikipedia.org/"
request = Request(url)
response = urlopen(request)
html = response.read()
response.close()

INTERMEDIATE IMPORTING DATA IN PYTHON


GET requests using requests

Used by “her Majesty's Government, Amazon, Google, Twilio,


NPR, Obama for America, Twi er, Sony, and Federal U.S.
Institutions that prefer to be unnamed”

INTERMEDIATE IMPORTING DATA IN PYTHON


GET requests using requests
One of the most downloaded Python packages

import requests
url = "https://www.wikipedia.org/"
r = requests.get(url)
text = r.text

INTERMEDIATE IMPORTING DATA IN PYTHON


Let's practice!
I N T E R M E D I AT E I M P O R T I N G D ATA I N P Y T H O N
Scraping the web in
Python
I N T E R M E D I AT E I M P O R T I N G D ATA I N P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
HTML
Mix of unstructured and structured data

Structured data:
Has pre-de ned data model, or

Organized in a de ned manner

Unstructured data: neither of these properties

INTERMEDIATE IMPORTING DATA IN PYTHON


BeautifulSoup
Parse and extract structured data from HTML

Make tag soup beautiful and extract information

INTERMEDIATE IMPORTING DATA IN PYTHON


BeautifulSoup
from bs4 import BeautifulSoup
import requests
url = 'https://www.crummy.com/software/BeautifulSoup/'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)

INTERMEDIATE IMPORTING DATA IN PYTHON


Prettified Soup
print(soup.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/transitional.dtd">


<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>
Beautiful Soup: We called him Tortoise because he taught us.
</title>
<link href="mailto:leonardr@segfault.org" rev="made"/>
<link href="/nb/themes/Default/nb.css" rel="stylesheet" type="text/css"/>
<meta content="Beautiful Soup: a library designed for screen-scraping HTML and XML." name="Description"/>
<meta content="Markov Approximation 1.4 (module: leonardr)" name="generator"/>
<meta content="Leonard Richardson" name="author"/>
</head>
<body alink="red" bgcolor="white" link="blue" text="black" vlink="660066">
<img align="right" src="10.1.jpg" width="250"/>
<br/>
<p>

INTERMEDIATE IMPORTING DATA IN PYTHON


Exploring BeautifulSoup
Many methods such as:

print(soup.title)

<title>Beautiful Soup: We called him Tortoise because he taught us.</title>

print(soup.get_text())

Beautiful Soup: We called him Tortoise because he taught us.


You didn't write that awful page. You're just trying to
get some data out of it. Beautiful Soup is here to
help. Since 2004, it's been saving programmers hours or
days of work on quick-turnaround screen scraping
projects.

INTERMEDIATE IMPORTING DATA IN PYTHON


Exploring BeautifulSoup
find_all()

for link in soup.find_all('a'):


print(link.get('href'))

bs4/download/
#Download
bs4/doc/
#HallOfFame
https://code.launchpad.net/beautifulsoup
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
http://www.candlemarkandgleam.com/shop/constellation-games/
http://constellation.crummy.com/Constellation%20Games%20excerpt.html
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
https://bugs.launchpad.net/beautifulsoup/
http://lxml.de/
http://code.google.com/p/html5lib/

INTERMEDIATE IMPORTING DATA IN PYTHON


Let's practice!
I N T E R M E D I AT E I M P O R T I N G D ATA I N P Y T H O N

You might also like