0% found this document useful (0 votes)

37 views22 pages

Chapter1 PDF

This document discusses various techniques for importing data from the web in Python, including downloading files, making HTTP requests, and scraping web data. It covers using the urllib and requests packages to download files and make GET requests. BeautifulSoup is introduced as a tool for parsing HTML and extracting structured data. Examples are provided to demonstrate downloading a file from a URL, making GET requests with urllib and requests, exploring the BeautifulSoup object model, and extracting links and text from an HTML document.

Uploaded by

vrhdzv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views22 pages

Chapter1 PDF

Uploaded by

vrhdzv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Importing flat files

from the web

I N T E R M E D I AT E I M P O R T I N G D ATA I N P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
You’re already great at importing!
Flat les such as .txt and .csv

Pickled les, Excel spreadsheets, and many others!

Data from relational databases

You can do all these locally

What if your data is online?

INTERMEDIATE IMPORTING DATA IN PYTHON

Can you import web data?

You can: go to URL and click to download les

BUT: not reproducible, not scalable

INTERMEDIATE IMPORTING DATA IN PYTHON

You’ll learn how to…
Import and locally save datasets from the web

Load datasets into pandas DataFrames

Make HTTP requests (GET requests)

Scrape web data such as HTML

Parse HTML into useful data (BeautifulSoup)

Use the urllib and requests packages

INTERMEDIATE IMPORTING DATA IN PYTHON

The urllib package
Provides interface for fetching data across the web

urlopen() - accepts URLs instead of le names

INTERMEDIATE IMPORTING DATA IN PYTHON

How to automate file download in Python
from urllib.request import urlretrieve
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
winequality-white.csv'
urlretrieve(url, 'winequality-white.csv')

('winequality-white.csv', <http.client.HTTPMessage at 0x103cf1128>)

INTERMEDIATE IMPORTING DATA IN PYTHON

Let's practice!
I N T E R M E D I AT E I M P O R T I N G D ATA I N P Y T H O N
HTTP requests to
import files from the
web
I N T E R M E D I AT E I M P O R T I N G D ATA I N P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
URL
Uniform/Universal Resource Locator

References to web resources

Focus: web addresses

Ingredients:
Protocol identi er - h p:

Resource name - datacamp.com

These specify web addresses uniquely

INTERMEDIATE IMPORTING DATA IN PYTHON

HTTP
HyperText Transfer Protocol

Foundation of data communication for the web

HTTPS - more secure form of HTTP

Going to a website = sending HTTP request

GET request

urlretrieve() performs a GET request

HTML - HyperText Markup Language

INTERMEDIATE IMPORTING DATA IN PYTHON

GET requests using urllib
from urllib.request import urlopen, Request
url = "https://www.wikipedia.org/"
request = Request(url)
response = urlopen(request)
html = response.read()
response.close()

INTERMEDIATE IMPORTING DATA IN PYTHON

GET requests using requests

Used by “her Majesty's Government, Amazon, Google, Twilio,

NPR, Obama for America, Twi er, Sony, and Federal U.S.
Institutions that prefer to be unnamed”

INTERMEDIATE IMPORTING DATA IN PYTHON

GET requests using requests
One of the most downloaded Python packages

import requests
url = "https://www.wikipedia.org/"
r = requests.get(url)
text = r.text

INTERMEDIATE IMPORTING DATA IN PYTHON

Let's practice!
I N T E R M E D I AT E I M P O R T I N G D ATA I N P Y T H O N
Scraping the web in
Python
I N T E R M E D I AT E I M P O R T I N G D ATA I N P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
HTML
Mix of unstructured and structured data

Structured data:
Has pre-de ned data model, or

Organized in a de ned manner

Unstructured data: neither of these properties

INTERMEDIATE IMPORTING DATA IN PYTHON

BeautifulSoup
Parse and extract structured data from HTML

Make tag soup beautiful and extract information

INTERMEDIATE IMPORTING DATA IN PYTHON

BeautifulSoup
from bs4 import BeautifulSoup
import requests
url = 'https://www.crummy.com/software/BeautifulSoup/'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)

INTERMEDIATE IMPORTING DATA IN PYTHON

Prettified Soup
print(soup.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/transitional.dtd">

<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>
Beautiful Soup: We called him Tortoise because he taught us.
</title>
<link href="mailto:leonardr@segfault.org" rev="made"/>
<link href="/nb/themes/Default/nb.css" rel="stylesheet" type="text/css"/>
<meta content="Beautiful Soup: a library designed for screen-scraping HTML and XML." name="Description"/>
<meta content="Markov Approximation 1.4 (module: leonardr)" name="generator"/>
<meta content="Leonard Richardson" name="author"/>
</head>
<body alink="red" bgcolor="white" link="blue" text="black" vlink="660066">
<img align="right" src="10.1.jpg" width="250"/>
<br/>
<p>

INTERMEDIATE IMPORTING DATA IN PYTHON

Exploring BeautifulSoup
Many methods such as:

print(soup.title)

<title>Beautiful Soup: We called him Tortoise because he taught us.</title>

print(soup.get_text())

Beautiful Soup: We called him Tortoise because he taught us.

You didn't write that awful page. You're just trying to
get some data out of it. Beautiful Soup is here to
help. Since 2004, it's been saving programmers hours or
days of work on quick-turnaround screen scraping
projects.

INTERMEDIATE IMPORTING DATA IN PYTHON

Exploring BeautifulSoup
find_all()

for link in soup.find_all('a'):

print(link.get('href'))

bs4/download/
#Download
bs4/doc/
#HallOfFame
https://code.launchpad.net/beautifulsoup
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
http://www.candlemarkandgleam.com/shop/constellation-games/
http://constellation.crummy.com/Constellation%20Games%20excerpt.html
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
https://bugs.launchpad.net/beautifulsoup/
http://lxml.de/
http://code.google.com/p/html5lib/

INTERMEDIATE IMPORTING DATA IN PYTHON

Let's practice!
I N T E R M E D I AT E I M P O R T I N G D ATA I N P Y T H O N

Python Module-4
No ratings yet
Python Module-4
109 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
24 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
Beautiful Soup Documentation
No ratings yet
Beautiful Soup Documentation
53 pages
Beautiful Soup Documentation - Beautiful Soup 4.13.0 Documentation
No ratings yet
Beautiful Soup Documentation - Beautiful Soup 4.13.0 Documentation
54 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Howto Urllib2
100% (2)
Howto Urllib2
11 pages
DA Unit 4
No ratings yet
DA Unit 4
46 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Beautiful Soup Documentation: Getting Help
100% (1)
Beautiful Soup Documentation: Getting Help
56 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
DAP - Module 4
No ratings yet
DAP - Module 4
57 pages
Chapter 11. Web Scraping
100% (1)
Chapter 11. Web Scraping
57 pages
Web Scrapping
100% (1)
Web Scrapping
20 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms
No ratings yet
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms
16 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Howto Urllib2
No ratings yet
Howto Urllib2
12 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
Beautiful Soup
No ratings yet
Beautiful Soup
61 pages
Imm 5781
No ratings yet
Imm 5781
67 pages
Fun With Python
100% (5)
Fun With Python
113 pages
Citl Exp 8
No ratings yet
Citl Exp 8
7 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
Howto Urllib2
No ratings yet
Howto Urllib2
12 pages
Efficient Python Tricks and Tools For Data Scientists
100% (1)
Efficient Python Tricks and Tools For Data Scientists
23 pages
01 Python 02 Data Sourcing
No ratings yet
01 Python 02 Data Sourcing
9 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Howto Urllib2
No ratings yet
Howto Urllib2
11 pages
DS Lab 9
No ratings yet
DS Lab 9
2 pages
Webscraping1 1 PDF
No ratings yet
Webscraping1 1 PDF
10 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Howto Urllib2
No ratings yet
Howto Urllib2
12 pages
Beautiful Soup Documentation
No ratings yet
Beautiful Soup Documentation
61 pages
PDF Document 2
No ratings yet
PDF Document 2
24 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
HOWTO Fetch Internet Resources Using The Urllib Package: Guido Van Rossum and The Python Development Team
No ratings yet
HOWTO Fetch Internet Resources Using The Urllib Package: Guido Van Rossum and The Python Development Team
12 pages
Importing Data in Python Ii: Importing Flat Files From The Web
No ratings yet
Importing Data in Python Ii: Importing Flat Files From The Web
22 pages
Python Web Crawler
No ratings yet
Python Web Crawler
15 pages
HOWTO Fetch Internet Resources Using The Urllib Package: Table Des Matières
No ratings yet
HOWTO Fetch Internet Resources Using The Urllib Package: Table Des Matières
11 pages
Beautiful Soup
No ratings yet
Beautiful Soup
40 pages
Extracting Data From HTML Table
No ratings yet
Extracting Data From HTML Table
12 pages
Web Scraping
No ratings yet
Web Scraping
11 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
AN 6568 Introducing The Merchant Payment Gateway ID For Authorization Transactions
No ratings yet
AN 6568 Introducing The Merchant Payment Gateway ID For Authorization Transactions
12 pages
Api and Data Structure
No ratings yet
Api and Data Structure
3 pages
HOWTO Fetch Internet Resources Using The Urllib Package: Guido Van Rossum and The Python Development Team
No ratings yet
HOWTO Fetch Internet Resources Using The Urllib Package: Guido Van Rossum and The Python Development Team
11 pages
Opticodec-PC Manual (Press)
No ratings yet
Opticodec-PC Manual (Press)
127 pages
A Guide To Web Scraping in Python Using Beautiful Soup
No ratings yet
A Guide To Web Scraping in Python Using Beautiful Soup
6 pages
Macroeconomics II: Introduction To New-Keynesian Models: D Aniel Baksa
No ratings yet
Macroeconomics II: Introduction To New-Keynesian Models: D Aniel Baksa
78 pages
Macroeconomics II: Real Business Cycle Model and Open Economies
No ratings yet
Macroeconomics II: Real Business Cycle Model and Open Economies
70 pages
Project Proposal TV Set
100% (1)
Project Proposal TV Set
4 pages
EMS M 11 Leadership & Commitment
No ratings yet
EMS M 11 Leadership & Commitment
5 pages
Chapter 3 - Highway Capacity and Level of Service
100% (2)
Chapter 3 - Highway Capacity and Level of Service
74 pages
LECTURE 2 Organization and Coordinating
No ratings yet
LECTURE 2 Organization and Coordinating
73 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Gerard J Tortora Bryan H Derrickson - Principles of Anatomy and Physiology-Wiley 2020
No ratings yet
Gerard J Tortora Bryan H Derrickson - Principles of Anatomy and Physiology-Wiley 2020
10 pages
25 Python Materials
No ratings yet
25 Python Materials
3 pages
FIRST QUARTERLY EXAMINATION Sample Answer
No ratings yet
FIRST QUARTERLY EXAMINATION Sample Answer
7 pages
The Age of Discovery and Geopolitical Thought
No ratings yet
The Age of Discovery and Geopolitical Thought
11 pages
Liquid Dosage Form
No ratings yet
Liquid Dosage Form
12 pages
Volume 9 - Number 2
No ratings yet
Volume 9 - Number 2
184 pages
Strip HTML Tags Using Python
No ratings yet
Strip HTML Tags Using Python
8 pages
Chapter 4
No ratings yet
Chapter 4
58 pages
Python For Web Scraping - Week 3: 1 Installing A Module
No ratings yet
Python For Web Scraping - Week 3: 1 Installing A Module
4 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
46 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
02 The Potential Outcomes Framework
No ratings yet
02 The Potential Outcomes Framework
30 pages
DLL - English 5 - Q1 - W3-D1
No ratings yet
DLL - English 5 - Q1 - W3-D1
2 pages
List Comprehensions: Hugo Bowne-Anderson
No ratings yet
List Comprehensions: Hugo Bowne-Anderson
30 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
28 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
21 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
04 Imperfect Compliance: Program Evaluation Instructor: Dániel Horn (Slides: Gábor Kézdi)
No ratings yet
04 Imperfect Compliance: Program Evaluation Instructor: Dániel Horn (Slides: Gábor Kézdi)
25 pages
Program Evaluation: Eltecon 2016/17 Autumn Instructor: Dániel Horn (Slides: Gábor Kézdi)
No ratings yet
Program Evaluation: Eltecon 2016/17 Autumn Instructor: Dániel Horn (Slides: Gábor Kézdi)
24 pages
05 Deepfake Detection
No ratings yet
05 Deepfake Detection
15 pages
Chapter 1
No ratings yet
Chapter 1
24 pages
Xenakis Architecture and Music
100% (1)
Xenakis Architecture and Music
32 pages
Using Facetgrid, Factorplot and Lmplot: Chris Mo
No ratings yet
Using Facetgrid, Factorplot and Lmplot: Chris Mo
32 pages
Chapter 3
No ratings yet
Chapter 3
31 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
25 pages
Welcome To The Case Study!: Hugo Bowne-Anderson
No ratings yet
Welcome To The Case Study!: Hugo Bowne-Anderson
16 pages
TSMT DIST - FINANCE - Reza Maulana Azhar
No ratings yet
TSMT DIST - FINANCE - Reza Maulana Azhar
3 pages
David Hume's Guide To Social Media
No ratings yet
David Hume's Guide To Social Media
4 pages
ITF Uniform TCC CBA 2008-2009
No ratings yet
ITF Uniform TCC CBA 2008-2009
29 pages
Bca Sem3 2024
No ratings yet
Bca Sem3 2024
1 page
Using Seaborn Styles: Chris Mo
No ratings yet
Using Seaborn Styles: Chris Mo
17 pages
CD Imp QNS
No ratings yet
CD Imp QNS
54 pages
B.A. (Hons.) Psychology Introduction To Psychology (DC-1.1) SEM-I (2905)
No ratings yet
B.A. (Hons.) Psychology Introduction To Psychology (DC-1.1) SEM-I (2905)
4 pages
Commerce Class G11 Topic - Cooperative Society Grade - G11 Subject - Commerce Week - 5 Cooperative Society
No ratings yet
Commerce Class G11 Topic - Cooperative Society Grade - G11 Subject - Commerce Week - 5 Cooperative Society
7 pages
Introduction To Seaborn: Chris Mo
No ratings yet
Introduction To Seaborn: Chris Mo
18 pages
End Term Question Paper Linux For Devices 2021
No ratings yet
End Term Question Paper Linux For Devices 2021
2 pages
RSL Drums Companion 2012-OnlineEdition 01sep2017
100% (5)
RSL Drums Companion 2012-OnlineEdition 01sep2017
86 pages
BS en 10020 2000
100% (2)
BS en 10020 2000
12 pages
RWS Weekly Output Checklist 3rd Quarter SY 2022 2023
No ratings yet
RWS Weekly Output Checklist 3rd Quarter SY 2022 2023
5 pages
04-Module 4 Preboard Solutions-Final
No ratings yet
04-Module 4 Preboard Solutions-Final
7 pages
Iii Geology of The Area
No ratings yet
Iii Geology of The Area
10 pages
Chapter 3 - Reading Material
No ratings yet
Chapter 3 - Reading Material
11 pages
Corporate Finance What Is It?: Aswath Damodaran
No ratings yet
Corporate Finance What Is It?: Aswath Damodaran
14 pages
July 22 Parallel Structures
No ratings yet
July 22 Parallel Structures
2 pages
Rhomobile Beginner's Guide
From Everand
Rhomobile Beginner's Guide
Abhishek Nalwaya
No ratings yet
Apache Hive Essentials
From Everand
Apache Hive Essentials
Dayong Du
No ratings yet
Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
From Everand
Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
Anish Chapagain
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet