Workshop on Digital Data Analysis: methods of web scrapping .
(July 8th IBEI)
Jorge Luis Salcedo M
jorgelsalcedo@gmail.com ; jsalcedoma@uoc.edu
Outline
1.
What is web scrapping or web harvesting? ............................................................................................................................................................................... 2
2.
Which are the potential uses of this data for interest groups research? ............................................................................................................................... 3
3.
Which free and low cost instruments can we use? .................................................................................................................................................................. 4
4.
Dissection a Newspaper web pages with Outwit Pro. .......................................................................................................................................................... 16
1.1.
1.1.1.
Obtaining hyperlinks ..................................................................................................................................................................................................... 18
1.1.2.
Image extraction.............................................................................................................................................................................................................19
1.1.3.
Creating your own tables of data ................................................................................................................................................................................. 20
1.1.4.
Exporting on different formats your data. ................................................................................................................................................................. 20
1.2.
5.
Free functions. ........................................................................................................................................................................................................................ 17
Pro version .............................................................................................................................................................................................................................. 21
1.2.1.
Words frequency ............................................................................................................................................................................................................21
1.2.2.
Creating your personalize scrappers. ........................................................................................................................................................................... 21
1.2.3.
Exploring multiple web pages ...................................................................................................................................................................................... 21
1.2.4.
Macro automation ..........................................................................................................................................................................................................22
Some final considerations. ........................................................................................................................................................................................................23
Further readings.................................................................................................................................................................................................................................. 24
1
1. What is web scrapping or web harvesting?
- You can always copy & paste, but it's time-consuming and prone to errors.
-To gather, in an automated fashion, freely available data in virtually any kind of online
format.
-Web scraping is the process of extracting web information automatically and transforms
it into a structured dataset.
-Scraping describes the method to extract data hidden in documents - such as Web Pages
or PDFs and make it useable for further processing. It is among the most useful skills if
you set out to investigate data - and most of the time it’s not especially challenging. For
the simplest ways of scraping you don’t even need to know how to write code.
2
2. Which are the potential uses of this data for interest groups research?
It is a growing amount of data is available on the Web:
Election results, budget allocations, legislative speeches
Social media data, newspapers articles
Some sources (we pages) that we are going to use
http://rss.cnn.com/rss/edition.rss
http://rss.elmundo.es/rss/
http://ep00.epimg.net/rss/elpais/portada.xml
http://www.cis.es/cis/export/sites/default/-Archivos/Marginales/2980_2999/2981/Cru298100SEXO.html
http://www.bcn.cat/estadistica/castella/dades/tpob/llars/a2010/persones/person01.htm
http://lobbyplag.eu/map
3
3. Which free and low cost instruments can we use?
Google Chrome and Mozilla
Spreadsheets formulas
Feeds
HTML-tables
Some apps
https://chrome.google.com/webstore/detail/table-capture/iebpjdmgckacbodjpijphcplhebcmeop
https://addons.mozilla.org/es/firefox/addon/dafizilla-table2clipboard/
Google Scraper
https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd
Outwitpro
http://www.outwit.com/
4
Spreadsheets formulas
Go to http://drive.google.com, log in and create a new spreadsheet
Import Feeds
1
5
2
6
Import tables from web pages
Why we don’t use excel? (Import data from the web)
http://www.cis.es/cis/export/sites/default/-Archivos/Marginales/2980_2999/2981/Cru298100SEXO.html
The last number indicates the number of the table in the document, just try them out and find the matching one...
7
Other syntaxes
=importHTML("http://www.cis.es/cis/export/sites/default/-Archivos/Marginales/2980_2999/2981/Cru298100SEXO.html", "table", 8)
8
Other app to import tables and list
https://addons.mozilla.org/es/firefox/addon/dafizilla-table2clipboard/
https://chrome.google.com/webstore/detail/table-capture/iebpjdmgckacbodjpijphcplhebcmeop
Some considerations
-Always take care of (,
; . and spaces)
- It is necessary to do a little cleanup: Delete all empty rows and the header
-Notice how if you work with the sheet, the deleted rows appear again and again? This is because
the formula keeps refreshing the content.
- In order to change the content or delete it, we’ll need to copy the content of the first sheet into
another sheet (paste values only)
9
Scrapper
INSTALL THE APP, ONLY WORKS ON CHROME.
https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd
Right click after you select the content.
10
11
-It is crucial to observe, and try to identify the multiple tags
-Do you see the small box on the upper left, saying XPath?
12
-See how the tweet is within a <p> tag? Let’s add the tag to our xpath.
-In the “Columns” section, change the name of the first column to “Tweet”
-Now let’s add the XPATH for the tweet to it
-The xpaths in the columns section are relative, that means “./p” will select the <p>
element
-add “./p” to the xpath for the tweet column and click “scrape”
13
See more at: http://extract-web-data.com/xpath-review/#sthash.xxsBbkqV.dpuf
14
15
4. Dissection a Newspaper web pages with Outwit Pro.
It works standalone or with Mozilla (app).
Download and install- (http://blog.outwit.com/ )
Feature
Light free version
Pro
Unlimited extractions
Link extraction
Image extraction & download
Email extraction
Data extraction
Simple text extraction
RSS News extraction
Colorized source
Document extraction & download
Words & groups of words
Directories of links & queries
Advanced scrapers
Macro automation
Periodical job execution
Query generation matrices
Advanced Dig functions
16
1.1.
Free functions.
-It’s a data extractor.
-The log panel is at the top
-The catch is my collection basket, where I can store all
-Types of information. Anything of interest can be dragged to the catch
-Form the page or any other view.
-I can identify all the out-links in a webpage, the documents, and pictures.
17
1.1.1.Obtaining hyperlinks
Searching on Google or Google scholar
It is a way to identify the Web communication policy of an organization.
18
1.1.2.Image extraction.
Searching Google images
When are you searching documents, also you don’t load selecting files.
19
1.1.3.Creating your own tables of data
1.1.4.Exporting on different formats your data.
Excel
HTML
CVS
TEXT
SQL
20
1.2.
Pro version
1.2.1. Words frequency
1.2.2. Creating your personalize scrappers.
In the case of TOPSY or some specific browsers or any web page.
1.2.3. Exploring multiple web pages
21
1.2.4. Macro automation
IF you know that you need to a task several times or event regularly the best alternative is a macro
IN addition you can program a job.
22
5. Some final considerations.
-The main function of scraping is to convert data that is semi-structured into structured data and make it easily
useable for further processing. While this is a relatively simple task with a bit of programming - for single
webpages it is also feasible without any programming at all.
-Respect the hosting site's wishes: Check if an API exists, or if data are available for download.
Some websites \disallow" scrapers on their robots.txt
-Limit your bandwidth use: Wait one second after each hit; Try to scrape websites during off-peak hours; Scrape
only what you need, and just once
-The fact that you can access some data doesn't mean you should use it for your research.
-Be aware of rate limits.
-Ongoing debate on replication of social science research using this source of data.
-Be careful of scrapping Google.
23
Further readings
http://www.google.com/url?q=http://www.slideshare.net/anniecushing/web-scraping-forcodeophobes&usd=2&usg=ALhdy29HTB4v5gZ9TzFizwQXJIT2usxdVA
http://www.google.com/url?q=https://chrome.google.com/webstore/detail/iebpjdmgckacbodjpijphcplhebcmeop&usd=2&usg=ALhdy29xBBNA6
cy5X9RDDbZMm8BIGgP2Rw
http://www.google.com/url?q=https://chrome.google.com/webstore/detail/mbigbapnjcgaffohmbkdlecaccepngjd&usd=2&usg=ALhdy2_LHaNQo
moylXXECAJ6DbzlYN5Ulg
http://www.google.com/url?q=https://docs.google.com/a/seerinteractive.com/spreadsheet/ccc%3Fkey%3D0Ak_0EzUuRyn0dDFYOWxwWGt0e
UNkTlcySk9iMUdDOGc%23gid%3D2&usd=2&usg=ALhdy2_AyYtUR8u3cH4AJTVu0owKUYGa3A
http://www.google.com/url?q=http://www.seerinteractive.com/blog/importxml-cookbook/&usd=2&usg=ALhdy2-Fa8IPlF4WZlphrkn_2_VbcjuOA
http://www.google.com/url?q=http://bit.ly/xpath-tutorial&usd=2&usg=ALhdy2-rG-k-Tc1ADe9ll39WLUBhlRBwBw
https://docs.google.com/spreadsheet/ccc?key=0Ak_0EzUuRyn0dFVnZUNHQVRGZ1hES3IxY3hWdVVsNEE#gid=3
24