8000 added web crawler · liuhuimin/python-scripts@7409f60 · GitHub
[go: up one dir, main page]

Skip to content

Commit 7409f60

Browse files
committed
added web crawler
1 parent 7a6597c commit 7409f60

File tree

2 files changed

+46
-0
lines changed

2 files changed

+46
-0
lines changed

08_basic_email_web_crawler.py

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
import requests
2+
import re
3+
import urlparse
4+
5+
# regex
6+
email_re = re.compile(r'([\w\.,]+@[\w\.,]+\.\w+)')
7+
link_re = re.compile(r'href="(.*?)"')
8+
9+
def crawl(url, maxlevel):
10+
11+
result = set()
12+
13+
while maxlevel > 0:
14+
15+
# Get the webpage
16+
req = requests.get(url)
17+
18+
# Check if successful
19+
if(req.status_code != 200):
20+
return []
21+
22+
# Find and follow all the links
23+
links = link_re.findall(req.text)
24+
for link in links:
25+
# Get an absolute URL for a link
26+
link = urlparse.urljoin(url, link)
27+
28+
# Find all emails on current page
29+
result.update(email_re.findall(req.text))
30+
31+
print "Crawled level: {}".format(maxlevel)
32+
33+
# new level
34+
maxlevel -= 1
35+
36+
# recurse
37+
crawl(link, maxlevel)
38+
39+
return result
40+
41+
emails = crawl('http://www.website_goes_here_dot_com', 2)
42+
43+
print "\nScrapped e-mail addresses:"
44+
for email in emails:
45+
print email

readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,4 @@
77
1. **05_load_json_without_dupes.py**: load json, convert to dict, raise error if there is a duplicate key
88
1. **06_execution_time.py**: class used for timing execution of code
99
1. **07_benchmark_permissions_loading_django.py**: benchmark loading of permissions in Django
10+
1. **08_basic_email_web_crawler.py**: web crawler for grabbing emails from a website recursively

0 commit comments

Comments
 (0)
0