Deep Web
Part II.B. Techniques and Tools:
Network Forensics
CSF: Forensics Cyber-Security
Fall 2015
Nuno Santos
Summary
} The Surface Web
} The Deep Web
2 CSF - Nuno Santos 2015/16
Remember were we are
} Our journey in this course:
} Part I: Foundations of digital forensics
} Part II: Techniques and tools
} A. Computer forensics
} B. Network forensics Current focus
} C. Forensic data analysis
3 CSF - Nuno Santos 2015/16
Previously: Three key instruments in cybercrime
Anonymity systems
How criminals hide their IDs
Tools of
cybercrime
Botnets Digital currency
How to launch large scale attacks How to make untraceable payments
4 CSF - Nuno Santos 2015/16
Today: One last key instrument – The Web itself
Offender
} Web allows for accessing services for criminal activity
} E.g., drug selling, weapon selling, etc.
} Provides huge source of information, used in:
} Crime premeditation, privacy violations, identity theft, extortion, etc.
} To find services and info, there are powerful search engines
} Google, Bing, Shodan, etc.
5 CSF - Nuno Santos 2015/16
The Web: powerful also for crime investigation
Investigator
} Powerful investigation tool about suspects
} Find evidence in blogs, social networks, browsing activity, etc.
} The playground where the crime itself is carried out
} Illegal transactions, cyber stalking, blackmail, fraud, etc.
6 CSF - Nuno Santos 2015/16
An eternal cat & mouse race (who’s who?)
} The sophistication of offenses (and investigations) is driven
by the nature and complexity of the Web
7 CSF - Nuno Santos 2015/16
The web is deep, very deep…
} What’s “visible” through typical search engines is minimal
8 CSF - Nuno Santos 2015/16
What can be found in the Deep Web?
} Deep Web is not
necessarily bad: it’s
just that the content
is not directly
indexed
} Part of the deep
web where criminal
activity is carried
out is named the
Dark Web
9 CSF - Nuno Santos 2015/16
Some examples of services in the Web “ocean”
10 CSF - Nuno Santos 2015/16
Offenders operate at all layers
} Investigators too!
11 CSF - Nuno Santos 2015/16
Roadmap
} The Surface Web
} The Deep Web
12 CSF - Nuno Santos 2015/16
The Surface Web
13 CSF - Nuno Santos 2015/16
The Surface Web
} The Surface Web is that portion of the World Wide Web
that is readily available to the general public and
searchable with standard web search engines
} AKA Visible Web, Clearnet, Indexed Web, Indexable Web or Lightnet
} As of June 14, 2015, Google's index of the surface web
contains about 14.5 billion pages
14 CSF - Nuno Santos 2015/16
Surface Web characteristics
} Distributed data
} 80 million web sites (hostnames responding) in April 2006
} 40 million active web sites (don’t redirect, …)
} High volatility
} Servers come and go …
} Large volume
} One study found 11.5 billion pages in January 2005 (at that
time Google indexed 8 billion pages)
15 CSF - Nuno Santos 2015/16
Surface Web characteristics
} Unstructured data
} Lots of duplicated content (30% estimate)
} Semantic duplication much higher
} Quality of data
} No required editorial process
} Many typos and misspellings (impacts IR)
} Heterogeneous data
} Different media
} Different languages
16 CSF - Nuno Santos 2015/16
Surface Web composition by file type
To hide
} As of 2003,
about 70% of
Web content is
images, HTML,
PHP, and PDF files
17 CSF - Nuno Santos 2015/16
How to find content and services?
} Using search engines
1. A web crawler gathers a
snapshot of the Web
3. User submits a search
query
4. Search engine ranks pages that match
the query and returns an ordered list
2. The gathered pages are
indexed for easy retrieval
18 CSF - Nuno Santos 2015/16
How a typical search engine works
} Architecture of a typical search engine
Lots and lots of computers
Users Interface Query Engine
Index
Crawler Indexer
Web
19 CSF - Nuno Santos 2015/16
What a Web crawler does
} The Web crawler is a foundational species
} Without crawlers, there would be nothing to search
} Creates and repopulates
search engines data by
navigating the web,
fetching docs and files
20 CSF - Nuno Santos 2015/16
What a Web crawler is
} In general, it’s a program for downloading web pages
} Crawler AKA spider, bot, harvester
} Given an initial set of seed URLs, recursively download
every page that is linked from pages in the set
} A focused web crawler downloads only those pages whose
content satisfies some criterion
} The next node to crawl is the URL frontier
} Can include multiple pages from the same host
21 CSF - Nuno Santos 2015/16
Crawling the Web: Start from the seed pages
URLs crawled
and parsed
Unseen Web
Seed URLs frontier
pages
Web
22 CSF - Nuno Santos 2015/16
Crawling the Web: Keep expanding URL frontier
URLs crawled
and parsed
Unseen Web
Seed
Pages
URL frontier
Crawling thread
23 CSF - Nuno Santos 2015/16
Web crawler algorithm is conceptually simple
} Basic Algorithm
Initialize queue (Q) with initial set of known URL’s
Until Q empty or page or time limit exhausted:
Pop URL, L, from front of Q
If L is not to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…)
continue loop
If already visited L, continue loop
Download page, P, for L
If cannot download P (e.g. 404 error, robot excluded)
continue loop
Index P (e.g. add to inverted index or store cached copy)
Parse P to obtain list of new links N
Append N to the end of Q
24 CSF - Nuno Santos 2015/16
But not so simple to build in practice
} Performance: How do you crawl 1,000,000,000 pages?
} Politeness: How do you avoid overloading servers?
} Failures: Broken links, time outs, spider traps.
} Strategies: How deep to go? Depth first or breadth first?
} Implementations: How do we store and update the URL
list and other data structures needed?
25 CSF - Nuno Santos 2015/16
Crawler performance measures
} Completeness
Is the algorithm guaranteed to find a solution when
there is one?
} Optimality
Is this solution optimal?
} Time complexity
How long does it take?
} Space complexity
How much memory does it require?
26 CSF - Nuno Santos 2015/16
No single crawler can crawl the entire Web
} Crawling technique may depend on goal
} Types of crawling goals:
} Create large broad index
} Create a focused topic or domain-specific index
} Target topic-relevant sites
} Index preset terms
} Create subset of content to model characteristics
of the Web
} Need to survey appropriately
} Cannot use simple depth-first or breadth-first
} Create up-to-date index
} Use estimated change frequencies
27 CSF - Nuno Santos 2015/16
Crawlers also be used for nefarious purposes
} Spiders can be used to collect email addresses for
unsolicited communication
} From: http://spiders.must.die.net
28 CSF - Nuno Santos 2015/16
Crawler code available for free
29 CSF - Nuno Santos 2015/16
Spider traps
} A spider trap is a set of web pages that may be used to
cause a web crawler to make an infinite number of
requests or cause a poorly constructed crawler to crash
} To “catch” spambots or similar that waste a website's bandwidth
} Common techniques used are:
• Creation of indefinitely deep directory
structures like
• http://foo.com/bar/foo/bar/foo/bar/foo/
bar/.....
• Dynamic pages like calendars that
produce an infinite number of pages for
a web crawler to follow
• Pages filled with many chars, crashing
the lexical analyzer parsing the page
30 CSF - Nuno Santos 2015/16
Search engines run specific and benign crawlers
} Search engines obtain their listings in two ways:
} The search engines “crawl” or “spider” documents by following one
hypertext link to
} Authors may submit their own Web pages
} As a result, only static Web content can be found on public
search engines
} Nevertheless, a lot of info can be retrieved by criminals and
investigators, especially when using “hidden” features of the
search engine
31 CSF - Nuno Santos 2015/16
Google hacking
} Google provides keywords for advanced searching
} Logic operators in search expressions
} Advanced query attributes: “login password filetype:pdf”
} Intitle,
allintitle
} Related
} Inurl,
allinurl
} Phonebook
} Filetype
} Rphonebook
} Allintext
} Bphonebook
} Site
} Author
} Link
} Group
} Inanchor
} Msgid
} Daterange
} Insubject
} Cache
} Stocks
} Info
} Define
32 CSF - Nuno Santos 2015/16
There’s entire books dedicated to Google hacking
Dornfest, Rael, Google Hacks 3rd ed,
O’Rielly, (2006)
Ethical Hacking,
http://www.nc-net.info/2006conf/
Ethical_Hacking_Presentation_October_
2006.ppt
A cheat sheet of Google search
features:
http://www.google.com/intl/en/help/
features.html
A Cheat Sheet for Google Search Hacks
-- how to find information fast and
efficiently
http://www.expertsforge.com/Security/
hacking-everything-using-google-3.asp
33 CSF - Nuno Santos 2015/16
Google hacking examples: Simple word search
} A simple search: “cd ls .bash_history ssh”
} Can return surprising
results: this is the
contents of a
live .bash_history file
34 CSF - Nuno Santos 2015/16
Google hacking examples: URL searches
} inurl: find the search
inurl:admin term within the URL
inurl:admin
users mbox
inurl:admin users
passwords
35 CSF - Nuno Santos 2015/16
Google hacking examples: File type searches
} filetype: narrow down search results to specific file type
filetype:xls “checking
account” “credit card”
36 CSF - Nuno Santos 2015/16
Google hacking examples: Finding servers
intitle:"Under
construction" "does not
currently have"
intitle:"Welcome to Windows
2000 Internet Services"
37 CSF - Nuno Santos 2015/16
Google hacking examples: Finding webcams
} To find open unprotected Internet
webcams that broadcast to the
web, use the following query:
} inurl:/view.shtml
} Can also search by manufacturer-specific URL patterns
} inurl:ViewerFrame?Mode=
} inurl:ViewerFrame?Mode=Refresh
} inurl:axis-cgi/jpg
} ...
38 CSF - Nuno Santos 2015/16
Google hacking examples: Finding webcams
} How to Find and View Millions of Free Live Web Cams
http://www.traveltowork.net/2009/02/how-to-find-view-
free-live-web-cams/
} How to Hack Security Cameras,
http://www.truveo.com/How-To-Hack-Security-Cameras/
id/180144027190129591
} How to Hack Security Cams all over the World
http://www.youtube.com/watch?
v=9VRN8BS02Rk&feature=related
39 CSF - Nuno Santos 2015/16
And we’re just scratching the surface…
What can be found in the depths of the Web?
40 CSF - Nuno Santos 2015/16
The Deep Web
41 CSF - Nuno Santos 2015/16
The Deep Web
} Deep Web is the part
of the Web which is
not indexed by
conventional search
engines and therefore
don’t appear in
search results
} Why is it not indexed
by typical search
engines?
42 CSF - Nuno Santos 2015/16
Some content can’t be found through URL traversal
• Dynamic web pages and searchable databases
– Response to a query or accessed only through a form
• Unlinked contents
– Pages without any backlinks
• Private web
– Sites requiring registration and login
• Limited access web
– Sites with captchas, no-cache pragma http headers
• Scripted pages
– Page produced by javascrips, Flash, etc.
43 CSF - Nuno Santos 2015/16
In other times, content won’t be found
} Crawling restrictions by site owner
} Use a robots.txt file to keep files off limits from spiders
} Crawling restrictions by the search engine
} E.g.: a page may be found this way:
http://www.website.com/cgi-bin/getpage.cgi?name=sitemap
} Most search engines will not read past the ? in that URL
} Limitations of the crawling engine
} E.g., real-time data – changes rapidly – too “fresh”
44 CSF - Nuno Santos 2015/16
How big is Deep Web?
} Studies suggest it’s approx. 500x the surface Web
} But cannot be determined accurately
} A 2001 study showed that 60 deep sites exceeded the
size of the surface web (at that time) by 40x
45 CSF - Nuno Santos 2015/16
Distribution of Deep Web sites by content type
} Back in 2001,
biggest fraction
goes to
databases
46 CSF - Nuno Santos 2015/16
Approaches for finding content in Deep Web
1. Specialized search engines
2. Directories
47 CSF - Nuno Santos 2015/16
Specialized search engines
} Crawl deeper
} Go beyond top page, or homepage
} Crawl focused
} Choose sources to spider—topical sites only
} Crawl informed
} Indexing based on knowledge of the specific subject
48 CSF - Nuno Santos 2015/16
Specialized search engines abound
} There’s hundreds of specialized search engines for almost
every topic
49 CSF - Nuno Santos 2015/16
Directories
} Collections of pre-screened web-sites into categories
based on a controlled ontology
} Including access to content in databases
} Ontology: classification of human knowledge into
topics, similar to traditional library catalogs
} Two maintenance models: open or closed
} Closed model: paid editors; quality control (Yahoo)
} Open model: volunteer editors; (Open Directory Project)
50 CSF - Nuno Santos 2015/16
Example of ontology
} Ontologies allow for adding structure to Web content
51 CSF - Nuno Santos 2015/16
A particularly interesting search engine
} Shodan lets the user find specific types of computers connected
to the internet using a variety of filters
} Routers, servers, traffic lights, security cameras, home heating systems
} Control systems for water parks, gas stations, water plants, power grids,
nuclear power plants and particle-accelerating cyclotrons
} Why is it interesting?
} Many devices use "admin" as user name and "1234" as password, and
the only software required to connect them is a web browser
52 CSF - Nuno Santos 2015/16
How does Shodan work?
“Google crawls URLs – I don’t do that at all.The only thing I
do is randomly pick an IP out of all the IPs that exist,
whether it’s online or not being used, and I try to connect
to it on different ports. It’s probably not a part of the visible
web in the sense that you can’t just use a browser. It’s not
something that most people can easily discover, just because
it’s not visual in the same way a website is.”
John Matherly, Shodan's creator
} Shodan collects data mostly on HTTP servers (port 80)
} But also from FTP (21), SSH (22) Telnet (23), and SNMP (161)
53 CSF - Nuno Santos 2015/16
One can see through the eye of a webcam
54 CSF - Nuno Santos 2015/16
Play with the controls for a water treatment facility
55 CSF - Nuno Santos 2015/16
Find the creepiest stuff…
} Controls for a crematorium; accessible from your computer
56 CSF - Nuno Santos 2015/16
No words needed
} Controls of Caterpiller trucks connected to the Internet
57 CSF - Nuno Santos 2015/16
A Deep Web’s particular case
Dark Web
58 CSF - Nuno Santos 2015/16
Dark Web
} Dark Web is the Web content that exists on darknets
} Darknets are overlay nets which use the public Internet
but require specific SW or authorization to access
} Delivered over small peer-to-peer networks
} As hidden services on top of Tor
} The Dark Web forms a small part of the Deep Web,
the part of the Web not indexed by search engines
59 CSF - Nuno Santos 2015/16
The Dark Web is a haven for criminal activities
} Hacking services
} Fraud and fraud
services
} Markets for
illegal products
} Hitmen
} …
60 CSF - Nuno Santos 2015/16
Surface Web vs. Deep Web
Surface Web Deep Web
} Size: Estimated to be 8+ billion } Size: Estimated to be 5 to 500x
(Google) to 45 billion larger (BrightPlanet)
(About.com) web pages
} Dynamically generated content
} Static, crawlable web pages that lives inside databases
} Large amounts of unfiltered } High-quality, managed, subject-
information specific content
} Limited to what is easily found } Growing faster than surface
by search engines web (BrightPlanet)
61 CSF - Nuno Santos 2015/16
Conclusions
} The Web is a major source of information for both
criminal and legal investigation activities
} The Web content that is typically accessible through
conventional search engines is named the Surface Web
and represents only a small fraction of the whole Web
} The Deep Web includes the largest bulk of the Web, a
small part of it (the Dark Web), being used
specifically for carrying out criminal activities
62 CSF - Nuno Santos 2015/16
References
} Primary bibliography
} Michael K. Bergman, The Deep Web: Surfacing Hidden Value
http://brightplanet.com/wp-content/uploads/2012/03/12550176481-
deepwebwhitepaper1.pdf
63 CSF - Nuno Santos 2015/16
Next class
} Flow analysis and intrusion detection
64 CSF - Nuno Santos 2015/16