PROLE 2008
Filtering Web Pages
1
Josep Silva 2
Computer Science Department
Technical University of Valencia
Camino de Vera s/n, E-46022 Valencia, Spain
Abstract
Nowadays, Internet is the main source of information for millions of people and
enterprises. However, the information in Internet has not been classified yet and,
consequently, the search for information is one of the most important tasks and
processes performed by users and systems. In particular, for W W W human users
the search for information is the main (time-consuming) task performed. In order to
face this problem both the industrial and the academic communities have developed
many methods and tools to index and search web pages. The most extended solution
is the use of search engines such as Google and Y ahoo; however, while current search
engines can be a suitable solution to find a particular webpage, they are useless to
find the relevant information in such a page. Hence, once a webpage is found, the
user must search on it in order to verify if the information needed is in there. This
is a problem which until now has not been satisfactorily solved. In this paper we
present a tool able to automatically extract from a webpage the information (text,
images, etc.) related to a filtering criterion without the use of semantic specifications
or indexes and without the need of offline parsing or compilation processes. This
tool has been published as an extension for the Firefox’s web navigator.
Key words: Filtering, Web pages, HTML.
1
Introduction
Internet is the main information provider in the world. The estimates vary
from 15 to some 30 billion webpages shared with heterogeneous information
about practically all matters. However, the organization of this information
1
This work has been partially supported by the EU (FEDER) and the Spanish MEC
under grant TIN2005-09207-C03-02, by the ICT for EU-India Cross-Cultural Dissemination
Project ALA/95/23/2003/077-054, and by the Vicerrectorado de Innovación y Desarrollo
de la UPV under project TAMAT ref 5771.
2
Email: jsilva@dsic.upv.es
This paper was generated using LATEX macros provided by
Electronic Notes in Theoretical Computer Science
Silva
has not been centralized, and each independent node is responsible of maintaining its own information. This makes the classification of the information
contained in the webpages a very difficult task, and thus, the search for information is one of the most important problems in Internet nowadays. One
of the most extended solutions in order to find information in Internet is the
use of search engines. For instance, the two most used search engines are
Google and Yahoo with more than one hundred and fifty million searches every day (42.7% and 28.17% of the market) [12]. A search engine is a system
with two main processes: indexing and searching. The interface between both
processes is an index. Indexes are built by means of a collection of robots
which visit webpages, process their contents, and index them by words. For
instance, given a word, an index returns the set of pages which contain this
word ordered by an impact rating. Nevertheless, these indexes treat pages
as atoms, thus, they cannot specify which part of the webpage contains the
required information.
In consequence, current search engines are useful to find webpages, but
they are useless to find the needed information in a webpage. The information
contained in a webpage can be huge and heterogenous; therefore the search
for information in a webpage can still be a time-consuming task. When the
user is human, she is usually forced to read and scroll a part of the webpage
before she knows if this page contains what is being searched.
Surprisingly, the scientific community has put little attention on this problem, and currently there are not tools available to suitably filter the information of a single webpage. In this work, we present a method to extract relevant
information from a webpage based on a process of filtering. In our approach
the user specifies a filtering criterion, i.e., a text which says what is being
searched and then, a new webpage is automatically produced which only contains the information related to the filtering criterion. From the best of our
knowledge, this is the first tool able to produce a slice from any webpage. In
the following sections we will describe how this approach works and how it
has been implemented and distributed.
The rest of the paper has been organized as follows: Firstly, next section
shows the usefulness of a web filtering tool by means of an example. In
Section 3 the filtering process is described. In Section 4 the Web Filtering
Toolbar which is the implementation of our approach is presented. We describe
related work in Section 5. Finally, in Section 6 we present the conclusions and
we define some future lines of research.
2
Motivation: Filtering Web Pages
The main objective of our technique is web filtering. The usefulness of this
technique can be easily shown with an example: The most simple (and common) functionality of our technique is related to information filtering (removing all the information which is of no interest). A small example of that is
2
Silva
(a) University of Cambridge’s website
(b) Filtered version
Fig. 1. Filtering the University of Cambridge’s website
the website produced by filtering the University of Cambridge’s website with
respect to the filtering criterion “students”. This is shown in Figure 1 where
all the information (including pictures, links, etc.) which is not related to
students has been removed.
However, our technique also allows to perform more complex processes
allowing information retrieval processes (extracting the relevant information
and reconfiguring it in a convenient and readable fashion). For instance, let
us assume 3 that we are looking in Internet for a list of papers published in
common by the researchers Germán Vidal and Josep Silva. A first solution
could be to find Germán Vidal’s webpage and look for his list of publications.
To do so we can use the Google search engine and specify a search criterion; for
instance, we can type “Germán Vidal papers”. The output of Google is shown
in Figure 2 (a). Unfortunately, the result produced by Google—and by the
rest of current web search engines—is only a list of webpages. Therefore, once
a webpage is selected, we have to manually review the information contained
in this webpage in order to find the information searched.
As it can be seen in Figure 2 (b), this process is time-consuming because
there are more than 90 publications in this webpage, and only 26 are in common with Josep Silva. Therefore, the user must review all the publications to
know which of them are Josep Silva’s publications.
Now, let us assume that we have available a tool which allows us to filter
a webpage according to a filtering criterion. Since we are looking for Josep
3
This assumption is a real example performed on November 28th, 2007.
3
Silva
(a) Google’s results
(b) German Vidal’s papers Webpage
(c) German Vidal’s filtered Webpage
Fig. 2. Searching for Germán Vidal and Josep Silva’s papers in common
Silva’s publications, we can type the filtering criterion “Josep Silva”. Then,
we press the filtering button and a new page is automatically generated from
the previous one by extracting all the information related to Josep Silva and
putting it together. The result is the list of publications in common by Germán
4
Silva
Vidal and Josep Silva (See Figure 2 (c)). In this case, we have found the desired
information with a single click. Note that, in this case, the result produced
is not limited to information removal, but a reconfiguration of the relevant
information has been done (look at the scroll of the webpages).
This simple example describes the normal behavior of our tool. Although
this is the main functionality, it can be parameterized to alter its standard behavior so that different kinds of filters can be used. The next section describes
how the filtering process works; we will explain some additional options in the
implementation section.
3
The Filtering Technique
The main advantage of our approach to webpages filtering is its simplicity.
The basic idea is to see a webpage as a set of components and delete (or
hide) all those components which are not relevant with respect to a filtering
criterion.
A webpage can be seen as a tree of labeled nodes. This tree is internally
represented in the Document Object M odel (DOM) with a data structure
called ‘document’ whose root node is labeled with “HT M L” and that usually
has a child labeled with “body”. We use this model to see webpages as a
hierarchically organized collection of nodes. With the API of DOM we can
explore the nodes of a webpage and query their properties.
Our filtering algorithm first constructs a tree-like data structure which contains all the information of the webpage hierarchically organized in document
order. This data structure is really an object with provides methods to query
the tree. Hence, after this phase, we have a webpage represented as a data
structure, and a mechanism to explore such a data structure.
Now, we need to define the notion of filtering criterion and determine what
should be the result produced for it. In our setting, a filtering criterion is a
pair (words, f lags) where words is a text and f lags is a vector with flags for
the filtering tool. In general, the user only needs to specify words, because
f lags contains options that can be left unchanged with their default values 4 .
words specifies the information that the user is looking for; therefore, the
filtering technique has to produce a new webpage which only contains those
nodes containing information related to this text. This can be easily checked
by comparing words with the information contained in the attributes of each
node.
In the current implementation, we consider that a node in a webpage
is relevant with respect to a filtering criterion (words, f lags) iff any of its
attributes contain the text words.
Since the nodes in the tree have a set of different attributes depending on
the considered node, we have to check different properties in different kinds of
4
See Section 4 for a complete description of the current f lags.
5
Silva
nodes. For instance, an image in a webpage is represented with a node with
the property tagN ame == “IM G”. Images have a special attribute called
‘alt’ which contains an alternative text to display if the image can not be
loaded. Thus, we compare this property when the node is an image.
Once the relevant nodes have been found, we proceed by extracting those
nodes which are related to them. Therefore, we do not restrict ourselves to exact text comparison. Instead, we use a proximity function which relays on the
following assumption: “semantically related objects are syntactically close”.
After our experiments and intensive use of the Web Filtering Toolbar, this
assumption has proven to be very precise in practice because, often, webpages
struture the information with tables, menues, and other similar componentes
which join together the related information. Moreover, a slicing criterion can
match different parts of a document, and thus one slicing criterion usually
points to a set of nodes of the tree structure. Then, for each relevant node we
extract the associated information by using the proximity function.
Despite our tool is prepared to work with a lexicon to find related information to a given query (e.g., synonyms), and it could also use metainformation
associated to webpage’s nodes; the current implementation does not exploit
these features. The current proximity function is a configurable (see Section 4
for details) distance measure which allows to get nodes related to the relevant
ones.
Example 3.1 Consider the webpage at the left of Figure 3 whose relevant
node is 5. There are different possibilities to produce a new (filtered) webpage
with respect to node 5. For instance, we could delete all the nodes except 5;
or all the nodes except 5 and its descendants (see Figure 3 (c)); or except 5,
its descendants and its ancestors (see Figure 3 (b)) depending on how much
information related to 5 we want to filter.
However, not all the solutions of the previous example produce the expected result: Some of them would destroy the structure of the webpage
because the filtered nodes would appear at the top of the new webpage instead of being at their original position. In some cases, this is the expected
result, for instance, this is the case of Figure 2 (c). However, in general, the
user does not want to destroy the webpage structure, and thus non-relevant
nodes cannot be destroyed. For instance, in Figure 1 (b) the structure of the
original webpage is kept. Fortunately, there is an easy solution: we can hide
the nodes not related to 5 instead of deleting them. All these possibilities
can be configured by means of the f lags component of the filtering criterion
described in the next section.
To summarize, we filter a given webpage with respect to a filtering criterion
by
(i) Building a data structure containing all the elements in the webpage,
(ii) Traversing the data structure to find the relevant nodes with respect to
the filtering criterion, and
6
Silva
H TM L
4
8
body
H TM L
1
2
3
1
5
6
7
5
9
10
11
(a) Webpage
H TM L
body
8
body
5
8
9
(b) Keep tree on
9
(c) Keep tree off
Fig. 3. Tree representation of a webpage
Fig. 4. Web Filtering Toolbar
(iii) Deleting (or hiding) all the nodes from the webpage except the relevant
nodes according to the filtering criterion’s flags.
4
Implementation
After a period of evaluation and testing in the sandbox of the Firefox’s developers area, the Firefox’s experts community has made public the tool and now
it is distributed as an official Firefox’s addon. Therefore, there is an available
version which can be freely downloaded from the Firefox’s webpage. In this
section we show how to download, install and use the Firefox’s Web Filtering
Toolbar (see Figure 4).
4.1 Installing Web Filtering Toolbar
The Web Filtering Toolbar is distributed as an xpi [5] file. An xpi package
is basically a ZIP file that, when opened by the browser utility, installs a
browser extension. Currently, this extension applies to both Mozilla and Firefox browsers. Since an xpi package contains all the necessary information to
install an extension, the user only has to drag the xpi and drop it over the
browser window. The extension will be automatically installed.
The user interface of the Web Filtering Toolbar —as the Firefox’s interface
itself—has been implemented with XUL and Javascript. XUL is an XML
implementation which provides the interface components (such as buttons,
menus, etc.); and Javascript is used to control and execute the user actions.
We have opened the source code; therefore, all the implementation (both
source code and executable) of the Web Filtering Toolbar is publicly available.
7
Silva
Fig. 5. Web Filtering Toolbar ’s options
To download the tool and some other materials visit:
http://www.dsic.upv.es/ ˜ jsilva/webfiltering.
and
http://www.firefox.com/addons.
4.2 Using Web Filtering Toolbar
In the normal case, using the filter is as easy as to type a text t inside a text
box and press the button “F ilter”. Then, the tool produces a slice of the
current webpage with respect to the filtering criterion (t, f ) where f are flags
which represent the default options of the tool. However, these options can
be easily changed by the user when needed.
In its current version, our tool can be parameterized with four flags which
determine the shape of the final slice (see Figure 5):
Keep Structure. When activated, “keep structure” ensures that the components of the slice keep the same position as in the original webpage; i.e.,
the webpage’ structure is kept and, thus, the final slice will contain blank
areas (see Figure 1). If the structure is not kept, all the data in the final
slice is reorganized so that no empty spaces exist (see Figure 2).
Format Size iFrames. Iframes allow us to embed a webpage inside another
webpage. If “format size iframes” is activated, the size of the iframes of
the original webpage is adapted to the slice. Otherwise, the original size is
kept. Usually, the webpage of an iframe is bigger than the area reserved for
the iframe; hence, the iframe uses scrolls. Often, the slice extracted from
an iframe is small, and thus, reformatting the size of the iframes avoids
unnecessary empty areas produced by the scroll.
Keep Tree. When a node in the tree representing a document belongs to
the slice, it is possible to include also all the nodes belonging to the path
between this node and the root (the ancestors). To do this, keep tree must be
8
Silva
activated. For instance, in the slice of Figure 3 (b) keep tree was activated,
while in the slice of Figure 3 (c) it was not.
Tolerance. The default tolerance of the slicer is 0. With a tolerance of 0, only
the relevant nodes and their descendants are included in the slice. If the
tolerance is incremented, then those nodes which are close to the relevant
nodes are set as relevant. Concretely, if the tolerance is incremented by one,
the parents (and their descendants) of relevant nodes are set as relevant.
For instance, in Figure 3 (a), with a tolerance of 0 and keep tree deactivated,
the slice produced is shown in Figure 3 (c). With a tolerance of 1, nodes
1—the parent of 5—and 4—the descendant of 1—would be included in the
slice. With a tolerance of 2, all the nodes would be included in the slice.
5
Related Work
The most extended tool used to search information in Internet is Google [4].
However, Google is a search engine to find webpages, and thus it is useless for
finding information inside webpages. Google should be complemented with
other tools in order to find the desired information inside the reported documents. It is possible to create a search engine and parameterize it according
to our needs. For instance, one of the main novel approaches for searching
information in webpages is Nutch [6]. Nutch is built on top of Lucene [9],
which is an API for text indexing and searching. Nutch divides naturally into
two pieces: the crawler and the searcher. The crawler fetches pages and turns
them into an inverted index, which the searcher uses to answer users’ search
queries. The interface between the two pieces is the index. This approach
allows the user to find pages related to a specified criterion; however the level
of granularity is the webpage, and thus, it does not allow to find information
in a single webpage. Moreover, Nutch needs an index to work, hence, it can
not be used with any Internet page as our tool does.
There is however, a kind of tools which can be applicable to any webpage:
the web content filtering tools [7,2]. These tools are responsible to analyze
the content of a webpage and decide if it contains a particular content in
order to forbid the access to this webpage. The main application of such tools
is preventing children from accessing pornographic or racist websites, e.g.,
from school computers. This kind of tools are in some sense related to our
tool because they are applicable to any webpage and they must analyze their
content. Notable examples are WebGuard [8] and the system developed by
Marinilli et al. [11]. The main difference is that while these tools only say if a
webpage contains a kind of information, our approach says where in the page
is such information.
There are other tools that really filter single webpages, For instance WebWatcher [10] and WebFilter [13]. Both of them rely on the use of an external
proxy server that acts as an intermediate layer between the user and the
visited website. The proxy filters the pages before they get to the user. Un9
Silva
fortunately, they are quite limited. For instance, in WebFilter, the user must
provide (in advance!) a script for each URL she wants to filter. This can be
useful for a frequently visited webpage with advertisements, but it cannot be
used with unknown webpages as our tool does. Another filtering advanced
system based on scripts is Digestor [3]. On the other hand, WebWatcher is an
assistant which helps the user to search through different pages by selecting
which links are more probable to reach the desired information. The selection is based on a probabilistic model which uses a user profile and previous
(training) searches.
Despite the big effort put on search engines for the web, there are very
few works devoted to find information inside a single webpage. Most of them,
e.g., FilterGus, cannot work with online pages; or need the use of semantic
specifications or cannot directly work with HTML code. This is the case for
instance of the Phil system [1], a tool to filter XML documents. Phil allows one
to extract relevant data as well as to exclude useless and misleading contents
from an XML document by matching patterns against it. Unfortunately, to
filter a document, it is needed a semantic specification—which usually implies
knowing the document structure—; and, moreover, Phil cannot handle HTML
pages directly. Another similar approach is [15] which is able to extract a slice
of an XML document w.r.t a slicing criterion. This work also argues that XML
slices can be used to produce webpage slices in the case that the webpage has
been generated from an XML document via XSLT. Unfortunately, the number
of pages in Internet generated via XSLT is very small, and moreover, in this
approach the filter is done at the level of the XML document which usually is
not public. Therefore, neither this tool can be used to filter Internet webpages
in general. Finally, there is another work [14] in which a part of a web application can be extracted by using the well-known program slicing technique
[16]. Therefore, this technique uses the control and data dependences of a web
application (which often implies several pages) to extract a slice; hence, it is
useless to extract a slice from a single static HTML webpage.
From the best of our knowledge, the Web Filtering Toolbar is the first tool
that can filter any webpage, it works online, and it is autonomous (it does not
need profiles, scripts, previously generated indexes or semantic specifications
about the webpage that is going to be filtered).
6
Conclusions and Future Work
Nowadays, Internet contains billions of webpages which are constantly being
browsed, queried and modified by internet users. Often, the amount of information retrieved by a human user while searching cannot be absorbed in a
pleasant and/or understandable fashion. To solve this situation, the scientific
community has invested a lot of research effort producing several indexing
and search systems which reduce the inherent complexity of such a massive
amount of data. However, most of the current search engines consider web10
Silva
pages as the basic unit of information which should be provided to the user.
From our point of view, this is frequently an error which makes the user to
load, read and scroll a lot of useless information. In this work, we have introduced a new approach to web filtering which allows the user to automatically
filter from a webpage irrelevant information by specifying a filtering criterion.
This approach has been implemented and integrated into the Firefox browser
producing a very practical tool that has demonstrated its usefulness after an
intensive phase of testing by real users.
There are some directions for future work that are still open. Regarding the
implementation, the current release cannot handle frames because it implies
that a webpage can display several HTML documents at the same time. We are
currently developing a new version of the Web Filtering Toolbar which includes
a special treatment for webpages containing frames. Another implementation
extension we plan to overcome is the use of a lexicon during the filtering
phase. This will allow to filter webpages by using alternative filtering criteria
semantically related to the one specified by the user. For instance, if the user
is searching for information related to “cars”, the filter could also include in
the slice all the information related to “autos” and “automobiles”. Finally, we
plan to extend our tool in order to be able to filter XML documents. Some
preliminary research in this direction has shown that our technique is directly
applicable to XML documents. We would like to take advantage of the fact
that with XML documents we have available a DTD which provides helpful
information about the structure of the document.
7
Acknowledgements
The author wish to thank Mercedes Garcı́a for his help in the implementation
of the first version of the Web Filtering Toolbar.
References
[1] M. Baggi and D. Ballis. A lazy implementation of a language for approximate
filtering of xml documents. Technical report, University of Udine, 2007.
[2] E. Bertino, E. Ferrari, and A. Perego. Content-based filtering of web documents:
the maX system and the EUFORBIA project. International Journal of
Information Security, 2(1):45–58, 2003.
[3] T.W. Bickmore, A. Girgensohn, and J.W. Sullivan. Web page filtering and
re-authoring for mobile users. The Computer Journal, 42(6):534–546, 1999.
[4] N. Blachman. Google guide: Making searching even easier. Accessed on October
10th 2007., URL: http://www.googleguide.com.
11
Silva
[5] Mozilla Developer Center. Cross-platform installer module (xpi). Accessed on
October 10th 2007., URL: http://developer.mozilla.org/en/docs/XPI.
[6] The Apache Software Foundation. Nutch version 0.7 tutorial. Accessed on
October 10th 2007., URL: http://lucene.apache.org/nutch/tutorial.pdf.
[7] M. Hammami, Y. Chahir, and L. Chen. Webguard: A web filtering engine
combining textual, structural, and visual content-based analysis. IEEE Trans.
Knowl. Data Eng, 18(2):272–284, 2006.
[8] M. Hammami, Y. Chahir, and L. Chen. Webguard: A web filtering engine
combining textual, structural, and visual content-based analysis. IEEE Trans.
Knowl. Data Eng, 18(2):272–284, 2006.
[9] E. Hatcher and O. Gospodnetic. Lucene in Action (In Action series). Manning
Publications Co., Greenwich, CT, USA, 2004.
[10] T. Joachims, D. Freitag, and T. M. Mitchell. WebWatcher: a tour guide
for the word wide web. In Martha E. Pollack, editor, Proceedings of the 15th
International Joint Conference on Artificial Intelligence (IJCAI97), pages 770–
775, Nagoya, JP, 1997. Morgan Kaufmann Publishers, San Francisco, US.
[11] M. Marinilli, A. Micarelli, and F. Sciarrone. A hybrid case-based architecture
for information filtering on the web. In Sascha Schmitt and Ivo Vollrath,
editors, Challenges for Case-Based Reasoning - Proceedings of the ICCBR’99
Workshops, Seeon Monastery, Germany, July 27-30, 1999, pages 23–32.
University of Kaiserslautern, Computer Science, 1999.
[12] NetRatings. Nielsen netratings: Internet data & ratings. Accessed on October
10th 2007., URL: http://www.netratings.com.
[13] J. Pereira, F. Fabret, H.A. Jacobsen, F. Llirbat, and D. Shasha. Webfilter: A
high-throughput XML-based publish and subscribe system. In Proceedings of
the 27th International Conference on Very Large Data Bases(VLDB ’01), pages
723–725, Orlando, September 2001. Morgan Kaufmann.
[14] Filippo Ricca and Paolo Tonella. Web application slicing. In ICSM, pages
148–157, 2001.
[15] Josep Silva. Slicing XML documents.
157(2):187–192, 2006.
Electr. Notes Theor. Comput. Sci,
[16] M.D. Weiser. Program Slicing. IEEE Transactions on Software Engineering,
10(4):352–357, 1984.
12