Modeling Implicit Knowledge from Microformatted Websites
J.G. Ramos1 and J. Silva2
1 Instituto Tecnológico de La Piedad, La Piedad, México
2 Departamento de Sistemas y Computación, Universidad Politécnica de Valencia, Valencia, Spain
Abstract— Microformats are a technique to incorporate
semantics into web documents by means of standard XHTML
tags enriched with particular attributes. They are a set of
simple and open metadata that describe semantic units of
information which are called classes.
Despite of its practicality, microformats lack of a formal
model for representing relations between semantic units of
information, and hence, they do not describe knowledge.
This fact contrasts with semantic technologies that already
represent knowledge, like for instance, RDF.
In order to face this problem, in this work we introduce
a formal model able to represent the classes provided by
microformats, and their relations. In essence, we model
the semantic information, in a graph like structure, namely
semantic network, where edges are labeled with predicates.
We show how semantic networks allow us to model microformats in a very convenient way for reasoning and knowledge
extraction.
Keywords: Semantic Web, Microformats, Information retrieval.
1. Introduction
The most important difference between the Web and
the Semantic Web is the fact that the later incorporates
metadata into their web documents. Metadata provides the
web contents with descriptions, meaning and inter-relations.
Thus, the semantics of information and services on the web
is made explicit by adding metadata [1].
Many technologies for developing the Semantic Web
are already in use. For instance the Resource Description
Framework (RDF) [2] provides a markup language based
on the eXtensible Markup Language (XML) [3] in order
to describe data resources. However, RDF requires the
definition of special vocabularies (i.e., ontologies). For this
aim, the Ontology Web Language (OWL) and RDF Schema
(RDFS) are useful in creating hierarchical vocabularies [4].
Nevertheless, efforts to extend the Web with meaning have
gained little traction, and these initiatives have been too
much trouble to implement at a large scale (see, e.g., the
discussion in [5]).
A novel approach for adding semantics to the Web are
Microformats [6], [7], [9]. Microformats are an initiative
that looks for attaching semantic data to web pages by
using simple extensions of the standard tags currently used
for web formatting in XHTML.
Example 1: Consider the following microformatted
XHTML code that describes information of a common
personal card.
<div class="vcard">
<span class="fn">Germán Vidal</span>
<div class="org">Tech. Univ. Of Valencia</div>
<div class="adr">
<div class="street-address">
Camino de Vera s/n, Room 2D42, DSIC building
</div>
<span class="locality">Valencia, Spain</span>
<span class="postal-code">E-46022</span>
</div>
<div class="tel">+34-96-387-7007</div>
</div>
This XHTML code uses the standard hCard microformat
[8], which is useful for representing people, companies,
organizations, and places data.
The class property qualifies each type of attribute
defined by the hCard microformat. The code starts with
the required main class vcard and classifies the information
with a set of classes which are auto-explicative: fn describes
name information, adr defines address details and so on.
Observe that thanks to hCard attributes, the semantics of
this code is directly processable by current email clients,
PDAs, etc.
From the code in Example 1 it is easy to see that one
of the most important advantages of microformats is their
simplicity of use and treatment. Indeed, one could argue that
microformats are basically augmented XHTML, but this is a
wrong assumption about the potential semantic contribution
of microformats.
Paradoxically, even though, microformats are very extended (currently, millions of instances of them are already in
use in the web [7]) there does not exist a standard model for
representing semantic interrelations between microformats
nor for knowledge representation into webpages or websites.
This is a reason why microformats are still not considered a
technology of the Semantic Web. Indeed, they are addressed
as the lowercase semantic web [9].
The lowercase semantic web requires new formal models,
methods and tools in order to represent and query the
embedded knowledge. For instance, in the Semantic Web
setting we can employ RDF to build a graph that represents
knowledge and, furthermore, we can extract information and
reason about it. There exist some approaches that face the
problem of knowledge representation, but they relay on the
application of transformations from microformats to RDF
using a mechanism for Gleaning Resource Descriptions from
Dialects of Languages (GRDDL [19]).
In this work, we try to resemble the expressivity of
RDF for knowledge representation and combine it with the
simplicity and power of microformats. As in the RDF setting,
we consider knowledge as sentences formed by the triple:
subject, predicate and object [4].
In particular we propose the use of semantic networks
(a convenient model for representing semantic data [10])
in order to model the knowledge which is implicit in
microformats. A semantic network is often used as a form
of knowledge representation [11]; and it is formalized as a
graph whose vertices represent classes (subjects or objects),
and whose edges represent semantic relations between the
concepts (predicates).
Once the implicit knowledge in microformats is modeled
in a semantic network, formal methods for information
extraction are needed to ensure a systematic and sound
treatment of the information. There exist a few other approaches for the modeling of microformats interrelations,
however, these models (see, e.g., [12]) are more devoted
towards efficiency aspects for information extraction than for
knowledge (sentences) representation. The main advantage
of our approach is that it is particularly appropriate for
knowledge representation and extraction.
The main contributions of this paper can be summarized
as follows:
• We adapt the concept of semantic network in order
to model the implicit knowledge which is present in
microformatted webpages and websites.
• We show how it is possible to extract knowledge from
microformats without applying transformations to RDF.
The rest of the paper is organized as follows. In Section 2,
we overview the topic of semantic networks and recall the
basic concepts related to them. In Section 3, we describe how
semantic networks can be built from the semantic web. Then,
in Section 4, we contextualize our method for knowledge
extraction w.r.t. RDF. And finally, in Section 5 we discuss
the relevance of the method and conclude.
2. Semantic Networks
The concept of semantic network is fairly old—in fact,
the term of semantic network dates back to Ross Quillian’s
works [13] where he introduced it as a way of talking
about the organization of human semantic memory—in the
literature of cognitive science and artificial intelligence.
Nevertheless, it is a common structure for knowledge representation, which is useful in modern and different problems
of artificial intelligence [11]. For instance, in the recent
Semantic Network Analysis Workshops [14], [15] many
applications of this formalism were discussed, e.g., for social
networks or hypertext networks.
A semantic network is a directed graph consisting of
nodes which represent concepts and edges which represent
semantic relations between the concepts. Sowa [16], [10]
introduced a classification of semantic networks, in which
the type of definitional networks emphasizes the subtype of
is-a relation between a concept type and a newly defined
subtype. This is the kind of semantic network that we will
use in this paper. In Figure 1, we present a typical example.
3. Modeling semantic knowledge from
microformatted websites
Roughly speaking, our method for modeling of semantic
web knowledge in microformatted websites is composed of
two steps:
1) We define a model for representing isolated (microformatted) web pages
2) We extend the model by adding a new descriptors for
semantic relationships between web pages
3.1 Constructing the semantic network from
the microformatted web pages
In order to represent semantic information in a semantic
network we should decide what is the relevant information
to be gathered and what we expect from a web knowledge
extraction query. In this work, we consider the microformats,
i.e., classes, as convenient entities for modeling, and then,
for indexing or referencing. In particular, classes are the
unique annotated information and, hence, only they provide
semantic information to be modeled. For instance, in Example 1 we see that information is qualified with the classes of
the hCard microformat. Additional formatting information
such as fonts, layers, colors, etc. must be specified outside
the classes.
It should be clear that a microformatted web page contains
many instances of defined units of information (i.e., classes
which have a metadata associated: org, url, locality,
etc.) and these instances can be repeated (i.e., when the same
metadata appears n times, it has n instances). Moreover,
classes can be classified as:
• Valued classes. They have associated a value to metadata. For instance <span class = "locality">
Valencia, Spain </span>,
• Container classes. They are composed of other classes
and are used to hierarchize the information. For instance, vcard and adr.
Definition 2 (qualified information): Given a microformatted webpage P , each microformat class of P is represented with a pair (i, c) where i is a label that uniquely
identifies class c of P . Given a microformat class (i, c) of the
form <XHTMLlabel class=c> vc </XHTMLlabel>
then vc is the value of c and it is represented with the triple
(i, c, vc ). Given a microformatted webpage P , the qualified
information of P , is Q = {v|(i, c, v) ∈ P }.
Fig. 1: A definitional semantic network.
In Example 1, (tel1, tel) and (vcard1, vcard) are
some microformat classes. The valued classes are
{fn, org, street-address, locality, postal-code, tel}
while
vcard and adr are not because they do not have a value
associated; ‘Valencia, Spain’ is the value of locality,
‘E-46022’ is the value of postal-code, etc.; and finally,
the qualified information is Q = {‘Germán Vidal’, ‘Tech.
Univ. Of Valencia’, ‘Camino de Vera s/n, Room 2D42,
DSIC building’, ‘Valencia, Spain’, ‘E-46022’, ‘+34-96-3877007’}
Given a container class (i, c), we often refer to the
contained classes of c as S(c). And we refer to the indexes
of the contained clases of c as S(i).
For
instance,
in
Example
1,
(adr1, adr)
and
(vcard1, vcard)
are
container
classes.
S(vcard) = {f n, org, adr, tel} and S(vcard1) =
{f n1, org1, adr1, tel1}.
Definition 3 (semantic network): A directed graph is an
ordered pair G = (V, E) where V is a finite set of vertices
or nodes, and E ⊆ V × V is a set of ordered pairs (v → v ′ )
with v, v ′ ∈ V called edges.
Given a microformatted webpage P with a set of container
classes C and qualified information Q, the semantic network
of P is a directed graph S = (Vs , Es ) where Vs = Q ∪ I,
with I = {i|(i, c) ∈ C}; and (v →l v ′ ) ∈ Es iff ∃(i1 , c1 ) ∈ C
with i2 ∈ S(i1 ) and
′
• (i2 , c2 ) ∈ C, v = i1 , v = i2 and l = c2 , or
′
• (i2 , c2 , v2 ) ∈ Q, v = i1 , v = v2 and l = c2 .
Roughly speaking, the semantic network is a tree whose
leafs contain the information described by the metadata introduced in the microformat, its internal nodes are (instances
of) classes used to classify the information, and edges are
labeled with classes.
Example 4: As an example of semantic network consider
the directed graph in Figure 2 constructed from Example 1.
Once we have formalized the definition of semantic network. We can represent microformats in an structured way.
Now, we adapt the notion of sentence used in RDF to be
used with semantic networks.
Definition 5 (sentence): Given a Semantic network
S
=
(Vs , Es ), a sentence of S is a triple
s = hsubject, predicate, valuei where v →p v ′ ∈ Es ,
subject = v, predicate = p and value = v ′ .
In order to extract knowledge from a semantic network
we can build its associated adjacency matrix. An adjacency
matrix is an efficient implementation of semantic networks
where rows represent subjects, columns are values/objects
of sentences and each cell contains the predicate induced by
its row and column.
Example 6: The adjacency matrix of the semantic network of Figure 2 is shown in Figure 3.
The sentences extracted from the adjacency matrix are:
h vcard1, fn, Germán Vidal i
h vcard1, org, Tech. Univ. Of Valencia i
h vcard1, adr, adr1 i
h vcard1, tel, +34-96-387-7007 i
h adr1, street-address, Camino de Vera s/n, Room
2D42, DSIC building i
6) h adr1, locality, Valencia, Spain i
7) h adr1, postal-code, E-46022 i
1)
2)
3)
4)
5)
They can be interpreted as: The personal card vcard1 has as
name Germán Vidal, the personal vcard1 has an organization
which is Tech. Univ. Of Valencia, and so on.
3.2 Extending the semantic network for websites
In many applications such as information brokers [20] it
is necessary to discover a set of microformats of similar type
among many pages. For instance, a common task is finding
all personal cards vcard [8] of a given domain in order to
be added to an electronic appointment book.
This fact implies representing many instances of a microformat of a webpage or also of different web pages. For
this, we extend the semantic network with a predicate relates
whose goal is to build a relation between two different
instances of container classes.
Given a web page with semantic networks (V,E), and
given three container classes (i1 , c1 ) with i1 ∈ V, (i2 , c2 )
with i2 ∈ V, (i3 , c3 ) with i3 ∈ V where c1 = c2 = c3 , then
we construct the edges i1 →r i2 , i2 →r i3 and i3 →r i1
where r is the predicate relates to.
Roughly speaking, we build a cycle among instances of
the same microformat type. This is useful to extract all
similar microformats in a website.
Fig. 2: A semantic network from Example 1.
Fig. 3: The adjacency matrix for Figure 2.
Fig. 4: The semantic network of a real website of Example 7.
Example 7: Let
us
consider
the
website
http://health-25.europages.co.uk/businessdirectory-europe/did-25/hc-21605/Hospitaland-medical-services.html
that
contains
information of twenty places of medical services in
Europe, i.e, there are twenty vcards. Now, a fragment of
code corresponding to the first vcard microformat:
<div class="vcard">
...
<div class="fn org">
CASA DI CURA SAN PIO X
</div>
...
<div class="adr">
<span class="street-address postal-code locality">
31, V. Nava 20159 MILANO (MI)
</span>
<span class="country-name">ITALY</span>
...
<span class="tel">+39 0269 511</span>
</span>
</div>
...
</div>
In the above code, we identify container and valued classes ir
order to build the semantic network. In Figure 4, the artificial
relationship relates to is included, and it produces a cycle
between vcard instances. Consequently, a fragment of the
implicit knowledge is as follows:
1) h vcard1, fn org, CASA DI CURA SAN PIO X i
2) h vcard1, adr, adr1 i
3) h adr1, country-name, ITALY i
4) h adr1, tel, +39 0269 511 i
5) h adr1, street-address postal-code locality, 31, V. Nava
20159 MILANO (MI) i
6) h vcard1, relates to, vcard2 i
7) . . .
The knowledge should be completed by sentences from
the twenty microformats of the real website. Extracted
knowledge could be useful for verbose report produced by
software tools.
artificial name (a container class) with a particular value
"Wendy Brown" mediated by the predicate name. This is a
typical case of a subject (container class), predicate (valued
class) and information (value of a class). Therefore, the
power expressivity of RDF is similar to the semantic network
representation of microformats.
Two reasons for using microformats are the following:
•
4. Contextualizing the model
The introduced model based on semantic networks was
inspired by RDF graphs. In this section we describe some
similarities between RDF and our knowledge model.
Example 8: Let us consider the following information
described with RDF and its associated graph constructed
with the RDF Validator [17] in Figure 5.
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<foaf:Person>
<foaf:name>Johnny Colt</foaf:name>
<foaf:mbox rdf:resource="mailto:jcolt@domain.com"/>
<foaf:homepage rdf:resource="http://www.jcolt.com/"/>
<foaf:nick>Johnny</foaf:nick>
<foaf:interest>
<rdf:Description
rdf:about="http://www.abc.org" rdfs:label="ABC"/>
</foaf:interest>
<foaf:knows>
<foaf:Person>
<foaf:name>Wendy Brown</foaf:name>
</foaf:Person>
</foaf:knows>
</foaf:Person>
</rdf:RDF>
•
Simplicity. Thousands and even millions of webpages
are published every day. It is impossible to ensure semantic preparation of these web documents by sophisticated annotating schemas. In contrast, microformats
offer a simply way for semantic preparation by web
developers [7].
Vocabularies (ontologies) are not required. People can
develop their own ontologies, eventually, millions of
ontologies can exist. With microformats the goal is
to use centralized standard vocabularies; this reduce
efforts to synchronize the employed terminology.
Both reasons to use microformats are translated to reasons to use the model of representation based on semantic
networks. Microformats are simple, the model should be
preserved simple.
The vocabularies to produce sentences are already defined
in the metadata provided by microformats. Hence the model
should preserve such descriptors, i.e., the predicates are the
proper metadata defined for each microformat.
The standard GRDDL [19] is useful to transform microformats to RDF and then for knowledge extraction, however
we propose that knowledge can be extracted directly from
microformats.
5. Conclusions
The above RDF code introduces descriptions that use the
vocabulary of the ontology called FOAF (Friend Of A
Friend [18]) useful to specify information of people and their
friendly relations. We can observe metadata such as interest,
nick, homepage, etc. which are associated to particular
information. Observe that they are similar to our valued
classes. Moreover, there are container classes, for instance,
the metadata knows is converted to an artificial name in the
graph. We also keep this design principle that permits to
preserve a metadata as the predicate of sentences. In general,
metadata are used as a descriptor of some attribute, thus they
are correctly considered as a predicate of a sentence.
Now, let us observe in Figure 6 two representative sentences extracted from the graph of Figure 5. Both sentences
have been generated from the RDF Validator application
[17]. The first sentence is formed from two container metadata, i.e., we observe a sentence with an artificial name as
subject and other as object and the semantics between them
is stated by the predicate knows. We proceed in a similar
way with vcard1 and adr1. The second sentence relates an
To the best of our knowledge this is the first proposal
for knowledge modeling by employing microformats. This
is a problem of a particular interest that was resolved
by transforming microformats first to RDF by employing
GRDDL [19]. However we consider that it is possible to
maintain a simple schema (like microformats) in a simple
model of knowledge by means of semantic networks. And,
in this way, it is possible to get sentences directly from
microformats.
Despite the simplicity of microformats, there is an increasing number of developers that adopt their use in thousands
of websites each few months (see, e.g., [7]).
We can observe many potential and interesting applications of our approach. For instance, it can be used for the
developing of tools for knowledge extraction focused on:
price comparing, automatic generation of academic exams,
automatic discovering of offers, etc. Moreover extracted
knowledge is useful for verbose report producing which
could be exploited by, for instance, software agents.
Fig. 5: The RDF graph from Example 8.
Fig. 6: Two sentences from the RDF graph in Figure 5.
6. Acknowledgements
This work has been partially supported by the Spanish
Ministerio de Ciencia e Innovación under grant TIN200806622-C03-02, by the Generalitat Valenciana under grant
ACOMP/2010/042, by the Universidad Politécnica de Valencia (Program PAID-06-08) and by the Mexican Dirección
General de Educación Superior Tecnológica.
References
[1] J. Hendler T. Berners-Lee and O. Lassila. The Semantic Web.
Scientific American Magazine, May 2001.
[2] World Wide Web Consortium. Resource Description Framework
(RDF). URL: http://www.w3.org/RDF/, 2010.
[3] World Wide Web Consortium. Extensible Markup Language (XML).
URL: http://www.w3.org/XML/, 2009.
[4] Liyang Yu. Introduction to the Semantic Web and Semantic Web
Services. Chapman & Hall/CRC, 2007.
[5] T. Çelik. What’s the Next Big Thing on the Web? It May Be a Small,
Simple Thing - Microformats. Knowledge@Wharton, 2005.
[6] Microformats.org.
The
Official
Microformats
Site.
http://microformats.org/, 2009.
[7] R. Khare and T. Çelik. Microformats: a Pragmatic Path to the
Semantic Web. In WWW ’06: Proceedings of the 15th International
Conference on World Wide Web, pages 865–866. ACM, 2006.
[8] hCard.
Simple, Open, Distributed Format for Representing
People,
Companies,
Organizations,
and
Places.
http://microformats.org/wiki/hcard, 2009.
[9] R. Khare. Microformats: The Next (Small) Thing on the Semantic
Web? IEEE Internet Computing, 10(1):68–75, 2006.
[10] J. F. Sowa. Semantic Networks. In S. C. Shapiro, editor, Encyclopedia
of Artificial Intelligence. John Wiley & Sons, 1992.
[11] R.T.N. Shetty, P.M. Riccio, and J. Quinqueton. Extended semantic network for knowledge representation. In Zhongzhi Shi, K. Shimohara,
and David Dagan Feng, editors, Intelligent Information Processing,
volume 228 of IFIP, pages 135–144. Springer, 2006.
[12] Gustavo Arroyo J. Guadalupe Ramos, Josep Silva and Juan C. Solorio.
A Technique for Information Retrieval from Microformatted Websites.
Lecture Notes in Computer Science, 5947/2010:344–351, 2010.
[13] R. Quillian. Semantic Memory. In Marvin Minsky, editor, Semantic
Information Processing. MIT Press, 1969.
[14] Gerd Stumme, Bettina Hoser, Christoph Schmitz, and Harith Alani,
editors. ISWC 2005 Workshop on Semantic Network Analysis, volume
171 of CEUR Workshop Proceedings, Galway, Ireland, 2005.
[15] Harith Alani, Bettina Hoser, Christoph Schmitz, and Gerd Stumme,
editors. Proceedings of the 2nd Workshop on Semantic Network
Analysis, 2006.
[16] J. F. Sowa, editor. Principles of Semantic Networks: Explorations in
the Representation of Knowledge. Morgan Kaufmann, 1991.
[17] World Wide Web Consortium. RDF Validation Service. URL:
http://www.w3.org/RDF/Validator/, 2007.
[18] Dan Brickley and Libby Miller. FOAF Vocabulary Specification 0.97.
URL: http://xmlns.com/foaf/spec/, 2010.
[19] World Wide Web Consortium.
Gleaning Resource
Descriptions from Dialects of Languages (GRDDL).
URL:
http://www.w3.org/TR/grddl/, 2007.
[20] Richard
MacManus.
Mozilla
Does
Microformats:
Firefox
3
as
Information
Broker.
URL:
http://www.readwriteweb.com/archives
/mozilla_does_microformats_firefox3.php, 2007.