[go: up one dir, main page]

Academia.eduAcademia.edu

Modeling Implicit Knowledge from Microformatted Websites

Modeling Implicit Knowledge from Microformatted Websites J.G. Ramos1 and J. Silva2 1 Instituto Tecnológico de La Piedad, La Piedad, México 2 Departamento de Sistemas y Computación, Universidad Politécnica de Valencia, Valencia, Spain Abstract— Microformats are a technique to incorporate semantics into web documents by means of standard XHTML tags enriched with particular attributes. They are a set of simple and open metadata that describe semantic units of information which are called classes. Despite of its practicality, microformats lack of a formal model for representing relations between semantic units of information, and hence, they do not describe knowledge. This fact contrasts with semantic technologies that already represent knowledge, like for instance, RDF. In order to face this problem, in this work we introduce a formal model able to represent the classes provided by microformats, and their relations. In essence, we model the semantic information, in a graph like structure, namely semantic network, where edges are labeled with predicates. We show how semantic networks allow us to model microformats in a very convenient way for reasoning and knowledge extraction. Keywords: Semantic Web, Microformats, Information retrieval. 1. Introduction The most important difference between the Web and the Semantic Web is the fact that the later incorporates metadata into their web documents. Metadata provides the web contents with descriptions, meaning and inter-relations. Thus, the semantics of information and services on the web is made explicit by adding metadata [1]. Many technologies for developing the Semantic Web are already in use. For instance the Resource Description Framework (RDF) [2] provides a markup language based on the eXtensible Markup Language (XML) [3] in order to describe data resources. However, RDF requires the definition of special vocabularies (i.e., ontologies). For this aim, the Ontology Web Language (OWL) and RDF Schema (RDFS) are useful in creating hierarchical vocabularies [4]. Nevertheless, efforts to extend the Web with meaning have gained little traction, and these initiatives have been too much trouble to implement at a large scale (see, e.g., the discussion in [5]). A novel approach for adding semantics to the Web are Microformats [6], [7], [9]. Microformats are an initiative that looks for attaching semantic data to web pages by using simple extensions of the standard tags currently used for web formatting in XHTML. Example 1: Consider the following microformatted XHTML code that describes information of a common personal card. <div class="vcard"> <span class="fn">Germán Vidal</span> <div class="org">Tech. Univ. Of Valencia</div> <div class="adr"> <div class="street-address"> Camino de Vera s/n, Room 2D42, DSIC building </div> <span class="locality">Valencia, Spain</span> <span class="postal-code">E-46022</span> </div> <div class="tel">+34-96-387-7007</div> </div> This XHTML code uses the standard hCard microformat [8], which is useful for representing people, companies, organizations, and places data. The class property qualifies each type of attribute defined by the hCard microformat. The code starts with the required main class vcard and classifies the information with a set of classes which are auto-explicative: fn describes name information, adr defines address details and so on. Observe that thanks to hCard attributes, the semantics of this code is directly processable by current email clients, PDAs, etc. From the code in Example 1 it is easy to see that one of the most important advantages of microformats is their simplicity of use and treatment. Indeed, one could argue that microformats are basically augmented XHTML, but this is a wrong assumption about the potential semantic contribution of microformats. Paradoxically, even though, microformats are very extended (currently, millions of instances of them are already in use in the web [7]) there does not exist a standard model for representing semantic interrelations between microformats nor for knowledge representation into webpages or websites. This is a reason why microformats are still not considered a technology of the Semantic Web. Indeed, they are addressed as the lowercase semantic web [9]. The lowercase semantic web requires new formal models, methods and tools in order to represent and query the embedded knowledge. For instance, in the Semantic Web setting we can employ RDF to build a graph that represents knowledge and, furthermore, we can extract information and reason about it. There exist some approaches that face the problem of knowledge representation, but they relay on the application of transformations from microformats to RDF using a mechanism for Gleaning Resource Descriptions from Dialects of Languages (GRDDL [19]). In this work, we try to resemble the expressivity of RDF for knowledge representation and combine it with the simplicity and power of microformats. As in the RDF setting, we consider knowledge as sentences formed by the triple: subject, predicate and object [4]. In particular we propose the use of semantic networks (a convenient model for representing semantic data [10]) in order to model the knowledge which is implicit in microformats. A semantic network is often used as a form of knowledge representation [11]; and it is formalized as a graph whose vertices represent classes (subjects or objects), and whose edges represent semantic relations between the concepts (predicates). Once the implicit knowledge in microformats is modeled in a semantic network, formal methods for information extraction are needed to ensure a systematic and sound treatment of the information. There exist a few other approaches for the modeling of microformats interrelations, however, these models (see, e.g., [12]) are more devoted towards efficiency aspects for information extraction than for knowledge (sentences) representation. The main advantage of our approach is that it is particularly appropriate for knowledge representation and extraction. The main contributions of this paper can be summarized as follows: • We adapt the concept of semantic network in order to model the implicit knowledge which is present in microformatted webpages and websites. • We show how it is possible to extract knowledge from microformats without applying transformations to RDF. The rest of the paper is organized as follows. In Section 2, we overview the topic of semantic networks and recall the basic concepts related to them. In Section 3, we describe how semantic networks can be built from the semantic web. Then, in Section 4, we contextualize our method for knowledge extraction w.r.t. RDF. And finally, in Section 5 we discuss the relevance of the method and conclude. 2. Semantic Networks The concept of semantic network is fairly old—in fact, the term of semantic network dates back to Ross Quillian’s works [13] where he introduced it as a way of talking about the organization of human semantic memory—in the literature of cognitive science and artificial intelligence. Nevertheless, it is a common structure for knowledge representation, which is useful in modern and different problems of artificial intelligence [11]. For instance, in the recent Semantic Network Analysis Workshops [14], [15] many applications of this formalism were discussed, e.g., for social networks or hypertext networks. A semantic network is a directed graph consisting of nodes which represent concepts and edges which represent semantic relations between the concepts. Sowa [16], [10] introduced a classification of semantic networks, in which the type of definitional networks emphasizes the subtype of is-a relation between a concept type and a newly defined subtype. This is the kind of semantic network that we will use in this paper. In Figure 1, we present a typical example. 3. Modeling semantic knowledge from microformatted websites Roughly speaking, our method for modeling of semantic web knowledge in microformatted websites is composed of two steps: 1) We define a model for representing isolated (microformatted) web pages 2) We extend the model by adding a new descriptors for semantic relationships between web pages 3.1 Constructing the semantic network from the microformatted web pages In order to represent semantic information in a semantic network we should decide what is the relevant information to be gathered and what we expect from a web knowledge extraction query. In this work, we consider the microformats, i.e., classes, as convenient entities for modeling, and then, for indexing or referencing. In particular, classes are the unique annotated information and, hence, only they provide semantic information to be modeled. For instance, in Example 1 we see that information is qualified with the classes of the hCard microformat. Additional formatting information such as fonts, layers, colors, etc. must be specified outside the classes. It should be clear that a microformatted web page contains many instances of defined units of information (i.e., classes which have a metadata associated: org, url, locality, etc.) and these instances can be repeated (i.e., when the same metadata appears n times, it has n instances). Moreover, classes can be classified as: • Valued classes. They have associated a value to metadata. For instance <span class = "locality"> Valencia, Spain </span>, • Container classes. They are composed of other classes and are used to hierarchize the information. For instance, vcard and adr. Definition 2 (qualified information): Given a microformatted webpage P , each microformat class of P is represented with a pair (i, c) where i is a label that uniquely identifies class c of P . Given a microformat class (i, c) of the form <XHTMLlabel class=c> vc </XHTMLlabel> then vc is the value of c and it is represented with the triple (i, c, vc ). Given a microformatted webpage P , the qualified information of P , is Q = {v|(i, c, v) ∈ P }. Fig. 1: A definitional semantic network. In Example 1, (tel1, tel) and (vcard1, vcard) are some microformat classes. The valued classes are {fn, org, street-address, locality, postal-code, tel} while vcard and adr are not because they do not have a value associated; ‘Valencia, Spain’ is the value of locality, ‘E-46022’ is the value of postal-code, etc.; and finally, the qualified information is Q = {‘Germán Vidal’, ‘Tech. Univ. Of Valencia’, ‘Camino de Vera s/n, Room 2D42, DSIC building’, ‘Valencia, Spain’, ‘E-46022’, ‘+34-96-3877007’} Given a container class (i, c), we often refer to the contained classes of c as S(c). And we refer to the indexes of the contained clases of c as S(i). For instance, in Example 1, (adr1, adr) and (vcard1, vcard) are container classes. S(vcard) = {f n, org, adr, tel} and S(vcard1) = {f n1, org1, adr1, tel1}. Definition 3 (semantic network): A directed graph is an ordered pair G = (V, E) where V is a finite set of vertices or nodes, and E ⊆ V × V is a set of ordered pairs (v → v ′ ) with v, v ′ ∈ V called edges. Given a microformatted webpage P with a set of container classes C and qualified information Q, the semantic network of P is a directed graph S = (Vs , Es ) where Vs = Q ∪ I, with I = {i|(i, c) ∈ C}; and (v →l v ′ ) ∈ Es iff ∃(i1 , c1 ) ∈ C with i2 ∈ S(i1 ) and ′ • (i2 , c2 ) ∈ C, v = i1 , v = i2 and l = c2 , or ′ • (i2 , c2 , v2 ) ∈ Q, v = i1 , v = v2 and l = c2 . Roughly speaking, the semantic network is a tree whose leafs contain the information described by the metadata introduced in the microformat, its internal nodes are (instances of) classes used to classify the information, and edges are labeled with classes. Example 4: As an example of semantic network consider the directed graph in Figure 2 constructed from Example 1. Once we have formalized the definition of semantic network. We can represent microformats in an structured way. Now, we adapt the notion of sentence used in RDF to be used with semantic networks. Definition 5 (sentence): Given a Semantic network S = (Vs , Es ), a sentence of S is a triple s = hsubject, predicate, valuei where v →p v ′ ∈ Es , subject = v, predicate = p and value = v ′ . In order to extract knowledge from a semantic network we can build its associated adjacency matrix. An adjacency matrix is an efficient implementation of semantic networks where rows represent subjects, columns are values/objects of sentences and each cell contains the predicate induced by its row and column. Example 6: The adjacency matrix of the semantic network of Figure 2 is shown in Figure 3. The sentences extracted from the adjacency matrix are: h vcard1, fn, Germán Vidal i h vcard1, org, Tech. Univ. Of Valencia i h vcard1, adr, adr1 i h vcard1, tel, +34-96-387-7007 i h adr1, street-address, Camino de Vera s/n, Room 2D42, DSIC building i 6) h adr1, locality, Valencia, Spain i 7) h adr1, postal-code, E-46022 i 1) 2) 3) 4) 5) They can be interpreted as: The personal card vcard1 has as name Germán Vidal, the personal vcard1 has an organization which is Tech. Univ. Of Valencia, and so on. 3.2 Extending the semantic network for websites In many applications such as information brokers [20] it is necessary to discover a set of microformats of similar type among many pages. For instance, a common task is finding all personal cards vcard [8] of a given domain in order to be added to an electronic appointment book. This fact implies representing many instances of a microformat of a webpage or also of different web pages. For this, we extend the semantic network with a predicate relates whose goal is to build a relation between two different instances of container classes. Given a web page with semantic networks (V,E), and given three container classes (i1 , c1 ) with i1 ∈ V, (i2 , c2 ) with i2 ∈ V, (i3 , c3 ) with i3 ∈ V where c1 = c2 = c3 , then we construct the edges i1 →r i2 , i2 →r i3 and i3 →r i1 where r is the predicate relates to. Roughly speaking, we build a cycle among instances of the same microformat type. This is useful to extract all similar microformats in a website. Fig. 2: A semantic network from Example 1. Fig. 3: The adjacency matrix for Figure 2. Fig. 4: The semantic network of a real website of Example 7. Example 7: Let us consider the website http://health-25.europages.co.uk/businessdirectory-europe/did-25/hc-21605/Hospitaland-medical-services.html that contains information of twenty places of medical services in Europe, i.e, there are twenty vcards. Now, a fragment of code corresponding to the first vcard microformat: <div class="vcard"> ... <div class="fn org"> CASA DI CURA SAN PIO X </div> ... <div class="adr"> <span class="street-address postal-code locality"> 31, V. Nava 20159 MILANO (MI) </span> <span class="country-name">ITALY</span> ... <span class="tel">+39 0269 511</span> </span> </div> ... </div> In the above code, we identify container and valued classes ir order to build the semantic network. In Figure 4, the artificial relationship relates to is included, and it produces a cycle between vcard instances. Consequently, a fragment of the implicit knowledge is as follows: 1) h vcard1, fn org, CASA DI CURA SAN PIO X i 2) h vcard1, adr, adr1 i 3) h adr1, country-name, ITALY i 4) h adr1, tel, +39 0269 511 i 5) h adr1, street-address postal-code locality, 31, V. Nava 20159 MILANO (MI) i 6) h vcard1, relates to, vcard2 i 7) . . . The knowledge should be completed by sentences from the twenty microformats of the real website. Extracted knowledge could be useful for verbose report produced by software tools. artificial name (a container class) with a particular value "Wendy Brown" mediated by the predicate name. This is a typical case of a subject (container class), predicate (valued class) and information (value of a class). Therefore, the power expressivity of RDF is similar to the semantic network representation of microformats. Two reasons for using microformats are the following: • 4. Contextualizing the model The introduced model based on semantic networks was inspired by RDF graphs. In this section we describe some similarities between RDF and our knowledge model. Example 8: Let us consider the following information described with RDF and its associated graph constructed with the RDF Validator [17] in Figure 5. <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"> <foaf:Person> <foaf:name>Johnny Colt</foaf:name> <foaf:mbox rdf:resource="mailto:jcolt@domain.com"/> <foaf:homepage rdf:resource="http://www.jcolt.com/"/> <foaf:nick>Johnny</foaf:nick> <foaf:interest> <rdf:Description rdf:about="http://www.abc.org" rdfs:label="ABC"/> </foaf:interest> <foaf:knows> <foaf:Person> <foaf:name>Wendy Brown</foaf:name> </foaf:Person> </foaf:knows> </foaf:Person> </rdf:RDF> • Simplicity. Thousands and even millions of webpages are published every day. It is impossible to ensure semantic preparation of these web documents by sophisticated annotating schemas. In contrast, microformats offer a simply way for semantic preparation by web developers [7]. Vocabularies (ontologies) are not required. People can develop their own ontologies, eventually, millions of ontologies can exist. With microformats the goal is to use centralized standard vocabularies; this reduce efforts to synchronize the employed terminology. Both reasons to use microformats are translated to reasons to use the model of representation based on semantic networks. Microformats are simple, the model should be preserved simple. The vocabularies to produce sentences are already defined in the metadata provided by microformats. Hence the model should preserve such descriptors, i.e., the predicates are the proper metadata defined for each microformat. The standard GRDDL [19] is useful to transform microformats to RDF and then for knowledge extraction, however we propose that knowledge can be extracted directly from microformats. 5. Conclusions The above RDF code introduces descriptions that use the vocabulary of the ontology called FOAF (Friend Of A Friend [18]) useful to specify information of people and their friendly relations. We can observe metadata such as interest, nick, homepage, etc. which are associated to particular information. Observe that they are similar to our valued classes. Moreover, there are container classes, for instance, the metadata knows is converted to an artificial name in the graph. We also keep this design principle that permits to preserve a metadata as the predicate of sentences. In general, metadata are used as a descriptor of some attribute, thus they are correctly considered as a predicate of a sentence. Now, let us observe in Figure 6 two representative sentences extracted from the graph of Figure 5. Both sentences have been generated from the RDF Validator application [17]. The first sentence is formed from two container metadata, i.e., we observe a sentence with an artificial name as subject and other as object and the semantics between them is stated by the predicate knows. We proceed in a similar way with vcard1 and adr1. The second sentence relates an To the best of our knowledge this is the first proposal for knowledge modeling by employing microformats. This is a problem of a particular interest that was resolved by transforming microformats first to RDF by employing GRDDL [19]. However we consider that it is possible to maintain a simple schema (like microformats) in a simple model of knowledge by means of semantic networks. And, in this way, it is possible to get sentences directly from microformats. Despite the simplicity of microformats, there is an increasing number of developers that adopt their use in thousands of websites each few months (see, e.g., [7]). We can observe many potential and interesting applications of our approach. For instance, it can be used for the developing of tools for knowledge extraction focused on: price comparing, automatic generation of academic exams, automatic discovering of offers, etc. Moreover extracted knowledge is useful for verbose report producing which could be exploited by, for instance, software agents. Fig. 5: The RDF graph from Example 8. Fig. 6: Two sentences from the RDF graph in Figure 5. 6. Acknowledgements This work has been partially supported by the Spanish Ministerio de Ciencia e Innovación under grant TIN200806622-C03-02, by the Generalitat Valenciana under grant ACOMP/2010/042, by the Universidad Politécnica de Valencia (Program PAID-06-08) and by the Mexican Dirección General de Educación Superior Tecnológica. References [1] J. Hendler T. Berners-Lee and O. Lassila. The Semantic Web. Scientific American Magazine, May 2001. [2] World Wide Web Consortium. Resource Description Framework (RDF). URL: http://www.w3.org/RDF/, 2010. [3] World Wide Web Consortium. Extensible Markup Language (XML). URL: http://www.w3.org/XML/, 2009. [4] Liyang Yu. Introduction to the Semantic Web and Semantic Web Services. Chapman & Hall/CRC, 2007. [5] T. Çelik. What’s the Next Big Thing on the Web? It May Be a Small, Simple Thing - Microformats. Knowledge@Wharton, 2005. [6] Microformats.org. The Official Microformats Site. http://microformats.org/, 2009. [7] R. Khare and T. Çelik. Microformats: a Pragmatic Path to the Semantic Web. In WWW ’06: Proceedings of the 15th International Conference on World Wide Web, pages 865–866. ACM, 2006. [8] hCard. Simple, Open, Distributed Format for Representing People, Companies, Organizations, and Places. http://microformats.org/wiki/hcard, 2009. [9] R. Khare. Microformats: The Next (Small) Thing on the Semantic Web? IEEE Internet Computing, 10(1):68–75, 2006. [10] J. F. Sowa. Semantic Networks. In S. C. Shapiro, editor, Encyclopedia of Artificial Intelligence. John Wiley & Sons, 1992. [11] R.T.N. Shetty, P.M. Riccio, and J. Quinqueton. Extended semantic network for knowledge representation. In Zhongzhi Shi, K. Shimohara, and David Dagan Feng, editors, Intelligent Information Processing, volume 228 of IFIP, pages 135–144. Springer, 2006. [12] Gustavo Arroyo J. Guadalupe Ramos, Josep Silva and Juan C. Solorio. A Technique for Information Retrieval from Microformatted Websites. Lecture Notes in Computer Science, 5947/2010:344–351, 2010. [13] R. Quillian. Semantic Memory. In Marvin Minsky, editor, Semantic Information Processing. MIT Press, 1969. [14] Gerd Stumme, Bettina Hoser, Christoph Schmitz, and Harith Alani, editors. ISWC 2005 Workshop on Semantic Network Analysis, volume 171 of CEUR Workshop Proceedings, Galway, Ireland, 2005. [15] Harith Alani, Bettina Hoser, Christoph Schmitz, and Gerd Stumme, editors. Proceedings of the 2nd Workshop on Semantic Network Analysis, 2006. [16] J. F. Sowa, editor. Principles of Semantic Networks: Explorations in the Representation of Knowledge. Morgan Kaufmann, 1991. [17] World Wide Web Consortium. RDF Validation Service. URL: http://www.w3.org/RDF/Validator/, 2007. [18] Dan Brickley and Libby Miller. FOAF Vocabulary Specification 0.97. URL: http://xmlns.com/foaf/spec/, 2010. [19] World Wide Web Consortium. Gleaning Resource Descriptions from Dialects of Languages (GRDDL). URL: http://www.w3.org/TR/grddl/, 2007. [20] Richard MacManus. Mozilla Does Microformats: Firefox 3 as Information Broker. URL: http://www.readwriteweb.com/archives /mozilla_does_microformats_firefox3.php, 2007.