1
Introduction to Semantic Web
2
Let’s organize a trip to Budapest using the
Web!
3
You try to find a proper flight with …
4
… a big, reputable airline, or …
5
… or a low cost one
6
You have to find a hotel, so you look for…
7
… a really cheap accommodation, or …
8
… or a really luxurious one, or …
9
… an intermediate one …
10
oops, that is no good, the page is in
Hungarian that almost nobody
understands, but…
11
… this one could work
12
Of course, you could decide to trust a
specialized site…
13
… like this one, or…
14
… or this one
15
You may want to know something about
Budapest; look for some photographs…
16
… on flickr …
17
… on Google …
18
… or a (social) travel site
19
What happened here?
You had to consult a large number of sites, all
different in style, purpose, possibly
language…
You had to mentally integrate all those
information to achieve your goals
We all know that, sometimes, this is a long
and tedious process!
20
All those pages are only tips of respective icebergs:
the real data is hidden somewhere in databases, XML files,
Excel sheets, …
you have only access to what the Web page designers allow
you to see
Specialized sites (Expedia, TripAdvisor) do a bit
more:
they gather and combine data from other sources (usually
with the approval of the data owners)
but they still control how you see those sources
But sometimes you want to personalize: access
the original data and combine it yourself!
21
Put it another way…
We would like to extend the current Web to a “Web
of data”:
allow for applications to exploit the data directly
22
But wait! Isn’t what mashup sites are
already doing?
23
A “mashup” example:
24
In some ways, yes, and that shows the huge power
of what such Web of data provides
But mashup sites are forced to do very ad-hoc jobs
various data sources expose their data via Web Services
each with a different API, a different logic, different structure
these sites are forced to reinvent the wheel many times
because there is no standard way of doing things
25
Put it another way (again)…
We would like to extend the current Web to a
standard way for a “Web of data”
What makes the current (document) Web work?
people create different documents
they give an address to it (ie, a URI) and make it accessible
to others on the Web
26
Let us put it together
What we need for a Web of Data:
use URI-s to publish data, not only full documents
allow the data to link to other data
characterize/classify the data and the links (the “terms”) to
convey some extra meaning
and use standards for all these!
27
So What is the Semantic Web?
28
It is a collection of standard technologies to
realize a Web of Data.
Semantic web formalizes knowledge in a way
that improves decision making and can form
the basis for autonomous reasoning in future
Semantic web is an effort to make the content
in www accessible and readable for a
machine
29
Web 3.0- semantic web
• The Web 3.0 also referred as Semantic Web or
read-write-execute is the era(2010 and above)
which refers to the future of web.
• In this era computers can interpret information like
humans via Artificial Intelligence and Machine
Learning.
• Help to intelligently generate and distribute useful
content tailored to a particular need of a user.
30
Semantic web
Semantic web have set of standards and best
practices for sharing data and the
semantics of that data over the web for use
by applications.
A set of standards:
• The RDF Data Model.
• The SPARQL query Language.
• OWL Standards for storing vocabularies and ontologies.
The best practices for sharing data over
web
• The use of URIs to name things
• The use of standards such as RDF and SPARQL
31
In what follows…
We will use a simplistic example to introduce the
main technical concepts
The details will be for later during the course
32
The rough structure of data integration
1. Map the various data onto an abstract data
representation
make the data independent of its internal representation…
2. Merge the resulting representations
3. Start making queries on the whole!
queries that could not have been done on the individual data
sets
A simplified bookstore data 33
English books database(dataset “A”)
ID Author Title Publisher Year
ISBN0-00-651409-X id_xyz The Glass Palace id_qpr 2000
ID Name Home Page
id_xyz Ghosh, Amitav http://www.amitavghosh.com
ID Publ. Name City
id_qpr Harper Collins London
34
1st: export your data as a set of relations
35
French books database (dataset “F”)
A B D E
1 ID Titre Traducteur Original
ISBN0 2020386682 Le Palais A13 ISBN-0-00-651409-X
des
2 miroirs
3
6 ID Auteur
7 ISBN-0-00-651409-X A12
11 Nom
12 Ghosh, Amitav
13 Besse, Christianne
36
2nd: export your second set of data
37
3rd: start merging your data
38
3rd: start merging your data (cont.)
39
3rd: merge identical resources
40
Start making queries…
User of data “F” can now ask queries like:
“give me the title of the original”
well, … « donnes-moi le titre de l’original »
This information is not in the dataset “F”…
…but can be retrieved by merging with dataset “A”!
41
However, more can be achieved…
We “feel” that a:author and f:auteur should be
the same
But an automatic merge does not know that!
Let us add some extra information to the merged
data:
a:author same as f:auteur
both identify a “Person”
a term that a community may have already defined:
a “Person” is uniquely identified by his/her name and, say,
homepage
it can be used as a “category” for certain type of resources
42
3rd revisited: use the extra knowledge
Foaf: friend of a friend is a machine readable
ontology describing persons, their activities and their
relations to other people and objects. Anyone can use
FOAF to describe themselves. FOAF allows groups of
people to describe social networks without the need for
a centralised database.
43
Start making richer queries!
User of dataset “F” can now query:
“donnes-moi la page d’accueil de l’auteur de l’originale”
well… “give me the home page of the original’s ‘auteur’”
The information is not in datasets “F” or “A”…
…but was made available by:
merging datasets “A” and datasets “F”
adding three simple extra statements as an extra “glue”
44
Combine with different datasets
Using, e.g., the “Person”, the dataset can be
combined with other sources
For example, data in Wikipedia can be extracted
using dedicated tools
e.g., the “dbpedia” project can extract the “infobox”
information from Wikipedia already…
45
Merge with Wikipedia data
46
Merge with Wikipedia data
47
Merge with Wikipedia data
48
Is that surprising?
It may look like it but, in fact, it should not be…
What happened via automatic means is done every
day by Web users!
The difference: a bit of extra rigour so that
machines could do this, too
49
What did we do?
We combined different datasets that
are somewhere on the web
are of different formats (mysql, excel sheet, XHTML, etc)
have different names for relations
We could combine the data because some URI-s
were identical (the ISBN-s in this case)
We could add some simple additional information
(the “glue”), possibly using common terminologies
that a community has produced
As a result, new relations could be found and
retrieved
50
It could become even more powerful
We could add extra knowledge to the merged
datasets
e.g., a full classification of various types of library data
geographical information
etc.
This is where ontologies, extra rules, etc, come in
ontologies/rule sets can be relatively simple and small, or
huge, or anything in between…
Even more powerful queries can be asked as a
result
51
What did we do? (cont)
52
The Basis: RDF
53
RDF HISTORY
54
Resource Description Framework
55
Views of RDF
56
RDF triples (cont.)
An RDF Triple (s,p,o) is such that:
“s”, “p” are URI-s, ie, resources on the Web; “o” is a URI or
a literal
“s”, “p”, and “o” stand for “subject”, “property”, and “object”
here is the complete triple:
RDF is a general model for such triples (with machine
readable formats like RDF/XML, Turtle, N3, RXR, …)
57
RDF triples (cont.)
Resources can use any URI, e.g.:
http://www.example.org/file.xml#element(home)
http://www.example.org/file.html#home
http://www.example.org/file2.xml#xpath1(//q[@a=b])
URI-s can also denote non Web entities:
http://www.ivan-herman.net/me is me
not my home page, not my publication list, but me
RDF triples form a directed, labelled graph
58
A simple RDF example (in RDF/XML)
<rdf:Description
<rdf:Description rdf:about="http://…/isbn/2020386682">
rdf:about="http://…/isbn/2020386682">
<f:titre
<f:titre xml:lang="fr">Le palais
xml:lang="fr">Le palais des
des mirroirs</f:titre>
mirroirs</f:titre>
<f:original
<f:original rdf:resource="http://…/isbn/000651409X"/>
rdf:resource="http://…/isbn/000651409X"/>
</rdf:Description>
</rdf:Description>
(Note: namespaces are used to simplify the URI-s)
59
Resource description framework (RDF)
It is Data Model used to represent resources
Use to define a resource a triple is used
It is basic building block of a statement
Universal machine readable exchange format
RDF has an XML syntax
RDF has other Notation (Turtle, N Triples, N3, JSON
60
RDF Graph (Eg : Beatle band)
61
This graph shows several nodes that represent entities such as the
Beatles band and one of their studio albums.
Each edge has an identifier that tells us what relationship holds between
those nodes. For example, the :member edge links bands to its
members. The rdf:type edge represent a special kind of relationship.
These edges are sometimes called “attributes” of the node and are often
used to represent the characteristics of the nodes.
This simple, flexible data model has a lot of expressive power to
represent complex situations, relationships, and other things of interest,
while also being appropriately abstract,
62
RDf in turtle
63
RDF
RDF has a special nomenclature for
naming nodes and edges in a graph.
An edge is called a triple, the source
node is called a subject, the edge
name is called a predicate, and the
target node is called an object.
64
RDF NODES
There are three different kinds of RDF nodes:
•IRI (Internationalized Resource Identifiers): An IRI is a unicode string for
identifying nodes and edges in an unambiguous way.
IRIs are internationalized versions of URIs which are generalizations of URLs.
•Blank node: Nodes without a user-visible identifier are called blank nodes
(“bnode” for short). A blank node is appropriate when the node does not need
to be referenced directly. Blank nodes can be reached by following its incident
edges from other nodes.
•Literal: Literals are concrete values used to represent datatypes like strings,
numbers, and dates.
In the example Beatles graph, we have several IRIs: :The_Beatles and :John_Lennon are
two examples. Literals in our example include the string "The Beatles", the date "1963-03-22",
and the integer 125
IRI 65
IRIs, just like the URIs and URLs they generalize, are long strings that are not
easy to read or write. A full IRI can be serialized by simply enclosing it in
angle brackets:
A prefix is a short name that is mapped to a long namespace. A prefixed name
is the sequence of the prefix and the local name separated by a colon :. The
empty string is a valid prefix and called the default namespace for a graph.
Blank Nodes 66
Suppose we extend our example. We have a new use case that requires us to
capture, for all the tracks in an album, which side they belong to–when applicable
for albums released on media with “sides”, i.e., vinyl, cassettes, etc.–and the order
of the song on the album/side. We can introduce a blank node between the album
and the song to attach this data:
Blank nodes do not have globally unique identifiers so when they are serialized a locally
unique, non-persistent label is used. These blank node names are serialized after
the special prefix _:.
67
Literals
Literals are serialized as their lexical value in double quotes followed by the datatype
after double carets (^^). The datatype is typically a built-in datatype from
XML Schema Datatypes (XSD) that defines many commonly used datatypes but
custom datatype IRIs can also be used. Some of the XSD datatypes can be
serialized without the explicit datatype or the quotes. The following table shows
examples of serializing different datatypes:
Serialization Datatype Description
“The Beatles” xsd:string Datatype can be omitted for string values
“1963-03-
xsd:date Date value
22”^^xsd:date
The datatype and double quotes can be omitted for integer
125 xsd:integer
values
Arbitrary-precision decimals can be written without the
xsd:decima
3.0 datatype and quotes too. Existence of . in the number makes it
l
a decimal.
Double-precision floating point values can be written in
3.2E4 xsd:double scientific notation with the symbol E separating the mantissa
from the exponent
xsd:boolea Lowercase strings true and false can be used for boolean
true
RDF Syntax 68
Node-and-link visualization of graphs is convenient
and easy to understand on a small scale, but it is
not very useful for exchanging data between
systems.
There are several syntaxes to serialize an RDF graph
as text including syntaxes based on XML and
JSON. In this tutorial, we will use the Turtle syntax
which is also the basis of the SPARQL query
language.
69
RDf in turtle
The
serialization of
the Beatles
graph in Turtle
syntax looks
like this:
70
Turtle introduces some syntactic sugar:
•Multiple predicates: If two triples share the same subject then the first triple can be
terminated with; and the subject of the second triple can be omitted.
•Multiple objects: If two triples share the same subject and the same predicate the
objects can be separated with , without repeating the subject or the predicate.
•Types: The letter a can be used in place of rdf:type; you can read this as “is a”,
basically. For example, “Love Me Do is a song.”
71
72
Named Graph
Sometimes it is useful to assign a name to an RDF
graph for the purposes of sane data management,
access control, or to attach metadata to the overall
graph rather than to individual nodes. The notion of
named graphs in RDF allows us to do that.
73
Shown here are the triples
that are separated into named
graphs, not the nodes, and
different named graphs can
share some common nodes,
e.g. :Please_Please_Me node
appears both in
the :Artist graph and
the :Album graph.
It is possible to traverse the
edges starting from one
named graph and continue
into another named graph via
these shared nodes.
It is through this sharing of
nodes across named graphs
that the collection of named
graphs (conceptually)
constitute a larger unified
graph.
74
RDF Class
Classes represent categories of nodes with similar
characteristics. Nodes that belong to this category are linked to
the class using the rdf:type (short hand: a) property. Classes
themselves are identified by the meta-class rdfs:Class
:Band a rdfs:Class . # declaration of a class
:The_Beatles a :Band . # declaring an instance of a class
Properties 75
Property is a relation between subjects and objects. We have already seen many
examples of properties; :album, for example. We can use the rdf:Property class to
declare properties:
:track a rdf:Property .
:length a rdf:Property .
should
The range of the track property is defined to be a Song class, so the objects be resources th
are instances of this class.
The range of the length property, on the other hand, is defined to be the built-in datatype xsd:integer
so the objects should be integer literals.
76
RDF Metadata
RDFS also provides two properties that can
be used to provide metadata about nodes,
classes, and properties:
•rdfs:label: provides a human-readable name
length rdfs:label "length (in seconds)" ;
for a resource
:
rdfs:comment "The length of a song expressed in seconds".
•rdfs:comment: provides a human-readable
description of a resource