CS549 Distributed Information Systems: Lecture 2: XML and Internet Databases
CS549 Distributed Information Systems: Lecture 2: XML and Internet Databases
This lecture is based on Chapter26: Elmasri and Navathe, fundamentals of database systems
Session Learning Outcomes
2
- Introduction
• What is XML
3
What is XML
• XML stands for Extensible Markup Language.
• Created in 1996. The first specification was published in 1998 by the W3C
• XML like HTML is a markup language, but unlike HTML it doesn’t have
predefined elements.
• You create your own elements and you assign them any name you like,
hence the term extensible.
4
How can XML be Used?
• XML is used to Exchange Data
5
What does XML look like
<Books>
<Book>
<Title> Java </Title>
Books <Author> John </Author>
<Year> 1999 </year>
Title Author year </Book>
…
Java John 1999 …
Pascal Sara 1980 …
<Book>
Basic Mary 1975 <Title> Oracle </Title>
<Author> Emad
Oracle Emad 1999 </Author>
…. …. <Year> 1999 </Year>
</Book>
….
Relation ….
</ Books>
XML document
6
XML and HTML …
7
XML and HTML
• HTML is for humans and HTML describes web pages
• You don’t want to see error messages about the web pages you visit
• Browsers ignore and/or correct as many HTML errors as they can, so
HTML is often sloppy
• XML allows the author to define his own tags and his own document
structure, hence the term extensible.
9
The Anatomy of XML Document
<?xml version:”1.0”?>
XML Processing
Declaration
<?xml-stylesheet type="text/xsl" href=“template.xsl"?> instruction
10
Components of an XML Document
• Elements
• Each element has a beginning and ending tag
• <TAG_NAME>...</TAG_NAME>
• Elements can be empty (<TAG_NAME />)
• Attributes
• Describes an element; e.g., data type, data range, etc.
• Can only appear on beginning tag
• Example: <Book ISBN = “1-111-123”>
• Processing instructions
• Encoding specification (Unicode by default)
• Namespace declaration
• Schema declaration
11
XML declaration
• The XML declaration looks like this:
<?xml version="1.0" encoding="UTF-8“ standalone="yes"?>
• Names must not start with the letters: xml (or XML or Xml ..)
14
XML Attributes
• XML elements can have attributes: Example: <Book ISBN = “1-111-123”>
• Data can be stored in child elements or in attributes
• Should you avoid using attributes?
• Here are some of the problems using attributes:
• attribute values are not easy to test against a (DTD) - which is used to
define the legal elements of an XML document
15
Distinction between sub-element and attribute
• In the context of documents, attributes are part of markup, while sub-element
contents are part of the basic document contents
• <Book>
…
<Publisher> McGraw Hill </Publisher>
…
</Book>
• Suggestion: use attributes for identifiers of elements, and use sub-elements for
contents 16
XML Validation
• Well-Formed XML document:
• Is an XML document with the correct basic syntax
17
Rules For Well-Formed XML
• Must begin with the XML declaration
• Must have one unique root element
• All start tags must match end-tags
• XML tags are case sensitive
• XML Schema
• XML Schema is an XML based alternative to DTD
• XML Validators
19
CDATA
• By default, all text inside an XML document is parsed
• Any characters, even & and <, can occur inside a CDATA
• The only real restriction is that the character sequence ]]> cannot occur
inside a CDATA
• CDATA is useful when your text has a lot of illegal characters (for
example, if your XML document contains some HTML text)
20
XML and DTDs
• A DTD (Document Type Definition) describes the structure of one or more
XML documents.
21
Why DTDs?
• With DTD, each of your XML files can carry a description of its
own format with it.
• Your application can use a standard DTD to verify that the data
you receive from the outside world is valid.
22
Parsers
23
An XML example
• <novel>
<foreword>
<paragraph> This is a great novel </paragraph>
</foreword>
<chapter number="1">
<paragraph>It was a dark and stormy night.</paragraph>
<paragraph>Suddenly, a shot rang out!</paragraph>
</chapter>
</novel>
An XML document contains (and the DTD describes):
Elements, such as novel and paragraph, consisting of tags and content
Attributes, such as number="1", consisting of a name and a value
Entities (not used in this example)
24
<novel>
<foreword>
A DTD example <paragraph> This is a great novel </paragraph>
</foreword>
<chapter number="1">
• <!DOCTYPE novel [ <paragraph>It was a dark...</paragraph>
<paragraph>Suddenly, a shot..!</paragraph>
<!ELEMENT novel (foreword, chapter+)> </chapter>
<!ELEMENT foreword (paragraph+)> </novel>
<!ELEMENT chapter (paragraph+)>
<!ELEMENT paragraph (#PCDATA)>
<!ATTRIBUTE chapter number CDATA #REQUIRED>
]>
• A novel consists of a foreword and one or more chapters, in that order
• Each chapter must have a number attribute
• A foreword consists of one or more paragraphs
• A chapter also consists of one or more paragraphs
• A paragraph consists of parsed character data (text that cannot contain any
other elements)
25
ELEMENT descriptions
<!DOCTYPE novel [
• Suffixes: Example <!ELEMENT novel (foreword, chapter+)>
? optional foreword? <!ELEMENT foreword (paragraph+)>
<!ELEMENT chapter (paragraph+)>
+ one or more chapter+ <!ELEMENT paragraph (#PCDATA)>
<!ATTRIBUTE chapter number CDATA #REQUIRED>
• zero or more appendix* ]>
• Separators:
, both, in order foreword?, chapter+
| or section|chapter
• Grouping:
() grouping (section|chapter)+
#REQUIRED Attribute is not optional
26
Another example: XML
<?xml version="1.0"?>
<!DOCTYPE myXmlDoc SYSTEM "http://www.mysite.com/mydoc.dtd">
<weatherReport>
<date>05/29/2002</date>
<location>
<city>Philadelphia</city>
<state>PA</state>
<country>USA</country>
</location>
<temperature-range>
<high scale="F">84</high>
<low scale="F">51</low>
</temperature-range>
</weatherReport> 27
The DTD for this example
<!ELEMENT weatherReport (date, location, temperature-range)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT location (city, state, country)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT country (#PCDATA)>
<!ELEMENT temperature-range ((low, high)|(high, low))>
<!ELEMENT low (#PCDATA)>
<!ELEMENT high (#PCDATA)>
<!ATTLIST low scale (C|F) #REQUIRED>
<!ATTLIST high scale (C|F) #REQUIRED>
28
XML Schema …
• <?xml version="1.0"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body> Don't forget me this weekend!</body>
</note>
• This is a simple DTD file called "note.dtd" that defines the elements of the
XML document above ("note.xml"):
<xs:schema>
...
...
</xs:schema>
• The <schema> element may contain some attributes. A schema declaration often looks
something like this:
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.w3schools.com"
xmlns="http://www.w3schools.com"
elementFormDefault="qualified">
<xs:schema> ... ... </xs:schema>
33
Information Retrieval by Using
XPath and XQuery
XML Tree
• An XML file can be viewed as a tree with a
root element and leaves that are
connected by branches.
<?xml version="1.0" encoding="ISO-8859-1" ?>
<library> <--- root element
<book>
<chapter> </chapter>
<chapter>
<section> • A tree has a root element and child
<paragraph>A </paragraph> elements.
• All elements can be parent so can have
<paragraph>B </paragraph> sub elements (child elements):
</section> • An element can have text content and
</chapter> attribute
• children on the same level are called
</book> Siblings (brothers and sisters).
</library> 35
XPath
• XPath is a syntax used for selecting parts of an XML
document
• The way XPath describes paths to elements is similar to the
way an operating system describes paths to files
• XPath is almost a small programming language; it has
functions, tests, and expressions
• XPath is a W3C standard
36
Terminology
<library> • library is the parent of book; book is the
parent of the two chapters
<book>
<chapter>
• The two chapters are the children of
</chapter>
book, and the section is the child of the
second chapter
<chapter>
<section> • The two chapters of the book are
<paragraph>A </paragraph> siblings (they have the same parent)
<paragraph>B </paragraph>
</section> • library, book, and the second chapter
are the ancestors of the section
</chapter>
• A path that does not begin with a / represents a path starting from the
current element
• Example: header/from
• A path that begins with // can start from anywhere in the document
• Example: //header/from selects every element from that is a child of an element
header
• This can be expensive, since it involves searching the entire document
39
Brackets and last()
• A number in brackets selects a particular matching child
• Example: /library/book[1] selects the first book of the library
• Example: //chapter/section[2] selects the second section of every chapter in the XML
document
• Example: //book/chapter[1]/section[2]
• Only matching elements are counted; for example, if a book has both sections and exercises,
the latter are ignored when counting sections
41
XQuery
• XQuery is the language for querying XML data
43
XQuery Example
• Example:
• Output
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
44
XQuery FLWOR Expressions
• The syntax of Flower expression looks like the combination of SQL and path
expression
• The following path expression will select all the title elements under the book
elements that is under the bookstore element that have a price element with a
value that is higher than 30.
doc("books.xml")/bookstore/book[price>30]/title
• The following FLWOR expression will select exactly the same as the path
expression above
for $x in doc("books.xml")/bookstore/book
where $x/price>30
return $x/title
• Output
<title lang="en">XQuery Kick Start</title>
<title lang="en">Learning XML</title>
45
-- FLWOR briefly explained
for $x in doc("books.xml")/bookstore/book
where $x/price>30
order by $x/title
return $x/title
47
- References
• http://www.w3schools.com/xml/default.asp
• http://www.w3.org/XML/
• XML Tutorials
• http://www.programmingtutorials.com/xml.aspx
• http://xml.coverpages.org/
49
END