[go: up one dir, main page]

0% found this document useful (0 votes)
17 views33 pages

06 XML

Uploaded by

akshatnigam7931
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views33 pages

06 XML

Uploaded by

akshatnigam7931
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

PSIT-Pranveer Singh Institute of Technology

Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

XML-eXtensible Markup Language


1. INTRODUCTION

A markup language is a computer language that uses tags to define elements within a document. It is human-
readable, meaning markup files contain standard words, rather than typical programming syntax. While several
markup languages exist, the two most popular are HTML and XML.

a. Physical Markup: focus on the presentation. Eg. HTML

b. Logical Markup: storage and transfer of data. Eg. XML

XML - eXtensible Markup Language. Used for data exchange.

<?xml version="1.0" encoding="UTF-8"?> → XML Prolog


<message> --> root
<from> A </from>
<to> B </to>
<subject> good morning </subject> → child tags
<data> please have some tea </data>
</message>
or

<message to="" from="" subject="" data=""/>

1.1. XML PROPERTIES


a) XML is a markup language that focuses on data rather than how the data looks.
b) XML is designed to send, store, receive and display data. In simple words, you can say that XML is used
for storing and transporting data.
c) XML is different from HTML. XML focuses on data while HTML focuses on how the data looks.
d) XML does not depend on software and hardware; it is platform and programming language independent.
e) Unlike HTML where most of the tags are predefined, XML doesn’t have predefined tags, rather you have
to create your tags.

Example 1: Write XML document to store the data of the student.

<students>
<student>
<rollno> 1 </rollno>
<first_name> Akarsh</first_name>
<last_name> Malhotra</last_name>
<branch> CSE </branch>
<section> A </section>
</student>
<student>
<rollno> 2 </rollno>
<first_name> Prateek </first_name>
<last_name> Bajpai </last_name>
<branch> CSE </branch>
<section> A </section>
</student>
</students>

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

1.2. FEATURES OF XML

(i) XML focuses on data rather than how it looks: One of the reasons, XML is popular is because it
focuses on data rather than data presentation. The other markup language such as HTML is used for data
presentation. This separates the data and its presentation part and gives us the freedom to present the
data, the way we want, once we receive it using XML. Two or more systems can receive the same data
from the same XML and present it in a different way using another markup language such as HTML.
(ii) Easy and efficient data sharing: Since XML is software and hardware-independent, it is easier to share
data between different systems with different hardware and software configurations. Any system with
any programming language can read and process an XML document.
(iii) Compatibility with other markup language HTML: It is so much easier to read the data from XML
and display it on a GUI(graphical user interface) using HTML markup language. When the data changes
over time, we need not make any changes in the HTML.
(iv) Supports platform transition: The main reason why changing to new systems and platforms is
challenging is because it involves the headache of data conversion between incompatible formats which
often results in data loss. XML simplifies this process as the data is transported on newly upgraded
systems without any data loss.
(v) Allows XML validation: An XML document can be validated using DTD or XML schema. This ensures
that the XML document is syntactically correct and avoids any issues that may arise due to the incorrect
XML.
(vi) Adapts technology advancements: The reason why XML is popular and being used for a very long
time is that it can adapt to the new technologies because of its platform-independent nature.
(vii) XML supports Unicode: XML supports Unicode that allows it to communicate almost any information
in any written human language.

1.3. ADVANTAGES & DISADVANTAGES OF XML


1.3.1. ADVANTAGES
(i) XML is platform-independent and programming language independent, thus it can be used on any system
and supports the technology change when that happens.
(ii) XML supports Unicode. Unicode is an international encoding standard for use with different languages
and scripts, by which each letter, digit, or symbol is assigned a unique numeric value that applies across
different platforms and programs. This feature allows XML to transmit any information written in any
human language.
(iii) The data stored and transported using XML can be changed at any point in time without affecting the
data presentation. Generally, another markup language such as HTML is used for data presentation,
HTML gets the data from XML and displays it on the GUI (graphical user interface), once data is updated
in XML, it does reflect in HTML without making any change in HTML GUI.
(iv) XML allows validation using DTD and Schema. This validation ensures that the XML document is free
from any syntax error.
(v) XML simplifies data sharing between various systems because of its platform-independent nature. XML
data doesn’t require any conversion when transferred between different systems.

1.3.2. DISADVANTAGES
(i) XML syntax is verbose and redundant compared to other text-based data transmission formats such as
JSON.
(ii) The redundancy in the syntax of XML causes higher storage and transportation costs when the volume
of data is large.
(iii) XML document is less readable compared to other text-based data transmission formats such as JSON.
(iv) XML doesn’t support array.
(v) XML file sizes are usually very large due to their verbose nature, it is dependent on who is writing it.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

2. NEED OF XML

Since there are systems with different-different operating systems having data in different formats. To transfer the
data between these systems is a difficult task as the data needs to be converted to incompatible formats before it
can be used on another system. With XML, it is so easy to transfer data between such systems as XML doesn’t
depend on the platform and the language.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

• XML can be used to exchange data between compatible/incompatible applications in Web/non-Web


applications.
• XML simplifies the process of data exchange between two or more applications.

Now, the question is, why not use the existing Database Management System (DBMS) products such as Oracle,
SQL Server, IMS, IDMS, and Informix, etc., for exchanging data over the Internet (and also outside of the
Internet)? The reason is the incompatibility of various kinds. These DBMS products are extremely popular and
provide great data storage and access mechanisms. However, they are not always compatible with each other in
terms of sharing or transferring data. Their formats, internal representations, data types, encoding, etc., are
different. This creates problems in data exchange.

This is similar to a situation when one person understands only English and the other understands only Hindi.
English and Hindi by themselves are great languages. However, they are not compatible with each other.
Similarly, for instance, suppose organization X uses Oracle as its DBMS (relational) and organization Y uses IMS
as its DBMS (Hierarchical). Each of these DBMS systems internally represents the data in their formats as well
as by using data structures such as chains, indexes, lists, etc. Now, whenever X and Y want to exchange any kind
of data (say the list of products available, last month’s sales report, etc.), they would not be able to do this directly.
Consider the following figure.

Database Management Systems (DBMS) are incompatible with each other when it comes to data exchange.

If X and Y want to exchange data, the simple solution would be that they agree on a common data format, and use
that format for data exchange. For example, when X wants to send an inventory status to Y, it would first convert

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

that data from Oracle format into this common format and then send it to Y. When Y receives this data, it would
convert the data from this common format into IMS format, and then its applications can use it. In the simplest
case, this common format can be a text file.

This approach of exchanging data in the text format seems to be fine. After all, all that is needed is some data
transformation programs at both ends, which either read from or write to text format from the native (Oracle/IMS)
format. This approach would be very similar to the one used in our translator approach for human conversations.
But there are some issues with this approach as well, in addition to what we had discussed earlier in the context
of human conversations.

• For instance, suppose another organization Z now wants to do business with X and Y. Therefore, X and
Y now need to exchange data with Z also. Suppose that Z is already interacting with other business
partners such as A and B. Now, if Z is using a different text format for data exchange with A and B, its
data exchange text formats with X/Y and A/B would be different! That is, for exchanging the same data
with different business partners, different application programs might be required.
• Also, suppose that these business partners specify some business rules. For instance, Z mandates that a
sales order arriving from any of its business partners (i.e., A, B, X, or Y) must carry at least three items.
For this, appropriate logic can be incorporated in the application program at its end to validate this rule,
whenever it receives any sales order from one of its business partners. However, can we not apply this
business rule before the data is sent by any of the business partners, rather than first accepting the data
and then validating it? If different data exchanges among different business partners demand different
business rules like this, it might be difficult to apply them in the text format.

HTML is the de facto language of the Internet. HTML defines a set of tags describing how the Web browser
should display the contents of a particular document to the end-user. For example, it uses tags that indicate that a

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

particular portion of the text is to be made boldface, underlined, small, big, and so on. In addition, we can display
lists of values using bullets, or create tables on the screen by using HTML.

The similarity between XML and HTML is that both languages use tags to structure documents. This,
incidentally, is perhaps the only real similarity between the two!

XML also uses tags to organize documents and the contents therein just as HTML does, it is not concerned with
these presentation features of a document. XML is more concerned with the meaning and rules of the data
contained in a document. XML describes what the various data items in a document mean, rather than describing
how to display them. Therefore, whereas HTML is an information presentation language, XML is an informative
description language. Thus, conceptually, XML is pretty similar to a data definition language. HTML concentrates
on the display/presentation of data to the end-user, whereas XML deals with the representation of data in
documents.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

3. XML TERMINOLOGY

Every XML file has an extension of .XML. Let us call the above file books.xml. As we can see, the file seems to
contain information organized hierarchically, with some unfamiliar symbols. Let us understand this example step
by step. In the process, we will start getting familiar with the XML syntax and terminology.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

4. INTRODUCTION TO DTD

Consider an XML document that we intend to write for capturing bank account information. We would like to
see data such as the account number, account holder’s name, opening balance, type of account, etc., as the
fields for which we want to capture information. However, at the same time, we also wish to ensure that this
XML document does not contain any other irrelevant information. For instance, we would like to make sure
that our XML document does not contain information about students, books, projects, or data not needed.

In short, we need easy mechanisms for validating an XML document. For example, we should be able to
specify and validate, which elements, attributes, etc., are allowed in an XML document.

A DTD allows us to validate the contents of an XML document.

For example, a DTD will allow us to specify that a book XML document can contain exactly one book
name and at the most two author names. A DTD is usually a file with an extension of DTD, although this extension
is optional. Technically, a DTD file need not have any extension. We can specify the relationship between an
XML document and a DTD. That is, we can mention that for a given XML file, we want to use a given DTD file.
Also, we specify the rules that we want to apply in that DTD file. Once this linkage is established, the DTD file
checks the contents of the XML document concerning these rules automatically whenever we attempt to make
use of the XML document.

Imagine a situation where we do not have anything such as a DTD. Yet, let us imagine that we want to
apply certain rules. How can we accomplish this? Well, there is no simple solution here. The programs that use
the XML document will need to perform all these validations before they can make use of the contents of the
XML document. Of course, it is not impossible. However, it would need to be performed by every program, which
wants to use this XML document for any purpose. Otherwise, there is no guarantee that the XML
the document contains bad data.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

A DTD will free application programs from the worry of validating the contents of an XML
document. It will take this responsibility on itself. Therefore, the portion of validation is concentrated in just
once place—inside the DTD. All other parties interested in the contents of an XML document are free to
concentrate on what they want to do, i.e., to make use of the XML document the way they want and process it, as
appropriate. On the other hand, the DTD would be busy validating the contents of the XML document on
behalf of any program or application.

• DTD helps us in specifying the rules for validating the contents of an XML document at once place,
thereby allowing the application programs to concentrate on the processing of the XML document.
• A DTD is a file with a DTD extension.
• The contents of this file are purely textual.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

4.1. DOCUMENT TYPE DEFINITION

An XML document contains a reference to a DTD file. This is similar to, for example, how a C program would
include references to various header files, or a Java program would include packages.

A DOCTYPE declaration in an XML document specifies that we want to include a reference to a DTD file.

Whenever any program (usually called an XML parser) reads our XML document containing a DOCTYPE
tag, it understands that we have defined a DTD for our XML document. Therefore, it attempts to also load and
interpret the contents of the DTD file. In other words, it applies the rules specified in the DTD to the contents
of our XML document for verifying them.

The DOCTYPE declaration stands for a document type declaration.

Note that the DOCTYPE tag is written as <!DOCTYPE …>.

There are two types of DTDs, internal DTD, and external DTD, also respectively called internal
subset and external subset.

An internal subset means that the contents of the DTD are inside an XML document itself. On the other hand,
an external subset means that an XML document has a reference to another file, which we call an external
subset.
Let us take a simple example. Suppose we want to define an XML document containing a book name as
the only element. We also wish to write a corresponding DTD, which will define the template or rule book for
our XML document. Then we have two situations: the DTD can be internal or external. Let us call our XML
document as book.xml, and our external DTD as a book.dtd. Note that when the DTD is internal, there is no need
to provide a separate name for the DTD (since the contents of the DTD are inside the contents of the XML
document anyway). But when the DTD is external, we must provide a name to this DTD file.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

As we can see, when a DTD is internal, we embed the contents of the DTD inside the XML document, as
shown in case (a). However, when a DTD is external, we simply provide a reference to the DTD inside our
XML document, as shown in case (b). The actual DTD file has a separate existence of its own.

When should we use an internal DTD, and when should we use an external DTD? For simple situations,
internal DTDs work well. However, external DTDs help us in two ways:

(i) External DTDs allow us to define a DTD once and then refer to it from any number of XML documents.
Thus, they are reusable. Also, if we need to make any changes to the contents of the DTD, the change needs
to be made just once (to the DTD file).
(ii) External DTDs reduce the size of the XML documents since the XML documents now contain just a
reference to the DTD, rather than the actual contents of the DTD. Another keyword we need to remember in
the context of internal DTDs.

An XML document can be declared as standalone if it does not depend on an external DTD.

The keyword standalone is used along with the XML opening tag.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

Let us now understand the syntax of the DTD declaration or reference, i.e., regardless of whether the DTD
is internal or external. We know that the internal DTD declaration looks like this in our example:

This DTD declaration indicates that our XML document will contain a root element called as myBook,
which, in turn, contains an element called book_name. Also, the contents of the DTD need to be wrapped inside
square brackets. This informs the XML parser to know the start and the end of the DTD syntax, and also to help
it differentiate between the DTD contents and the XML contents. On the other hand, the external DTD reference
looks like this:
<!DOCTYPE myBook SYSTEM “myBook.dtd”>

This does not give us an idea about the actual contents of the DTD file, since the DTD is external.
Let us now worry about the DOCTYPE syntax. In general, the basic syntax for the DOCTYPE line is as shown
below:

Let us understand what it means.

1. The DOCTYPE keyword indicates that this is either an internal declaration of a DTD or a reference to
an external DTD.

2. Regardless of whether it is internal or external, this is followed by the name of the root element in the
XML document.

3. This is followed by the actual contents of the DTD (if the DTD is internal), or by the name of the DTD
file (if it is an external DTD). This is currently shown with dots (…).

Therefore, we can now enhance our DOCTYPE declaration as follows:

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

4.2. ELEMENT TYPE DECLARATION

Elements are the backbone of any XML document. If we want to associate a DTD with an XML document, we
need to declare all the elements that we would like to see in the XML document, also in the DTD. This should be
quite obvious to understand. After all, a DTD is a template or rule book for an XML
document. An element is declared in a DTD by using the element type declarations (ELEMENT tag).

For example, we can declare an element called book_name, we can use the following declaration:
<!ELEMENT book_name (#PCDATA)>

As we can see, book_name is the name of the element, and its data type is PCDATA. The XML jargon calls an
element name as generic identifier. The data type is called content specification.

The element name must be unique within a DTD.

Let us consider an example. Suppose that we want to store just the name of a book in our XML document.
The example below shows a sample XML document and the corresponding DTD that specifies the rules for this
XML
document. Note that we are using an external DTD. We have added line numbers simply for the sake of
understanding the example easily by providing references during our discussion. The actual XML document
and DTD will never have line numbers.

Understanding the XML document (book.xml)


Line 1 indicates that this is an XML document.
Line 2 is a comment.
Line 3 declares a document type reference. It indicates that our XML document makes use of an
external DTD. The name of this external DTD is a book.dtd. Also, the root element of our XML document
is an element called as myBook.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

Lines 4–6 define the actual contents of our XML document. These consist of an element called as
book_name.
Understanding the DTD (book.dtd)
Line 1 is an element-type reference. It indicates that the root element of the XML document that this
DTD will be used to verify, will have the name myBook. This root element (myBook) contains one subelement
called book_name.
Line 2 states that the element book_name can contain parsed character data.

4.2.1. Specifying Sequences, Occurrences & Choices

Sequences: The first question is how we add more element-type declarations to a DTD. For example, suppose
that our book DTD needs to contain the book name and author name. For this, we simply need to add a comma
between these two element type declarations. For example:

<!ELEMENT myBook (book_name, author)>

This declaration specifies that our XML document should contain exactly one book name, followed by exactly
one author name. Any number of book name-author pairs can exist. The following figure shows an example of
specifying the address book.

As we can see, our address book contains sub-elements, such as street, region, postal code, locality, and country.
Each of these sub-elements is defined as a parsed character data field. Of course, we can extend the concept of
sub-elements further. That is, we can, for example, break down the street sub-element into street number and
street name. This is shown in the figure below.

Choices: Choices can be specified by using the pipe (|) character. This allows us to specify options of type A or
B. For example, we can specify that the result of an examination can be that the student has passed or failed (but
not both), as follows.

<!ELEMENT result (pass | fail)>

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

To a guest, we want to offer tea or coffee, but not both!

Occurrence: The number of occurrences, or the frequency, of an element can be specified by using the plus
(+), asterisk (*), or question mark (?) characters. If we do not use any of the occurrence symbols (i.e., +, *, or ?),
then the element can occur only once. That is, the default frequency of an element is 1.

For example, we can specify that a book must contain one or more chapters as follows.

<!ELEMENT book (chapter+) >

We can use the same concept to apply to a group of sub-elements. For example, suppose that we want to specify
that a book must contain a title, followed by at least one chapter and at least one author, we can use this declaration.

<!ELEMENT book (title, (chapter, author)+ )>

A sample XML document conforming to this DTD declaration is shown in figure below:

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

Of course, the grouping of sub-elements for the purpose of specifying frequency is not restricted to the plus sign
(+). It can be done equally well for the asterisk (*) or question mark (?) symbols. The asterisk symbol (*) specifies
that the element may or may not occur. If it is used, it can repeat any number of times.

The DTD specifies that the XML document can depict zero or more employees in an organization. One sample
XML document has three employees, the other has none. Both are allowed. On the other hand, if we replace the
asterisk (*) with a plus sign (+), the situation changes. We must now have at least one employee. Therefore, the
empty organization case (i.e., an organization containing no employees) is now ruled out.

Finally, a question mark (?) indicates that the element cannot occur at all or can occur exactly once.

A nation can have only one president. This is indicated by the following declaration.

<!ELEMENT nation (president?) >

At times, of course, the nation may be without a president temporarily. However, at no point can a nation
have more than one president.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

4.3. ATTRIBUTE DECLARATION

Elements describe markup of an XML document. Attributes provide more details about the elements. An
element can have 0 or more attributes. For example, an employee XML document can contain elements to
depict the employee number, name, designation, and salary. The designation element, in turn, can have a
manager attribute that indicates the manager for that employee.

The keyword ATTLIST describes the attribute(s) for an element.

Figure shows an XML document containing an inline DTD. We can see that the element contains an
attribute.

We can see that the message element has three attributes: from, to, and subject. All the three attributes have a data
type of CDATA (which stands for character data), and a #REQUIRED keyword. The #REQUIRED keyword
indicates that this attribute must be a part of the element.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

4.4. LIMITATIONS OF DTD

In spite of their several advantages, DTDs suffer from a number of limitations.

5. INTODUCTION TO SCHEMA

We know that a DTD is used for validating the contents of an XML document. DTD is undoubtedly a very
important feature of the XML technology. However, there are a number of areas in which DTDs are weak. The
main argument against DTDs is that their syntax is not like that of XML documents. Therefore, the people working
with DTDs have to learn new syntax to work with DTDs. Furthermore, this leads to problems, such as, we cannot
search for information inside DTDs, we cannot display their contents in the form of HTML, etc.

A schema is an alternative to DTD.

It is expected that schemas would eventually completely replace most (but not all) features of DTDs. DTDs are
easier to write and provide support for some features (e.g., entities) better. However, schemas are much richer in
terms of their capabilities and extensibility. A schema document is a separate document, just like a DTD. However,
the syntax of a schema is like the syntax of an XML document. Therefore, we can state:

The main difference between a DTD and a schema is that the syntax of a DTD is different from that of XML.
However, the syntax of a schema is the same as that of XML.

In other words, a schema document is an XML document.

We declare an element in a DTD by using the syntax <!ELEMENT>. This is clearly not legal in XML. We cannot
begin an element declaration with an exclamation mark, as happens in the case of a DTD.

We can use a very simple, yet powerful example to illustrate the difference between using a DTD and using a
schema. Suppose that we want to represent the marks of a student in an XML document. For this purpose, we
want to add an element called as Marks to our root element Student. We will declare this element as of type
PCDATA in our DTD file. This will ensure that the parser checks for the existence of the Marks element in the

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

XML document. However, can it ensure that marks are numeric? Clearly, no! We cannot control what contents
the element Marks can have. These contents can very well be alphabetic or alphanumeric.

As we can see, the usage of PCDATA in the declaration of an element does not stop us from entering alphabetic
data in a Marks element. In other words, we cannot specify exactly what should our elements contain. This is
quite clearly not desirable at all. In the case of a schema, we can very well specify that our element should only
contain numeric data. Moreover, we can control many other aspects of the contents of elements, which is not
possible in the case of DTDs. We use similar terminology for checking the correctness of XML documents in the
case of a schema (as in the case of DTDs). An XML document that conforms to the rules of a schema is called as
a valid XML document. Otherwise, it is called as invalid.

Consider an XML document which contains a greeting message:

First and foremost, an XML schema is defined in a separate file. This file has the extension xsd.

In our example, the schema file is named message.xsd. The following declaration in our XML document indicates
that we want to associate this schema with our XML document:
<MESSAGE xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”
xsi:noNamespaceSchemaLocation=”message.xsd”>

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

Let us dissect this statement:

1. The word MESSAGE indicates the root element of our XML document. There is nothing unusual
about it.

2. The declaration xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” is an attribute. It defines


a namespace prefix and a namespace URI. The namespace prefix is xmlns. The namespace URI is
http://www.w3.org/2001/XMLSchema-instance. The namespace prefix can change. The namespace URI must be
written exactly as shown. The namespace URI specifies a particular instance of the
schema specifications to which our XML document is adhering.

3. The declaration xsi:noNamespaceSchemaLocation=”message.xsd” specifies a particular schema


file which we want to associate with our XML document. In this case, we are stating that our XML
document wants to refer to a schema file whose name is message.xsd.

This is followed by the actual contents of our XML document. In this case, the contents are nothing but the
contents of our root element.

Note that the schema file is an XML file with an extension of xsd. That is, like any XML document, it begins with
an <?xml …?> declaration. The following lines specify that this is a schema file, and not an ordinary XML
document. They also contain the actual contents of the schema. Let us first reproduce them:
<xsd:schema xmlns:xsd = “http://www.w3org/2001/XMLSchema”>
<xsd:element name = “MESSAGE“ type = “xsd:string”/>
</xsd:schema>

Let us understand this step by step:

1. The declaration <xsd:schema xmlns:xsd = “http://www.w3org/2001/XMLSchema”> indicates that this is a schema,


because its root element is named schema. It has a namespace prefix of xsd. The namespace URI is
http://www.w3org/2001/XMLSchema. This means that our schema declarations conform to the schema standards
specified on the site http://www.w3org/2001/XMLSchema, and that we can use a namespace prefix of xsd to refer
to them in our schema file.

2. The declaration <xsd:element name = “MESSAGE” type = “xsd:string”/> specifies that we want to use an element
called as MESSAGE in our XML document. The type of this element is string. Also, we are using the namespace

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

prefix xsd. Recall that this namespace prefix was associated with a namespace URI
http://www.w3org/2001/XMLSchema in our earlier statement.

3. The line </xsd:schema> specifies the end of the schema

5.1. COMPLEX TYPE

5.1.1. Basics of Simple and Complex Types

Elements in schema can be divided into two categories: simple and complex.

Simple Elements

Simple elements can contain only text. They cannot have sub-elements or attributes. The text that they can contain,
however, can be of various data types such as strings, numbers, dates, etc.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

Complex Elements

Complex elements, on the other hand, can contain sub-elements, attributes, etc. Many times, they are made up of
one or more simple element.

Suppose we want to capture student information in the form of the student’s roll number, name, marks, and result.
Then we can have all these individual blocks of information as simple elements. Then we will have a complex
element in the form of the root element. This complex element will encapsulate these individual simple elements.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

Let us understand our schema:

1. <xsd:schema xmlns:xsd = “http://www.w3org/2001/XMLSchema”>:

We know that the root element of the schema is a reserved keyword called as schema. Here also, same is the case.
The namespace prefix xsd maps to the namespace URI http://www.w3.org/2001/ XMLSchema, as before. In
general, this will be true for any schema that we write.

2. <xsd:element name = “STUDENT” type = “StudentType”/>:

This declares STUDENT as the root element of our XML document. In the schema, it is called as the top-level
element. Remember that in the case of a schema, the root element is always the keyword schema. Therefore, the
root element in an XML document is not the root of the corresponding schema. Instead, it appears in the schema
after the root element schema.

The STUDENT element is declared of type StudentType. This is a user-defined type.

Conceptually, a user-defined type is similar to a structure in C/C++ or a class in Java (without the
methods). It allows us to create our own custom type.

In other words, the schema specification allows us to create our own custom data types. For example,
we can create our own types for storing information about employees, departments, songs, friends, sports games,
and so on. We recognize this as a user-defined type because it does not have our namespace prefix xsd. Remember
that all the standard data types provided by the XML schema specifications reside at the namespace
http://www.w3.org/2001/XMLSchema, which we have prefixed as xsd in the earlier statement.

3. <xsd:complexType name = “StudentType”>:

Now that we have declared our own type, we must explain what it represents and contains. That is exactly what
we are doing here. This statement indicates that we have used StudentType as a type earlier, and now we want to
explain what it means. Also, note that we use a keyword complexType to designate that StudentType is a complex
element. This is similar to stating struct StudentType or class StudentType in C++/Java.

4. <xsd:sequence>:

Schemas allow us to force a sequence of simple elements within a complex element. We can specify that a
particular complex element must contain one or more simple elements in a strict sequence. Thus, if the complex
element is A, containing two simple elements B and C, we can mandate that C must follow B inside A. In other
words, the XML document must have:

<A>
<B> … </B>
<C>… </C>
</A>

This is accomplished by the sequence keyword.

5. <xsd:element name = “ROLL_NUMBER” type = “xsd:string”/>:

This declaration specifies that the first simple element inside our complex element is ROLL_NUMBER, of type
string. After this, we have NAME, MARKS, and RESULT as three more simple elements following
ROLL_NUMBER.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

6. BASIC OF PARSING

Parsing of XML is the process of reading and validating an XML document and converting it into the desired
format. The program that does this job is called as a parser.

An XML file is something that exists on the disk. So, the parser has to first of all bring it from the disk into the
main memory. More importantly, the parser has to make this in memory representation of an XML file available
to the programmer in a form that the programmer is comfortable with. A parser reads a file from the disk, converts
it into an in-memory object and hands it over to the programmer. The programmer’s responsibility is then to take
this object and manipulate it the way she wants. For example, the programmer may want to display the values of
certain elements, add some attributes, count the total number of elements, and so on.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

6.1. PARSING APPROACHES

Suppose that someone younger in your family has returned from playing a cricket match. He is very excited about
it, and wants to describe what happened in the match. He can describe it in two ways:

When an XML document is to be presented to a Java program as an object, there are two main possibilities.
1. Present the document in bits and pieces, as and when we encounter certain sections or portions of the
document.

2. Present the entire document tree at one go. This means that the Java program has to then think of this
document tree as one object, and manipulate it the way it wants.

We have discussed this concept in the context of the description of a cricket match earlier. We can either
describe the match as it happened, event by event; or first describe the overall highlights and then get into
specific details. For example, consider an XML document

Now, we can look at this XML structure in two ways.

1. Go through the XML structure item by item (e.g., to start with, the line <?xml version=”1.0”?>,
followed by the element <employees>, and so on).

2. Read the entire XML document in the memory as an object, and parse its contents as per the needs.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

Technically, the first approach is called as Simple API for XML (SAX), whereas the latter is known as
Document Object Model (DOM).

In general, we can equate the SAX approach to our example of the step-by-step description of a cricket match.
The SAX approach works on an event model. This works as follows.

(i) The SAX parser keeps track of various events, and whenever an event is detected, it informs our Java
program.
(ii) Our Java program needs to then take an appropriate action, based on the requirements of handling
that event. For example, there could be an event Start element as shown in the diagram.
(iii) Our Java program needs to constantly monitor such events, and take an appropriate action.
(iv) Control comes back to SAX parser, and steps (i) and (ii) repeat.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

In general, we can equate the DOM approach to our example of the overall description of a cricket match.This
works as follows.

(i) The DOM approach parses through the whole XML document at one go. It creates an in-memory
tree-like structure of our XML document.
(ii) This tree-like structure is handed over to our Java program at one go, once it is ready. No events get
fired unlike what happens in SAX.
(iii) The Java program then takes over the control and deals with the tree the way it wants, without
actively interfacing with the parser on an event-by-event basis. Thus, there is no concept of
something such as Start element, Characters, End element, etc.

Department of Computer Science & Engineering


PSIT-Pranveer Singh Institute of Technology
Kanpur-Delhi National Highway (NH-2), Bhauti, Kanpur-209305 (U.P.), India

Department of Computer Science & Engineering

You might also like