[go: up one dir, main page]

Menu

Entities and XSLT

March 14, 2001

Bob DuCharme

In XML, entities are named units of storage. Their names are assigned and associated with storage units in a DTD's entity declarations. These units may be internal entities, whose contents are specified as a string in the entity declaration itself, or they may be external entities, whose contents are outside of the entity declaration. Typically, this means that the external entity is a file outside of the DTD file which contains the entity declaration, but we don't say "file" in the general case because XML and XSLT work on operating systems that don't use the concept of files.

A DTD might declare an internal entity to act like a constant in a programming language. For example, if a document has many copyright notices that refer to the current year, declaring an entity cpdate to store the string "2001" and then putting the entity reference "&cpdate;" throughout the document means that updating the year value to "2002" for the whole document will only mean changing the declaration.

Internal entities are especially popular to represent characters not available on computer keyboards. For example, while you could insert the "ñ" character in your document using the numeric character reference "ñ" (or the hexadecimal version "&#xF1"), storing this character reference in an entity called ntilde lets you put "España" in an XML document as "Espanña", which is much easier to read than "España" or "Espa&#xF1a". (It has the added bonus of being familiar to those who used the same entity reference in HTML -- perhaps without even knowing that it was an entity reference.)

An external entity can be a file that stores part of a DTD, which makes it an external parameter entity, or it can store part of a document, which makes it an external general entity. For example, the following XML document declares and references the external general entity ext1. (Comments in sample documents refer to filenames in this zip file.)

<!-- xq226.xml -->
<!DOCTYPE poem [
<!ENTITY ext1 SYSTEM "lines938-939.xml">
]>
<poem>
<verse>I therefore, I alone first undertook</verse>
<verse>To wing the desolate Abyss, and spy</verse>
&ext1;
<verse>Better abode, and my afflicted Powers</verse>
<verse>To settle here on Earth or in mid-air</verse>
</poem>

An XML parser reading this document will look for an external entity named lines938-939.xml and report an error if it doesn't find it. If it does find a file named lines938-939.xml that looks like this,


<!-- xq227.xml (lines938-939.xml) --> <verse>This new created World, whereof in Hell</verse> <verse>Fame is not silent, here in hope to find</verse>

it will pass something like the following to the application using that XML parser (for example, an XSLT processor):

<poem>
<verse>I therefore, I alone first undertook</verse>
<verse>To wing the desolate Abyss, and spy</verse>
<verse>This new created World, whereof in Hell</verse>
<verse>Fame is not silent, here in hope to find</verse>
<verse>Better abode, and my afflicted Powers</verse>
<verse>To settle here on Earth or in mid-air</verse>
</poem>

Because an XSLT stylesheet is an XML document, you can store and reference pieces of it using the same technique, but you'll find that the xsl:include and xsl:import instructions give you more control over how your pieces fit together. See my November column Combining Stylesheets with Include and Import for more detail.

All these categories of entities are known as parsed entities because an XML parser reads them in, replaces each entity reference with the entity's contents, and parses them as part of the document. XML documents use unparsed entities, which aren't used with entity references but as the value of specially declared attributes, to incorporate non-XML entities.

When you apply an XSLT stylesheet to a document, if entities are declared and referenced in that document, your XSLT processor won't even know about them. An XSLT processor leaves the job of parsing the input document (reading it and figuring out what's what) to an XML parser; that's why the installation of some XSLT processors requires you to identify the XML parser you want them to use. (Others include an XML parser as part of their installation.) An important part of an XML parser's job is to resolve all entity references, so that if the input document's DTD declares a cpdate entity as having the value "2001" and the document has the line "copyright &cpdate; all rights reserved", the XML parser will pass along the text node "copyright 2001 all rights reserved" to put on the XSLT source tree. Newcomers to XSLT often ask how they can check for entity references such as "&nbsp;" or "&lt;" in the source tree, and the answer is: you can't. By the time the document's content reaches the source tree, it's too late.

How about entities in your result tree? You can't add entity declarations there, because although XSLT can add a document type declaration to a result tree, it can't add one with an internal DTD subset, which is the only way to add DTD declarations to a document entity.

There are, however, ways to add entity references. If you create an XML document in your result tree, and you add references to any entities other than the five that all XML processors are required to handle, whether they're declared or not (lt, gt, apos, quot, and amp), then your document must have a document type declaration that points to a DTD with declarations for your entities. If you're creating an HTML document, entity declarations aren't required, and most web browsers understand a wide variety of entity references for special characters such as "&eacute;" for the "é" character and "&ntilde;" for the "ñ" character.

Let's look at various approaches to creating an entity reference in a result tree. We'll use the following one-line document as a source document and try to add a text node to the result that includes the entity reference "&ntilde;" for the "ñ" character.

<test>Dagon his Name, Sea Monster</test>

If the stylesheet document has the appropriate entity declaration, the XML parser that feeds the stylesheet and source document to the XSLT processor will replace this entity reference in the stylesheet with the replacement text declared for it. For this stylesheet, it will replace "&ntilde;" with the Unicode value for the "ñ" character:

<!-- xq230.xsl: converts xq229.xml into xq231.xml -->

<!DOCTYPE stylesheet [ <!ENTITY ntilde "&#241;" ><!-- small n, tilde --> ]>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="test"> <testOut> The Spanish word for "Spain" is "Espa&ntilde;a". <xsl:apply-templates/> </testOut> </xsl:template>
</xsl:stylesheet>

The actual "ñ" character and not an entity reference to it shows up in the result:

<?xml version="1.0" encoding="utf-8"?><testOut>
      The Spanish word for "Spain" is "España".
      Dagon his Name, Sea Monster</testOut>

Normally, your stylesheet doesn't need a DOCTYPE declaration, but if the stylesheet has references to any entities besides the five predeclared ones listed above, you must declare them inside a DOCTYPE declaration. The XML parser that reads in the stylesheet for your XSLT processor will replace any entity references with their entity values before giving the stylesheet to the XSLT processor.

This is handy, but not what we're looking for. We want to see an entity reference, not the entity it refers to, in the result document. XSLT offers no way to tell the XML processor not to make entity replacements. (Certain XSLT processors such as Xalan offer this option as a non-standard feature). However, XSLT does offer a way to turn off its automatic "escaping" of certain characters -- that is, an XSLT processor's substitution of the entity reference "&amp;" for ampersands and "&lt;" for less-than characters in result tree text nodes. You can turn it off for your entire result tree with an xsl:output instruction that has a method attribute value of "text", and you can turn it off for a single xsl:text element by setting its disable-output-escaping attribute to equal "yes".

The disabling of output escaping is used too often in situations where it shouldn't be -- in particular, to create a less-than character that starts a tag or declaration that could be added to a result tree with a more appropriate XSLT instruction. Because it's essentially turning off something that the XSLT processor is supposed to do, it should be used sparingly.

The following version of the stylesheet resembles the previous one except for the replacement text specified in the ntilde declaration. It's an xsl:text instruction with "&amp;ntilde;" as its contents.

<!-- xq232.xsl: converts xq229.xml into xq233.xml -->

<!DOCTYPE stylesheet [ <!ENTITY ntilde "<xsl:text disable-output-escaping='yes'>&amp;ntilde;</xsl:text>"> ]>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output doctype-system="testOut.dtd"/>
<xsl:template match="test"> <testOut> The Spanish word for "Spain" is "Espa&ntilde;a". <xsl:apply-templates/> </testOut> </xsl:template>
</xsl:stylesheet>

The XML parser that reads the stylesheet and hands it off to the XSLT processor will replace that "&amp;" with a "&", but because the xsl:text element has its disable-output-escaping attribute set to "yes", the XSLT processor will pass along the "&ntilde;" string to the result tree without trying to resolve it. (If it did try to resolve it, it would cause an error, because having "&ntilde;" as the replacement text for the ntilde entity would be an illegal recursive entity declaration.) With the same test document, the new stylesheet creates this output:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE testOut SYSTEM "testOut.dtd">
<testOut>
      The Spanish word for "Spain" is "Espa&ntilde;a".
      Dagon his Name, Sea Monster</testOut>

The new stylesheet has one more difference from the earlier one: it includes an xsl:output element. This element doesn't need a method attribute, because the default value of "xml" is fine, but the doctype-system attribute is important. If the result document has an "&ntilde;" entity reference, that entity must be declared somewhere. XSLT doesn't offer a way to include such declarations in an internal DTD subset of the document's DOCTYPE declaration, although some stylesheet developers have assembled hacks to add these declarations using disable-output-escaping kludges. The best way to ensure that these declarations are properly declared is to give the result tree a DOCTYPE declaration with a SYSTEM identifier that points to a DTD with that declaration. The example above adds a SYSTEM declaration that points to a testOut.dtd file that should include a declaration for the ntilde entity.

This trick works for any general entity reference you want in your result tree, whether it references an internal entity whose contents are included in the declaration (like the ntilde entities in the examples above) or an external entity whose contents are stored in an external file like the ext1 one that references the lines938-939.xml file at the beginning of this column.

To review, you can add any kind of entity reference you want to your result tree with the following two steps:

  • Add an entity reference to your result tree.

  • Declare the entity's contents in the stylesheet's DOCTYPE declaration to be an ampersand, the entity name, and a semicolon all inside of an xsl:text element with its disable-output-escaping attribute set to "yes".