XML - Managing Data Exchange/XHTML



Previous Chapter Next Chapter
DTD XPath




Learning objectives

  • List the differences between XHTML and HTML
  • Create a valid, well-formed XHTML document
  • Convert an existing HTML document to XHTML
  • Decide when XHTML is more appropriate than HTML


In previous chapters, we have learned how to generate HTML documents from XML documents and XSL stylesheets. In this chapter, we will learn how to convert those HTML documents into valid XHTML. We will discuss why XHTML has evolved as a standard and when it should be used.

The Evolution of XHTML

edit

Originally, Web pages were designed in HTML. Unfortunately most implementations of this markup language allow all sorts of mistakes and bad formatting. Major browsers were designed to be forgiving, and poor code would display with few problems in most cases. This poor code was often not portable between browsers, e.g. a page would render in Netscape but not Internet Explorer or vice versa. The accounting for human error and bad formatting takes an amount of processing power that small handheld devices might not have. Thus when displaying data on handhelds, a tiny mistake can crash the device.

XHTML partially mitigates these problems. The processing burden is reduced by requiring XHTML documents to conform to the much stricter rules defined in XML. Aside from the stricter rules, HTML 4.01 and XHTML 1.0 are functionally equivalent. If a document breaks XML's well-formedness rules, an XHTML-compliant browser must not render the page. If a document is well-formed but invalid, an XHTML-compliant browser may render the page, so a significant number of mistakes still slip through.

In this chapter, we will examine in detail how to create an XHTML document.

The biggest problem with HTML from a design standpoint is that it was never meant to be a graphical design language. The original version of HTML was intended to structure human readable content (e.g. marking a section of text as a paragraph), not to format it (e.g. this paragraph should be displayed in 14pt Arial). HTML has evolved far past its original purpose and is being stretched and manipulated to cover cases that the original HTML designers never imagined.

The recommended solution is to use a separate language to describe the presentation of a group of documents. Cascading Style Sheets (CSS) is a language used for describing presentation. From version 1.1 of XHTML upwards web pages must be formatted using CSS or a language with equivalent capabilites such as XSLT (XSL Transformations). The use of CSS or XSLT is optional in XHTML 1.0 unless the strict variant is used. HTML 4.01 supports CSS but not XSLT.

So What is XHTML?

edit

As you might have guessed, XHTML stands for eXtensible HyperText Markup Language. It is a cross between HTML and XML. It fulfills two major purposes that were ignored by HTML:

  1. XHTML is a stricter standard than HTML. XHTML documents must be well-formed just like regular XML. This reduces vagaries and inconsistency between browsers, because browsers do not have to decide how to display a badly-formed page. Malformed XHTML is not allowed.
    Note 1: Browsers only enforce well-formedness if the MIME type is set to application/xhtml+xml. If the MIME type is set to text/html, the browser will allow badly-formed documents. There are a large number of 'XHTML' documents on the web that are badly-formed and get away with it because their MIME type is text/html.
    Note 2: Browsers are not required to check for validity. See Invalid XHTML below for an example.
  2. XHTML allows for modularization (m12n). For different environments different element and attribute subsets can be defined.

The best thing about XHTML is that it is almost the same as HTML! If you know how to write an HTML document, it will be very simple for you to create an XHTML document without too much trouble. The biggest thing that you must keep in mind is that unlike with HTML, where simple errors like missing a closing tag are ignored by the browser, XHTML code must be written according to an exact specification. We will see later that adhering to these strict specifications actually allows XHTML to be more flexible than HTML.

XHTML Document Structure

edit

At a minimum, an XHTML document must contain a DOCTYPE declaration and four elements: html, head, title, and body:

<!DOCTYPE ... >
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="...">
   <head>
      <title></title>
   </head>
   <body></body>
</html>

The opening html tag of an XHTML document must include a namespace declaration for the XHTML namespace.

The DOCTYPE declaration should appear immediately before the html tag in an XHTML document. It can follow one of three formats.

XHTML 1.0 Strict

edit
<!DOCTYPE html
 PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

The Strict declaration is the least forgiving. This is the preferred DOCTYPE for new documents. Strict documents tend to be streamlined and clean. All formatting will appear in Cascading Style Sheets rather than the document itself. Elements that should be included in the Cascading Style Sheet and not the document itself include, but are not limited to:

<body text="blue">, <u>nderline</u>, <b>old</b>, <i>talics</i>, and <font color="#9900FF" face="Arial" size="+2">

There are also certain instances where your code needs to be nested within block elements.

Incorrect Example:

<p>I hope that you enjoy</p> your stay.

Correct Example:

<p>I hope that you enjoy your stay.</p>

XHTML 1.0 Transitional

edit
<!DOCTYPE html
 PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

This declaration is intended as a halfway house for migrating legacy HTML documents to XHTML 1.0 Strict. The W3C encourages authors to use the Strict DOCTYPE for new documents. (The XHTML 1.0 Transitional DTD refers readers to the relevant note in the HTML4.01 Transitional DTD.)

This DOCTYPE does not require CSS for formatting; although, it is recommended. It generally tolerates inline elements found where block-level elements are expected.

There are a couple of reasons why you might choose this DOCTYPE for new documents.

  • You require backwards compatibility with browsers that support the formatting elements of XHTML but do not support CSS. This is a very small fraction of general users (less than 1%). Many browsers that don't support CSS don't support HTML 4.0 or XHTML either. However, it may be useful on a corporate intranet that has a larger than normal fraction of very old (pre-2000) browsers.
  • You need to link to frames. Using frames is discouraged as they work badly in many browsers.

XHTML 1.0 Frameset

edit
<!DOCTYPE html
 PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">

If you are creating a page with frames, this declaration is appropriate. However, since frames are generally discouraged when designing Web pages, this declaration should be used rarely.

XML Prolog

edit

Additionally, XHTML authors are encouraged by the W3C to include the following processing instruction as the first line of each document:

<?xml version="1.0" encoding="UTF-8"?>

Although it is recommended by the standard, this processing instruction may cause errors in older Web browsers including Internet Explorer version 6. It is up to the individual author to decide whether to include the prolog.

Language

edit

It is good practice to include the optional xml:lang attribute [1] on the html element to describe the document's primary language. For compatibility with HTML the lang attribute should also be specified with the same value. For an English language document use:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">

The xml:lang and lang attributes can also be specified on other elements to indicate changes of language within the document, e.g. a French quotation in an English document.

Converting HTML to XHTML

edit

In this section, we will discover how to transform an HTML document into an XHTML document. We will examine each of the following rules:

  • Documents must be well-formed
    • Tags must be properly nested
    • Elements must be closed
  • Tags must be lowercase
  • Attribute names must be lowercase
  • Attribute values must be quoted
  • Attributes cannot be minimized
  • The name attribute is replaced with the id attribute (in XHTML 1.0 both name and id should be used with the same value to maintain backwards-compatibility).
  • Plain ampersands are not allowed
  • Scripts and CSS must be escaped(enclose them within the tags <![CDATA[ and ]]>) or preferably moved into external files.

Documents must be well-formed

edit

Because XHTML conforms to all XML standards, an XHTML document must be well-formed according to the W3C's recommendations for an XML document. Several of the rules here reemphasize this point. We will consider both incorrect and correct examples.

Tags must be properly nested

edit

Browsers widely tolerate badly nested tags in HTML documents.

<b><u>
This text is probably bold and underlined, but inside incorrectly nested tags.
</b></u>

The text above would display as bold and underlined, even though the end tags are not in the proper order. An XHTML page will not display if the tags are improperly nested, because it would not be considered a valid XML document. The problem can be easily fixed.

<b><u>
This text is bold and underlined and inside properly nested tags.
</u></b>

Elements must be closed

edit

Again, XHTML documents must be considered valid XML documents. For this reason, all tags must be closed. HTML specifications listed some tags as having "optional" end tags, such as the <p> and <li> tags.

<p>Here is a list:
<ul>
   <li>Item 1
   <li>Item 2
   <li>Item 3
</ul>

In XHTML, the end tags must be included.

<p>Here is a list: </p>
<ul>
   <li>Item 1</li>
   <li>Item 2</li>
   <li>Item 3</li>
</ul>

What should we do about HTML tags that do not have a closing tag? Some special tags do not require or imply a closing tag.

<img src="titlebar.gif" alt="Title">
<hr>
<br>
<p>Welcome to my web page!</p>

In XHTML, the XML rule of including a closing slash within the tag must be followed.

<img src="titlebar.gif" alt="title" />
<hr />
<br />
<p>Welcome to my Web page!</p>

Note that some of today's browsers will incorrectly render a page if the closing slash does not have a space before it (<br/>). Although it is not part of the official recommendation, you should always include the space (<br />) for compatibility purposes.

Here are the common empty tags in HTML:

  • area
  • base
  • basefont
  • br
  • hr
  • img
  • input
  • link
  • meta
  • param

Tags must be lowercase

edit

In HTML, tags could be written in either lowercase or uppercase. In fact, some Web authors preferred to write tags in uppercase to make them easier to read. XHTML requires that all tags be lowercase.

<H1>This is an example of bad case.</h1>

This difference is necessary because XML differentiates between cases. XML would read <H1> and <h1> as different tags, causing problems in the above example.

<h1>This is an example of good case.</h1>

The problem can be easily fixed by changing all tags to lowercase.

Attribute names must be lowercase

edit

Following the pattern of writing all tags in lowercase, all attribute names must also be in lowercase.

<p CLASS="specialText">Important Notice</p>

The correct tags are easy to create.

<p class="specialText">Important Notice</p>

Attribute values must be quoted

edit

Some HTML values do not require quotation marks around them. They are understood by browsers.

<table border=1 width=100%>
</table>

XHTML requires all attributes to be quoted. Even numeric, percentage, and hexadecimal values must appear in quotations for them to be considered part of a proper XHTML document.

<table border="1"  width="100%">
</table>

Attributes cannot be minimized

edit

HTML allowed some attributes to be written in shorthand, such as selected or noresize.

<form>
   <input checked ... />
   <input disabled ... />
</form>

When using XHTML, attribute minimization is forbidden. Instead, use the syntax x="x", where x is the attribute that was formerly minimized.

<form>
   <input checked="checked"  .../>
   <input disabled="disabled"  .../>
</form>

A complete list of minimized attributes follows:

  • checked
  • compact
  • declare
  • defer
  • disabled
  • ismap
  • nohref
  • noresize
  • noshade
  • nowrap
  • readonly
  • selected
  • multiple

The name attribute is replaced with the id attribute

edit

HTML 4.01 standards define a name attribute for the tags a, applet, frame, iframe, img, and map.

<a name="anchor">
<img src="banner.gif" name="mybanner" />
</a>

XHTML has deprecated the name attribute. Instead, the id attribute is used. However, to ensure backwards compatibility with today's browsers, it is best to use both the name and id attributes.

<a name="anchor" id="anchor" >
<img src="banner.gif" name="mybanner" id="mybanner"  />
</a>

As technology advances, it will eventually be unnecessary to use both attributes and XHTML 1.1 removed name altogether.

Ampersands are not supported

edit

Ampersands are illegal in XHTML.

<a href="home.aspx?status=done&amp;itWorked=false">Home &amp; Garden</a>

They must instead be replaced with the equivalent character code &amp;.

<a href="home.aspx?status=done&amp;amp;itWorked=false">Home &amp;amp; Garden</a>

Image alt attributes are mandatory

edit

Because XHTML is designed to be viewed on different types of devices, some of which are not image-capable, alt attributes must be included for all images.

<img src="titlebar.gif">

Remember that the img tag must include a closing slash in XHTML!

<img src="titlebar.gif" alt="title"  />

Scripts and CSS must be escaped

edit

Internal scripts and CSS often include characters like the ampersand and less-than characters.

<script language="JavaScript">
   <!--
      document.write('Hello World!'); 
   //-->
</script>

If you are using internal scripts or CSS, enclose them within the tags <![CDATA[ and ]]>. This will mark them as character data that should not be parsed. If you do not use these tags, characters like & and < will be treated as start-of-character entities (like &nbsp;) and tags (like <b>) respectively. This will cause your page to behave unpredictably, and it may invalidate your code.

Additionally, the type attribute is mandatory for scripts. The comment tags <!-- and --> that have traditionally been used to hide JavaScript from noncompliant browsers should not be included. The XML standard states that text enclosed in comment tags may be completely excluded from rendered documents, which would lose all script enclosed in the tags.

<script type="text/javascript" language="javascript">
/*<![CDATA[*/
   document.write('Hello World!');
/*]]>*/
</script>

Also document.write(); is not permitted in XHTML documents. You must used node creation methods such as document.createElementNS(); instead. Confusingly, document.write(); will appear to work as expected if the document is incorrectly served with a MIME type of text/html (the type for HTML documents), instead of application/xhtml+xml (the type for XHTML documents). If the MIME type is text/html the document will be parsed as HTML which allows document.write();. Parsing the document as HTML defeats the purpose of writing it in XHTML.

Similar changes must be made for internal stylesheets.

<style>
<!--
   .SpecialClass {
      color: #000000;
   }
-->
</style>

The type attribute must be included, and the CDATA tags should be used.

<style type="text/css">
/*<![CDATA[*/
   .SpecialClass {
      color: #000000;
   }
/*]]>*/
</style>

Because scripts and CSS may complicate an XHTML document, it is strongly recommended that they be placed in external .js and .css files, respectively. They can then be linked to from your XHTML document.

<script src="myscript.js" type="text/javascript" />

<link href="styles.css" type="text/css" rel="stylesheet" />

Some elements may not be nested

edit

The W3C recommendations state that certain elements may not be contained within others in an XHTML document, even when no XML rules are violated by the inclusion. Elements affected are listed below.

Element Cannot contain ...
a a
pre big, img, object, small, sub, sup
button button, fieldset, form, iframe, input, isindex, label, select, textarea
label label
form form

When to convert

edit

By now, it probably sounds as though converting an HTML document into XHTML is easy, but tedious. When would you want to convert your existing pages into XHTML? Before deciding to change your entire Web site, consider these questions.

  • Do you want your pages to be easily viewed over a nontraditional Internet-capable device, such as a PDA or Web-enabled telephone? Will this be a goal of your site in the future? XHTML is the language of choice for Web-enabled portable devices. Now may be a good time for you to commit to creating an all-XHTML site.
  • Do you plan to work with XML in the future? If so, XHTML may be a logical place to begin. If you head up a team of designers who are accustomed to using HTML, XHTML is a small step away. It may be less intimidating for beginners to learn XHTML than it is to try teaching them all about XML from scratch.
  • Is it important that your site be current with the most recent W3C standards? Staying on top of current standards will make your site more stable and help you stay updated in the future, as you will only have to make small changes to upgrade your site to the newest versions of XHTML as they are approved by the W3C.
  • Will you need to convert your documents to another format? As a valid XML document, XHTML can utilize XSL to be converted into text, plain HTML, another XHTML document, or another XML document. HTML cannot be used for this purpose.

If you answered yes to any of the above questions, then you should probably convert your Web site to XHTML.

MIME Types

edit

XHTML 1.0 documents should be served with a MIME Type of application/xhtml+xml to Web browsers that can accept this type. XHTML 1.0 may be served with the MIME type text/html to clients that cannot accept application/xhtml+xml provided that the XHTML complies with the additional constraints in [Appendix C] of the XHTML 1.0 specification. If you cannot configure your Web server to serve documents as different MIME types, you probably should not convert your Web site to XHTML.

You should check that your XHTML documents are served correctly to browsers that support application/xhtml+xml, e.g. Mozilla Firefox. Use 'Page Info' to verify that the type is correct.

XHTML 1.1 documents are often not backwards compatible with HTML and should not be served with a MIME type of text/html.[2]

Help Converting

edit

HTML Tidy

edit

When creating HTML, it's very easy to make a mistake by leaving out an end tag or not properly nesting tags. HTML Tidy is a wonderful application that can be used to correct a number of errors with poorly formed HTML documents and convert it into XHTML. Tidy can also format ugly code to be more readable, including code generated by WYSIWYG editors. HTML Tidy can't generate clean code when it encounters problems it isn't sure of how to fix. In these cases, it will generate an error to let you know where the mistake is located in your document.

A few examples of problems that HTML Tidy can remedy:

  • Missing or mismatched end tags.
  • Improperly nested elements.
  • Mixed up tags.
  • Add a missing "/" to properly close tags.
  • Insert missing tags into lists.
  • Add missing quotes around attribute values.
  • Ability to insert the correct DOCTYPE value based on your code (can also recognize and report proprietary elements).

HTML Tidy can also be customized at runtime using a wide array of command line arguments. It is capable of indenting code to make it more readable as well as replacing FONT, NOBR, and CENTER tags with style tags and rules using CSS. Tidy can also be taught new tags by declaring them in the configuration file.

You can read more about HTML Tidy at the W3C's HTML Tidy site, as well as download the application as a binary or get the source code. There are several sites that offer HTML Tidy as an online service including the W3C and Site Valet.

You can also validate your page using the validator available at http://validator.w3.org/.

When not to convert

edit

You shouldn't convert your Web pages if they will always be served with a MIME type of text/html. Make sure you know how to configure your server or server-side script to perform HTTP content negotiation so that XHTML capable browsers receive XHTML marked as application/xhtml+xml. If you can't set up content negotiation, stick to HTML 4.01. People viewing your Web pages with mainstream browsers will be unable to tell the difference between a valid HTML 4.01 web page and a valid XHTML 1.0 Web page.

Make sure the automated tests you run on your site simulate connections from both XHTML-compatible browsers, e.g. Mozilla Firefox, and non–XHTML-compatiable browsers, e.g. Internet Explorer 6.0. This is particularly important if you use Javascript on your Web site. If maintaining two copies of your test suite is too time consuming, don't convert.

Bear in mind that valid HTML 4.01 Strict documents generally require less effort to convert to XHTML 1.1 than valid XHTML 1.0 Transitional documents. A valid HTML 4.01 Strict document can only contain elements that are valid in XHTML 1.1, although a few attributes may need changing. XHTML 1.0 Transitional documents on the other hand can contain ten element types and more than a dozen attributes that are not valid in XHTML 1.1. The XHTML 1.0 Transitional body element alone has six atrributes that are not supported in XHTML 1.1.

Don't be pressured into using XHTML by people talking vaguely about bad practice. Pin them down to what they mean by bad practice. If they start talking about separation of content and presentation, they have confused the differences between HTML and XHTML with the differences between the Transitional and Strict doctypes. Both XHTML 1.0 Transitional and HTML 4.01 Transitional allow you to mix presentation and content in the same document, i.e. they allow this type of bad practice. Both HTML 4.01 Strict and XHTML 1.0 Strict force you to move the bulk of the presentation (but not all of it) in to CSS or an equivalent language. All four doctypes allow you to use embedded stylesheets, whereas, true separation requires that all CSS and Javascript be moved to external files.

XHTML 1.1

edit

XHTML 1.0 is a suitable markup language for most purposes. It provides the option to separate content and presentation, which fits the needs of most Web authors. XHTML 1.1 enforces the separation of content and presentation. All deprecated elements and attributes have been removed. It also removes two attributes that were retained in XHTML 1.0 purely for backwards-compatibility. The lang attribute is replaced by xml:lang and name is replaced by id. Finally it adds support for ruby text found in East Asian documents.

DOCTYPE

edit

The DOCTYPE for XHTML 1.1 is:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

Modularization

edit

The modularization of XHTML, or XHTML m12n, provides suggestions for customizing XHTML, either by integrating subsets of XHTML into other XML applications or extending the XHTML element set. The framework defines two proceses:

  • How to group elements and attributes into "modules"
  • How to combine modules to create new markup languages

The resulting languages, which the W3C calls "XHTML Host Languages", are based on the familiar XHTML structure but specialized for specific purposes. XHTML 1.1 is an example of a host language. It was created by grouping the different elements available to XHTML.

XHTML variations, while possible in theory, have not been widely adopted. There is continuing work being done to develop host languages, but their details are beyond the scope of this discussion.

Invalid XHTML

edit

XHTML-compliant browsers are allowed to render invalid XHTML documents provided that the documents are well-formed. A simple example is given below:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Invalid XHTML</title>
  </head> 
  <body>
     <p>This sentence contains a <p>nested paragraph.</p></p>
  </body>
</html>

Save the example as invalid.xhtml (the .xhtml extension is important) and open the page with Mozilla Firefox. The page will render even though it is invalid.


Summary

edit

XHTML stands for eXtensible HyperText Markup Language. XHTML is very similar to HTML, but it is stricter and easier to parse. XHTML documents must be well-formed just like regular XML. XHTML allows for modularization. XHTML code must be written according to an exact specification unlike with HTML, where simple errors like missing a closing tag are ignored by the browser. Adhering to these strict specifications actually allows XHTML to be more flexible than HTML. The benefits described in this summary are only gained if the MIME type of the document is application/xhtml+xml. XHTML documents can be validated but most browsers choose not to.

  NODES
INTERN 7
Note 5
USERS 1
Verify 1