Using an XML Audit to Move SGML Data towards XML

This paper was originally presented at XML'98 in Chicago. The copy on the CD-ROM proceedings was apparently garbled during the composition process.

Charlie Halpern-Hamu, PhD, MBA

Structured-Text Consultant

Incremental Development, Inc.

charlie@IncrementalDevelopment.com

Abstract

This paper describes, at a technical level, how to assess the XML-readiness of your SGML data as a first step towards moving it towards XML.

This paper suggests an 'XML audit': a technical review of current markup practice with eye towards simplification. The goal of an XML audit is to understand which portions of your current SGML application are not XML. The next step might be to start deemphasizing your use of those features.

Moving all the way to XML allows you to use XML tools that do not support full SGML. Even getting part way there means you can use a wider variety of SGML tools. In either case you will be simplifying work for both editorial and programming staff. Simpler is better.

This paper is derived from James Clark's 'Comparison of SGML and XML', a World Wide Web Consortium Note ( www.w3.org/TR/NOTE-sgml-xml-971215 by jjc@jclark.com ).

Introduction

This talk describes, at a technical level, how to assess the XML-readiness of your SGML data as a first step towards moving it towards XML.

XML Audit

This talk introduces the concept of an 'XML audit': a review of current markup practice with eye towards simplification. An XML audit lets you know where you stand. Your next step might be to de-emphasize those SGML features that are not XML.

Motivation

But even if you choose to make no immediate change to your markup practices, an XML audit will give you valuable information that will help inform future decisions. You may discover that, give or take an angle-bracket or two, you are already doing XML.

Notes on Style

All discussion assumes the reference concrete syntax. So I will say 'left angle-bracket' or ' < ', but not 'start-tag open delimiter' or ' STAGO '. Similarly, I say 'white space' instead of 'separator'.

Where SGML and XML vary slightly in their nomenclature, I tend towards the SGML, since that's our starting point. Or I fall back towards spelling things out using the reference concrete syntax as described above.

Acknowledgments

This paper is derived from James Clark's 'Comparison of SGML and XML', a World Wide Web Consortium Note ( www.w3.org/TR/NOTE-sgml-xml-971215 by jjc@jclark.com ).

Clark's Note discusses XML options not available in SGML. This paper ignores these, only discussing those SGML options that are not available in XML.

In this paper, and to an even greater degree in the corresponding presentation, I've tried to give more prominence to the more commonly-used SGML features that are missing in XML.

I'd like to thank Larry Sulky for his copy edit. The only suggestion I didn't take was to change 'a journey of a thousand miles' to 'a journey of sixteen-hundred kilometres'.

How to Conduct an XML Audit

The key idea in conducting an XML audit is resisting the temptation to do more than simply review where you stand.

Who Should Attend

You need a selection of technical people: someone who knows the DTD, someone who knows editorial tagging practices, someone who knows about the programs that operate on the data as it flows in, through, and out of the organization.

An XML audit is for figuring out where you are, not where you are going. Consequently, you don't need managerial or technical decision-makers at the meeting. They will want to understand and act on the final assessment.

How to Prepare

Make printouts of this paper, your SGML declaration(s), DTD(s), some sample data, and programs that act on this data. Distribute these items in advance to your attendees. Each attendee should review these items, especially those about which she is the designated expert. So the data architect should focus on the DTDs, the programmer the programs, etc. Ask attendees to note those aspects of your current SGML that are not XML, perhaps in the margins of this paper.

How to Proceed

Designate one person as the note-taker. As with the individual preparation step, it may be convenient to use a copy of this paper as a note-taking template. Move systematically through the headings in this paper and determine if they apply to your application.

Postpone discussions about how to recast SGML usage as simpler XML usage. Focus on simply listing those aspects of your SGML usage that go beyond XML. When you do find non-XML usages, include details of where. Do you have one use of the ' & ' connector or a dozen? Which elements? Try not to worry about why you use this aspect of SGML or how you might avoid it.

Results

The result of an XML audit should be an assessment report. Transcribe your notes into a complete list of the non-XML things you do. The next step will be to decide if it makes sense to change all or some of your markup practices.

Stupid SGML Tricks

Those aspects of SGML that are not available in XML are listed in the sections that follow. The following organization has been used:

The Big Three

Elements
Attributes
Entities

Out of Band

Comments
Marked Sections
Processing Instructions

Miscellaneous

Characters
Minimization
Other
Obscure

The Big Three: Elements

Declaring Several Elements at Once

You can not declare several element types with the same declaration:

    <!ELEMENT (isnt-xml | isnt-xml2) (#PCDATA | em)*>

This habit makes finding element type declarations in a DTD more difficult. A better practice might be to use a parameter entity for the common content model:

    <!ENTITY % inline '(#PCDATA | em)'>
    <!ELEMENT okay-xml  %inline;>
    <!ELEMENT okay-xml2 %inline;>

Specifying Minimization

You can not specify minimization in XML element declarations:

    <!ELEMENT isnt-xml - - (#PCDATA | em)*>

If you are not using OMITTAG , you can leave this out of your SGML:

    <!ELEMENT okay-xml (#PCDATA | em)*>

RCDATA and CDATA

You can not declare content to be RCDATA :

    <!ELEMENT isnt-xml RCDATA>

You can not declare content to be CDATA :

    <!ELEMENT isnt-xml CDATA>

The '&' Connector

You can not use the ' & ' connector:

    <!ELEMENT isnt-xml (phone & fax & email)>

If the random order is important to you, you can recast short lists by listing all the possible orders, avoid SGML-ambiguous content models by factoring out commonalities:

    <!ELEMENT okay-xml ( (phone, ((fax, email) | (email, fax))
                       | (fax, ((phone, email) | (email, phone))
                       | (email, ((phone, fax) | (fax, phone)) )>

If you can enforce an order, do so:

    <!ELEMENT okay-xml (phone, fax, email)>

If you can't enforce an order, but your list it too long to recast without the & connector, you may need to loosen your content model:

    <!ELEMENT okay-xml (phone | fax | email)+>

Mixed Content

You can not have deprecated mixed content in XML:

    <!ELEMENT isnt-xml (em | #PCDATA)>

Indeed, the rules are stricter even than just avoiding deprecated mixed content:

    <!ELEMENT isnt-xml  (em | #PCDATA)*>
    <!ELEMENT isnt-xml2 (#PCDATA)*>

In a mixed content model, the #PCDATA must be listed first, the only connector permitted is ' | ', the only occurrence indicator permitted is ' * ', and the ' * ' must appear only when there is a ' | ':

    <!ELEMENT okay-xml  (#PCDATA | em)*>
    <!ELEMENT okay-xml2 (#PCDATA)>

Inclusion Exceptions

XML does not allow inclusions on content models:

    <!ENTITY % text    '(#PCDATA)'>
    <!ELEMENT isnt-xml (heading | para)* +(warning)>
    <!ELEMENT heading  %text;>
    <!ELEMENT para     %text;>
    <!ELEMENT warning  %text;>

Element types declared using inclusions are often far looser than they need to be. Usually they can be recast using other mechanisms:

    <!ENTITY % text    '(#PCDATA | warning)*'>
    <!ELEMENT ok-xml   (heading | para | warning)*>
    <!ELEMENT heading  %text;>
    <!ELEMENT para     %text;>
    <!ELEMENT warning  %text;>

Exclusion Exceptions

XML does not allow exclusions on content models:

    <!ENTITY % text    '(#PCDATA | em | etc | isnt-xml)*'>
    <!ELEMENT document (heading | para | isnt-xml)*>
    <!ELEMENT heading  %text;>
    <!ELEMENT para     %text;>
    <!ELEMENT em       %text;>
    <!ELEMENT etc      %text;>
    <!ELEMENT isnt-xml %text; -(isnt-xml)>

Sometimes exclusions can be recast using other mechanisms:

    <!ENTITY % text    '(#PCDATA | em | etc | okay-xml)*'>
    <!ELEMENT document (heading | para | okay-xml)*>
    <!ELEMENT heading  %text;>
    <!ELEMENT para     %text;>
    <!ELEMENT em       %text;>
    <!ELEMENT etc      %text;>
    <!ELEMENT okay-xml (#PCDATA | em | etc)*>

Other times, the easiest way to move to XML is to simply remove the exclusion, leaving the content model somewhat looser than it was.

Empty Elements

XML uses a special syntax for empty elements:

    <toc/>
    <toc depth='2'/>

XML also allows empty elements to have end tags:

    <toc></toc>
    <toc depth='2'></toc>

You should note which elements you declare as empty:

    <!ELEMENT toc EMPTY>
    <!ATTLIST toc depth #CDATA #IMPLIED>

Here's one way to make the transition. This element declaration is looser than intended, but is both SGML and XML:

    <!ELEMENT toc (#PCDATA)> <!--Should be EMPTY.-->

In both SGML and XML the declaration above allows the markup below:

    <toc></toc>

If you have a very small number of such elements, you might consider if they could be recast as container elements or perhaps as attributes on other elements. Those DTDs that do not feature empty elements avoid a major area of incompatibility between XML and SGML as it is usually used.

Here's one way to change your SGML declaration so that it allows XML-style markup for emptments or perhaps as attributes on other elements. Those DTDs that do not feature empty elements avoid a major area of incompatibility between XML and SGML as it is usually used.

Here's one way to change your SGML declaration so that it allows XML-style markup for empty elements:

    DELIM GENERAL SGMLREF
                  NET '/>'

The Big Three: Attributes

Attribute Declarations for Multiple Elements

You can only declare attributes for one element type at a time:

    <!ATTLIST (isnt-xml | isnt-xml2) attrib #CDATA #IMPLIED>

XML will require that this be split into one ATTLIST declaration per element type:

    <!ATTLIST okay-xml  attrib #CDATA #IMPLIED>
    <!ATTLIST okay-xml2 attrib #CDATA #IMPLIED>

If removing the redundancy is important, this can be done using a parameter entity:

    <!ENTITY % attribute 'attrib #CDATA #IMPLIED'>
    <!ATTLIST okay-xml  %attribute;>
    <!ATTLIST okay-xml2 %attribute;>

Declared Values for Attributes

XML does not include some declared values for attributes that can be used in SGML. Substituting other declared values may have little or no negative effect on your SGML environment while moving you one step closer to XML.

The following declared values are not allowed:

    <!ATTLIST isnt-xml  attrib NAME     #IMPLIED>
    <!ATTLIST isnt-xml2 attrib NAMES    #IMPLIED>
    <!ATTLIST isnt-xml3 attrib NUMBER   #IMPLIED>
    <!ATTLIST isnt-xml4 attrib NUMBERS  #IMPLIED>
    <!ATTLIST isnt-xml5 attrib NUTOKEN  #IMPLIED>
    <!ATTLIST isnt-xml6 attrib NUTOKENS #IMPLIED>
    <!ATTLIST isnt-xml7 attrib NOTATION (jpeg | tiff) #IMPLIED>

The following are allowed:

    <!ATTLIST okay-xml  attrib CDATA    #IMPLIED>
    <!ATTLIST okay-xml  attrib ENTITY   #IMPLIED>
    <!ATTLIST okay-xml  attrib ENTITIES #IMPLIED>
    <!ATTLIST okay-xml  attrib ID       #IMPLIED>
    <!ATTLIST okay-xml  attrib IDREF    #IMPLIED>
    <!ATTLIST okay-xml  attrib IDREFS   #IMPLIED>
    <!ATTLIST okay-xml  attrib NMTOKEN  #IMPLIED>
    <!ATTLIST okay-xml  attrib NMTOKENS #IMPLIED>
    <!ATTLIST okay-xml  attrib (this | that) #IMPLIED>

When you enumerate a list of options using a name token group, you must use the or-bar between then (SGML allows you to use the or-bar or comma interchangeably):

    <!ATTLIST isnt-xml  attrib (red, green, blue) #IMPLIED>
    <!ATTLIST okay-xml  attrib (red | green | blue) #IMPLIED>

Default Values for Attributes

These two default value declarations are not allowed in XML:

    <!ATTLIST isnt-xml  attrib CDATA #CURRENT>
    <!ATTLIST isnt-xml2 attrib CDATA #CONREF>

These four default value declarations are allowed:

    <!ATTLIST okay-xml  attrib CDATA #FIXED "only value">
    <!ATTLIST okay-xml2 attrib CDATA "default value">
    <!ATTLIST okay-xml3 attrib CDATA #REQUIRED>
    <!ATTLIST okay-xml4 attrib CDATA #IMPLIED>

Default values must be enclosed in quote marks:

    <!ATTLIST isnt-xml  attrib (this | that) this>
    <!ATTLIST okay-xml  attrib (this | that) "this">
    <!ATTLIST okay-xml2 attrib (this | that) 'this'>

Attribute Value Specification

You must use an attribute value literal, not an attribute value, in an attribute value specification. In other words, you must use quote marks when specifying an attribute value:

    <isnt-xml  attrib=this>...</isnt-xml>
    <okay-xml  attrib="this">...</okay-xml>
    <okay-xml2 attrib='this'>...</okay-xml>

You must always spell out the attribute name; you can't imply it by using a name value:

    <isnt-xml "red">...</isnt-xml>
    <okay-xml color="red">...</okay-xml>

Data Attributes

You can't use data attributes:

    <!NOTATION mpeg SYSTEM "mpgview.exe">
    <!ATTRIBUTE #NOTATION mpeg isnt-xml (v2 | v3) #REQUIRED>
    <!ENTITY movie-a SYSTEM "movie-a.mpg" NDATA mpeg [isnt-xml="v2"]>
    <!ENTITY movie-b SYSTEM "movie-b.mpg" NDATA mpeg [isnt-xml="v3"]>

In some cases, the way to make this XML might be to expand your list of notations:

    <!NOTATION mpeg2 SYSTEM "mpgview2.exe">
    <!ENTITY movie-a SYSTEM "movie-a.mpg" NDATA mpeg2>
    <!ENTITY movie-b SYSTEM "movie-b.mpg" NDATA mpeg3>

The Big Three: Entities

XML places various restrictions on entity declarations and entity references.

Internal Entities

You can't use data text internal entities ( CDATA , SDATA or PI ):

    <!ENTITY isnt-xml  CDATA "text">
    <!ENTITY isnt-xml2 SDATA "[adjust me]">
    <!ENTITY isnt-xml3 PI    "BRS ..YEAR">

You can't use bracketed text internal entities ( STARTTAG , ENDTAG , MS and MD ):

    <!ENTITY isnt-xml4 STARTTAG "gi">
    <!ENTITY isnt-xml5 ENDTAG   "gi">
    <!ENTITY isnt-xml6 MS       "CDATA[text">
    <!ENTITY isnt-xml7 MD       "--comment--">

Only the simplest form is allowed for internal entities:

    <!ENTITY okay-xml    "text">
    <!ENTITY okay-xml2   "[adjust me]">
    <!ENTITY okay-xml3   "<?BRS ..YEAR?>"
    <!ENTITY still-isnt-xml4 "<gi>">
    <!ENTITY still-isnt-xml5 "</gi>">
    <!ENTITY okay-xml4-5 "<gi></gi>">
    <!ENTITY okay-xml6   "<![CDATA[text]]>">
    <!ENTITY okay-xml7   "<!--comment-->">

Examples 4 and 5 are explained under 'Synchronicity', below.

External Entities

You can't use SUBDOC , CDATA or SDATA external entities:

    <!ENTITY isnt-xml  SYSTEM "url" SUBDOC>
    <!ENTITY isnt-xml2 SYSTEM "url" CDATA mpeg>
    <!ENTITY isnt-xml3 SYSTEM "url" SDATA mpeg>

External entities can have no entity type specified, or have NDATA specified:

    <!ENTITY okay-xml  SYSTEM "url">
    <!ENTITY okay-xml2 SYSTEM "url" NDATA mpeg>

PUBLIC Identifiers

The FORMAL feature allows you to use what are called 'formal public identifiers' to name entities such as portions of DTDs and portions of documents. XML allows public identifiers, but requires that they be followed by a system identifier to use in case the public identifier can not be resolved. External entities are identified in ENTITY declarations:

    <!ENTITY isnt-xml PUBLIC "-//Example//Entity Example//EN">
    <!ENTITY okay-xml PUBLIC "-//Example//Entity Example//EN"
                             "../examples/example.ent">

External entities are also identified in DOCTYPE declarations:

    <!DOCTYPE isnt-xml PUBLIC "-//Example//DTD Example//EN">
    <!DOCTYPE okay-xml PUBLIC "-//Example//DTD Example//EN"
                              "http://www.example.org/example.dtd">

The exception to this rule is the NOTATION declaration, which does not require a system identifier:

    <!NOTATION okay-xml  PUBLIC "ISO/IEC 10918:1993//NOTATION Digital
    Compression and Coding of Continuous-tone Still Images (JPEG)//EN">
    <!NOTATION okay-xml2 PUBLIC "ISO/IEC 10918:1993//NOTATION Digital
    Compression and Coding of Continuous-tone Still Images (JPEG)//EN"
                                "jpegview.exe">

SYSTEM Identifier

When you use SYSTEM identifiers for external entities, these identifiers must be URLs:

    <!DOCTYPE  okay-xml  SYSTEM "example.dtd">
    <!NOTATION okay-xml  SYSTEM "http://www.example.org/example.not">

Omitted System Identifier

In SGML, you can omit the system identifier after the SYSTEM keyword:

    <!ENTITY isnt-xml SYSTEM>

In XML, you must always include it:

    <!ENTITY okay-xml SYSTEM "example.ent">

Default Entity

You can not declare a default entity:

    <!ENTITY #DEFAULT "[isnt-xml]">

Semicolon

You can't leave the final semicolon off entity references, as SGML allows you to do in certain contexts:

    <isnt-xml>R&eacute;sum&eacute</isnt-xml>
    <okay-xml>R&eacute;sum&eacute;</okay-xml>

Synchronicity

SGML's deprecated obfuscatory entity references are disallowed in XML. Elements and marked sections need to start and end in the same entity.

Generally, everything needs to be balanced inside of each entity. This is important because it allows you to choose not to expand entities in certain contexts while still maintaining a balanced structure.

External Entities in Attributes

XML does not allow references to external entities in attribute literals:

    <!ENTITY external SYSTEM "file.txt">
    <!ENTITY internal "text">
    ...
    <isnt-xml attrib="&external;">
    <okay-xml attrib="&internal;">

References to External Data Entities in Content

You can refer to external data entities in content; but non-validating parsers are not required to include that entity. They may merely choose to note that they saw the reference and go on.

Parameter Entities

In a separate DTD file (the 'external subset'), parameter entities are allowed to appear inside of markup declarations. But in the internal subset of the DTD in an XML document, they can only appear where a whole markup declaration would be allowed:

    <!DOCTYPE document SYSTEM "document.dtd" [
    <!ENTITY % isnt-xml "p">
    <!ELEMENT %isnt-xml; (#PCDATA)>
    <!ENTITY %okay-xml SYSTEM "fragment.dtd">
    %okay-xml;
    ]>

Out of Band: Comments

XML restricts the variation in syntax and location of comments that SGML allows.

A typical SGML comment looks like this:

    <!--Okay XML.-->

The ' <! ' and ' > ' are called the comment declaration, and the ' --...-- ' is the comment proper.

Comments in Other Declarations

You can't slip comments into other declarations. So this is not allowed:

    <!ELEMENT p (#PCDATA | em)* --Isn't XML.-->

Empty Comments

You must have exactly one comment inside of each comment declaration. You are not allowed zero:

    <isnt-xml><!></isnt-xml>
    <okay-xml><!----></okay-xml>

Multiple Comments

And you are not allowed more than one:

    <!--Isn't XML.-- --Isn't XML.-->
    <!--Okay XML.- - - -Okay XML [FILE GARBAGED HERE, SORRY]

Out of Band: Marked Sections

[FILE GARBAGED HERE, SORRY] n empty status keyword specification. XML does not allow this:

<![[isnt-xml]]>

TEMP

You can't use TEMP in a status keyword specification:

<![ TEMP [isnt-xml]]>

RCDATA

You can't use RCDATA marked sections:

<![ RCDATA [isnt-xml]]>

INCLUDE and IGNORE

You can't use INCLUDE or IGNORE marked sections in the document instance, but only in the DTD (and not in the internal subset of the DTD):

<!DOCTYPE document SYSTEM "document.dtd" [
<![ IGNORE [isnt-xml]]>
]>
<document><![ IGNORE [isnt-xml]]></document>

Multiple Keywords

You can't use more than one status keyword in a single marked section:

    <![ INCLUDE CDATA [isnt-xml]]>

Parameter Entities

You can't use parameter entities to specify status keywords.

    <![ %maybe; [isnt-xml]]>

No Separators for CDATA Sections

You aren't allowed any white space around the word ' CDATA ' in a CDATA marked section start:

    <![CDATA [isnt-xml]]>
    <![ CDATA[isnt-xml]]>
    <![ CDATA [isnt-xml]]>
    <![CDATA[okay-xml]]>

Out of Band: Processing Instructions

XML uses a special syntax for processing instructions. You can imitate this XML syntax by using a similar convention for your SGML processing instructions. Processing instructions are closed in SGML with a right angle-bracket. In XML, they are closed by a question-mark right angle-bracket sequence:

    <?isnt-xml This is a processing instruction.>
    <?okay-xml This is a processing instruction.?>

In XML, the PIC (processing instruction close) delimiter to ' ?> ' instead of the usual ' > '. If you make this change to your SGML declaration, then the first processing instruction above will not parse and the second will parse just as in XML. If you do not make this change, both will parse, but the second will contain the question mark as part of the content of the processing instruction, rather than as the ending delimiter.

It is good practice to categorize your SGML processing instructions by always starting them with a name that says to which processor they are directed. In XML, this practice is a requirement. This name is called the PI 'target':

    <??> <!--This isn't XML because it has no target.-->
    <?okay-xml?>
    <?okay-xml2 The target is 'okay-xml2'.?>
    <?okay-xml3The target is 'okay-xml3The'.?>

The target ' xml ' has a special meaning in XML. To avoid confusion, any other capitalization of those three letters is reserved (and prohibited):

    <?xml This isn't XML.>
    <?XML This isn't XML.>
    <?XmL This isn't XML.>
    <?xmlx This is technically okay but tempting fate.>
    <?sgml This is okay XML.>

Miscellaneous: Characters

Case Insensitivity

XML insists on case-sensitivity in places where SGML is typically insensitive. This can be a big headache at first, but it can ultimately simplify processing of the data. This is one of several places where SGML can be made to match XML by changing the SGML declaration you use.

First, adopt a standard capitalization for your element and attribute names. As a programmer afraid of carpal-tunnel syndrome, I suggest all lower case. Then, change ' NAMECASE GENERAL YES ' to ' NAMECASE GENERAL NO ' in your SGML declaration file.

Odd Name Characters

All the name characters allowed by the reference concrete syntax are allowed by XML. So are thousands of others. But it's possible to have an SGML declaration that declares as name characters some characters that XML doesn't allow as name characters.

Character References without Semicolons

Like with entity references, you can't leave off the final semicolon in a character reference:

    <isnt-xml>R&#233;sum&#233</isnt-xml>
    <okay-xml>R&#233;sum&#233;</okay-xml>

Named Character References

You can't use named character references:

    <isnt-xml>You can't use &#RE;, &#RS;, &#SPACE;.
    or a custom-defined function &#NAME;.</isnt-xml>

References to non-SGML Characters

You can't use a numeric character reference to include a non-SGML character in XML.

Miscellaneous: Minimization

XML does not include a wide variety of markup minimization features available in SGML. This section lists the more common types of minimization. Less commonly used minimization techniques are listed under 'Obscure Features'.

OMITTAG

The OMITTAG feature is fairly commonly used. It allows you to completely leave out certain start and end tags when you can tell by the context that they are required. So, using this feature, you might leave out the start tag for a chapter title (provided that there were some data characters at the beginning of the chapter, yet all chapters were required by the DTD to have start with a title) or the end tag for a chapter (provided that there was a start tag for the next chapter and the DTD didn't allow chapters to nest). Notice how both examples require consulting the DTD to determine which tags have been left out. XML does not allow tags to be omitted in this way.

Portions of SHORTTAG

The SHORTTAG feature allows various abbreviations to be made within a tag. This feature is officially declared to be ON in the SGML declaration for XML, because some of these abbreviations are in fact allowed. But many are not.

SHORTTAG: Empty Tags

Quite distinct from the idea of an empty element, there is the possibility in SGML of having empty tags. An empty start tag looks like this: ' <> '; and an empty end tag looks like this: ' </> '. Empty tags are allowed in SGML in certain contexts where it is clear what the missing element type name is.

    <isnt-xml><>Apparently, isnt-xml must always start
    with a certain element</></isnt-xml>

In XML, element type names must always be spelled out:

    <okay-xml><TITLE>We need to spell out 'title'</title></okay-xml>

SHORTTAG: Unclosed Tags

Did you know that the final right angle-bracket is not always required on tags in SGML? Stomach-turning, isn't it? Sorry I ever mentioned it.

    <isnt-xml<isnt-xml2>text</isnt-xml2</isnt-xml>
    <okay-xml><okay-xml2>text</okay-xml2></okay-xml>

SHORTTAG: Leaving off Quote Marks

In SGML, you can sometimes leave off the quotes when specifying attribute values. This is not allowed in XML. See The Big Three: Attributes: Specifying Attribute Values, above.

SHORTTAG: Leaving off Attribute Names

In SGML, you can sometimes leave off the attribute name when specifying attribute values. This is not allowed in XML. See The Big Three: Attributes: Specifying Attribute Values, above.

Miscellaneous: Other Restrictions

< and &

You should not use ' & ' or ' < ' as data:

    <isnt-xml>In SGML, you can use & and < as data in
    certain contexts; when followed by a space, for example.</isnt-xml>

Use ' & ' for ' & ' and ' < ' for ' < ':

    <okay-xml>In SGML, you can use &amp; and &lt; as data in
    certain contexts; when followed by a space, for example.</okay-xml>

The places where you can use the & and < characters without them being interpreted as markup are in comments, processing instructions, CDATA marked sections and in the literal entite> and < characters without them being interpreted as markup are in comments, processing instructions, CDATA marked sections and in the literal entitace (the S ) after the entity name can not be left out:

    <!ENTITY isnt-xml"Scrunching allowed in SGML.">
    <!ENTITY okay-xml  "No scrunching in XML.">

But note that the start of an XML CDATA section is defined to include no white space:

    [19] CDStart ::= '<![CDATA['

SGML Declaration

Do not include the SGML declaration in the XML document entity. The SGML declaration for XML must be left implied.

Miscellaneous: Obscure SGML Features

There are a number of features of SGML of which you may be only dimly aware. You likely won't notice their absence from XML. The SHORTREF , RANK and DATATAG features are properly classified under the minimization category.

SHORTREF

XML does not allow SGML's SHORTREF feature, whereby certain short sequences (like double carriage-returns) are interpreted as abbreviated references to markup (like paragraph tags).

DATATAG

SGML includes a feature called DATATAG in which data acts as both markup and content. I've never encountered a use of this feature. Perhaps a raise of hands?

RANK

The RANK feature allows you to declare a set of elements that differ only in a numerical suffix (like the H1 , H2 , H3 heading elements in HTML) and then to type only <H> , having it be interpreted as another of whatever the most recent heading-level occurred in the document.

Even when this feature is not turned on in your SGML declaration, you can still split the element type name into two parts in an element declaration. You can't do this in XML.

LINK

There are three variations on the LINK feature ( SIMPLE , EXPLICIT and IMPLICIT ). I've heard arguments for the importance of this feature, from people with more authority than I have. But I don't believe them. At any rate, XML does not include this feature. (Note: The SGML LINK feature is not for hyperlinking content, but rather for associating processing with an SGML document.)

CONCUR

The SGML CONCUR feature allows you do apply more than one DTD to the same data, simultaneously. XML does not include this feature.

Conclusion

A journey of a thousand miles starts with the first step. But before you take that step, you ought to determine where you stand. This will help you start out in the right direction. Or realize you're happy right where you are.

References

SGML is defined by 'ISO 8879:1986(E). Information processing - Text and Office Systems - Standard Generalized Markup Language (SGML). First edition - 1986-10-15', available from the International Organization for Standardization in Geneva.

XML is defined by 'Extensible Markup Language (XML) 1.0', a World Wide Web Consortium (W3C) Recommendation dated 1998 February 10 ( www.w3.org/TR/REC-xml ).

This paper is derived from James Clark's 'Comparison of SGML and XML', a World Wide Web (W3C) Consortium Note dated 1997 December 15 ( www.w3.org/TR/NOTE-sgml-xml-971215 by jjc@jclark.com ). A few details have apparently changed since Clark's Note was written: PUBLIC identifiers, for example, are now a part of XML.